Saturday, March 2, 2019

Troubleshooting - Namenode Stuck In Safe Mode

--Name node in safe mode
hadoop fs -copyFromLocal test /tmp/
copyFromLocal: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot create file /tmp/test. Name node is in safe mode.

--From name node web UI or logs
Safe mode is ON. The reported blocks 8900092 needs additional 6476 blocks to reach the threshold 1.0000 of total blocks 8906567

--hdfs-site.xml 
dfs.namenode.safemode.threshold-pct 

####Reason 1 - loss of datanodes or the cluster is running low on resources.####

sudo -u hdfs hdfs dfsadmin -report

Safe mode is ON
Configured Capacity: 6807953326080 (6.19 TB)
Present Capacity: 5076746797056 (4.62 TB)
DFS Remaining: 5076745936896 (4.62 TB)
DFS Used: 860160 (840 KB)
DFS Used%: 0.00%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
Missing blocks (with replication factor 1): 0
-------------------------------------------------
Live datanodes (5):
Name: 10.15.230.42:50010 (node2)
Hostname: node2
Rack: /default
Decommission Status : Normal
Configured Capacity: 1361590665216 (1.24 TB)
DFS Used: 172032 (168 KB)
Non DFS Used: 425847939072 (396.60 GB)
DFS Remaining: 935742554112 (871.48 GB)
DFS Used%: 0.00%
DFS Remaining%: 68.72%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 0 (0 B)
Cache Remaining: 4294967296 (4 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 2
Last contact: Thu May 07 21:52:10 EDT 2018

Name: 10.15.230.44:50010 (node4)
Hostname: node4
Rack: /default
Decommission Status : Normal
Configured Capacity: 1361590665216 (1.24 TB)
DFS Used: 172032 (168 KB)
Non DFS Used: 219371347968 (204.31 GB)
DFS Remaining: 1142219145216 (1.04 TB)
DFS Used%: 0.00%
DFS Remaining%: 83.89%
Configured Cache Capacity: 4294967296 (4 GB)
Cache Used: 0 (0 B)
Cache Remaining: 4294967296 (4 GB)
Cache Used%: 0.00%
Cache Remaining%: 100.00%
Xceivers: 2
Last contact: Thu May 07 21:52:10 EDT 2018


sudo -u hdfs hdfs dfsadmin -report

Safe mode is ON
    Configured Capacity: 52710469632 (49.09 GB)
    Present Capacity: 213811200 (203.91 MB)
    DFS Remaining: 0 (0 B)
    DFS Used: 213811200 (203.91 MB)
    DFS Used%: 100.00%
    Under replicated blocks: 39
    Blocks with corrupt replicas: 0
    Missing blocks: 0

####Reason 2 - Block corruption####

hdfs fsck / 
 
Connecting to namenode via http://master:50070
FSCK started by hdfs (auth:SIMPLE) from /10.15.230.22 for path / at Thu May 07 21:56:17 EDT 2018
..
/accumulo/tables/!0/table_info/A00009pl.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073806971
/accumulo/tables/!0/table_info/A00009pl.rf: MISSING 1 blocks of total size 891 B..
/accumulo/tables/!0/table_info/A00009pm.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073806989
/accumulo/tables/!0/table_info/A00009pm.rf: MISSING 1 blocks of total size 891 B..
/accumulo/tables/!0/table_info/F00009pn.rf: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073807006
............................................
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jaspic_1.0_spec-1.0.jar: MISSING 1 blocks of total size 30548 B..
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jta_1.1_spec-1.1.1.jar: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073743585
/user/oozie/share/lib/lib_20180408141046/sqoop/geronimo-jta_1.1_spec-1.1.1.jar: MISSING 1 blocks of total size 16030 B..
/user/oozie/share/lib/lib_20180408141046/sqoop/groovy-all-2.1.6.jar: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073743592
............................................
/tmp/logs/fincalc/logs/application_1430405794825_0003/DAST-node5_8041: MISSING 9 blocks of total size 1117083218 B..
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800217
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800222
/tmp/logs/fincalc/logs/application_1430405794825_0004/DAST-node1_8041: CORRUPT blockpool BP-2034730372-10.15.230.22-1428441473000 block blk_1073800230

Total size:    154126314807 B (Total open files size: 186 B)
 Total dirs:    3350
 Total files:   1790
 Total symlinks:                0 (Files currently being written: 2)
 Total blocks (validated):      2776 (avg. block size 55521006 B) (Total open file blocks (not validated): 2)
  ********************************
  CORRUPT FILES:        1764
  MISSING BLOCKS:       2776
  MISSING SIZE:         154126314807 B
  CORRUPT BLOCKS:       2776
  ********************************
 Minimally replicated blocks:   0 (0.0 %)
 Over-replicated blocks:        0 (0.0 %)
 Under-replicated blocks:       0 (0.0 %)
 Mis-replicated blocks:         0 (0.0 %)
 Default replication factor:    2
 Average block replication:     0.0
 Corrupt blocks:                2776
 Missing replicas:              0
 Number of data-nodes:          5
 Number of racks:               1
FSCK ended at Thu May 07 21:56:18 EDT 2018 in 516 milliseconds

The filesystem under path '/' is CORRUPT 


hdfs fsck / -list-corruptfileblocks

Connecting to namenode via http://ec2-80-80-80-80.compute-1.amazonaws.com:50070
The filesystem under path '/' has 0 CORRUPT files

hdfs@ip-172-31-45-216:~$ hdfs fsck /user/hirw/input/stocks -files -blocks -locations

Connecting to namenode via http://ec2-80-80-80-80.compute-1.amazonaws.com:50070
FSCK started by hdfs (auth:SIMPLE) from /172.31.45.216 for path /user/hirw/input/stocks at Mon Sep 18 11:22:00 UTC 2017
/user/hirw/input/stocks <dir>
/user/hirw/input/stocks/stocks 428223209 bytes, 4 block(s):  OK
0. BP-2125152513-172.31.45.216-1410037307133:blk_1074178780_437980 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
1. BP-2125152513-172.31.45.216-1410037307133:blk_1074178781_437981 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
2. BP-2125152513-172.31.45.216-1410037307133:blk_1074178782_437982 len=134217728 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]
3. BP-2125152513-172.31.45.216-1410037307133:blk_1074178783_437983 len=25570025 Live_repl=3 [DatanodeInfoWithStorage[172.31.45.216:50010,DS-d0d9eb5c-f35f-4a12-bfdf-544085d693a3,DISK], DatanodeInfoWithStorage[172.31.46.124:50010,DS-fe24aecb-f56f-4c9c-8cf9-a3b1259bc0d0,DISK], DatanodeInfoWithStorage[172.31.45.217:50010,DS-7e4a4aef-39b9-4087-b0ff-3199c5b8b8bb,DISK]]

Status: HEALTHY


$ hdfs fsck /user/hadooptest/test1 -locations -blocks -files
FSCK started by hadoop (auth:SIMPLE) from /10-10-10-10 for path /user/hadooptest/test1 at Thu Dec 15 23:32:30 PST 2016
/user/hadooptest/test1 339281920 bytes, 3 block(s): 
/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741830

/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741831

/user/hadooptest/test1: CORRUPT blockpool BP-762523015-10-10-10-10-1480061879099 block blk_1073741832
 MISSING 3 blocks of total size 339281920 B
0. BP-762523015-10-10-10-10-1480061879099:blk_1073741830_1006 len=134217728 MISSING!
1. BP-762523015-10-10-10-10-1480061879099:blk_1073741831_1007 len=134217728 MISSING!
2. BP-762523015-10-10-10-10-1480061879099:blk_1073741832_1008 len=70846464 MISSING!


--Leave safe mode and delete the file
hdfs dfsadmin -safemode leave

hadoop fs -rm /user/hadooptest/test1