Saturday, March 2, 2019

Troubleshooting - Namenode - Failure and Recovery

--Error in the namenode logs

EditLogInputStream caught exception initializing

<<few lines down>>

EditLogFileInputStream$LogHeaderCorruptException: Reached EOF when reading log header



--hdfs-site.xml

<property>

  <name>dfs.namenode.name.dir</name>

    <value>file:///disk1/dfs/name,/disk2/dfs/name,/disk3/dfs/name</value>

</property>





cp /disk2/dfs/name/current/edits* /disk1/dfs/name/current


hadoop namenode -recover



You have selected Metadata Recovery mode.  This mode is intended to recover
lost metadata on a corrupt filesystem.  Metadata recovery mode often
permanently deletes data from your HDFS filesystem.  Please back up your edit
log and fsimage before trying this!



Are you ready to proceed? (Y/N)

 (Y or N)



 11:10:41,443 ERROR FSImage:147 - Error replaying edit log at offset 71.  Expected transaction ID was 3

Recent opcode offsets: 17 71

org.apache.hadoop.fs.ChecksumException: Transaction is corrupt. Calculated checksum is -1642375052 but read checksum -6897

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.validateChecksum(FSEditLogOp.java:2356)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.decodeOp(FSEditLogOp.java:2341)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogOp$Reader.readOp(FSEditLogOp.java:2247)

        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextOp(EditLogFileInputStream.java:110)

        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:74)

        at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:140)

        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:74)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:138)

        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:93)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:683)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:639)

        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:247)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:498)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:390)

        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:354)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.doRecovery(NameNode.java:1033)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1103)

        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1164)

11:10:41,444 ERROR MetaRecoveryContext:96 - We failed to read txId 3

11:10:41,444  INFO MetaRecoveryContext:64 -

Enter 'c' to continue, skipping the bad section in the log

Enter 's' to stop reading the edit log here, abandoning any later edits

Enter 'q' to quit without saving

Enter 'a' to always select the first choice in the future without prompting. (c/s/q/a)



12:22:38,829  INFO MetaRecoveryContext:105 - Continuing.

12:22:38,860 ERROR MetaRecoveryContext:96 - There appears to be a gap in the edit log.  We expected txid 3, but got txid 4.

12:22:38,860  INFO MetaRecoveryContext:64 -

Enter 'c' to continue, ignoring missing  transaction IDs

Enter 's' to stop reading the edit log here, abandoning any later edits

Enter 'q' to quit without saving

Enter 'a' to always select the first choice in the future without prompting. (c/s/q/a)



12:22:42,205  INFO MetaRecoveryContext:105 - Continuing.

12:22:42,207  INFO FSEditLogLoader:199 - replaying edit log: 4/5 transactions completed. (80%)

12:22:42,208  INFO FSImage:95 - Edits file /opt/hadoop/run4/name1/current/edits_0000000000000000001-0000000000000000005 of size 1048580 edits # 4 loaded in 4 seconds.

12:22:42,212  INFO FSImage:504 - Saving image file /opt/hadoop/run4/name2/current/fsimage.ckpt_0000000000000000005 using no compression

12:22:42,213  INFO FSImage:504 - Saving image file /opt/hadoop/run4/name1/current/fsimage.ckpt_000000000000000