作者介绍:简历上没有一个精通的运维工程师。下面的思维导图也是预计更新的内容和当前进度(不定时更新)。
中间件,我给它的定义就是为了实现某系业务功能依赖的软件,包括如下部分:
Web服务器
代理服务器
ZooKeeper
Kafka
RabbitMQ
Hadoop HDFS(本章节)
上个小节我们部署了3个节点的HDFS高可用集群,本小节我们就来介绍各个组件是怎么来实现高可用的。
NameNode
[root@node1 ~]# hdfs haadmin -getAllServiceState
node1:8020 active
node2:8020 standby
[root@node1 ~]#
我们前面的部署有两个nn节点,其中只有处于active的才会接收客户的请求。另外一个是备份的,如果这个时候我们把node1的namenode给停止。zkfc会把node2的nn变成active,如果故障的node1的nn恢复,他也会变standby。
2025-07-11 00:41:46,909 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at node2/192.168.31.162:8020 active...
2025-07-11 00:41:47,570 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at node2/192.168.31.162:8020 to active state
下图是nn1未恢复的情况,nn2变成active。
[root@node1 ~]# hdfs haadmin -getAllServiceState
25/07/11 00:46:43 INFO ipc.Client: Retrying connect to server: node1/192.168.31.161:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
node1:8020 Failed to connect: Call From node1/192.168.31.161 to node1:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
node2:8020 active
JournalNode
JournalNode在我们这个架构下,有3个进程,他是怎么来来实现高可用的呢?下面的日志是一个写操作。
2025-07-11 01:11:08,942 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 26 58
2025-07-11 01:11:09,054 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=192.168.31.161:50010, 192.168.31.163:50010, 192.168.31.162:50010 for /test1/core-site.xml._COPYING_
2025-07-11 01:11:09,387 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741825_1001 is COMMITTED but not COMPLETE(numNodes= 0 < minimum = 1) in file /test1/core-site.xml._COPYING_
2025-07-11 01:11:09,794 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /test1/core-site.xml._COPYING_ is closed by DFSClient_NONMAPREDUCE_-277858816_1
2025-07-11 01:11:24,383 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 192.168.31.161
2025-07-11 01:11:24,383 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2025-07-11 01:11:24,383 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 171, 177
2025-07-11 01:11:24,424 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 8 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 1 Number of syncs: 7 SyncTimes(ms): 174 142
2025-07-11 01:11:24,429 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/hadoop/nn/current/edits_inprogress_0000000000000000171 -> /data/hadoop/nn/current/edits_0000000000000000171-0000000000000000178
2025-07-11 01:11:24,429 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 179
1.上传新文件,生成文件块:allocate blk_1073741825_1001
2.分成3副本写入3个节点,并且生成一个临时文件名(原始文件后面_CONFIG_)
3.开始写入,但是还没有满足最低1副本要求(也可以配置2)
4.写入成功,修改文件名,关闭
上面正常写入流程,下面就是关掉2个jn,无法达成过半,所以无法写入成功。从这里可以看出,jn是通过多节点然后过半原则来保证jn的高可用。
2025-07-11 01:34:55,243 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.31.161:8485, 192.168.31.162:8485, 192.168.31.163:8485], stream=QuorumOutputStream starting at txid 206))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 1 successful responses:
192.168.31.163:8485: null [success]
2 exceptions thrown:
192.168.31.162:8485: Call From node1/192.168.31.161 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.161:8485: Call From node1/192.168.31.161 to node1:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused
DataNode
datanode是实际存储数据的地方,所以他的高可用,才能代表真正的高可用,hdfs实现数据高可用是通过3副本来实现的。简单来说就是通过把一份数据(这里的说法是块)存3份,只要有2份我们就认为数据是可用的。
下面的内容就是我找了一个很小文件,只存在一个块。
[root@node1 ~]# hdfs fsck /test1/core-site.xml -files -blocks -locations
Connecting to namenode via http://192.168.31.161:9870/fsck?ugi=root&files=1&blocks=1&locations=1&path=%2Ftest1%2Fcore-site.xml
FSCK started by root (auth:SIMPLE) from /192.168.31.161 for path /test1/core-site.xml at Fri Jul 11 01:50:48 CST 2025
/test1/core-site.xml 1148 bytes, 1 block(s): OK
0. BP-1905858294-192.168.31.161-1752155830873:blk_1073741825_1001 len=1148 Live_repl=3
[DatanodeInfoWithStorage[192.168.31.163:50010,DS-01219cbd-15f4-47c9-bba4-0b7fb22c1721,DISK],
DatanodeInfoWithStorage[192.168.31.162:50010,DS-c5f8264e-2571-420b-9dd6-bd62bf8f56da,DISK],
DatanodeInfoWithStorage[192.168.31.161:50010,DS-3599cf97-4c6f-4c09-8064-300d518e789f,DISK]]
Status: HEALTHY
Total size: 1148 B
Total dirs: 0
Total files: 1
Total symlinks: 0
Total blocks (validated): 1 (avg. block size 1148 B)
Minimally replicated blocks: 1 (100.0 %)
Over-replicated blocks: 0 (0.0 %)
Under-replicated blocks: 0 (0.0 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor: 3
Average block replication: 3.0
Corrupt blocks: 0
Missing replicas: 0 (0.0 %)
Number of data-nodes: 3
Number of racks: 1
FSCK ended at Fri Jul 11 01:50:49 CST 2025 in 7 milliseconds
The filesystem under path '/test1/core-site.xml' is HEALTHY
这里其实就是这个文件的块分别存在在3个节点上面。
DatanodeInfoWithStorage[192.168.31.163:50010,DS-01219cbd-15f4-47c9-bba4-0b7fb22c1721,DISK],
DatanodeInfoWithStorage[192.168.31.162:50010,DS-c5f8264e-2571-420b-9dd6-bd62bf8f56da,DISK],
DatanodeInfoWithStorage[192.168.31.161:50010,DS-3599cf97-4c6f-4c09-8064-300d518e789f,DISK]