首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >Hadoop HDFS-高可用集群验证

Hadoop HDFS-高可用集群验证

作者头像
运维小路
发布2025-07-24 17:43:29
发布2025-07-24 17:43:29
10400
代码可运行
举报
文章被收录于专栏:运维小路运维小路
运行总次数:0
代码可运行

作者介绍:简历上没有一个精通的运维工程师。下面的思维导图也是预计更新的内容和当前进度(不定时更新)。

中间件,我给它的定义就是为了实现某系业务功能依赖的软件,包括如下部分:

Web服务器

代理服务器

ZooKeeper

Kafka

RabbitMQ

Hadoop HDFS(本章节)

上个小节我们部署了3个节点的HDFS高可用集群,本小节我们就来介绍各个组件是怎么来实现高可用的。

NameNode

代码语言:javascript
代码运行次数:0
运行
复制
[root@node1 ~]# hdfs haadmin -getAllServiceState
node1:8020                                         active    
node2:8020                                         standby   
[root@node1 ~]# 

我们前面的部署有两个nn节点,其中只有处于active的才会接收客户的请求。另外一个是备份的,如果这个时候我们把node1的namenode给停止。zkfc会把node2的nn变成active,如果故障的node1的nn恢复,他也会变standby。

代码语言:javascript
代码运行次数:0
运行
复制
2025-07-11 00:41:46,909 INFO org.apache.hadoop.ha.ZKFailoverController: Trying to make NameNode at node2/192.168.31.162:8020 active...
2025-07-11 00:41:47,570 INFO org.apache.hadoop.ha.ZKFailoverController: Successfully transitioned NameNode at node2/192.168.31.162:8020 to active state

下图是nn1未恢复的情况,nn2变成active。

代码语言:javascript
代码运行次数:0
运行
复制
[root@node1 ~]# hdfs haadmin -getAllServiceState
25/07/11 00:46:43 INFO ipc.Client: Retrying connect to server: node1/192.168.31.161:8020. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=1, sleepTime=1000 MILLISECONDS)
node1:8020                                         Failed to connect: Call From node1/192.168.31.161 to node1:8020 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
node2:8020                                         active  

JournalNode

JournalNode在我们这个架构下,有3个进程,他是怎么来来实现高可用的呢?下面的日志是一个写操作。

代码语言:javascript
代码运行次数:0
运行
复制
2025-07-11 01:11:08,942 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 2 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 0 Number of syncs: 1 SyncTimes(ms): 26 58 
2025-07-11 01:11:09,054 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocate blk_1073741825_1001, replicas=192.168.31.161:50010, 192.168.31.163:50010, 192.168.31.162:50010 for /test1/core-site.xml._COPYING_
2025-07-11 01:11:09,387 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: BLOCK* blk_1073741825_1001 is COMMITTED but not COMPLETE(numNodes= 0 <  minimum = 1) in file /test1/core-site.xml._COPYING_
2025-07-11 01:11:09,794 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /test1/core-site.xml._COPYING_ is closed by DFSClient_NONMAPREDUCE_-277858816_1
2025-07-11 01:11:24,383 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Roll Edit Log from 192.168.31.161
2025-07-11 01:11:24,383 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Rolling edit logs
2025-07-11 01:11:24,383 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Ending log segment 171, 177
2025-07-11 01:11:24,424 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Number of transactions: 8 Total time for transactions(ms): 2 Number of transactions batched in Syncs: 1 Number of syncs: 7 SyncTimes(ms): 174 142 
2025-07-11 01:11:24,429 INFO org.apache.hadoop.hdfs.server.namenode.FileJournalManager: Finalizing edits file /data/hadoop/nn/current/edits_inprogress_0000000000000000171 -> /data/hadoop/nn/current/edits_0000000000000000171-0000000000000000178
2025-07-11 01:11:24,429 INFO org.apache.hadoop.hdfs.server.namenode.FSEditLog: Starting log segment at 179

1.上传新文件,生成文件块:allocate blk_1073741825_1001

2.分成3副本写入3个节点,并且生成一个临时文件名(原始文件后面_CONFIG_)

3.开始写入,但是还没有满足最低1副本要求(也可以配置2)

4.写入成功,修改文件名,关闭

上面正常写入流程,下面就是关掉2个jn,无法达成过半,所以无法写入成功。从这里可以看出,jn是通过多节点然后过半原则来保证jn的高可用。

代码语言:javascript
代码运行次数:0
运行
复制
2025-07-11 01:34:55,243 FATAL org.apache.hadoop.hdfs.server.namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [192.168.31.161:8485, 192.168.31.162:8485, 192.168.31.163:8485], stream=QuorumOutputStream starting at txid 206))
org.apache.hadoop.hdfs.qjournal.client.QuorumException: Got too many exceptions to achieve quorum size 2/3. 1 successful responses:
192.168.31.163:8485: null [success]
2 exceptions thrown:
192.168.31.162:8485: Call From node1/192.168.31.161 to node2:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
192.168.31.161:8485: Call From node1/192.168.31.161 to node1:8485 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused

DataNode

datanode是实际存储数据的地方,所以他的高可用,才能代表真正的高可用,hdfs实现数据高可用是通过3副本来实现的。简单来说就是通过把一份数据(这里的说法是块)存3份,只要有2份我们就认为数据是可用的。

下面的内容就是我找了一个很小文件,只存在一个块。

代码语言:javascript
代码运行次数:0
运行
复制
[root@node1 ~]# hdfs fsck /test1/core-site.xml -files -blocks -locations
Connecting to namenode via http://192.168.31.161:9870/fsck?ugi=root&files=1&blocks=1&locations=1&path=%2Ftest1%2Fcore-site.xml
FSCK started by root (auth:SIMPLE) from /192.168.31.161 for path /test1/core-site.xml at Fri Jul 11 01:50:48 CST 2025
/test1/core-site.xml 1148 bytes, 1 block(s):  OK
0. BP-1905858294-192.168.31.161-1752155830873:blk_1073741825_1001 len=1148 Live_repl=3
 [DatanodeInfoWithStorage[192.168.31.163:50010,DS-01219cbd-15f4-47c9-bba4-0b7fb22c1721,DISK], 
 DatanodeInfoWithStorage[192.168.31.162:50010,DS-c5f8264e-2571-420b-9dd6-bd62bf8f56da,DISK], 
 DatanodeInfoWithStorage[192.168.31.161:50010,DS-3599cf97-4c6f-4c09-8064-300d518e789f,DISK]]

Status: HEALTHY
 Total size:	1148 B
 Total dirs:	0
 Total files:	1
 Total symlinks:		0
 Total blocks (validated):	1 (avg. block size 1148 B)
 Minimally replicated blocks:	1 (100.0 %)
 Over-replicated blocks:	0 (0.0 %)
 Under-replicated blocks:	0 (0.0 %)
 Mis-replicated blocks:		0 (0.0 %)
 Default replication factor:	3
 Average block replication:	3.0
 Corrupt blocks:		0
 Missing replicas:		0 (0.0 %)
 Number of data-nodes:		3
 Number of racks:		1
FSCK ended at Fri Jul 11 01:50:49 CST 2025 in 7 milliseconds


The filesystem under path '/test1/core-site.xml' is HEALTHY

这里其实就是这个文件的块分别存在在3个节点上面。

代码语言:javascript
代码运行次数:0
运行
复制
DatanodeInfoWithStorage[192.168.31.163:50010,DS-01219cbd-15f4-47c9-bba4-0b7fb22c1721,DISK], 
DatanodeInfoWithStorage[192.168.31.162:50010,DS-c5f8264e-2571-420b-9dd6-bd62bf8f56da,DISK], 
DatanodeInfoWithStorage[192.168.31.161:50010,DS-3599cf97-4c6f-4c09-8064-300d518e789f,DISK]
本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2025-07-12,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 运维小路 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档