前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Kafka异常复盘NotLeaderForPartitionException【实战笔记】

Kafka异常复盘NotLeaderForPartitionException【实战笔记】

作者头像
瓜农老梁
发布2019-08-30 14:45:35
8.3K0
发布2019-08-30 14:45:35
举报
文章被收录于专栏:瓜农老梁
一、情况分析

1.客户端异常报警

晚上10点20分接到使用方电话,日志持续报以下异常,持续时间已有10多分钟。

代码语言:javascript
复制
ERROR 2019-05-15 23:05:23,221 [kafka-producer-network-thread | producer-1] A failure occurred sending a message to Kafka.
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
WARN 2019-05-15 23:05:23,237 [main] Error sending event to listener kafkahandler, status: ABEND, event: Commit transaction

2.集群状况查看

监控及Kafka Manager显示节点数量正常。问题集群有6个节点,3个副本。查看日志发现,都在大量输出选举的日志,日志中暂时没有发现明显的ERROR和FATAL日志。类似内容如下:

代码语言:javascript
复制
[2019-05-15 22:03:49,511] INFO Rolled new log segment for 'order-KSTREAM-AGGREGATE-STATE-STORE-0000000002-repartition-196' in 1 ms. (kafka.log.Log)
[2019-05-15 22:20:13,880] INFO Updated PartitionLeaderEpoch. New: {epoch:1, offset:5910507}, Current: {epoch:0, offset4932329} for Partition: zto_big_marks_mq-46. Cache now contains 1 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-15 22:20:13,881] INFO Updated PartitionLeaderEpoch. New: {epoch:78, offset:1070147099}, Current: {epoch:77, offset1041132585} for Partition: zto_scan_rec-11. Cache now contains 1 entries. 

3.GC日志查看 通过查看各个节点GC日志发现,其中一台节点GC回收异常,堆内存回后,只能回收600来M的空间。应该可以初步确定为问题节点。 日志内容如下:

代码语言:javascript
复制
2019-05-15T23:51:14.885+0800: 21994518.023: [GC pause (G1 Evacuation Pause) (young), 0.0749305 secs]
   [Parallel Time: 70.4 ms, GC Workers: 33]
      [GC Worker Start (ms): Min: 21994518023.0, Avg: 21994518023.5, Max: 21994518023.9, Diff: 0.9]
      [Ext Root Scanning (ms): Min: 14.9, Avg: 15.7, Max: 17.1, Diff: 2.2, Sum: 518.0]
      [Update RS (ms): Min: 3.4, Avg: 5.0, Max: 7.8, Diff: 4.5, Sum: 166.5]
         [Processed Buffers: Min: 1, Avg: 1.7, Max: 7, Diff: 6, Sum: 57]
      [Scan RS (ms): Min: 0.0, Avg: 2.1, Max: 3.7, Diff: 3.7, Sum: 69.4]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 45.9, Avg: 46.6, Max: 47.2, Diff: 1.3, Sum: 1536.2]
      [Termination (ms): Min: 0.0, Avg: 0.1, Max: 0.1, Diff: 0.1, Sum: 3.3]
         [Termination Attempts: Min: 1, Avg: 38.3, Max: 52, Diff: 51, Sum: 1263]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 2.8]
      [GC Worker Total (ms): Min: 69.1, Avg: 69.6, Max: 70.1, Diff: 1.0, Sum: 2296.1]
      [GC Worker End (ms): Min: 21994518093.0, Avg: 21994518093.1, Max: 21994518093.2, Diff: 0.2]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.0 ms]
   [Other: 3.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.5 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.9 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.1 ms]
      [Free CSet: 0.5 ms]
   [Eden: 380.0M(380.0M)->0.0B(356.0M) Survivors: 28.0M->52.0M Heap: 5649.6M(8192.0M)->5466.5M(8192.0M)]
 [Times: user=2.31 sys=0.00, real=0.07 secs]
2019-05-15T23:51:44.433+0800: 21994547.571: [GC pause (G1 Evacuation Pause) (mixed), 0.1036003 secs]
   [Parallel Time: 81.6 ms, GC Workers: 33]
      [GC Worker Start (ms): Min: 21994547572.2, Avg: 21994547573.2, Max: 21994547574.3, Diff: 2.1]
      [Ext Root Scanning (ms): Min: 20.8, Avg: 24.4, Max: 80.8, Diff: 60.1, Sum: 805.5]
      [Update RS (ms): Min: 0.0, Avg: 2.0, Max: 3.4, Diff: 3.4, Sum: 65.7]
         [Processed Buffers: Min: 0, Avg: 2.0, Max: 5, Diff: 5, Sum: 65]
      [Scan RS (ms): Min: 0.0, Avg: 10.1, Max: 10.9, Diff: 10.9, Sum: 334.2]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 33.4, Max: 35.2, Diff: 35.2, Sum: 1102.0]
      [Termination (ms): Min: 0.0, Avg: 10.3, Max: 10.7, Diff: 10.7, Sum: 338.8]
         [Termination Attempts: Min: 1, Avg: 26.2, Max: 38, Diff: 37, Sum: 865]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.4, Diff: 0.3, Sum: 2.8]
      [GC Worker Total (ms): Min: 79.2, Avg: 80.3, Max: 81.2, Diff: 2.1, Sum: 2649.0]
      [GC Worker End (ms): Min: 21994547653.4, Avg: 21994547653.5, Max: 21994547653.5, Diff: 0.1]
   [Code Root Fixup: 0.1 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.7 ms]
   [Other: 20.3 ms]
      [Choose CSet: 0.2 ms]
      [Ref Proc: 16.5 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.8 ms]
      [Humongous Register: 0.2 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 1.9 ms]
   [Eden: 356.0M(356.0M)->0.0B(364.0M) Survivors: 52.0M->44.0M Heap: 5822.5M(8192.0M)->5240.1M(8192.0M)]
 [Times: user=2.69 sys=0.00, real=0.10 secs]
二、故障恢复

1.问题节点下线

确定为问题节点后,先将该节点下线。由于三个副本,所以对可用性不能造成影响。在该节点下线后,客户端报警解除,恢复正常。

2.问题节点恢复

问题节点为正常下线(kill -s TERM), 当时在恢复过程中出现卡顿,其中一个分区卡顿长达20多分钟没有继续往下走。 尝试将该问题分区移除日志目录,重新启动,该topic又出现了其他问题的卡顿。索性将该topic的所有分区全部移除日志目录。但是依然有其他topic的分区卡顿不前,在反复近10次重启后,依然没有恢复该节点。最后使用空目录启动,节点恢复。

三、问题复盘

1.有几个问题需要澄清:

1.配置了GC回收短信告警,而告警没有发出 2.什么原因导致频繁的Leader选举 3.GC为何不能正常回收 4.为什么正常关机而重新启动后不能正常恢复

故障后的第二天,需要对以上问题作出分析。

2.查看问题节点server日志

发现Unable to reconnect to ZooKeeper service出现频率非常高,没隔1到2分钟就会打印一次。

代码语言:javascript
复制
[2019-05-15 22:20:14,397] INFO Opening socket connection to server 192.168.x.x/192.168.x.x2181. Will not attempt to authenticate using SASL (unknown error) (org.apache.zookeeper.ClientCnxn)
[2019-05-15 22:20:14,398] INFO Socket connection established to 192.168.x.x/192.168.x.x2181, initiating session (org.apache.zookeeper.ClientCnxn)
[2019-05-15 22:20:14,399] WARN Unable to reconnect to ZooKeeper service, session 0x167772767ad92fd has expired (org.apache.zookeeper.ClientCnxn)
[2019-05-15 22:20:14,399] INFO zookeeper state changed (Expired) (org.I0Itec.zkclient.ZkClient)
[2019-05-15 22:20:14,399] INFO Updated PartitionLeaderEpoch. New: {epoch:21, offset:248090181}, Current: {epoch:20, offset234983013} for Partition: ZTO_BALANCE_SITE_LIST-7. Cache now contains 1 entries. (kafka.server.epoch.LeaderEpochFileCache)
[2019-05-15 22:20:14,399] INFO Unable to reconnect to ZooKeeper service, session 0x167772767ad92fd has expired, closing socket connection (org.apache.zookeeper.ClientCnxn)
[2019-05-15 22:20:14,399] INFO Initiating client connection, connectString=192.168.x.x:2181,192.168.x.x:2181,192.168.x.x:2181 sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@50a638b5 (org.apache.zookeeper.ZooKeeper)

3.查看zk日志

代码语言:javascript
复制
2019-05-15 22:49:58,992 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@942] - Client attempting to establish new session at /192.168.x.x:20482
2019-05-15 22:49:58,993 [myid:3] - INFO  [CommitProcessor:3:ZooKeeperServer@687] - Established session 0x367772767999305 with negotiated timeout 6000 for client /192.168.x.x:20482
2019-05-15 22:49:58,994 [myid:3] - INFO  [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x367772767999305 type:create cxid:0x3 zxid:0x204ae6e7d txntype:-1 reqpath:n/a Error Path:/brokers Error:KeeperErrorCode = NodeExists for /brokers
2019-05-15 22:49:58,994 [myid:3] - INFO  [ProcessThread(sid:3 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x367772767999305 type:create cxid:0x4 zxid:0x204ae6e7e txntype:-1 reqpath:n/a Error Path:/brokers/ids Error:KeeperErrorCode = NodeExists for /brokers/ids
2019-05-15 22:50:01,633 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - Accepted socket connection from /127.0.0.1:14379
2019-05-15 22:50:01,636 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@942] - Client attempting to establish new session at /127.0.0.1:14379
2019-05-15 22:50:01,637 [myid:3] - INFO  [CommitProcessor:3:ZooKeeperServer@687] - Established session 0x367772767999306 with negotiated timeout 30000 for client /127.0.0.1:14379
2019-05-15 22:50:01,958 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@368] - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x367772767999306, likely client has closed socket
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
    at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
    at java.lang.Thread.run(Thread.java:745)

4.查看GC日志

代码语言:javascript
复制
2019-05-15T22:25:59.422+0800: 21989402.560: [Full GC (Allocation Failure)  8071M->4285M(8192M), 10.3987109 secs]
   [Eden: 0.0B(408.0M)->0.0B(408.0M) Survivors: 0.0B->0.0B Heap: 8071.2M(8192.0M)->4285.7M(8192.0M)], [Metaspace: 32321K->32321K(1077248K)]
 [Times: user=21.74 sys=0.00, real=10.39 secs]
2019-05-15T22:26:09.824+0800: 21989412.962: [GC concurrent-mark-abort]
2019-05-15T22:26:09.915+0800: 21989413.053: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0274477 secs]
   [Parallel Time: 22.3 ms, GC Workers: 33]
      [GC Worker Start (ms): Min: 21989413054.9, Avg: 21989413055.4, Max: 21989413055.8, Diff: 0.9]
      [Ext Root Scanning (ms): Min: 13.3, Avg: 14.2, Max: 15.2, Diff: 1.9, Sum: 468.0]
      [Update RS (ms): Min: 0.7, Avg: 1.2, Max: 2.1, Diff: 1.3, Sum: 40.4]
         [Processed Buffers: Min: 1, Avg: 2.7, Max: 6, Diff: 5, Sum: 89]
      [Scan RS (ms): Min: 0.0, Avg: 0.3, Max: 0.6, Diff: 0.6, Sum: 8.6]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 4.6, Avg: 5.8, Max: 6.6, Diff: 2.0, Sum: 191.1]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.4]
         [Termination Attempts: Min: 1, Avg: 1.4, Max: 3, Diff: 2, Sum: 46]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.2, Diff: 0.2, Sum: 2.9]
      [GC Worker Total (ms): Min: 21.1, Avg: 21.6, Max: 22.0, Diff: 0.9, Sum: 711.5]
      [GC Worker End (ms): Min: 21989413076.9, Avg: 21989413076.9, Max: 21989413077.0, Diff: 0.2]
   [Code Root Fixup: 0.2 ms]
   [Code Root Purge: 0.1 ms]
   [Clear CT: 0.5 ms]
   [Other: 4.3 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.2 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.9 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.3 ms]
   [Eden: 192.0M(408.0M)->0.0B(396.0M) Survivors: 0.0B->12.0M Heap: 4476.9M(8192.0M)->4295.3M(8192.0M)]
 [Times: user=0.73 sys=0.00, real=0.03 secs]
 
 2019-05-15T22:26:49.312+0800: 21989452.450: [Full GC (Allocation Failure)  8056M->3891M(8192M), 10.2331784 secs]
   [Eden: 0.0B(408.0M)->0.0B(408.0M) Survivors: 0.0B->0.0B Heap: 8056.7M(8192.0M)->3891.8M(8192.0M)], [Metaspace: 32321K->32321K(1077248K)]
 [Times: user=21.25 sys=0.00, real=10.23 secs]
2019-05-15T22:26:59.548+0800: 21989462.686: [GC concurrent-mark-abort]
2019-05-15T22:26:59.554+0800: 21989462.692: [GC pause (G1 Humongous Allocation) (young) (initial-mark), 0.0472311 secs]
   [Parallel Time: 41.6 ms, GC Workers: 33]
      [GC Worker Start (ms): Min: 21989462693.5, Avg: 21989462693.8, Max: 21989462694.2, Diff: 0.7]
      [Ext Root Scanning (ms): Min: 14.4, Avg: 15.6, Max: 16.7, Diff: 2.4, Sum: 515.6]
      [Update RS (ms): Min: 0.4, Avg: 1.2, Max: 5.1, Diff: 4.7, Sum: 38.4]
         [Processed Buffers: Min: 1, Avg: 2.5, Max: 8, Diff: 7, Sum: 82]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 19.3, Avg: 23.8, Max: 25.1, Diff: 5.8, Sum: 783.9]
      [Termination (ms): Min: 0.0, Avg: 0.2, Max: 0.2, Diff: 0.2, Sum: 5.8]
         [Termination Attempts: Min: 1, Avg: 72.2, Max: 93, Diff: 92, Sum: 2384]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 1.6]
      [GC Worker Total (ms): Min: 40.4, Avg: 40.8, Max: 41.1, Diff: 0.7, Sum: 1345.4]
      [GC Worker End (ms): Min: 21989462734.6, Avg: 21989462734.6, Max: 21989462734.6, Diff: 0.1]
   [Code Root Fixup: 0.3 ms]
   [Code Root Purge: 0.1 ms]
   [Clear CT: 0.8 ms]
   [Other: 4.6 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.6 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 1.1 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.1 ms]
   [Eden: 40.0M(408.0M)->0.0B(388.0M) Survivors: 0.0B->20.0M Heap: 3933.8M(8192.0M)->3912.6M(8192.0M)]

备注: 频繁Full GC, 间隔在1到2分钟,每次Full GC的时间长达10秒以上。 查看server和zk日志,发现问题节点在频繁的连接zk,导致partition leader的频繁选举。

4.监控查看

问题节点堆内存使用持续居高:

问题节点老年代使用持续居高:

回收时间明显增加:

回收次数明显增加:

五、复盘总结

1.由于堆内存不能正常回收,导致频繁(1到2分钟)的Full GC. 2.每次Full GC长达10秒以上,导致连接zk的session过期(默认6秒) 3.Session的过期导致问题节点频繁重连zk集群,造成频繁Partition Leader选举

六、未尽事项

1.由于未dump当时的堆照,分析堆内存异常原因比较困难,如果调查三天来有哪些业务使用造成的,也比较困难。

2.恢复GC短信告警,发生时问题时告警短信异常。

3.除了增加GC回收告警外,增加Full GC告警

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-08-27,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 瓜农老梁 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 一、情况分析
  • 二、故障恢复
  • 三、问题复盘
  • 五、复盘总结
  • 六、未尽事项
相关产品与服务
短信
腾讯云短信(Short Message Service,SMS)可为广大企业级用户提供稳定可靠,安全合规的短信触达服务。用户可快速接入,调用 API / SDK 或者通过控制台即可发送,支持发送验证码、通知类短信和营销短信。国内验证短信秒级触达,99%到达率;国际/港澳台短信覆盖全球200+国家/地区,全球多服务站点,稳定可靠。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档