k8s-java-thread-dumper 是一款开源工具,专为在k8s环境中捕捉 Java 应用程序高负载时的线程池信息设计,帮助进行问题排查和处理。
在前面的文章《开源!Pod高负载自动打印JAVA线程堆栈》中我发布了k8s-java-thread-dumper的第一个版本,并获得小伙伴们通过微信和Github issue的积极反馈与建议,其中有几个不错的建议都在本次发布的新版本中实现,具体功能介绍和使用方法请见下文。
01、优化内容
下面橙色字体为本次新增功能点:
02、工作流程
与 Grafana 的告警联动,配合阿里的 arthas,来完成高CPU使用率线程的堆栈抓取。
整体流程如下:
与 Prometheus Alertmanager 的告警联动,配合阿里的 arthas,来完成高CPU使用率线程的堆栈抓取。
整体流程如下:
03、效果预览
http://xxxxxx:8099/stacks/
04、支持环境
Grafana v10.x (v9.x应该也支持,未测试)
05、配置说明
server:
# 服务监听端口
port: 8099
# 每node同时运行执行数为10
maxNodeLockManager: 10
# 服务监听域名
domain: "http://127.0.0.1:8099"
wework:
# 企业微信webhook地址
webhook: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxx"
arthas:
# 是否远程拷贝arthas,值为false时将通过crawl.sh脚本进行下载
remoteCopy: true
#arthas-boot.jar源的存放路径
path: "tools/arthas-boot.jar"
crawl.sh
中修改)crawl.sh
中修改)
06、如何使用
docker pull registry.cn-hangzhou.aliyuncs.com/yilingyi/k8s-java-thread-dumper:2.1.0
拉取源码
git clone https://github.com/yilingyi/k8s-java-thread-dumper.git
构建镜像
make docker IMAGE=yilingyi/k8s-java-thread-dumper:2.1.0
Kubernetes部署
kubectl create namespace monitor
kubectl apply -f . -n monitor
进行k8s资源创建
Deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8s-java-thread-dumper
labels:
app: k8s-java-thread-dumper
spec:
replicas: 1
selector:
matchLabels:
app: k8s-java-thread-dumper
template:
metadata:
labels:
app: k8s-java-thread-dumper
spec:
containers:
- name: k8s-java-thread-dumper
image: registry.cn-hangzhou.aliyuncs.com/yilingyi/k8s-java-thread-dumper:2.1.0
ports:
- containerPort: 8099
volumeMounts:
- name: config-volume
mountPath: /app/config/config.yaml
subPath: config.yaml
volumes:
- name: config-volume
configMap:
name: k8s-java-thread-dumper-config
Service.yaml
apiVersion: v1
kind: Service
metadata:
name: k8s-java-thread-dumper-service
labels:
app: k8s-java-thread-dumper
spec:
selector:
app: k8s-java-thread-dumper
ports:
- protocol: TCP
port: 8099
targetPort: 8099
type: NodePort
ConfigMap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: k8s-java-thread-dumper-config
data:
config.yaml: |
server:
port: 8099
maxNodeLockManager: 10
domain: "http://xxxxx:8099"
wework:
webhook: "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxxxxxx"
arthas:
remoteCopy: true
path: "tools/arthas-boot.jar"
保存为rolebinding.yaml,并使用kubectl apply -f rolebinding.yaml
进行创建,其中<target-namespace>
改为目标命名空间
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
namespace: <target-namespace>
name: pod-exec-role
rules:
- apiGroups: [""]
resources: ["pods/exec"]
verbs: ["create"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: pod-exec-role-binding
namespace: <target-namespace>
subjects:
- kind: ServiceAccount
name: default
namespace: monitor
roleRef:
kind: Role
name: pod-exec-role
apiGroup: rbac.authorization.k8s.io
回调接口
http://xxxxxx:8099/hooks/grafana
http://xxxxxx:8099/hooks/prometheus
Grafana告警规则
sum(irate(container_cpu_usage_seconds_total{prometheus_name=~"gz",pod=~".*",container =~".*",container !="",container!="POD",node=~".*",namespace=~"(prod)"}[2m])) by (namespace, pod, node, container) / (sum(container_spec_cpu_quota{prometheus_name=~"gz",pod=~".*",container =~".*",container !="",container!="POD",node=~".*",namespace=~"(prod)"}/100000) by (namespace, pod, node, container)) * 100
{{node}} - {{namespace}} - {{pod}} - {{container}}
配置完如下:
选择webhook,URL地址为http://xxxxx/hooks/grafana
配置完如下:
rules:
- alert: HighPodCPUUsage
expr: sum(irate(container_cpu_usage_seconds_total{prometheus_name=~"gz",pod=~".*",container =~".*",container !="",container!="POD",node=~".*",namespace=~"(prod)"}[2m])) by (namespace, pod, node, container) / (sum(container_spec_cpu_quota{prometheus_name=~"gz",pod=~".*",container =~".*",container !="",container!="POD",node=~".*",namespace=~"(prod)"}/100000) by (namespace, pod, node, container)) * 100 > 90
for: 5m
labels:
severity: critical
annotations:
summary: "High CPU usage detected on pod {{ $labels.pod }} in namespace {{ $labels.namespace }}"
description: "CPU usage is above 90% for more than 5 minutes.\n VALUE = {{ $value }}\n POD = {{ $labels.pod }}\n NAMESPACE = {{ $labels.namespace }}"
新增路由,将alertname为HighPodCPUUsage的告警发送到receiver:high-pod-cpu-usage,然后回调接口http://xxxxx/hooks/prometheus
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
- match:
alertname: 'HighPodCPUUsage'
receiver: 'high-pod-cpu-usage'
receivers:
- name: 'default'
webhook_configs:
- url: 'http://default-webhook-url/api/v1/alerts'
- name: 'high-pod-cpu-usage'
webhook_configs:
- url: 'http://xxxxx/hooks/prometheus'
07、结 语
本次新版增加了prometheus回调功能,并增加了arthas远程拷贝,满足极简容器环境使用,使用期间有遇到问题或者建议欢迎反馈。本期分享就到这里,谢谢!
源码地址:
https://github.com/yilingyi/k8s-java-thread-dumper.git