SkyWalking 是一个APM(应用程序性能监视器)系统,专门为微服务,云原生和基于容器(Docker,Kubernetes,Mesos)的体系结构而设计。 SkyWalking的功能包括对Cloud Native体系结构中的分布式系统的监视,跟踪,诊断功能。核心功能如下:
#创建namespace - monitoring
apiVersion: v1
kind: Namespace
metadata:
name: monitoring
#创建SkyWalking相关的rbac权限
#相关文件可查看https://github.com/apache/skywalking-kubernetes/tree/master/chart/skywalking/templates下的k8s配置
apiVersion: v1
kind: ServiceAccount
metadata:
labels:
app: skywalking-oap-server
release: 8.3.0
name: skywalking-oap-server
namespace: monitoring
---
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: skywalking-oap-server
namespace: monitoring
labels:
app: skywalking-oap-server
release: 8.3.0
rules:
- apiGroups: [""]
resources: ["pods","configmaps"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: skywalking-oap-server
namespace: monitoring
labels:
app: skywalking-oap-server
release: 8.3.0
rules:
- apiGroups: [""]
resources: ["pods", "endpoints", "services"]
verbs: ["get", "watch", "list"]
- apiGroups: ["extensions"]
resources: ["deployments", "replicasets"]
verbs: ["get", "watch", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: skywalking-oap-server
namespace: monitoring
labels:
app: skywalking-oap-server
release: 8.3.0
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: skywalking-oap-server
subjects:
- kind: ServiceAccount
name: skywalking-oap-server
namespace: monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: skywalking-oap-server
labels:
app: skywalking-oap-server
release: 8.3.0
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: skywalking-oap-server
subjects:
- kind: ServiceAccount
name: skywalking-oap-server
namespace: monitoring
#创建SkyWalking的alarm-settings.yaml ConfigMap配置文件
kind: ConfigMap
apiVersion: v1
metadata:
name: alarm-settings
namespace: monitoring
data:
alarm-settings.yml: |
rules:
# Rule unique name, must be ended with `_rule`.
#1.过去3分钟内服务平均响应时间超过1秒
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 60
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
# 2.服务成功率在过去2分钟内低于80%。
service_sla_rule:
# Metrics value need to be long, double or int
metrics-name: service_sla
op: "<"
threshold: 8000
# The length of time to evaluate the metrics
period: 10
# How many times after the metrics match the condition, will trigger alarm
count: 2
# How many times of checks, the alarm keeps silence after alarm triggered, default as same as period.
silence-period: 60
message: Successful rate of service {name} is lower than 80% in 2 minutes of last 10 minutes
#3.服务90%响应时间在过去3分钟内低于1000毫秒.
service_resp_time_percentile_rule:
# Metrics value need to be long, double or int
metrics-name: service_percentile
op: ">"
threshold: 1000,1000,1000,1000,1000
period: 10
count: 3
silence-period: 60
message: Percentile response time of service {name} alarm in 3 minutes of last 10 minutes, due to more than one condition of p50 > 1000, p75 > 1000, p90 > 1000, p95 > 1000, p99 > 1000
#4.服务实例在过去2分钟内的平均响应时间超过1秒
service_instance_resp_time_rule:
metrics-name: service_instance_resp_time
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 60
message: Response time of service instance {name} is more than 1000ms in 2 minutes of last 10 minutes
database_access_resp_time_rule:
metrics-name: database_access_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
silence-period: 60
message: Response time of database access {name} is more than 1000ms in 2 minutes of last 10 minutes
endpoint_relation_resp_time_rule:
metrics-name: endpoint_relation_resp_time
threshold: 1000
op: ">"
period: 10
count: 2
silence-period: 60
message: Response time of endpoint relation {name} is more than 1000ms in 2 minutes of last 10 minutes
# Active endpoint related metrics alarm will cost more memory than service and service instance metrics alarm.
# Because the number of endpoint is much more than service and instance.
#5.端点平均响应时间过去2分钟超过1秒。
endpoint_avg_rule:
metrics-name: endpoint_avg
op: ">"
threshold: 1000
period: 10
count: 2
silence-period: 60
message: Response time of endpoint {name} is more than 1000ms in 2 minutes of last 10 minutes
#创建SkyWalking deployment,这里containers端口开放了11800、12800分别作为grpc、rest端口,且通过nodeport形式暴露给内网环境,使非本k8s环境主机可以访问。
#为了便捷,直接使用aliyun的elasticsearch7.7云服务作为SkyWalking的数据源存储,其余数据源可以查看已支持的https://github.com/apache/skywalking/tree/master/oap-server/server-storage-plugin
apiVersion: apps/v1
kind: Deployment
metadata:
name: skywalking-oap-server
namespace: monitoring
labels:
app: skywalking-oap-server
release: 8.3.0
spec:
replicas: 2
selector:
matchLabels:
app: skywalking-oap-server
template:
metadata:
labels:
app: skywalking-oap-server
devops: k8s-app
spec:
serviceAccountName: skywalking-oap-server
containers:
- name: skywalking-oap-server
image: apache/skywalking-oap-server:latest
imagePullPolicy: IfNotPresent
livenessProbe:
tcpSocket:
port: 12800
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
tcpSocket:
port: 12800
initialDelaySeconds: 15
periodSeconds: 20
securityContext:
allowPrivilegeEscalation: false
ports:
- name: grpc
containerPort: 11800
- name: rest
containerPort: 12800
resources:
requests:
memory: "128Mi"
limits:
memory: "4Gi"
cpu: 4
env:
- name: JAVA_OPTS
value: "-Xmx2g -Xms2g"
- name: SW_CLUSTER
value: kubernetes
- name: SW_CLUSTER_K8S_NAMESPACE
value: monitoring
- name: SW_CONFIGURATION
value: k8s-configmap
- name: SW_CONFIG_CONFIGMAP_PERIOD
value: "60"
- name: SKYWALKING_COLLECTOR_UID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: SW_STORAGE
value: elasticsearch7
- name: SW_STORAGE_ES_CLUSTER_NODES
value: xxxxxxx.elasticsearch.aliyuncs.com:9200
- name: SW_ES_USER
value: elastic
- name: SW_ES_PASSWORD
value: xxxxx
volumeMounts:
- name: zone
mountPath: /etc/localtime
readOnly: true
- name: alarm-settings
mountPath: /skywalking/config/alarm-settings.yml
readOnly: true
subPath: alarm-settings.yml
volumes:
- name: zone
hostPath:
path: /etc/localtime
- name: alarm-settings
configMap:
name: alarm-settings
---
apiVersion: v1
kind: Service
metadata:
name: skywalking-oap-server
namespace: monitoring
labels:
app: skywalking-oap-server
spec:
selector:
app: skywalking-oap-server
ports:
- name: grpcport
port: 11800
targetPort: 11800
protocol: TCP
nodePort: 31180
- name: restport
port: 12800
targetPort: 12800
protocol: TCP
nodePort: 31280
type: NodePort
#创建SkyWalking的ui,注意的是spec.spec.template.spec.containers.env.SW_OAP_ADDRESS需要跟sky-deployment.yaml的name对齐,并加上rest port,并且通过traefik2 的IngressRoute暴露域名。
apiVersion: apps/v1
kind: Deployment
metadata:
name: skywalking-ui
namespace: monitoring
labels:
app: skywalking-ui
spec:
replicas: 1
selector:
matchLabels:
app: skywalking-ui
template:
metadata:
labels:
app: skywalking-ui
spec:
containers:
- name: skywalking-ui
image: apache/skywalking-ui:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8080
name: page
resources:
requests:
memory: "128Mi"
limits:
memory: "3G"
cpu: 2
env:
- name: SW_OAP_ADDRESS
value: skywalking-oap-server:12800
volumeMounts:
- name: zone
mountPath: /etc/localtime
readOnly: true
volumes:
- name: zone
hostPath:
path: /etc/localtime
---
apiVersion: v1
kind: Service
metadata:
labels:
app: skywalking-ui
name: skywalking-ui
namespace: monitoring
spec:
ports:
- port: 80
targetPort: 8080
protocol: TCP
name: page
selector:
app: skywalking-ui
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: skywalking-ui
namespace: monitoring
labels:
app: skywalking-ui
spec:
entryPoints:
- http
routes:
- match: Host(`sw.domain.com`) && PathPrefix(`/`)
kind: Rule
priority: 10
middlewares:
- name: net-offical
namespace: default
services:
- name: skywalking-ui
namespace: monitoring
port: 80
按顺序分别kubectl apply部署SkyWalking,部署完成后可查看相关SkyWalking资源。
当浏览器登录sw.domain.com的时候,可以看到SkyWalking UI已经准备完成,只不过现在没有服务接入,所有都是空白的,
接下来我们来准备SkyWalking Agent,让JAVA服务接入agent。
#SkyWalking Agent Dockerfile
FROM alpine:3.8
LABEL maintainer=xiayun
ENV SKYWALKING_VERSION=8.3.0
ADD http://mirrors.tuna.tsinghua.edu.cn/apache/skywalking/${SKYWALKING_VERSION}/apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz /
RUN tar -zxvf /apache-skywalking-apm-${SKYWALKING_VERSION}.tar.gz && \
mv apache-skywalking-apm-bin skywalking && \
mv /skywalking/agent/optional-plugins/apm-trace-ignore-plugin* /skywalking/agent/plugins/ && \
chmod -R 777 /skywalking/agent && \
echo -e "\n# Ignore Path" >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
echo "# see https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md" >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
echo 'trace.ignore_path=${SW_AGENT_TRACE_IGNORE_PATH:/health}' >> /skywalking/agent/config/apm-trace-ignore-plugin.config && \
echo 'agent.namespace=${SW_AGENT_NAMESPACE:default-namespace}' >> /skywalking/agent/config/agent.config && \
echo 'logging.max_file_size=${SW_LOGGING_MAX_FILE_SIZE:1073741824}' >> /skywalking/agent/config/agent.config
通过此SkyWalking Agent Dockerfile文件,生成skywalking-agent:r1.0镜像,并上传至nexus3(nexus3在k8s中部署可以查看公众号的上一篇文章<<云原生利器 -- Nexus3>>)
在java服务的Dockerfile中需要加{JAVA_OPTS}参数,在k8s配置文件中,我们需要增加env变量,如:CMD java {JAVA_OPTS} -jar jar-name然后在java k8s配置文件中,增加initContainers,以k8s sidecar的形式部署SkyWalking agent
#java k8s配置文件
apiVersion: apps/v1
kind: Deployment
metadata:
name: server-name
namespace: ENV
labels:
prometheus: ENV-server
spec:
replicas: 1
selector:
matchLabels:
app: server-name
template:
metadata:
labels:
app: server-name
prometheus: ENV-server
devops: k8s-app
spec:
initContainers:
- name: skywalking-agent
image: skywalking-agent:r1.0
securityContext:
allowPrivilegeEscalation: false
resources:
limits:
memory: 1Gi
requests:
memory: 100Mi
command:
- 'sh'
- '-c'
- 'set -ex;mkdir -p /vmskywalking/agent;cp -r /skywalking/agent/* /vmskywalking/agent'
volumeMounts:
- name: zone
mountPath: /etc/localtime
readOnly: true
- name: sw-agent
mountPath: /vmskywalking/agent
containers:
- name: server-name
image: 172.16.10.13/ENV-server/server-name:<BUILD_TAG>
imagePullPolicy: Always
securityContext:
allowPrivilegeEscalation: false
readinessProbe:
tcpSocket:
port: 8081
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
tcpSocket:
port: 8081
initialDelaySeconds: 300
periodSeconds: 5
ports:
- name: web
protocol: TCP
containerPort: 8081
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
memory: "MAXMEM"
env:
- name: JAVA_OPTS
value: -javaagent:/usr/lib/agent/skywalking-agent.jar
- name: SW_AGENT_NAME
value: ENV-server-name
- name: SW_AGENT_COLLECTOR_BACKEND_SERVICES
value: skywalking-oap-server.monitoring.svc.cluster.local:11800
- name: SW_LOGGING_LEVEL
value: ERROR
- name: SW_LOGGING_MAX_FILE_SIZE
value: "1073741824"
- name: SW_AGENT_NAMESPACE
value: ENV
- name: SW_MOUNT_FOLDERS
value: plugins,activations
- name: SW_AGENT_TRACE_IGNORE_PATH
value: /health,/actuator/prometheus,/prometheus
volumeMounts:
- name: zone
mountPath: /etc/localtime
readOnly: true
- name: app-logs
mountPath: /home/admin/server-name/logs
- name: fonts
mountPath: /usr/share/fonts
subPath: fonts
readOnly: true
- name: sw-agent
mountPath: /usr/lib/agent
volumes:
- name: zone
hostPath:
path: /etc/localtime
- name: app-logs
emptyDir: {}
- name: sw-agent
emptyDir: {}
- name: fonts
persistentVolumeClaim:
claimName: fonts
---
apiVersion: v1
kind: Service
metadata:
name: server-name-svc
namespace: ENV
labels:
prometheus: ENV-server
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8081"
prometheus.io/path: "/actuator/prometheus"
spec:
template:
metadata:
labels:
name: server-name-svc
namespace: ENV
prometheus: ENV-server
spec:
selector:
app: server-name
ports:
- name: web
port: 80
targetPort: 8081
配置完成后,运行java 服务。让我们来看下现在k8s SkyWalking的基础架构,
采用aliyun elasticsearch作为skywalking的存储源,skywalking server跟ui都部署在k8s上,skywalking agent客户端采用k8s sidecar 边车模式跟微服务共享容器空间。
登录SkyWalking UI页面,右上角刷新一下,可以显示出新增的java服务,如,
从仪表盘的APM中,可以看到Services Load、Slow Services、Un-Health Service、Slow Endpoints的Top10情况。 从拓扑图中,可以看到整个环境中的服务链路调用情况,如,
从追踪中,可以看到服务的链路情况明细,如,
如果trace链路需要忽略某些路径,如/health,/actuator/prometheus,/prometheus这些监控uri,可以在java k8s配置文件中的env.SW_AGENT_TRACE_IGNORE_PATH配置,如需通配路径,参考trace.ignore_path=/your/path/1/**,/your/path/2/**
,具体可以查阅https://github.com/apache/skywalking/blob/8.3.0/docs-hotfix/docs/en/setup/service-agent/java-agent/agent-optional-plugins/trace-ignore-plugin.md
性能剖析和日志,目前没有使用到,暂不介绍,等后续更新吧···
从告警中,可以看到当前服务的链路告警详情,告警规则可以在alarm-settings.yml里配置,告警可以接入WebHook,如Dingtalk Hook,WeChat Hook,Slack Chat Hook,gRPCHook等
rules:
service_resp_time_rule:
metrics-name: service_resp_time
op: ">"
threshold: 1000
period: 10
count: 3
silence-period: 60
message: Response time of service {name} is more than 1000ms in 3 minutes of last 10 minutes.
如此配置中,service_resp_time_rule的告警规则为过去3分钟内服务平均响应时间超过1秒就告警,沉默时间为60分钟。 告警规则主要有以下几点:
1.https://github.com/apache/skywalking 2.https://github.com/apache/skywalking-kubernetes 3.https://skywalking-handbook.netlify.app/