使用Prometheus实现大规模的应用程序监视【Containers】

王欣壳

修改于 2019-11-11 03:06:08

1.6K00

代码可运行

文章被收录于专栏：Opensource翻译专栏Opensource翻译专栏

运行总次数：0

代码可运行

我们有充分的理由证明Prometheus是一个日益流行的开源工具。开源工具可以为应用程序和服务器提供监视和警报。 Prometheus的强大优势在于监视服务器端指标，并将其存储为时间序列数据。尽管Prometheus并不适合于应用程序性能管理，主动控制或用户体验监视（尽管GitHub扩展确实使Prometheus可以使用用户浏览器指标），但Prometheus作为监视系统的能力是很强的，并且能够通过联盟实现高可扩展性服务器的数量使Prometheus成为各种使用案例的强大选择。

在本文中，我们将仔细研究Prometheus的体系结构和功能，然后研究该工具的详细实例。

Prometheus架构和组件

Prometheus由Prometheus服务器（通过PromQL查询语言处理服务发现，度量标准检索和存储以及时间序列数据分析），度量标准的数据模型，图形GUI和对Grafana的本机支持组成。还有一个可选的警报管理器，允许用户通过查询语言定义警报，以及一个可选的推送网关，用于短期应用程序监视。这些组件的位置如下图所示。

Prometheus可以通过使用代理在应用程序环境中执行通用代码来自动捕获标准指标。它还可以通过检测来捕获自定义指标，将自定义代码放在受监视应用程序的源代码中。 Prometheus正式支持Go，Python，Ruby和Java / Scala的客户端库，还使用户能够编写自己的库。此外，还有许多其他语言的非官方库。

开发人员还可以利用第三方出口商自动激活可能正在使用的许多流行软件解决方案的工具。例如，基于JVM的应用程序（例如开源Apache Kafka和Apache Cassandra）的用户可以利用现有的JMX导出器轻松收集指标。在其他情况下，将不需要导出程序，因为该应用程序将公开Prometheus格式的指标。 Cassandra上的用户可能还会发现Instaclustr的可免费获得的Prosmetheus Cassandra Exporter很有帮助，因为它将Cassandra指标从一个自管群集中集成到Prometheus应用程序监视中。

同样重要的是：开发人员可以利用可用的节点导出器来监视内核指标和主机硬件。 Prometheus还提供了Java客户端，具有许多功能，这些功能可以通过单个 DefaultExports.initialize（）进行逐项注册或一次注册。命令包括内存池，垃圾回收，JMX，类加载和线程计数。

Prometheus数据建模和指标

Prometheus提供了四种度量标准类型：

计数器：计算增量值；重新启动可以将这些值恢复为零
量规：跟踪可以上升和下降的指标
直方图：根据指定的响应大小或持续时间观察数据，并对观察值的总和以及可配置存储桶中的计数进行计数
摘要：对类似于直方图的观察数据进行计数，并提供可配置的分位数，这些分位数在滑动时间窗口内计算

Prometheus时间序列数据度量标准每个都包含一个字符串名称，该名称遵循命名约定，以包括受监视数据主体的名称，逻辑类型和所使用的度量单位。每个度量标准都包括时间戳减少到毫秒的64位浮点值流，以及一组标注其测量尺寸的key：value对。 Prometheus会自动将Job和Instance标签添加到每个度量标准，以分别跟踪数据目标的已配置作业名称和已抓取目标URL的<host>：<port>段。

普罗米修斯的例子：the Anomalia Machina的异常检测试验

为了演示如何将Prometheus付诸实践并进行大规模的应用程序监视，让我们看一下我们最近在Instaclustr完成的实验性Anomalia Machina项目。这个项目只是一个测试用例，而不是商业上可用的解决方案，它在Kubernetes部署的应用程序中利用Kafka和Cassandra，该应用程序对流数据执行异常检测。（这种检测对于包括IoT应用程序和数字广告欺诈在内的用例非常重要。）试验性应用程序在很大程度上依赖于Prometheus来收集分布式实例中的应用程序指标并使其易于查看。

此图显示了实验的体系结构：

我们利用Prometheus的目标包括监视应用程序的更通用指标，例如吞吐量，以及由Kafka负载生成器（Kafka生产者），Kafka使用者和负责检测应用程序中任何异常的Cassandra客户端提供的响应时间。数据。 Prometheus还监视系统的硬件指标，例如运行该应用程序的每个AWS EC2实例的CPU。该项目还依靠Prometheus来监视特定于应用程序的指标，例如每个Cassandra读取返回的总行数，以及至关重要的是，它检测到的异常数。为了简化起见，所有这些监视都是集中的。

实际上，这意味着使用生产者，消费者和检测者方法以及以下三个指标形成测试管道：

每次执行每个流水线级都不会发生意外时，称为prometheusTest_requests_total的计数器会增加，而级标签允许跟踪每个级的成功执行，而总标签则跟踪总流水线数量。
另一个称为prometheusTest_anomalies_total的计数器衡量任何检测到的异常。
最后，一个称为prometheusTest_duration_seconds的度量标准会跟踪每个阶段的持续时间（再次使用阶段标签和总标签）。

这些测量背后的代码使用inc（）方法增加计数器指标，并使用setToTime（）方法设置量表指标的时间值。在以下带注释的示例代码中对此进行了演示：

import java.io.IOException;
import io.prometheus.client.Counter;
import io.prometheus.client.Gauge;
import io.prometheus.client.exporter.HTTPServer;
import io.prometheus.client.hotspot.DefaultExports;
 
// https://github.com/prometheus/client_java
// Demo of how we plan to use Prometheus Java client to instrument Anomalia Machina.
// Note that the Anomalia Machina application will have Kafka Producer and Kafka consumer and rest of pipeline running in multiple separate processes/instances.
// So metrics from each will have different host/port combinations.
public class PrometheusBlog {  
static String appName = "prometheusTest";
// counters can only increase in value (until process restart)
// Execution count. Use a single Counter for all stages of the pipeline, stages are distinguished by labels
static final Counter pipelineCounter = Counter.build()
    .name(appName + "_requests_total").help("Count of executions of pipeline stages")
    .labelNames("stage")
    .register();
// in theory could also use pipelineCounter to count anomalies found using another label
// but less potential for confusion having another counter. Doesn't need a label
static final Counter anomalyCounter = Counter.build()
    .name(appName + "_anomalies_total").help("Count of anomalies detected")
    .register();
// A Gauge can go up and down, and is used to measure current value of some variable.
// pipelineGauge will measure duration in seconds of each stage using labels.
static final Gauge pipelineGauge = Gauge.build()
    .name(appName + "_duration_seconds").help("Gauge of stage durations in seconds")
    .labelNames("stage")
    .register();
 
public static void main(String[] args) {
// Allow default JVM metrics to be exported
   DefaultExports.initialize();
 
   // Metrics are pulled by Prometheus, create an HTTP server as the endpoint
   // Note if there are multiple processes running on the same server need to change port number.
   // And add all IPs and port numbers to the Prometheus configuration file.
HTTPServer server = null;
try {
server = new HTTPServer(1234);
} catch (IOException e) {
e.printStackTrace();
}
// now run 1000 executions of the complete pipeline with random time delays and increasing rate
int max = 1000;
for (int i=0; i < max; i++)
{
// total time for complete pipeline, and increment anomalyCounter
pipelineGauge.labels("total").setToTime(() -> {
producer();
consumer();
if (detector())
anomalyCounter.inc();
});
// total pipeline count
pipelineCounter.labels("total").inc();
System.out.println("i=" + i);
 
// increase the rate of execution
try {
Thread.sleep(max-i);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
server.stop();
}
// the 3 stages of the pipeline, for each we increase the stage counter and set the Gauge duration time
public  static void producer() {
class Local {};
String name = Local.class.getEnclosingMethod().getName();
pipelineGauge.labels(name).setToTime(() -> {
try {
Thread.sleep(1 + (long)(Math.random()*20));
} catch (InterruptedException e) {
e.printStackTrace();
}
});
pipelineCounter.labels(name).inc();
   }
public  static void consumer() {
class Local {};
String name = Local.class.getEnclosingMethod().getName();
pipelineGauge.labels(name).setToTime(() -> {
try {
Thread.sleep(1 + (long)(Math.random()*10));
} catch (InterruptedException e) {
e.printStackTrace();
}
});
pipelineCounter.labels(name).inc();
   }
// detector returns true if anomaly detected else false
public  static boolean detector() {
class Local {};
String name = Local.class.getEnclosingMethod().getName();
pipelineGauge.labels(name).setToTime(() -> {
try {
Thread.sleep(1 + (long)(Math.random()*200));
} catch (InterruptedException e) {
e.printStackTrace();
}
});
pipelineCounter.labels(name).inc();
return (Math.random() > 0.95);
   }
}

Prometheus通过轮询（“抓取”）检测到的代码来收集指标（与其他一些通过推送方法接收指标的监视解决方案不同）。上面的代码示例在端口1234上创建了一个必需的HTTP服务器，以便Prometheus可以根据需要抓取度量标准。以下示例代码解决了Maven依赖项：

<!-- The client -->
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient</artifactId>
<version>LATEST</version>
</dependency>
<!-- Hotspot JVM metrics-->
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_hotspot</artifactId>
<version>LATEST</version>
</dependency>
<!-- Exposition HTTPServer-->
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_httpserver</artifactId>
<version>LATEST</version>
</dependency>
<!-- Pushgateway exposition-->
<dependency>
<groupId>io.prometheus</groupId>
<artifactId>simpleclient_pushgateway</artifactId>
<version>LATEST</version>
</dependency>

下面的代码示例告诉Prometheus应该在哪里寻找指标。只需将这些代码添加到配置文件（默认值：Prometheus.yml）中，即可进行基本部署和测试。

global:
 scrape_interval:    15s # By default, scrape targets every 15 seconds.
 
# scrape_configs has jobs and targets to scrape for each.
scrape_configs:
# job 1 is for testing prometheus instrumentation from multiple application processes.
 # The job name is added as a label job=<job_name> to any timeseries scraped from this config.
 - job_name: 'testprometheus'
 
   # Override the global default and scrape targets from this job every 5 seconds.
   scrape_interval: 5s
   
   # this is where to put multiple targets, e.g. for Kafka load generators and detectors
   static_configs:
     - targets: ['localhost:1234', 'localhost:1235']
     
 # job 2 provides operating system metrics (e.g. CPU, memory etc).
 - job_name: 'node'
 
  # Override the global default and scrape targets from this job every 5 seconds.
   scrape_interval: 5s
   
   static_configs:
     - targets: ['localhost:9100']

注意此配置文件中使用端口9100的名为“ node”的作业；此作业提供了节点指标，并且需要在运行应用程序的同一台服务器上运行Prometheus节点导出器。度量指标的轮询应格外小心：过于频繁地执行可能会使应用程序过载，而过于频繁地执行则会导致延迟。在无法轮询应用程序指标的地方，Prometheus还提供了一个推送网关。

查看Prometheus指标和结果

我们的实验最初使用表达式，后来使用Grafana来可视化数据并克服Prometheus缺少默认仪表板的问题。使用Prometheus界面（或http：// localhost：9090 / metrics），按名称选择指标，然后在表达式框中输入它们以执行。（请注意，在此阶段通常会遇到错误消息，因此，如果遇到一些问题，请不要气.。）使用正确运行的表达式，结果将可以适当地显示在表格或图形中。

在计数器指标上使用irate或rate函数将产生有用的比率图：