XGBoost模型在时序异常检测方向的实践总结

原创

durdendong-董善东

发布于 2024-10-17 21:24:50

31300

代码可运行

文章被收录于专栏：智能运维解决方案智能运维解决方案

运行总次数：0

代码可运行

xgboost 介绍

xgboost （Extreme Gradient Boosting）是一种基于梯度提升决策树的机器学习算法。熟悉机器学习的同学对这个模型应该是一点都不陌生。在传统的机器学习比赛中， XGBoost模型是常客，甚至常常是获奖模型。

xgboost 在时间序列异常检测的两种形态：

监督学习（分类）。通过将异常数据点标记为一类，正常数据点标记为另一类，xgboost 模型可以被训练来识别这两类数据。这种方法在数据集中有明显的异常类别标签时非常有效。
无监督（回归）：使用xgboost 模型来预测时间序列的未来值，然后将预测结果与实际观测值进行比较，以识别那些显著偏离预测的异常点。

这里，我们主要采取了第一种。

如果对xgboost算法模型感兴趣，可以参考： https://www.cnblogs.com/mantch/p/11164221.html

流程说明

在实际应用中，xgboost 的时间序列异常检测通常涉及以下步骤：

数据准备：包括清洗数据、处理缺失值和异常值、以及将时间序列数据转换为适合监督学习的形式。
特征工程：创建滞后特征、滚动统计特征（如移动平均）、季节性特征等，以帮助模型捕捉时间序列数据中的复杂模式。
模型训练：使用XGBoost训练模型，可能需要调整超参数，如学习率、树的最大深度、正则化参数等，以优化模型性能。
模型评估：在测试集上评估模型的性能，使用适当的评估指标，如准确率、召回率、F1分数等。
异常检测：将训练好的模型应用于新的时间序列数据，识别异常数据点。

输入数据

时序的特性

时间序列往往呈现有序性、趋势性、季节性、周期性，以及随机性。

有序性：时间序列数据点按照时间顺序排列，这种顺序性是时间序列分析的基础。
趋势（Trend）：时间序列数据可能会显示出长期增长或下降的趋势，这可能是由于多种因素的影响，如经济增长、技术进步等。
季节性（Seasonality）：许多时间序列数据会表现出周期性的变化模式，这些周期性变化可能是年度的、季度的、月度的或周的，与季节或特定时间段相关。
周期性（Cyclic）：与季节性不同，周期性变化不具有固定的周期，而是可能在不同时间尺度上波动。
随机性（Irregularity）：时间序列数据中可能会包含随机波动或不规则变动，这些变动往往是不可预测的，如自然灾害、意外事件等。

除了时序的基础特性以外，时间序列中的一个数据点往往与它之前的点相关联、即序列值存在自相关性。时序中往往还存在缺失值、平稳与非平稳等其他特点。

这些特性让我们需要在预处理可以对时序做一定的处理：缺失值填充、归一化、平稳性校验、STL时序分解。

样本的构成

考虑到时序的周期和时序长度的trade-off，天、周同比往往在业务场景中是具备明显特征的。因此，我们可以考虑短时时序 + 周期时序组合为一条输入样本。如下：

待检测点+短时180分钟数据
待检测点对应的天、周同比前后180分钟数据。
这里的180分钟是一个经验值，当时尝试了30分钟， 60分钟， 2小时， 3小时， 6小时， 12小时。
天、周同比前后180分钟主要是存在大量的周期漂移现象。同时后180分钟展示了先前数据的变化模式。

对比：

如果放入7天的数据，一天1440个， 7天则1万多个点，这对于实时的数据查询是一个巨大的成本。通过短时+ 天、周同比的结合，在不丢失样本关键部分的同时，数据获取的成本大大降低。

特征工程

计算时间序列特征：包括以下三类，

时间序列统计特征：最大值、最小值、值域、均值、中位数、方差、峰度、同比、环比、周期性、自相关系数、变异系数

时间序列拟合特征：移动平均算法、带权重的移动平均算法、指数移动平均算法、二次指数移动平均算法、三次指数移动平均算法、奇异值分解算法、自回归算法、深度学习算法

时间序列分类特征：熵特征、小波分析特征、值分布特征（直方图分布、分时段的数据量分布）

在传统的xgboost模型训练中，往往会从2个方面入手提升模型准确率：

堆特征数量
堆数据样本数量

在metis中，最后干到了200多的特征数量。当然，这里面有一些特征其实过于复杂了，本质上是将其他无监督的检测模型对应的一部分中间特征给放了进来。人工设计痕迹有点重。

为了计算这些时序特征，可以直接使用开源包： Tsfresh https://tsfresh.readthedocs.io/en/latest/index.html。

数据集

要训练一个效果达到预期的异常检测xgboost模型，所需要的样本数量大概在1万以上。

之前我们为了训练metis的xgboost模型，整体构建了的样本数量为：

postive表示异常样本， negative表示正常样本。

为了达到这样一个量级的样本量，需要从2个方面入手：

好的标注&样本管理工具，可以提高标注效率。例如：
1. 可以对标注的时序进行聚类
2. 无监督检测模型预标注
开源数据集的引入。

xgb代码实战

1. 特征工程提取时序特征

1.1. 提取特征的主function

import statistical_features
import classification_features
import fitting_features
from time_series_detector.common import tsd_common

def extract_features(time_series, window):
    """
    Extracts three types of features from the time series.
  :param time_series: the time series to extract the feature of
  :type time_series: pandas.Series
  :param window: the length of window
  :type window: int
  :return: the value of features
  :return type: list with float
  """
  if not tsd_common.is_standard_time_series(time_series, window):
      # add your report of this error here...
  
      return []
  
  # spilt time_series
  split_time_series = tsd_common.split_time_series(time_series, window)
  # nomalize time_series
  normalized_split_time_series = tsd_common.normalize_time_series(split_time_series)
  max_min_normalized_time_series = tsd_common.normalize_time_series_by_max_min(split_time_series)
  s_features = statistical_features.get_statistical_features(normalized_split_time_series[4])
  f_features = fitting_features.get_fitting_features(normalized_split_time_series)
  c_features = classification_features.get_classification_features(max_min_normalized_time_series)
  # combine features with types
  features = s_features + f_features + c_features
  return features

1.2. statistical_features示例

def get_statistical_features(x):
      statistical_features = [
          time_series_maximum(x),
          time_series_minimum(x),
          time_series_mean(x),
          time_series_variance(x),
          time_series_standard_deviation(x),
          time_series_skewness(x),
          time_series_kurtosis(x),
          time_series_median(x),
          time_series_abs_energy(x),
          time_series_absolute_sum_of_changes(x),
          time_series_variance_larger_than_std(x),
          time_series_count_above_mean(x),
          time_series_count_below_mean(x),
          time_series_first_location_of_maximum(x),
          time_series_first_location_of_minimum(x),
          time_series_last_location_of_maximum(x),
          time_series_last_location_of_minimum(x),
          int(time_series_has_duplicate(x)),
          int(time_series_has_duplicate_max(x)),
          int(time_series_has_duplicate_min(x)),
          time_series_longest_strike_above_mean(x),
          time_series_longest_strike_below_mean(x),
          time_series_mean_abs_change(x),
          time_series_mean_change(x),
          time_series_percentage_of_reoccurring_datapoints_to_all_datapoints(x),
          time_series_ratio_value_number_to_time_series_length(x),
          time_series_sum_of_reoccurring_data_points(x),
          time_series_sum_of_reoccurring_values(x),
          time_series_sum_values(x),
          time_series_range(x)
      ]
      # append yourself statistical features here...
  return statistical_features

https://github.com/Tencent/Metis/blob/master/time_series_detector/feature/statistical_features.py

1.3. 具体代码：

https://github.com/Tencent/Metis/tree/master/time_series_detector/feature

2. xgboost train & predict

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
Tencent is pleased to support the open source community by making Metis available.
Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
Licensed under the BSD 3-Clause License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://opensource.org/licenses/BSD-3-Clause
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
"""

import os
import xgboost as xgb
from time_series_detector.feature import feature_service
from time_series_detector.common.tsd_errorcode import *
from time_series_detector.common.tsd_common import *
MODEL_PATH = os.path.join(os.path.dirname(__file__), '../model/')
DEFAULT_MODEL = MODEL_PATH + "xgb_default_model"


class XGBoosting(object):
    """
    XGBoost is an optimized distributed gradient boosting library designed to be highly efficient,
    flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.
    XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems
    in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI)
    and can solve problems beyond billions of examples.

    https://github.com/dmlc/xgboost
    """

    def __init__(self,
                 threshold=0.15,
                 max_depth=10,
                 eta=0.05,
                 gamma=0.1,
                 silent=1,
                 min_child_weight=1,
                 subsample=0.8,
                 colsample_bytree=1,
                 booster='gbtree',
                 objective='binary:logistic',
                 eval_metric='auc'):
        """
        :param threshold: The critical point of normal.
        :param max_depth: Maximum tree depth for base learners.
        :param eta: Value means model more robust to overfitting but slower to compute.
        :param gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree.
        :param silent: If 1, it will print information about performance. If 2, some additional information will be printed out.
        :param min_child_weight: Minimum sum of instance weight(hessian) needed in a child.
        :param subsample: Subsample ratio of the training instance.
        :param colsample_bytree: Subsample ratio of columns when constructing each tree.
        :param booster: Specify which booster to use: gbtree, gblinear or dart.
        :param objective: Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
        :param eval_metric: If a str, should be a built-in evaluation metric to use. See doc/parameter.md. If callable, a custom evaluation metric.
        """
        self.threshold = threshold
        self.max_depth = max_depth
        self.eta = eta
        self.gamma = gamma
        self.silent = silent
        self.min_child_weight = min_child_weight
        self.subsample = subsample
        self.colsample_bytree = colsample_bytree
        self.booster = booster
        self.objective = objective
        self.eval_metric = eval_metric

    def __save_libsvm_format(self, data, feature_file_name):
        """
        Save the time features to libsvm format.

        :param data: feature values
        :param file_name: file saves the time features and label
        """
        try:
            f = open(feature_file_name, "w")
        except Exception as ex:
            return TSD_CAL_FEATURE_ERR, str(ex)
        times = 0
        for temp in data:
            if times > 0:
                f.write("\n")
            result = ['{0}:{1}'.format(int(index) + 1, value) for index, value in enumerate(temp[0])]
            f.write(str(temp[1]))
            for x in result:
                f.write(' ' + x)
            times = times + 1
        return TSD_OP_SUCCESS, ""

    def __calculate_features(self, data, feature_file_name, window=DEFAULT_WINDOW):
        """
        Caculate time features and save as libsvm format.

        :param data: the time series to detect of
        :param feature_file_name: the file to use
        :param window: the length of window
        """
        features = []
        for index in data:
            if is_standard_time_series(index["data"], window):
                temp = []
                temp.append(feature_service.extract_features(index["data"], window))
                temp.append(index["flag"])
                features.append(temp)
        try:
            ret_code, ret_data = self.__save_libsvm_format(features, feature_file_name)
        except Exception as ex:
            ret_code = TSD_CAL_FEATURE_ERR
            ret_data = str(ex)
        return ret_code, ret_data

    def xgb_train(self, data, task_id, num_round=300):
        """
        Train an xgboost model.

        :param data: Training dataset.
        :param task_id: The id of the training task.
        :param num_round: Max number of boosting iterations.
        """
        model_name = MODEL_PATH + task_id + "_model"
        feature_file_name = MODEL_PATH + task_id + "_features"
        ret_code, ret_data = self.__calculate_features(data, feature_file_name)
        if ret_code != TSD_OP_SUCCESS:
            return ret_code, ret_data
        try:
            dtrain = xgb.DMatrix(feature_file_name)
        except Exception as ex:
            return TSD_READ_FEATURE_FAILED, str(ex)
        params = {
            'max_depth': self.max_depth,
            'eta': self.eta,
            'gamma': self.gamma,
            'silent': self.silent,
            'min_child_weight': self.min_child_weight,
            'subsample': self.subsample,
            'colsample_bytree': self.colsample_bytree,
            'booster': self.booster,
            'objective': self.objective,
            'eval_metric': self.eval_metric,
        }
        try:
            bst = xgb.train(params, dtrain, num_round)
            bst.save_model(model_name)
        except Exception as ex:
            return TSD_TRAIN_ERR, str(ex)
        return TSD_OP_SUCCESS, ""

    def predict(self, X, window=DEFAULT_WINDOW, model_name=DEFAULT_MODEL):
        """
        :param X: the time series to detect of
        :type X: pandas.Series
        :param window: the length of window
        :param model_name: Use a xgboost model to predict a particular sample is an outlier or not.
        :return 1 denotes normal, 0 denotes abnormal.
        """
        if is_standard_time_series(X, window):
            ts_features = []
            features = [10]
            features.extend(feature_service.extract_features(X, window))
            ts_features.append(features)
            res_pred = xgb.DMatrix(np.array(ts_features))
            bst = xgb.Booster({'nthread': 4})
            bst.load_model(model_name)
            xgb_ret = bst.predict(res_pred)
            if xgb_ret[0] < self.threshold:
                value = 0
            else:
                value = 1
            return [value, xgb_ret[0]]
        else:
            return [0, 0]

3. 模型文件

基于上述输入样本，特征工程，和xgboost模型，训练的模型链接： https://github.com/Tencent/Metis/blob/master/time_series_detector/model/xgb_default_model

4. 模型服务上线

对于机器学习模型上线，现在有了成熟了机器学习平台可以支持，例如阿里云的PAI。

当然，如果算法工程自己搭建的可以，可以使用如tensorflow-serving的方式。当时我是采取的这种方案上线的。

总结

xgboost有监督模型，在时序异常检测的效果还不错，在特征工程设计充分，数据集达到万规模的情况下，整体准确率可以达到：85%以上。

参考

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

aiops

机器学习算法

原创声明：本文系作者授权腾讯云开发者社区发表，未经许可，不得转载。

如有侵权，请联系 cloudcommunity@tencent.com 删除。

登录后参与评论

0 条评论

热度