xgboost (Extreme Gradient Boosting)是一种基于梯度提升决策树的机器学习算法。 熟悉机器学习的同学对这个模型应该是一点都不陌生。 在传统的机器学习比赛中, XGBoost模型是常客, 甚至常常是获奖模型。
xgboost 在时间序列异常检测的两种形态:
这里, 我们主要采取了第一种。
如果对xgboost算法模型感兴趣, 可以参考: https://www.cnblogs.com/mantch/p/11164221.html
在实际应用中,xgboost 的时间序列异常检测通常涉及以下步骤:
时间序列往往呈现有序性、趋势性、季节性、周期性, 以及随机性。
除了时序的基础特性以外, 时间序列中的一个数据点往往与它之前的点相关联、即序列值存在自相关性。 时序中往往还存在缺失值、平稳与非平稳等其他特点。
这些特性让我们需要在预处理可以对时序做一定的处理: 缺失值填充、归一化、平稳性校验、STL时序分解。
考虑到时序的周期和时序长度的trade-off, 天、周同比往往在业务场景中是具备明显特征的。 因此, 我们可以考虑短时时序 + 周期时序 组合为一条输入样本。 如下:
如果放入7天的数据, 一天1440个, 7天则1万多个点, 这对于实时的数据查询是一个巨大的成本。 通过短时+ 天、周同比的结合, 在不丢失样本关键部分的同时, 数据获取的成本大大降低。
计算时间序列特征:包括以下三类,
在传统的xgboost模型训练中, 往往会从2个方面入手提升模型准确率:
在metis中, 最后干到了200多的特征数量。 当然, 这里面有一些特征其实过于复杂了, 本质上是将其他无监督的检测模型对应的一部分中间特征给放了进来。 人工设计痕迹有点重。
为了计算这些时序特征, 可以直接使用开源包: Tsfresh https://tsfresh.readthedocs.io/en/latest/index.html。
要训练一个效果达到预期的异常检测xgboost模型,所需要的样本数量大概在1万以上。
之前我们为了训练metis的xgboost模型, 整体构建了的样本数量为:
postive表示异常样本, negative表示正常样本。
为了达到这样一个量级的样本量, 需要从2个方面入手:
import statistical_features
import classification_features
import fitting_features
from time_series_detector.common import tsd_common
def extract_features(time_series, window):
"""
Extracts three types of features from the time series.
:param time_series: the time series to extract the feature of
:type time_series: pandas.Series
:param window: the length of window
:type window: int
:return: the value of features
:return type: list with float
"""
if not tsd_common.is_standard_time_series(time_series, window):
# add your report of this error here...
return []
# spilt time_series
split_time_series = tsd_common.split_time_series(time_series, window)
# nomalize time_series
normalized_split_time_series = tsd_common.normalize_time_series(split_time_series)
max_min_normalized_time_series = tsd_common.normalize_time_series_by_max_min(split_time_series)
s_features = statistical_features.get_statistical_features(normalized_split_time_series[4])
f_features = fitting_features.get_fitting_features(normalized_split_time_series)
c_features = classification_features.get_classification_features(max_min_normalized_time_series)
# combine features with types
features = s_features + f_features + c_features
return features
def get_statistical_features(x):
statistical_features = [
time_series_maximum(x),
time_series_minimum(x),
time_series_mean(x),
time_series_variance(x),
time_series_standard_deviation(x),
time_series_skewness(x),
time_series_kurtosis(x),
time_series_median(x),
time_series_abs_energy(x),
time_series_absolute_sum_of_changes(x),
time_series_variance_larger_than_std(x),
time_series_count_above_mean(x),
time_series_count_below_mean(x),
time_series_first_location_of_maximum(x),
time_series_first_location_of_minimum(x),
time_series_last_location_of_maximum(x),
time_series_last_location_of_minimum(x),
int(time_series_has_duplicate(x)),
int(time_series_has_duplicate_max(x)),
int(time_series_has_duplicate_min(x)),
time_series_longest_strike_above_mean(x),
time_series_longest_strike_below_mean(x),
time_series_mean_abs_change(x),
time_series_mean_change(x),
time_series_percentage_of_reoccurring_datapoints_to_all_datapoints(x),
time_series_ratio_value_number_to_time_series_length(x),
time_series_sum_of_reoccurring_data_points(x),
time_series_sum_of_reoccurring_values(x),
time_series_sum_values(x),
time_series_range(x)
]
# append yourself statistical features here...
return statistical_features
https://github.com/Tencent/Metis/blob/master/time_series_detector/feature/statistical_features.py
https://github.com/Tencent/Metis/tree/master/time_series_detector/feature
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
"""
Tencent is pleased to support the open source community by making Metis available.
Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
Licensed under the BSD 3-Clause License (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
https://opensource.org/licenses/BSD-3-Clause
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
"""
import os
import xgboost as xgb
from time_series_detector.feature import feature_service
from time_series_detector.common.tsd_errorcode import *
from time_series_detector.common.tsd_common import *
MODEL_PATH = os.path.join(os.path.dirname(__file__), '../model/')
DEFAULT_MODEL = MODEL_PATH + "xgb_default_model"
class XGBoosting(object):
"""
XGBoost is an optimized distributed gradient boosting library designed to be highly efficient,
flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework.
XGBoost provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems
in a fast and accurate way. The same code runs on major distributed environment (Hadoop, SGE, MPI)
and can solve problems beyond billions of examples.
https://github.com/dmlc/xgboost
"""
def __init__(self,
threshold=0.15,
max_depth=10,
eta=0.05,
gamma=0.1,
silent=1,
min_child_weight=1,
subsample=0.8,
colsample_bytree=1,
booster='gbtree',
objective='binary:logistic',
eval_metric='auc'):
"""
:param threshold: The critical point of normal.
:param max_depth: Maximum tree depth for base learners.
:param eta: Value means model more robust to overfitting but slower to compute.
:param gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree.
:param silent: If 1, it will print information about performance. If 2, some additional information will be printed out.
:param min_child_weight: Minimum sum of instance weight(hessian) needed in a child.
:param subsample: Subsample ratio of the training instance.
:param colsample_bytree: Subsample ratio of columns when constructing each tree.
:param booster: Specify which booster to use: gbtree, gblinear or dart.
:param objective: Specify the learning task and the corresponding learning objective or a custom objective function to be used (see note below).
:param eval_metric: If a str, should be a built-in evaluation metric to use. See doc/parameter.md. If callable, a custom evaluation metric.
"""
self.threshold = threshold
self.max_depth = max_depth
self.eta = eta
self.gamma = gamma
self.silent = silent
self.min_child_weight = min_child_weight
self.subsample = subsample
self.colsample_bytree = colsample_bytree
self.booster = booster
self.objective = objective
self.eval_metric = eval_metric
def __save_libsvm_format(self, data, feature_file_name):
"""
Save the time features to libsvm format.
:param data: feature values
:param file_name: file saves the time features and label
"""
try:
f = open(feature_file_name, "w")
except Exception as ex:
return TSD_CAL_FEATURE_ERR, str(ex)
times = 0
for temp in data:
if times > 0:
f.write("\n")
result = ['{0}:{1}'.format(int(index) + 1, value) for index, value in enumerate(temp[0])]
f.write(str(temp[1]))
for x in result:
f.write(' ' + x)
times = times + 1
return TSD_OP_SUCCESS, ""
def __calculate_features(self, data, feature_file_name, window=DEFAULT_WINDOW):
"""
Caculate time features and save as libsvm format.
:param data: the time series to detect of
:param feature_file_name: the file to use
:param window: the length of window
"""
features = []
for index in data:
if is_standard_time_series(index["data"], window):
temp = []
temp.append(feature_service.extract_features(index["data"], window))
temp.append(index["flag"])
features.append(temp)
try:
ret_code, ret_data = self.__save_libsvm_format(features, feature_file_name)
except Exception as ex:
ret_code = TSD_CAL_FEATURE_ERR
ret_data = str(ex)
return ret_code, ret_data
def xgb_train(self, data, task_id, num_round=300):
"""
Train an xgboost model.
:param data: Training dataset.
:param task_id: The id of the training task.
:param num_round: Max number of boosting iterations.
"""
model_name = MODEL_PATH + task_id + "_model"
feature_file_name = MODEL_PATH + task_id + "_features"
ret_code, ret_data = self.__calculate_features(data, feature_file_name)
if ret_code != TSD_OP_SUCCESS:
return ret_code, ret_data
try:
dtrain = xgb.DMatrix(feature_file_name)
except Exception as ex:
return TSD_READ_FEATURE_FAILED, str(ex)
params = {
'max_depth': self.max_depth,
'eta': self.eta,
'gamma': self.gamma,
'silent': self.silent,
'min_child_weight': self.min_child_weight,
'subsample': self.subsample,
'colsample_bytree': self.colsample_bytree,
'booster': self.booster,
'objective': self.objective,
'eval_metric': self.eval_metric,
}
try:
bst = xgb.train(params, dtrain, num_round)
bst.save_model(model_name)
except Exception as ex:
return TSD_TRAIN_ERR, str(ex)
return TSD_OP_SUCCESS, ""
def predict(self, X, window=DEFAULT_WINDOW, model_name=DEFAULT_MODEL):
"""
:param X: the time series to detect of
:type X: pandas.Series
:param window: the length of window
:param model_name: Use a xgboost model to predict a particular sample is an outlier or not.
:return 1 denotes normal, 0 denotes abnormal.
"""
if is_standard_time_series(X, window):
ts_features = []
features = [10]
features.extend(feature_service.extract_features(X, window))
ts_features.append(features)
res_pred = xgb.DMatrix(np.array(ts_features))
bst = xgb.Booster({'nthread': 4})
bst.load_model(model_name)
xgb_ret = bst.predict(res_pred)
if xgb_ret[0] < self.threshold:
value = 0
else:
value = 1
return [value, xgb_ret[0]]
else:
return [0, 0]
基于上述输入样本, 特征工程, 和xgboost模型, 训练的模型链接: https://github.com/Tencent/Metis/blob/master/time_series_detector/model/xgb_default_model
对于机器学习模型上线, 现在有了成熟了机器学习平台可以支持, 例如阿里云的PAI。
当然, 如果算法工程自己搭建的可以, 可以使用如tensorflow-serving的方式。 当时我是采取的这种方案上线的。
xgboost有监督模型, 在时序异常检测的效果还不错, 在特征工程设计充分, 数据集达到万规模的情况下, 整体准确率可以达到:85%以上。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。