学习GBDT与LR融合方案

文章来源：企鹅号 - 菜鸟的学习历程

什么是GBDT

GBDT全称Gradient Boost Decision Tree，中文梯度提升决策树。它是一种常用的非线性模型，基于集成学习中的boosting思想，每次迭代都在向减少残差的梯度方向新建立一棵决策树，并且都是基于上一棵树进行建立的，迭代多少次就会产生多少个决策树，在这里迭代次数过多的话，容易产生过拟合。GBDT的思想可以用一个简单的例子来说明，假如买某个东西需要100元，首先使用80去拟合，发现损失有20元，接着我们使用10元去拟合剩下的损失，发现还损失10元，紧接着我们使用5元去拟合，发现损失还差5元，如果迭代次数还有，我们将继续迭代下去。每一次迭代，拟合的误差都将减少。梯度问题可以了解BP算法，就很直观了。

GBDT的思想具有天然的优势能够发现多种区分性特征以及特征组合，决策树的路径可以用于LR的输入特征。在Kaggle上也是经常使用这种方式的。在kaggle的一次CTR中就是用GBDT+LR/FM取得了不错的成绩。

以前我也是分开用，今天学习了发现分开使用的ROC并没有联合使用的高，于是打算将它分享出来。这里就不介绍LR了。可以理解成一个线性函数然后加了一个sigmoid函数（S型函数）进行映射到0到1的一个区间，可以理解为概率，用于二分类中。

我是使用的是Python的Jupyter作为开发工具，能够实时的了解到每一步。

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import make_classification

from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import GradientBoostingClassifier

from sklearn.preprocessing import OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.metrics import roc_curve

from sklearn.pipeline import make_pipeline

因为需要的python包。然后我们定义迭代次数为10。并引用python里面自带的数据，是一个二分类数据。并作拆分。

n_estimator = 10

X,y = make_classification(n_samples=100000)

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.8)

X_train, X_train_lr, y_train, y_train_lr = train_test_split(X_train,y_train,test_size=0.5)

然后定义GBDT，然后建立模型，以及使用GBDT生成的决策树进行LR的输入。

gbdt= GradientBoostingClassifier(

max_depth=3,n_estimators=n_estimator,random_state=0)

gbdt_enc = OneHotEncoder()

gbdt_lr = LogisticRegression()

gbdt.fit(X_train,y_train)

gbdt_enc.fit(gbdt.apply(X_train)[:,:,0])

gbdt_lr.fit(gbdt_enc.transform(gbdt.apply(X_train_lr)[:,:,0]),y_train_lr)

y_pred_gbdt_lr =gbdt_lr.predict_proba(

gbdt_enc.transform(gbdt.apply(X_test)[:, :, 0]))[:, 1]

fpr_gbdt_lr, tpr_gbdt_lr, _ = roc_curve(y_test, y_pred_gbdt_lm)

再单独使用GBDT看其效果：

y_pred_gbdt= gbdt.predict_proba(X_test)[:, 1]

fpr_gbdt, tpr_gbdt, _ = roc_curve(y_test, y_pred_grd)

接下来我们看看ROC情况图片：

plt.figure(1)

#先画一个横纵坐标为0到1的区间图， k--是指以 --- 的形式画一条斜线一半分开

plt.plot([0,1],[0,1],"k--")

plt.plot(fpr_gbdt, tpr_gbdt, label='GBT')

plt.plot(fpr_gbdt_lr, tpr_gbdt_lr, label='GBT + LR')

plt.xlabel('False positive rate')

plt.ylabel('True positive rate')

plt.title('ROC curve')

plt.legend(loc='best')

plt.show()

效果还是很直观的嘛。

我们放大那部分区域呢？会是什么样子

plt.figure(2)

plt.xlim(0, 0.2)

plt.ylim(0.8, 1)

plt.plot([0,1],[0,1],"k--")

plt.plot(fpr_gbdt, tpr_gbdt, label='GBT')

plt.plot(fpr_gbdt_lr, tpr_gbdt_lr, label='GBT + LR')

plt.xlabel('False positive rate')

plt.ylabel('True positive rate')

plt.title('ROC curve bigger')

plt.legend(loc='best')

plt.show()

发表于: 2018-03-282018-03-28 22:48:05
原文链接：http://kuaibao.qq.com/s/20180328G1WLWQ00?refer=cp_1026
腾讯「腾讯云开发者社区」是腾讯内容开放平台帐号（企鹅号）传播渠道之一，根据《腾讯内容开放平台服务协议》转载发布内容。
如有侵权，请联系 cloudcommunity@tencent.com 删除。

扫码

添加站长进交流群

领取专属 10元无门槛券

私享最新 技术干货

学习GBDT与LR融合方案

相关快讯

扫码

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐