【机器学习实战】电信客户流失预测

机器学习司猫白

发布于 2025-01-21 09:42:00

14000

代码可运行

文章被收录于专栏：机器学习实战机器学习实战

运行总次数：0

代码可运行

本文主要介绍一个特征选择的方法

在这个项目中，我们将展示如何通过先进的机器学习技术来预测电信行业中的客户流失。我们首先利用随机森林（RF）算法与递归特征消除和交叉验证（RFECV）方法进行高效的特征选择，从大量特征中筛选出最具预测价值的变量。随后，结合LightGBM这一高效的梯度提升算法，搭配过采样技术解决正负样本严重不平衡的问题。最后，通过SHAP（SHapley Additive exPlanations）模型解释技术，让我们深入了解模型的决策过程和各特征的影响。

特征选择的重要性

特征选择是机器学习中至关重要的一步，它直接影响模型的性能、效率以及最终的预测结果。以下是特征选择在机器学习中的几个重要性：

提高模型的性能选择与目标变量高度相关的特征能够帮助模型更好地学习数据中的关键模式。去除冗余或无关的特征，不仅能避免过拟合，还能提升模型的泛化能力。这意味着模型在新数据上的表现会更好，从而提高预测准确度。
减少过拟合如果模型中包含太多无关或噪声特征，可能会导致过拟合。过拟合是指模型在训练数据上表现得非常好，但在新数据上表现差。特征选择通过减少无关特征，有助于简化模型，降低过拟合的风险，使得模型更加稳健。
加快模型训练速度机器学习模型的训练时间通常与特征的数量呈正相关。选择较少的特征可以显著减少训练时间，尤其是在处理高维数据（如图像、文本或基因数据）时，特征选择尤为重要。减少特征维度可以提高计算效率，加速模型的训练与调优过程。
改善模型的可解释性当模型中包含大量特征时，很难理解每个特征对最终预测的具体贡献。特征选择有助于保留最关键的特征，从而提高模型的可解释性，使得模型的决策过程更加透明。这对于需要对模型进行解释和理解的应用（例如金融、医疗等行业）至关重要。
降低数据收集和存储的成本在实际应用中，数据的收集和存储通常需要大量的资源。特征选择可以减少数据的维度，降低存储需求和数据处理的成本，尤其在大规模数据集的场景下，特征选择可以显著提高工作效率。
消除多重共线性多重共线性指的是特征之间存在较强的线性相关性，这可能导致模型的估计不稳定和不准确。通过特征选择，可以去除冗余的特征，减少多重共线性问题，从而使得模型更加稳定并减少参数的方差。
提高数据的可操作性在处理大规模数据时，特征的数量可能非常庞大，而并非所有的特征都对预测有实际意义。通过特征选择，数据可以被简化成更易操作和分析的形式，从而提升业务决策的效率和质量。

RFECV简介

RFECV（Recursive Feature Elimination with Cross-Validation）是一种结合了递归特征消除（RFE）和交叉验证（CV）的方法，用于进行特征选择。它的核心思想是通过递归地训练模型并消除最不重要的特征，同时利用交叉验证评估模型的性能，从而找到最优的特征子集。RFECV常用于提高模型性能，尤其是在特征较多时，能够有效地识别最具预测力的特征。

RFECV的基本原理：

递归特征消除（RFE）：

RFE的核心思想是通过迭代的方式，反复训练模型并评估模型性能，每次消除一个最不重要的特征。通过这种方式，逐渐减小特征集的维度，直到找到最重要的特征子集。每次迭代时，模型会根据某些准则（如特征的权重或系数）选择“最不重要”的特征并将其移除。例如，在线性模型中，通常通过绝对系数值的大小来判断特征的重要性。交叉验证（Cross-Validation，CV）：

交叉验证是一种评估模型性能的技术，它通过将数据集分成多个子集，轮流使用不同的子集作为训练集和测试集，来减少数据划分带来的波动。在RFECV中，交叉验证被用来评估在不同特征子集上的模型性能，从而选择最优的特征集。通过交叉验证，可以得到每个特征子集的平均性能，从而选择最能提升模型泛化能力的特征集。

RFECV的工作流程：

初始阶段：

首先，使用所有特征训练一个模型（如支持向量机、随机森林等），并使用交叉验证评估模型性能（通常使用K折交叉验证）。特征消除：

根据当前模型中各特征的重要性（例如，基于权重、特征的重要性评分等），识别最不重要的特征，并将其从特征集移除。再次训练与评估：

移除特征后，重新训练模型并进行交叉验证评估。计算当前特征子集的性能，通常是通过平均交叉验证得分来衡量。重复步骤：

重复以上步骤，逐渐消除特征，并在每次消除后评估模型的性能。每次消除特征后，模型会更新，并重新进行交叉验证。选择最优特征子集：

在每轮特征消除和交叉验证后，选择那个交叉验证得分最高的特征子集作为最终的特征集。输出结果：

最终，RFECV会返回一个包含最佳特征子集的模型，并提供相应的性能评价指标。

RFECV的关键优势：

自动化选择最优特征： RFECV可以自动选择对模型预测性能最有帮助的特征，并消除冗余或不相关的特征。这可以有效减少噪声，提高模型的泛化能力。

交叉验证的稳健性：通过交叉验证，RFECV可以确保选择的特征子集在不同数据划分下的稳定性，从而提高模型的泛化能力和稳健性。

避免过拟合：由于RFECV使用交叉验证评估模型的性能，因此能够有效避免因过多的特征导致的过拟合问题。它会在保持模型准确性的同时，避免模型在特定数据集上的过拟合。

适用于多种模型： RFECV不仅可以用于线性模型（如逻辑回归、Lasso回归等），还可以用于非线性模型（如随机森林、支持向量机等）。这一特性使得RFECV在各种机器学习任务中都能广泛应用。

RFECV的实现步骤：

数据准备：收集并准备好输入数据，包括特征矩阵和目标变量。选择模型：
选择一个基础模型，如逻辑回归、支持向量机、随机森林等。RFECV会基于模型的特征重要性进行选择。
应用RFECV：使用sklearn（Python中的机器学习库）中的RFECV类，传入基础模型和交叉验证的参数。RFECV类会自动执行递归特征消除并进行交叉验证。

源码

文件读取

import pandas as pd
import warnings

# 忽略所有警告
warnings.filterwarnings("ignore")

file_path = 'ata\train.csv'
# 使用GBK编码读取文件
train_df = pd.read_csv(file_path, encoding='GBK')

train_df.info(verbose=True)

特征工程

# 遍历月份6, 7, 8,将last_date_of_month转换为时间戳
for n in [6, 7, 8]:
    # 创建列名
    column_name = f'last_date_of_month_{n}'
    timestamp_column_name = f'last_date_of_month_{n}_timestamp'
    
    # 将列转换为 datetime 类型
    train_df[column_name] = pd.to_datetime(train_df[column_name])
    
    # 转换为时间戳（秒）
    train_df[timestamp_column_name] = train_df[column_name].astype('int64') // 10**9  # 转换为秒

train_df['date_of_last_rech_6'] = pd.to_datetime( train_df['date_of_last_rech_6'])
train_df['date_of_last_rech_7'] = pd.to_datetime( train_df['date_of_last_rech_7'])
train_df['date_of_last_rech_8'] = pd.to_datetime( train_df['date_of_last_rech_8'])

train_df['cha_67'] = (train_df['date_of_last_rech_7'] - train_df['date_of_last_rech_6']).dt.days
train_df['cha_78'] = (train_df['date_of_last_rech_8'] - train_df['date_of_last_rech_7']).dt.days
# 总通话时长
train_df['zongshichang_6'] = train_df['total_og_mou_6'] + train_df['total_ic_mou_6']
train_df['zongshichang_7'] = train_df['total_og_mou_7'] + train_df['total_ic_mou_7']
train_df['zongshichang_8'] = train_df['total_og_mou_8'] + train_df['total_ic_mou_8']
# 总时长差值
train_df['zongshicha_78'] = train_df['zongshichang_8'] - train_df['zongshichang_7']
train_df['zongshicha_67'] = train_df['zongshichang_7'] - train_df['zongshichang_6']
# 平均每次充值金额
train_df['rech_avg_6'] = train_df['total_rech_amt_6'] / train_df['total_rech_num_6']
train_df['rech_avg_7'] = train_df['total_rech_amt_7'] / train_df['total_rech_num_7']
train_df['rech_avg_8'] = train_df['total_rech_amt_8'] / train_df['total_rech_num_8']

# 网内外通话比例
train_df['bili_6'] = train_df['onnet_mou_6'] / train_df['offnet_mou_6']
train_df['bili_7'] = train_df['onnet_mou_7'] / train_df['offnet_mou_7']
train_df['bili_8'] = train_df['onnet_mou_8'] / train_df['offnet_mou_8']
# 异网通话时长
train_df['yiwangshichang_6'] = (train_df['loc_og_t2m_mou_6'] +train_df['std_og_t2m_mou_6'] + train_df['loc_ic_t2m_mou_6'] + train_df['std_ic_t2m_mou_6'])
train_df['yiwangshichang_7'] = (train_df['loc_og_t2m_mou_7'] +train_df['std_og_t2m_mou_7'] + train_df['loc_ic_t2m_mou_7'] + train_df['std_ic_t2m_mou_7'])
train_df['yiwangshichang_8'] = (train_df['loc_og_t2m_mou_8'] +train_df['std_og_t2m_mou_8'] + train_df['loc_ic_t2m_mou_8'] + train_df['std_ic_t2m_mou_8'])
# 输出
train_df.head()

去除异常值

# 遍历DataFrame中的每一列
for column in train_df.select_dtypes(include=['int64', 'float64']).columns:
    # 计算每列的第一四分位数和第三四分位数
    Q1 = train_df[column].quantile(0.25)
    Q3 = train_df[column].quantile(0.75)
    IQR = Q3 - Q1

    # 定义异常值的上下界限
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # 替换异常值为中位数
    median = train_df[column].median()
    train_df[column] = train_df[column].apply(lambda x: median if x < lower_bound or x > upper_bound else x)
train_df

处理缺失值

# 等频分箱，指定分箱数为 5（即生成 5 个箱）
train_df['fenxiang_6'] = pd.qcut(train_df['rech_avg_6'], q=3, labels=[1, 2, 3])

import numpy as np

# 使用众数填充缺失值
for column in train_df.columns:
    if train_df[column].isnull().any():  # 检查列是否有缺失值
        mode_value = train_df[column].mode()[0]  # 获取众数
        train_df[column].fillna(mode_value, inplace=True)  # 用众数填充缺失值

label_df=train_df['label']
train_df=train_df.drop(['label','date_of_last_rech_6','date_of_last_rech_7','date_of_last_rech_6'], axis=1)

编码

from sklearn.preprocessing import OneHotEncoder
from pypinyin import pinyin, Style
import pandas as pd

# 定义一个函数将中文列名转换为拼音
def to_pinyin(chinese_name):
    # 使用 pypinyin 将中文转为拼音，保留字母，去掉声调
    pinyin_name = ''.join([item[0] for item in pinyin(chinese_name, style=Style.NORMAL)])
    return pinyin_name

# 创建 OneHotEncoder 对象，指定 sparse_output=False 返回密集矩阵
encoder = OneHotEncoder(sparse_output=False)

# 获取所有 object 类型的列名
object_columns = train_df.select_dtypes(include='object').columns

# 遍历所有 object 类型的列，并进行 One-Hot 编码
for column in object_columns:
    # 进行 One-Hot 编码
    encoded_data = encoder.fit_transform(train_df[[column]])
    
    # 将编码后的数据转换为 DataFrame，并转换列名为拼音
    encoded_df = pd.DataFrame(encoded_data, columns=[
        to_pinyin(col_name) for col_name in encoder.get_feature_names_out([column])
    ])
    
    # 将编码后的数据合并到原始 DataFrame 中
    train_df = pd.concat([train_df, encoded_df], axis=1)
    
    # 删除原始的 object 列
    train_df.drop(column, axis=1, inplace=True)

# 查看结果列名
print(train_df.columns)

Y = pd.DataFrame(label_df)
# 映射 '是' 为 1，'否' 为 0
Y['label'] = Y['label'].replace({'是': 1, '否': 0})

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
plt.rcParams['font.family'] = 'Times New Roman'
plt.rcParams['axes.unicode_minus'] = False
# 划分特征和目标变量
X = train_df
y = Y
# 划分训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42,stratify=Y)
# 初始化随机森林分类器，调整树的数量和最大深度以减少计算时间
clf = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=42,n_jobs=-1)

# 定义StratifiedKFold用于交叉验证，设置为3折
cv = StratifiedKFold(n_splits=3)

## 'accuracy': 准确率，正确预测的比例。
## 'precision': 精确率，预测为正的样本中实际为正的比例。
## 'recall': 召回率，实际为正的样本中被正确预测为正的比例。
## 'f1': F1-score，精确率和召回率的调和平均。
## 'roc_auc': 接受者操作特征曲线下的面积（AUC）。
## 'average_precision': 对于二分类问题，平均精度。
## 'log_loss': 对数损失函数

# 递归特征消除和交叉验证
rfecv = RFECV(estimator=clf, step=1, cv=cv, scoring='accuracy')    ##accuracy，precision，recall，f1，roc_auc，average_precision，log_loss
rfecv.fit(X_train, y_train)

# 打印最佳特征数量
print(f"Optimal number of features: {rfecv.n_features_}")

# 获取交叉验证每一折的分数
cv_results = rfecv.cv_results_

# 取出3次交叉验证的单独分数
fold_scores = [cv_results[f'split{i}_test_score'] for i in range(3)]

# 计算平均得分
mean_scores = cv_results['mean_test_score']

# 输出选择的特征列
selected_features = X_train.columns[rfecv.support_]
print(f"Selected features: {list(selected_features)}")

# 选择原始数据框中的选定特征
df_selected = train_df[selected_features]
df_selected.head()

import matplotlib.pyplot as plt

# 绘制图形
plt.figure(figsize=(12, 8), dpi=1200)
plt.title('Recursive Feature Elimination with Cross-Validation (RFCV)', fontsize=16, fontweight='bold', pad=20)
plt.xlabel('Number of features selected', fontsize=14, labelpad=15)
plt.ylabel('Cross-validation score (accuracy)', fontsize=14, labelpad=15)

# 设置背景颜色
plt.gca().set_facecolor('#f7f7f7')

# 绘制每一条灰色线，表示3次交叉验证
for i in range(3):  # 注意这里要对应3折交叉验证
    plt.plot(range(1, len(fold_scores[i]) + 1), fold_scores[i], 
             marker='o', color='gray', linestyle='-', linewidth=0.8, alpha=0.6)

# 绘制淡黑色线，表示平均交叉验证得分
plt.plot(range(1, len(mean_scores) + 1), mean_scores, 
         marker='o', color='#696969', linestyle='-', linewidth=5, label='Mean CV Accuracy')

# 绘制最佳特征数的垂直线
plt.axvline(x=rfecv.n_features_, color='#E76F51', linestyle='--', 
            linewidth=2, label=f'Optimal = {rfecv.n_features_}')

# 图例设置
plt.legend(fontsize=12, loc='best', frameon=True, shadow=True, facecolor='white', framealpha=0.9)

# 网格设置
plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.7)

# 字体设置
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)

# 调整图形边距
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)

# 保存图形为PDF文件
plt.savefig('分类.pdf', format='pdf', bbox_inches='tight')

# 显示图形
plt.show()

可以看到选择出104个最优的特征。

Y.value_counts()

label
0        62867
1         7132
Name: count, dtype: int64

可以看到正负样本是极度不均衡的，因此接下来使用过采样技术来平衡样本。

from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
import lightgbm as lgb
import joblib

x = df_selected
y = Y

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# 标准化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # 对训练集进行标准化
X_test_scaled = scaler.transform(X_test)        # 对测试集使用相同的缩放器进行标准化


# 应用SMOTE进行过采样
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_scaled, y_train)

# 定义 LightGBM 模型参数
params = {
  'objective': 'binary',
  'scale_pos_weight': 1,  # 由于已经过采样，正负样本比例平衡，设置为1
  'max_depth': 7,
  'min_child_weight': 3,
  'subsample': 0.8,
  'colsample_bytree': 0.8,
  'metric': 'auc',
  'reg_lambda': 5,  # L2正则化项
  'alpha': 3,  # L1正则化项
  'random_state': 42
}

# 初始化 LightGBM 模型
model1 = lgb.LGBMClassifier(**params)

# 使用过采样数据训练模型
model1.fit(X_resampled, y_resampled)

# 进行预测
y_pred = model1.predict(X_test_scaled)
y_proba = model1.predict_proba(X_test_scaled)[:, 1]  # 预测概率，用于计算AUC

# 评估模型
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_proba)  # 计算AUC

# 输出评估结果
print("查准率: {:.2f}".format(precision))
print("查全率: {:.2f}".format(recall))
print("F1分数: {:.2f}".format(f1))
print("AUC分数: {:.2f}".format(auc))

# # 保存模型和标准化器
# joblib.dump(model, 'lightgbm_model.joblib')
# joblib.dump(scaler, 'scaler.joblib')