因果推断——借微软EconML测试用DML和deepIV进行反事实预测实验（二十五）

悟乙己

发布于 2022-05-11 09:41:01

1.5K00

代码可运行

文章被收录于专栏：素质云笔记素质云笔记

运行总次数：0

代码可运行

文章目录

1 导言
- 1.1 KDD2021：盒马-融合反事实预测与MDP模型的清滞销定价算法
- 1.2 本篇想法
- 1.3 其他一些问题追踪
2 代码
- 2.1 数据生成
- 2.2 DML模型：有干预下的Y增量
- 2.3 Tree-based模型
- 2.4 deepIV训练与预测
- 2.5 结果比较
- 2.6 随着X维度上升，各模型的准确性
- 2.7 短期小结

1 导言

1.1 KDD2021：盒马-融合反事实预测与MDP模型的清滞销定价算法

本篇想法来源：因果推断与反事实预测——盒马KDD2021的一篇论文（二十三）

盒马论文提到了

论文模型：半参数模型，上图是顺着使用数据的比例增加三个模型的RMAE，
对比方案1-XGB：将折扣Treatment作为特征放入模型中预估销量值，但是这个模型本身存在混杂因子，估计是有偏的；
对比方案2-DeepIV：将三级品类的平均价格（treatment）作为工具变量，建模深度学习模型刻画折扣和销量的关系，其中折扣Treatment建模成高斯分布

1.2 本篇想法

1.3 其他一些问题追踪

盒马的弹性系数问题：

当然盒马那里提出了价格弹性，而且品类非常细分，本篇没那么细致的数据就先不考虑这种方式。

同时价格弹性与笔者这里提到的CATE在log-log DML回归其实是等价的。

而且，价格弹性按照盒马论文中，不同分类有不同的价格弹性，那么这里可以非常弹性的根据x/t来进行预测。可能更加符合算法工程上。

后续也会拿价格弹性来试试，不过数据不够，相关如看：

因果推断与反事实预测——利用DML进行价格弹性计算（二十四）

另外补充一个问题，就是为什么不直接使用DML中的model_y来直接预测？

model_y训练的时候，只是把T删除，训练集中，不仅有T=0样本，还有T=1的样本。笔者思路没有严格按照【因果推断/uplift建模】Double Machine Learning(DML)文章中所描述的那样。

2 代码

2.1 数据生成

这里很简单粗暴跟着Econml里面的代码来生成数据，只是实验，不太严谨。。

笔者使用的软件版本：

econml.__version__,keras.__version__,xgboost.__version__,tensorflow.__version__
>>> ('0.12.0', '2.6.0', '1.3.3', '2.6.0')

数据生成：

import econml

## Ignore warnings
import warnings
warnings.filterwarnings("ignore")

# Main imports
from econml.dml import DML, LinearDML, SparseLinearDML, CausalForestDML

# Helper imports
import numpy as np
from itertools import product
from sklearn.linear_model import (Lasso, LassoCV, LogisticRegression,
                                  LogisticRegressionCV,LinearRegression,
                                  MultiTaskElasticNet,MultiTaskElasticNetCV)
from sklearn.ensemble import RandomForestRegressor,RandomForestClassifier
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt
import matplotlib
from sklearn.model_selection import train_test_split

%matplotlib inline

# Treatment effect function
def exp_te(x):
    return np.exp(2 * x[0])# DGP constants

np.random.seed(123)
n = 2000
n_w = 30
support_size = 4
n_x = 6
# Outcome support
support_Y = np.random.choice(range(n_w), size=support_size, replace=False)
coefs_Y = np.random.uniform(0, 1, size=support_size)
epsilon_sample = lambda n:np.random.uniform(-1, 1, size=n)
# Treatment support
support_T = support_Y
coefs_T = np.random.uniform(0, 1, size=support_size)
eta_sample = lambda n: np.random.uniform(-1, 1, size=n) 

# Generate controls, covariates, treatments and outcomes
W = np.random.normal(0, 1, size=(n, n_w))
X = np.random.uniform(0, 1, size=(n, n_x))
# Heterogeneous treatment effects
TE = np.array([exp_te(x_i) for x_i in X])
# Define treatment
log_odds = np.dot(W[:, support_T], coefs_T) + eta_sample(n)
T_sigmoid = 1/(1 + np.exp(-log_odds))
T = np.array([np.random.binomial(1, p) for p in T_sigmoid])
# Define the outcome
Y = TE * T + np.dot(W[:, support_Y], coefs_Y) + epsilon_sample(n)

# 生成训练数据
Y_train, Y_val, T_train, T_val, X_train, X_val, W_train, W_val = train_test_split(Y, T, X, W, test_size=.2)
# Generate test data
#X_test = np.array(list(product(np.arange(0, 1, 0.01), repeat=n_x)))

W.shape,T.shape,X.shape,Y.shape#,X_test.shape
>>> ((2000, 30), (2000), (2000, 6), (2000))

这里的混淆因子W有30个维度，T为0/1变量，X为6维特征

2.2 DML模型：有干预下的Y增量

参考的：

因果推断笔记——DML ：Double Machine Learning案例学习（十六）

这里测试了四款DML模型：

LinearDML；SparseLinearDML；DML；CausalForestDML

# Default Setting
est = LinearDML(model_y=RandomForestRegressor(),
                model_t=RandomForestRegressor(),
                random_state=123)
est.fit(Y_train, T_train, X=X_train, W=W_train,cache_values = True)
#te_pred = est.effect(X_test)
print('LinearDML')

# fit(Y, T, X=X, W=W,

# Polynomial Features for Heterogeneity
est1 = SparseLinearDML(model_y=RandomForestRegressor(),
                       model_t=RandomForestRegressor(),
                       featurizer=PolynomialFeatures(degree=3),
                       random_state=123)
est1.fit(Y_train, T_train, X=X_train, W=W_train)
#te_pred1 = est1.effect(X_test)
print('SparseLinearDML')

# Polynomial Features with regularization
est2 = DML(model_y=RandomForestRegressor(),
           model_t=RandomForestRegressor(),
           model_final=Lasso(alpha=0.1, fit_intercept=False),
           featurizer=PolynomialFeatures(degree=10),
           random_state=123)
est2.fit(Y_train, T_train, X=X_train, W=W_train)
#te_pred2 = est2.effect(X_test)
print('DML')

# CausalForestDML
est3 = CausalForestDML(model_y=RandomForestRegressor(),
                       model_t=RandomForestRegressor(),
                       criterion='mse', n_estimators=1000,
                       min_impurity_decrease=0.001,
                       random_state=123)
est3.tune(Y_train, T_train, X=X_train, W=W_train)
est3.fit(Y_train, T_train, X=X_train, W=W_train)
#te_pred3 = est3.effect(X_test)
print('CausalForestDML')

2.3 Tree-based模型

这里干预Tree-based模型，有两个，也就是1.2里面说的，

测试模型1，需要W,X,T都作为解释变量；
测试模型3，需要W,X作为解释变量且干预=0的样本

import xgboost
#import shap
import numpy as np
#shap.initjs()
import numpy as np
import pandas as pd
from sklearn import preprocessing
import lightgbm as lgb

from sklearn.metrics import mean_squared_error # 均方误差
from sklearn.metrics import mean_absolute_error # 平方绝对误差
from sklearn.metrics import r2_score # R square

# 测试模型3，只筛选T=0的样本
Y_train_2 = np.array([Y_train[n]  for n,i in enumerate(T_train) if i ==0  ] )
T_train_2 = np.array([T_train[n]  for n,i in enumerate(T_train) if i ==0  ] )
if X_train.shape[1] == 1:
    X_train_2 = np.array([X_train[n]  for n,i in enumerate(T_train) if i ==0  ] ).reshape((-1,1))
else:
    X_train_2 = np.array([X_train[n]  for n,i in enumerate(T_train) if i ==0  ] )#.reshape((-1,1))
W_train_2 = np.array([W_train[n]  for n,i in enumerate(T_train) if i ==0  ] )#.reshape((-1,1))

# 训练集
XW_train_0 = np.hstack((X_train_2,W_train_2)) # 测试模型3-只有干预=0的样本
XW_train_0_1 = np.hstack((X_train,W_train))  
XWT_train_0_1 = np.hstack((XW_train_0_1,T_train.reshape((-1,1)))) # 测试模型1-W,X,T都作为特征的训练集

# 生成验证集
XW_val = np.hstack((X_val,W_val))  # 测试数据集
XWT_Val = np.hstack((XW_val,T_val.reshape((-1,1)))) # 测试数据集

以上就是训练、验证数据的生成过程

然后就是非常简单的训练与预测的过程：

# 测试模型3-只有T=0的情况下
model_0 = xgboost.XGBRegressor().fit(XW_train_0, Y_train_2)

# 测试模型1-xwt模型 - 都包括
model_01 = xgboost.XGBRegressor().fit(XWT_train_0_1, Y_train)

# 测试模型3-只有T=0的情况下- 验证集预测
y_val_xgb_0 = model_0.predict(XW_val)

# 测试模型1- 验证集预测
y_val_xgb_01 = model_01.predict(XWT_Val)

2.4 deepIV训练与预测

本篇需参考：因果推断笔记——工具变量、内生性以及DeepIV（六）

deepIV与测试模型1/3不一样，是把T作为IV变量

#T_train.shape,X_train.shape,W_train.shape
t_x = 1 + X_train.shape[1]
w_x = W_train.shape[1] + X_train.shape[1]
print('t+x',t_x)
print('w+x',w_x)

from econml.iv.nnet import DeepIV
import keras
import numpy as np
import matplotlib.pyplot as plt

# 构建模型，需要留意如果W|X维度不一致，需要重新设置，input_shape
# w+x
treatment_model = keras.Sequential([keras.layers.Dense(128, activation='relu', input_shape=(w_x,)), # input_shape=(2,)
                                    keras.layers.Dropout(0.17),
                                    keras.layers.Dense(64, activation='relu'),
                                    keras.layers.Dropout(0.17),
                                    keras.layers.Dense(32, activation='relu'),
                                    keras.layers.Dropout(0.17)])

# t+x，如果T|X维度不一致需要重新设置
response_model = keras.Sequential([keras.layers.Dense(128, activation='relu', input_shape=(t_x,)), # input_shape=(2,)
                                   keras.layers.Dropout(0.17),
                                   keras.layers.Dense(64, activation='relu'),
                                   keras.layers.Dropout(0.17),
                                   keras.layers.Dense(32, activation='relu'),
                                   keras.layers.Dropout(0.17),
                                   keras.layers.Dense(1)])

# deepIV模型初始化
keras_fit_options = { "epochs": 100,
                      "validation_split": 0.1}

deepIvEst = DeepIV(n_components = 10, # number of gaussians in our mixture density network
                   m = lambda z, x : treatment_model(keras.layers.concatenate([z,x])), # treatment model
                   h = lambda t, x : response_model(keras.layers.concatenate([t,x])),  # response model
                   n_samples = 1, # number of samples to use to estimate the response
                   use_upper_bound_loss = False, # whether to use an approximation to the true loss
                   n_gradient_samples = 1, # number of samples to use in second estimate of the response (to make loss estimate unbiased)
                   optimizer='adam', # Keras optimizer to use for training - see https://keras.io/optimizers/ 
                   first_stage_options = keras_fit_options, # options for training treatment model
                   second_stage_options = keras_fit_options) # options for training response model

# deepiv模型训练
deepIvEst.fit(Y=Y_train,T=T_train,X=X_train,Z=W_train)

# deepiv预测
y_val_deepiv_01 = deepIvEst.predict(T_val, X_val)

留意treatment_model 、response_model 的Input维度是需要自行调整的

2.5 结果比较

# 测试模型3 有四款模型，四类Y预测值的增量
# 这里 当T=0 直接用预测结果，当T=1的时候，就是y_xgb + y_dml
te_pred = est.effect(X_val)
te_pred1 = est1.effect(X_val)
te_pred2 = est2.effect(X_val)
te_pred3 = est3.effect(X_val)

model_name = ['LinearDML','SparseLinearDML','DML','CausalForestDML']

print('实验模型1-MSE：',mean_squared_error(Y_val,y_val_xgb_01))
print('实验模型2-deepiv MSE：',mean_squared_error(Y_val,y_val_deepiv_01))

for tn,tp in enumerate([te_pred,te_pred1,te_pred2,te_pred3]):
    y_val_xgb_0_dml1 = []
    for n,t in enumerate(T_val):
        x = y_val_xgb_0[n]
        if t == 1:
            y_val_xgb_0_dml1.append(x+tp[n])
        else:
            y_val_xgb_0_dml1.append(x)

    print(f'实验模型3 -DML-{model_name[tn]}的MAE:',mean_squared_error(Y_val,y_val_xgb_0_dml1))

最后的结果使用MSE

实验模型1-MSE： 0.4982044649307843
实验模型2-deepiv MSE： 5.159633681241892
实验模型3-DML- LinearDML的MSE: 0.5558297771296007
实验模型3-DML- SparseLinearDML的MSE: 1.8249646076083048
实验模型3-DML- DML的MSE: 0.9855352650079277
实验模型3-DML- CausalForestDML的MSE: 0.4753863023209694

这里也仅是实验，不过可以看到，

实验模型1效果还行；

实验模型2，deepIV好像MSE很高，可能是我哪里写错了；

实验模型3，DML，这里随着不同的DML方法波动挺大，这里看到CausalForestDML结果优于实验模型1

2.6 随着X维度上升，各模型的准确性

以上是X为6维的时候的结果，我们来对比一下X维度提升最终结果的情况：

# x=1维
实验模型1-MSE： 0.5160967348769703
实验模型2-deepiv MSE： 35.59973032150524
实验模型3-DML- LinearDML的MSE: 0.5813808010457113
实验模型3-DML- SparseLinearDML的MSE: 0.5019110708791529
实验模型3-DML- DML的MSE: 0.9961407722015089
实验模型3-DML- CausalForestDML的MSE: 0.5089520789034898


# x=3维度
实验模型1-MSE： 0.5449129530089527
实验模型2-deepiv MSE： 22.62998950191628
实验模型3-DML- LinearDML的MSE: 0.5069041205691804
实验模型3-DML- SparseLinearDML的MSE: 0.5152944232346934
实验模型3-DML- DML的MSE: 1.031471234778512
实验模型3-DML- CausalForestDML的MSE: 0.4678926411195991

# x=6维度
实验模型1-MSE： 0.4982044649307843
实验模型2-deepiv MSE： 5.159633681241892
实验模型3-DML- LinearDML的MSE: 0.5558297771296007
实验模型3-DML- SparseLinearDML的MSE: 1.8249646076083048
实验模型3-DML- DML的MSE: 0.9855352650079277
实验模型3-DML- CausalForestDML的MSE: 0.4753863023209694

# x=9维度

实验模型1-MSE： 0.5646551531847374
实验模型2-deepiv MSE： 3.2337960384156053
实验模型3-DML- LinearDML的MSE: 0.6796584984496488
实验模型3-DML- SparseLinearDML的MSE: 10.997935994944733
实验模型3-DML- DML的MSE: 0.864753230235102
实验模型3-DML- CausalForestDML的MSE: 0.5614196162924773