我正在使用statmodel OLS进行迭代异常值消除。我已经用来拟合模型了。
ols_result = sm.OLS(y,X).fit()
然后我可以得到学院化的删除残差外部和bonferroni
ols_result.outlier_test(method="bonf")
我正在删除超过bonferroni p> %10的样本,其中cooks距离在每次迭代中也是最高的。直到我没有得到bonf(p)>%10的样本,然后我得到了原始样本的子集。
假设我有400个样本,在删除异常值后,我得到了380个样本。现在我想找出学院化的删除残差和bonferroni,再用400个样本对380个样本进行回归拟合。以查看删除的异常值是否真的是异常值。
这就是问题所在。我正在寻找一种简单的方法来使用statmodels OLS模型来获得拟合值的残差和厨师距离,而不是自己编写这些函数。但是.outlier_test()
和.get_influence()
似乎可以处理OLS Result对象。
你们有没有什么简单的方法来实现这些测试而不需要太多的代码等等。
提前感谢
发布于 2019-12-26 15:42:57
经过对这种实用性的深入研究。我找不到用statmodels库做这件事的简单方法。相反,我编写了自己的代码,也得到了statmodel源代码的帮助。如果有人需要我在下面分享的代码
def internally_studentized_residual(X,Y,y_hat):
"""
Calculate studentized residuals internal
Parameters
______________________________________________________________________
X: List
Variable x-axis
Y: List
Response variable of the X
y_hat: List
Predictions of the response variable with the given X
Returns
______________________________________________________________________
Dataframe : List
Studentized Residuals
"""
# print(len(Y))
X = np.array(X, dtype=float)
Y = np.array(Y, dtype=float)
y_hat = np.array(y_hat,dtype=float)
mean_X = np.mean(X)
mean_Y = np.mean(Y)
n = len(X)
# print(X.shape,Y.shape,y_hat.shape)
residuals = Y - y_hat
X_inverse = np.linalg.pinv(X.reshape(-1,1))[0]
h_ii = X_inverse.T * X
Var_e = math.sqrt(sum((Y - y_hat) ** 2)/(n-2))
SE_regression = Var_e/((1-h_ii) ** 0.5)
studentized_residuals = residuals/SE_regression
return studentized_residuals
def deleted_studentized_residual(X,Y,y_hat):
"""
Calculate studentized residuals external
Parameters
______________________________________________________________________
X: List
Variable x-axis
Y: List
Response variable of the X
y_hat: List
Predictions of the response variable with the given X
Returns
______________________________________________________________________
Dataframe : List
Studentized Residuals External
"""
#formula from https://newonlinecourses.science.psu.edu/stat501/node/401/
r = internally_studentized_residual(X,Y,y_hat)
n = len(r)
return [r_i*math.sqrt((n-2-1)/(n-2-r_i**2)) for r_i in r]
def outlier_test(X,Y,y_hat,alpha=0.1):
"""
outlier test for the points
Parameters
______________________________________________________________________
X: List
Variable x-axis
Y: List
Response variable of the X
y_hat: List
Predictions of the response variable with the given X
alpha: float
alpha value for multiple test
Returns
______________________________________________________________________
Dataframe : studentized, unadjusted p values and benferroni multiple
test dataframe
"""
resid = deleted_studentized_residual(X,Y,y_hat)
df = len(X) - 1
p_vals = stats.t.sf(np.abs(resid),df) * 2
bonf_test = multipletests(p_vals,alpha,method="bonf")
df_result = pd.DataFrame()
df_result.loc[:,"student_resid"] = resid
df_result.loc[:,"unadj_p"] = p_vals
df_result.loc[:,"bonf(p)"] = bonf_test[1]
df_result.index = X.index
return df_result
这可能是一个有点脏的代码。我没有时间去清理和提高效率。它使用numpy
模块和scipy.stats
https://stackoverflow.com/questions/59466607
复制相似问题