我最近开发了一个使用scikit-learn RandomForestRegressor模型的全功能随机森林回归软件,现在我有兴趣将它的性能与其他库进行比较。因此,我找到了一个scikit-learn API for XGBoost random forest regression,并使用X特征和全零的Y数据集做了一个小的SW测试。
from numpy import array
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor
tree_number = 100
depth = 10
jobs = 1
dimension = 19
sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
n_jobs=jobs)
xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, random_state=42,
n_jobs=jobs)
dataset = array([[0.0] * dimension, [0.0] * dimension])
y_val = array([0.0, 0.0])
sk_VAL.fit(dataset, y_val)
xgb_VAL.fit(dataset, y_val)
sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))
print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))令人惊讶的是,xgb_VAL模型的输入样本为全零的预测结果为非零:
sk_prediction = [0.]
xgb_prediction = [0.02500369]我的评估或比较的构造中的错误是什么,我得到了这个结果?
发布于 2021-04-16 18:53:19
似乎XGBoost在模型中包含了全局偏差,并且将其固定在0.5,而不是基于输入数据进行计算。这是在XGBoost GitHub存储库中提出的一个问题(请参阅https://github.com/dmlc/xgboost/issues/799)。相应的超参数是base_score,如果您将其设置为零,您的模型将按照预期预测零。
from numpy import array
from xgboost import XGBRFRegressor
from sklearn.ensemble import RandomForestRegressor
tree_number = 100
depth = 10
jobs = 1
dimension = 19
sk_VAL = RandomForestRegressor(n_estimators=tree_number, max_depth=depth, random_state=42, n_jobs=jobs)
xgb_VAL = XGBRFRegressor(n_estimators=tree_number, max_depth=depth, base_score=0, random_state=42, n_jobs=jobs)
dataset = array([[0.0] * dimension, [0.0] * dimension])
y_val = array([0.0, 0.0])
sk_VAL.fit(dataset, y_val)
xgb_VAL.fit(dataset, y_val)
sk_predict = sk_VAL.predict(array([[0.0] * dimension]))
xgb_predict = xgb_VAL.predict(array([[0.0] * dimension]))
print("sk_prediction = {}\nxgb_prediction = {}".format(sk_predict, xgb_predict))
#sk_prediction = [0.]
#xgb_prediction = [0.] https://stackoverflow.com/questions/67122859
复制相似问题