在机器学习领域摸爬滚打这么多年,我发现很多新手总是被各种概念和术语搞得晕头转向。今天就用实际案例带大家入门sklearn,构建一个靠谱的预测模型。
环境准备这块很简单,装个sklearn就完事了:
pip install scikit-learn
来个经典的房价预测案例,这个场景既不会太复杂,又能覆盖核心知识点:
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
import numpy as np
# 加载波士顿房价数据集
boston = load_boston()
X = boston.data
y = boston.target
数据集拆分是机器学习中的必修课,测试集的存在能让我们清楚模型到底行不行:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
线性回归模型虽然简单,但它是理解机器学习的绝佳起点。这个模型就像是在多维空间中找一个最佳的超平面:
model = LinearRegression()
model.fit(X_train, y_train)
模型评估是检验真理的时刻。记住,过拟合的模型就像是背书的学生,考试可能挂科:
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
print(f"训练集R2分数: {train_score:.3f}")
print(f"测试集R2分数: {test_score:.3f}")
特征工程是提升模型性能的关键。给大家看个特征标准化的例子:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model_scaled = LinearRegression()
model_scaled.fit(X_train_scaled, y_train)
交叉验证能让模型评估更可靠,避免运气成分:
from sklearn.model_selection import cross_val_score
cv_scores = cross_val_score(model, X, y, cv=5)
print(f"交叉验证分数: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")
模型调优是个技术活,网格搜索就像是给模型找最合适的参数配置:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5]
}
rf = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(rf, param_grid, cv=5)
grid_search.fit(X_train, y_train)
print(f"最佳参数: {grid_search.best_params_}")
print(f"最佳分数: {grid_search.best_score_:.3f}")
模型部署前,pickle保存模型是标配操作:
import pickle
with open('model.pkl', 'wb') as f:
pickle.dump(grid_search.best_estimator_, f)
实际项目中,数据预处理管道能让代码更优雅:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
numeric_features = [0, 1, 2, 3] # 示例数值型特征索引
categorical_features = [4, 5] # 示例类别型特征索引
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(drop='first'))
])
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('regressor', RandomForestRegressor())
])
这套流程走下来,你已经掌握了机器学习的基本功。不过要记住,理论和实践要相辅相成。代码写得再漂亮,不理解背后的原理也是白搭。建议深入了解下模型的原理,比如随机森林为什么能降低过拟合,交叉验证为什么要做k折。
最后提醒下,这些代码看着简单,但要在生产环境用好它们还需要考虑很多因素:数据质量、特征选择、模型解释性、计算资源等等。所以多练多想,慢慢你就能找到最适合自己项目的最佳实践。
要是觉得这些内容还不够过瘾,可以去看看sklearn的官方文档,里面有更多高级用法。记住,在机器学习这条路上,保持学习的热情比什么都重要。
领取专属 10元无门槛券
私享最新 技术干货