本文是对scikit-learn.org上函数说明<learning_curve>一文的翻译。 包括其引用的用户手册-learning_curve
learning_curve(
estimator,
X,
y,
*,
groups=None,
train_sizes=array([0.1 , 0.325, 0.55 , 0.775, 1. ]),
cv=None,
scoring=None,
exploit_incremental_learning=False,
n_jobs=None,
pre_dispatch='all',
verbose=0,
shuffle=False,
random_state=None,
error_score=nan,
return_times=False,
fit_params=None,
)
学习曲线。
求出不同的训练集大小的交叉验证的训练和测试分数
一个交叉验证的生成器把整个数据集拆分训练数据和测试数据k次。不同大小的训练集的子集将被用来训练estimator,并计算每次训练子集的分数。之后,the scores will be averaged over all k runs for each training subset size.
更多信息参考用户指南
一并翻译如下:
每个estimator都有它自己的优势和缺点。它的泛化误差能分解成偏差,方差和噪声。Estimator的偏差是他在不同训练集上的平均误差。Estimator的方差表示它对不同训练集有多敏感。噪声是数据本身的性质。
下面的图表中,我们看到一个函数(f(x) = \cos (\frac{3}{2} \pi x))和这个来自函数的噪声样例。下面用三个不同的estimators来fit这个函数:1,4和15度的多项式特征的线性回归。第一个estimator最好情况下也只能在样本和真实函数之间提供一个很差的适应,因为它太简单了(高偏差),第二个estimator接近完美,最后一个estimator完美贴合了训练数据但是没有很好的适应真实函数,也就是说它对不同的训练数据非常敏感(高方差)。
image.png
import numpy as np
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
def true_fun(X):
return np.cos(1.5 * np.pi * X)
np.random.seed(0)
n_samples = 30
degrees = [1, 4, 15]
X = np.sort(np.random.rand(n_samples))
y = true_fun(X) + np.random.randn(n_samples) * 0.1
plt.figure(figsize=(14, 5))
for i in range(len(degrees)):
ax = plt.subplot(1, len(degrees), i + 1)
plt.setp(ax, xticks=(), yticks=())
polynomial_features = PolynomialFeatures(degree=degrees[i], include_bias=False)
linear_regression = LinearRegression()
pipeline = Pipeline(
[
("polynomial_features", polynomial_features),
("linear_regression", linear_regression),
]
)
pipeline.fit(X[:, np.newaxis], y)
# Evaluate the models using crossvalidation
scores = cross_val_score(
pipeline, X[:, np.newaxis], y, scoring="neg_mean_squared_error", cv=10
)
X_test = np.linspace(0, 1, 100)
plt.plot(X_test, pipeline.predict(X_test[:, np.newaxis]), label="Model")
plt.plot(X_test, true_fun(X_test), label="True function")
plt.scatter(X, y, edgecolor="b", s=20, label="Samples")
plt.xlabel("x")
plt.ylabel("y")
plt.xlim((0, 1))
plt.ylim((-2, 2))
plt.legend(loc="best")
plt.title(
"Degree {}\nMSE = {:.2e}(+/- {:.2e})".format(
degrees[i], -scores.mean(), scores.std()
)
)
plt.show()
偏差和方差是estimator的固有性质,我们经常需要选择学习算法的超参数来让偏差和方差尽可能的低(see Bias-variance dilemma )。减小模型方差的另一种方式使用更多的训练集。不过,你只有在true function太过复杂以至于不能用一个低方差的estimator很好的逼近的时候才需要采集更多的数据
在我们上面举的这个简单一维问题来说,显然无论这个estimator遭受多少偏差和方差。在高维空间中,模型都变得很难可视化。正是这个原因,通常使用下面描述的工具。
验证模型需要用到评分函数 (see Metrics and scoring: quantifying the quality of predictions),例如分类器的准确度。选择一个estimator的多个超参数的合适的方法当然是网格搜索或者类似的方法 (see Tuning the hyper-parameters of an estimator) ,也就是在一个或者多个验证集中选择分数最高的超参数。 注意我们优化超参数基于的验证分数有了偏差以及估计的泛化不再优秀了。为了获得更强的泛化能力需要在另外的测试集上计算分数。
不过,对于一些超参数值来说,有时候把单个超参数对于训练分数和验证分数的影响画出来对于找出estimator是否过拟合还是欠拟合很有帮助。
这种情况下 validation_curve
就很有帮助了
>>> import numpy as np
>>> from sklearn.model_selection import validation_curve
>>> from sklearn.datasets import load_iris
>>> from sklearn.linear_model import Ridge
>>> np.random.seed(0)
>>> X, y = load_iris(return_X_y=True)
>>> indices = np.arange(y.shape[0])
>>> np.random.shuffle(indices)
>>> X, y = X[indices], y[indices]
>>> train_scores, valid_scores = validation_curve(
... Ridge(), X, y, param_name="alpha", param_range=np.logspace(-7, 3, 3),
... cv=5)
>>> train_scores
array([[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.93..., 0.94..., 0.92..., 0.91..., 0.92...],
[0.51..., 0.52..., 0.49..., 0.47..., 0.49...]])
>>> valid_scores
array([[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
[0.90..., 0.84..., 0.94..., 0.96..., 0.93...],
[0.46..., 0.25..., 0.50..., 0.49..., 0.52...]])
如果训练分数和验证分数都很低,这个estimator就是欠拟合的,如果训练分数很高,验证分数很低,这个estimator就是过拟合的,不然它就是非常有效得。训练分数很低,验证分数很高通常不可能。下面图表中是使用digits数据集的一个SVM,在不同(\gamma)\参数下的欠拟合,过拟合和有效的模型。
image.png
代码:
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.model_selection import validation_curve
X, y = load_digits(return_X_y=True)
subset_mask = np.isin(y, [1, 2]) # binary classification: 1 vs 2
X, y = X[subset_mask], y[subset_mask]
param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
SVC(),
X,
y,
param_name="gamma",
param_range=param_range,
scoring="accuracy",
n_jobs=2,
)
train_scores_mean = np.mean(train_scores, axis=1)
train_scores_std = np.std(train_scores, axis=1)
test_scores_mean = np.mean(test_scores, axis=1)
test_scores_std = np.std(test_scores, axis=1)
plt.title("Validation Curve with SVM")
plt.xlabel(r"$\gamma$")
plt.ylabel("Score")
plt.ylim(0.0, 1.1)
lw = 2
plt.semilogx(
param_range, train_scores_mean, label="Training score", color="darkorange", lw=lw
)
plt.fill_between(
param_range,
train_scores_mean - train_scores_std,
train_scores_mean + train_scores_std,
alpha=0.2,
color="darkorange",
lw=lw,
)
plt.semilogx(
param_range, test_scores_mean, label="Cross-validation score", color="navy", lw=lw
)
plt.fill_between(
param_range,
test_scores_mean - test_scores_std,
test_scores_mean + test_scores_std,
alpha=0.2,
color="navy",
lw=lw,
)
plt.legend(loc="best")
plt.show()
学习曲线展示了estimator在改变训练样本数量时的验证和训练分数。是一个找出增加训练集究竟有多大程度优化,和estimator是否受到方差或者偏差的影响的工具。考虑下面的例子:画出来的朴素贝叶斯和SVM的学习曲线。
朴素贝叶斯中,随着训练集的加大,验证分数和训练分数汇聚到一个很低的值。这样,增加训练集数据可能没多少优化了。
相反,同样数量的数据,SVM的训练分数比验证分数高很多。增加训练样本能够增加泛化能力。
image.png
使用 learning_curve
来生成我们需要在学习曲线中画出来的值(已经使用过的样例的数量,训练集的平均分数,以及验证集的平均分数)
>>> from sklearn.model_selection import learning_curve
>>> from sklearn.svm import SVC
>>> train_sizes, train_scores, valid_scores = learning_curve(
... SVC(kernel='linear'), X, y, train_sizes=[50, 80, 110], cv=5)
>>> train_sizes
array([ 50, 80, 110])
>>> train_scores
array([[0.98..., 0.98 , 0.98..., 0.98..., 0.98...],
[0.98..., 1. , 0.98..., 0.98..., 0.98...],
[0.98..., 1. , 0.98..., 0.98..., 0.99...]])
>>> valid_scores
array([[1. , 0.93..., 1. , 1. , 0.96...],
[1. , 0.96..., 1. , 1. , 0.96...],
[1. , 0.96..., 1. , 1. , 0.96...]])
estimator : object type that implements the "fit" and "predict" methods An object of that type which is cloned for each validation.
X : array-like of shape (n_samples, n_features)
Training vector, where n_samples
is the number of samples and
n_features
is the number of features.
y : array-like of shape (n_samples,) or (n_samples, n_outputs) Target relative to X for classification or regression; None for unsupervised learning.
groups : array-like of shape (n_samples,), default=None
Group labels for the samples used while splitting the dataset into
train/test set. Only used in conjunction with a "Group" :term:cv
instance (e.g., :class:GroupKFold
).
train_sizes : array-like of shape (n_ticks,), default=np.linspace(0.1, 1.0, 5) Relative or absolute numbers of training examples that will be used to generate the learning curve. If the dtype is float, it is regarded as a fraction of the maximum size of the training set (that is determined by the selected validation method), i.e. it has to be within (0, 1]. Otherwise it is interpreted as absolute sizes of the training sets. Note that for classification the number of samples usually have to be big enough to contain at least one sample from each class.
cv : int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 5-fold cross validation,
- int, to specify the number of folds in a `(Stratified)KFold`,
- :term:`CV splitter`,
- An iterable yielding (train, test) splits as arrays of indices.
For int/None inputs, if the estimator is a classifier and ``y`` is
either binary or multiclass, :class:`StratifiedKFold` is used. In all
other cases, :class:`KFold` is used. These splitters are instantiated
with `shuffle=False` so the splits will be the same across calls.
Refer :ref:`User Guide <cross_validation>` for the various
cross-validation strategies that can be used here.
.. versionchanged:: 0.22
``cv`` default value if None changed from 3-fold to 5-fold.
scoring : str or callable, default=None
A str (see model evaluation documentation) or
a scorer callable object / function with signature
scorer(estimator, X, y)
.
exploit_incremental_learning : bool, default=False If the estimator supports incremental learning, this will be used to speed up fitting for different training set sizes.
n_jobs : int, default=None
Number of jobs to run in parallel. Training the estimator and computing
the score are parallelized over the different training and test sets.
None
means 1 unless in a :obj:joblib.parallel_backend
context.
-1
means using all processors. See :term:Glossary <n_jobs>
for more details.
pre_dispatch : int or str, default='all' Number of predispatched jobs for parallel execution (default is all). The option can reduce the allocated memory. The str can be an expression like '2*n_jobs'.
verbose : int, default=0 Controls the verbosity: the higher, the more messages.
shuffle : bool, default=False
Whether to shuffle training data before taking prefixes of it
based ontrain_sizes
.
random_state : int, RandomState instance or None, default=None
Used when shuffle
is True. Pass an int for reproducible
output across multiple function calls.
See :term:Glossary <random_state>
.
error_score : 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised.
.. versionadded:: 0.20
return_times : bool, default=False Whether to return the fit and score times.
fit_params : dict, default=None Parameters to pass to the fit method of the estimator.
.. versionadded:: 0.24
train_sizes_abs : array of shape (n_unique_ticks,) Numbers of training examples that has been used to generate the learning curve. Note that the number of ticks might be less than n_ticks because duplicate entries will be removed.
train_scores : array of shape (n_ticks, n_cv_folds) Scores on training sets.
test_scores : array of shape (n_ticks, n_cv_folds) Scores on test set.
fit_times : array of shape (n_ticks, n_cv_folds)
Times spent for fitting in seconds. Only present if return_times
is True.
score_times : array of shape (n_ticks, n_cv_folds)
Times spent for scoring in seconds. Only present if return_times
is True.