KNN
算法是机器学习中最简单的算法,一句经典的总结:物以类聚
k-近邻算法
k-nearest neighbor, kNN
是一种基本分类和回归的算法。k
近邻算法中的输入为实例的特征向量,输出为实例的类别,类别可以有多类。算法主要思想:
k
个最近邻的训练实例的类别,经投票表决等方式进行预测
k
的选择:k
值如何选择?越大越好吗?奇偶性如何?经验值是多少?直观解释:给定一个训练数据集,对于新输入的实例,在训练集数据中找出和该实例最邻近的k个实例。这k个实例中的多数属于某个类,就将新实例划分为这个类别。 输入训练数据集
其中,xi为实例特征向量,yi为实例的类别;i=1,2,3,…N。 输出:实例x所属的类别y
T
中找出与x
最近邻的k
个点,涵盖这个k
个点的x
的邻域记作:Nk(x)
上式中,I为指示函数,即当:yi=cj是为1,不等则为0
k=1
称之为最近邻算法。对于输入的新实例,将训练集中离x最近点的所属类作为x
的类别k近邻算法的模型主要有三个要素:
k
值的选择特征空间中两个实例点的距离是两个实例点相似度的反映。k近邻模型的特征空间一般是n维实数向量空间R^n。一般使用的欧式距离,也可以是其他距离,如:L_p距离或者Minkowski距离。
假设特征空间X是n维实数向量空间R^n,{x_i,x_j}\in X,其中x_i=(x_i{(1)},x_i{(2)},…,x_i^{(n)}), x_j=(x_j{(1)},x_j{(2)},…,x_j^{(n)})
的距离定义为
此距离实际上就是明科夫斯基距离
规定:p\geq1
k值一般选取较小的值,一般是奇数值;通过交叉验证来选取最优的k值
k近邻法中分类决策通常采取的是多数表决,即输入实例的k个近邻的训练实例中的多数列决定输入实例的类。如果分类的损失函数是0-1分类,分类函数是
多数表决规则等价于经验风险最小化。
将所有的数据映射到同一个尺度上
适合分布明显边界的情况:比如考试分数,像素边界 缺点:受outlier影响,比如收入没有边界
数据分布没有明显边界 可能存在极端数据
# 调用用于均值归一化的类
from sklearn.preprocessing import StandardScaler
standScaler = StandardScaler()
standScaler.fit(X_train)
standScaler.mean_
# scale表示的是方差
standScaler.scale_
standScaler.transform(X_train)
将导入的样本数据分成训练集
train
和测试集test
两类,一般是2:8
seed
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
p
只有在method
为distance
的条件下才有意义import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
from sklearn import datasets
# 导入数据集
digits = datasets.load_digits()
X = digits.data
y = digits.target
from sklearn.model_selection import train_test_split
# 拆分为训练数据和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)
# 导入的类就是KNN算法
from sklearn.neighbors import KNeighborsClassifier
# 传入的参数相当于是k值
knn_clf = KNeighborsClassifier(n_neighbors=3, weights='uniform')
# 数据的拟合
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)
0.9888888888888889
# 具体参数的设置
param_grid = [
{
'weights': ['uniform'],
'n_neighbors': [i for i in range(1, 11)]
},
{
'weights': ['distance'],
'n_neighbors': [i for i in range(1, 11)],
'p': [i for i in range(1, 6)]
}
]
# 实例化类
knn_clf = KNeighborsClassifier()
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(knn_clf, param_grid)
%%time
grid_search.fit(X_train, y_train)
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=5, p=2,
weights='uniform'),
iid='warn', n_jobs=None,
param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': ['uniform']},
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=0)
grid_search.best_estimator_
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=3, p=3,
weights='distance')
# 查看准确率
grid_search.best_score_0
.9853862212943633
# 查看最终形成的最好参数
grid_search.best_params_{
'n_neighbors': 3, 'p': 3, 'weights': 'distance'}
# 查看测试数据的准确度
knn_clf = grid_search.best_estimator_
knn_clf.score(X_test,y_test)0
# 结果
.9833333333333333
# n_jobs几个核,加快速度;verbose:进行算法信息的输出
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=1, verbose=2)
grid_search.fit(X_train, y_train)
c:\users\admin\venv\lib\site-packages\sklearn\model_selection\_split.py:1978: FutureWarning: The default value of cv will change from 3 to 5 in version 0.22. Specify it explicitly to silence this warning.
warnings.warn(CV_WARNING, FutureWarning)
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
Fitting 3 folds for each of 60 candidates, totalling 180 fits
[CV] n_neighbors=1, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total= 0.6s
[CV] n_neighbors=1, weights=uniform ..................................
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.5s remaining: 0.0s
[CV] ................... n_neighbors=1, weights=uniform, total= 0.6s
[CV] n_neighbors=1, weights=uniform ..................................
[CV] ................... n_neighbors=1, weights=uniform, total= 0.5s
....
[CV] ............ n_neighbors=10, p=5, weights=distance, total= 0.6s
GridSearchCV(cv='warn', error_score='raise-deprecating',
estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30,
metric='minkowski',
metric_params=None, n_jobs=None,
n_neighbors=3, p=3,
weights='distance'),
iid='warn', n_jobs=1,
param_grid=[{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'weights': ['uniform']},
{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'p': [1, 2, 3, 4, 5], 'weights': ['distance']}],
pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
scoring=None, verbose=2)