In this recipe, we'll look at working with categorical variables in a different way. In the event that only one or two categories of the feature are important, it might be wise to avoid the extra dimensionality, which might be created if there are several categories.
在这里,我们将学习另一种操作分类变量的方法。在这里只有一两个类别特征是重要的,如果分类变量过多的话,明智的办法是避免过多的额外维度。
Getting ready准备:
There's another way to work with categorical variables. Instead of dealing with the categorical variables using OneHotEncoder , we can use LabelBinarizer . This is a combination of thresholding and working with categorical variables.
另外一种处理分类变量的方法,除了OneHotEncoder能处理分类变量以外,我们能用LabelBinarizer,这是一个结合设置阈值和处理分类变量。
To show how this works, load the iris dataset:为了示例,载入iris数据集:
from sklearn import datasets as d
iris = d.load_iris()
target = iris.targetHow to do it...如何工作的
Import the LabelBinarizer() method and create an object:导入LabelBinarizer() 方法然后生成一个对象:
from sklearn.preprocessing import LabelBinarizer
label_binarizer = LabelBinarizer()Now, simply transform the target outcomes to the new feature space:将target转换为新特征空间
new_target = label_binarizer.fit_transform(target)Let's look at new_target and the label_binarizer object to get a feel of what happened:让我们看一下new_target和label_binarizer对象,来看一看发生了什么:
new_target.shape
(150, 3)
new_target[:5]
array([[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0],
[1, 0, 0]])
new_target[-5:]
array([[0, 0, 1],
[0, 0, 1],
[0, 0, 1],
[0, 0, 1],
[0, 0, 1]])
label_binarizer.classes_
array([0, 1, 2])How it works...如何工作的
The iris target has a cardinality of 3 , that is, it has three unique values. When LabelBinarizer converts the vector N x 1 into the vector N x C, where C is the cardinality of the N x 1 dataset, it is important to note that once the object has been fit, introducing unseen values in the transformation will throw an error:
iris标签有三个基数(有三个值),当LabelBinarizer把向量从N*1转化到N*C时(C就是N*1数据集中基数的个数),注意对象的个数非常重要,当调用转化后的数据中不存在的值时,将返回array([[0, 0, 0]])
label_binarizer.transform([4])
array([[0, 0, 0]])There's more...扩展阅读
Zero and one do not have to represent the positive and negative instances of the target value. For example, if we want positive values to be represented by 1,000, and negative values to be represented by -1,000, we'd simply make the designation when we create label_binarizer :
例子中的值并非必须用0和1表达是或非,例如,如果我们想用1000代表是,用-1000代表非,我们只需在生成 label_binarizer时做一个定义就行。
label_binarizer = LabelBinarizer(neg_label=-1000, pos_label=1000)
label_binarizer.fit_transform(target)[:5]
array([[ 1000, -1000, -1000],
[ 1000, -1000, -1000],
[ 1000, -1000, -1000],
[ 1000, -1000, -1000],
[ 1000, -1000, -1000]])The only restriction on the positive and negative values is that they must be integers.
对于是和非的选择,唯一的要求是使用整数。
本文系外文翻译,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。
本文系外文翻译,前往查看
如有侵权,请联系 cloudcommunity@tencent.com 删除。