我有一个具有分类特性的数据集,我的预处理如下:
dataset = data_df.values
#Spliting dataset:
X = dataset[:,0:8]
y = dataset[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
#columns to one-hot-encoding:
categorical_cols = ['emp','ed','st','jb','br']
categorical_cols_idx = [data_df.columns.get_loc(c) for c in categorical_cols]
del data_df
#Train Encoding
ohe = OneHotEncoder(sparse = False)
X_train_enc = ohe.fit_transform(X_train[:,categorical_cols_idx])
X_train_new = np.concatenate((X_train[:,:3],X_train_enc),axis = 1)
del X_train
del X_train_enc
#Test Encoding
X_test_enc = ohe.transform(X_test[:,categorical_cols_idx])
X_test_new = np.concatenate((X_test[:,:3],X_test_enc),axis = 1)
del X_test
del X_test_enc
但我怀疑,我的预处理步骤是最佳的。我应该如何优化它?
发布于 2021-10-18 08:55:45
您可以使用make_column_transformer
和make_column_selector
来执行此操作。
make_column_transformer
将指定要对哪些列执行哪些操作,而make_column_selector
将根据其类型选择列。除此之外,您还可以处理在ColumnTransformer中没有使用参数remainder
处理的元素。在这里,remainder='passthrough'
将简单地返回不是object
的特性。
详情如下:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer, make_column_selector
import pandas as pd
df = pd.DataFrame({
'emp' : ['A', 'A', 'B'],
'ed' : ['A', 'A', 'B'],
'st' : ['A', 'A', 'B'],
'jb' : ['A', 'A', 'B'],
'br' : ['A', 'A', 'B'],
'other_feature_1' : [0,1,2],
'other_feature_2' : [2,3,4]
})
preprocessing = make_column_transformer((OneHotEncoder(sparse=False), make_column_selector(dtype_include='object')),
remainder='passthrough'
)
preprocessing.fit_transform(df)
这将输出以下数组:
array([[1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 0., 2.],
[1., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 3.],
[0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 2., 4.]])
https://stackoverflow.com/questions/69613098
复制相似问题