列名 | 解释 |
---|---|
survival | 是否幸存 0 = No, 1 = Yes |
pclass | 船票等级 |
sex | 性别 |
Age | 年龄 |
sibsp | 船上兄弟姐妹/配偶的个数 |
parch | 船上父母/孩子的个数 |
ticket | 船票号 |
fare | 船票价格 |
cabin | 船舱号码 |
embarked | 登船口 C = Cherbourg, Q = Queenstown, S = Southampton |
import pandas as pd #数据分析
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
data_train = pd.read_csv("/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/train.csv")
print(data_train.columns)
#看看各乘客等级的获救情况
Survived_0 = data_train.Pclass[data_train.Survived == 0].value_counts()
print(Survived_0)
Survived_1 = data_train.Pclass[data_train.Survived == 1].value_counts()
print(Survived_1)
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0})
print(df)
df.plot(kind='bar', stacked=True)
plt.title(u"各乘客等级的获救情况")
plt.xlabel(u"乘客等级")
plt.ylabel(u"人数")
plt.show()
结果:
未获救 获救
1 80 136
2 97 87
3 372 119
Survived_0 = data_train.Embarked[data_train.Survived == 0].value_counts()
Survived_1 = data_train.Embarked[data_train.Survived == 1].value_counts()
df=pd.DataFrame({u'获救':Survived_1, u'未获救':Survived_0})
print(df)
结果:
未获救 获救
S 427 217
C 75 93
Q 47 30
Survived_m = data_train.Survived[data_train.Sex == 'male'].value_counts()
Survived_f = data_train.Survived[data_train.Sex == 'female'].value_counts()
df=pd.DataFrame({u'男性':Survived_m, u'女性':Survived_f})
print(df)
女性 男性
0 81 468
1 233 109
data_train.info()
查看数据,年龄字段中有
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
将年龄作为预测目标,通过算法进行拟合。
age_df = df[['Age', 'Fare', 'Parch', 'SibSp', 'Pclass']]
# 乘客分成已知年龄和未知年龄两部分
known_age = age_df[age_df.Age.notnull()].as_matrix()
unknown_age = age_df[age_df.Age.isnull()].as_matrix()
# y即目标年龄
y = known_age[:, 0]
# X即特征属性值
X = known_age[:, 1:]
# fit到RandomForestRegressor之中
rfr = RandomForestRegressor(random_state=0, n_estimators=2000, n_jobs=-1)
rfr.fit(X, y)
# 用得到的模型进行未知年龄结果预测
predictedAges = rfr.predict(unknown_age[:, 1::])
# 用得到的预测结果填补原缺失数据
df.loc[(df.Age.isnull()), 'Age'] = predictedAges
最后只选取8个维度 Pclass Age SibSp Parch Sex Cabin Fare Embarked。dummy编码进行维度扩展。
Pclass 891 non-null int64
Age 891 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Fare 891 non-null float64
Sex_female 891 non-null uint8
Sex_male 891 non-null uint8
Cabin_NO 891 non-null uint8
Cabin_YES 891 non-null uint8
Embarked_C 891 non-null uint8
Embarked_Q 891 non-null uint8
Embarked_S 891 non-null uint8
建模:
import numpy as np
import pandas as pd
test_file = "/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/test.csv"
train_file = "/Users/wangsen/ai/03/3day_feature/code/a8_titanic/data/train.csv"
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)
train_target = train_df.pop('Survived')
train_df = train_df.drop(['PassengerId','Name','Ticket'],axis=1)
test_df = test_df.drop(['PassengerId','Name','Ticket'],axis=1)
train_len = train_df.shape[0]
test_len = test_df.shape[0]
all_df = pd.concat((train_df,test_df),axis=0)
train_df.loc[train_df.Cabin.isnull()==False,'Cabin'] = 'YES'
train_df.loc[train_df.Cabin.isnull()==True,'Cabin']='NO'
print(train_df['Cabin'].value_counts())
train_df.Age = train_df.Age.fillna(train_df.Age.mean())
train_dummy_df = pd.get_dummies(train_df)
#train_df.info()
#print(train_df.describe())
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(train_dummy_df,train_target)
score = lr.score(train_dummy_df,train_target)
print("简单逻辑回归的准确率:",score)
结果:
简单逻辑回归的准确率: 0.8002244668911336
泰坦尼克数据集比较规整,只需要处理较少量缺失值。用线性逻辑回归模型,可以达到80%的预测准确性。