前言

本文探索性数据分析和特征工程部分大量参考了kaggle上的一篇教程,地址:https://www.kaggle.com/ash316/eda-to-prediction-dietanic
这篇文章的内容我花了几天的时间整理出来的,包含了kaggle上竞赛的基本过程,主要内容有探索性数据分析、特征工程、建模、调参

导库并加载数据

import warnings

warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

数据总览

print(train.shape)
train.head()
(891, 12)
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
train.tail()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
print(test.shape)
test.head()
(418, 11)
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
train.describe() #显示数值型特征的描述信息
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
train.describe(include='O') #显示字符串类型的特征描述
Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... male 1601 C23 C25 C27 S
freq 1 577 7 4 644
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

各个特征的含义:

  • Survived:1代表被救,0代表遇难
  • Pclass:票的等级,1最高,3最低
  • sex:性别
  • age:年龄
  • sibsp:兄弟姐妹配偶等同辈的家人的数量
  • parch:父母或儿女的数量
  • ticket:票号
  • fare:花了多少钱
  • cabin:客舱号码
  • Embarked:登陆的港口

探索性数据分析

通过下面的两个方法可以查看哪些特征是数值型,哪些是字符串,分类的

train.select_dtypes(include='number').columns
Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')
train.select_dtypes(include=['object']).columns
Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')

Survived-目标变量

train.Survived.value_counts(normalize=True)
0    0.616162
1    0.383838
Name: Survived, dtype: float64
train.Survived.value_counts()
0    549
1    342
Name: Survived, dtype: int64
sns.countplot('Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bd52c88>

kaggle泰坦尼克号处理流程

单单只从获救人数来看的话,获救的人数占总人数不多

Sex 无序分类变量

train.groupby(['Sex','Survived'])['Survived'].count()
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64
pd.crosstab(train.Sex, train.Survived, margins=True) 
Survived 0 1 All
Sex
female 81 233 314
male 468 109 577
All 549 342 891

从这个交叉表可以看出,女性获救概率特别大,可视化会看的更清楚

sns.countplot('Sex', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bdcb1d0>

kaggle泰坦尼克号处理流程

Embarked 无序分类变量

train[['Embarked', 'Survived']].groupby('Embarked').sum()
Survived
Embarked
C 93
Q 30
S 217
pd.crosstab(train.Embarked, train.Survived, margins=True)
Survived 0 1 All
Embarked
C 75 93 168
Q 47 30 77
S 427 217 644
All 549 340 889
sns.countplot('Embarked', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1be2e898>

kaggle泰坦尼克号处理流程

可以看出,虽然从港口S上船的人获救人数最多,但是获救比最高的是从港口C上船的

前面知道,女性获救的概率很大,放到这里再看看:

sns.factorplot('Pclass', 'Survived',hue='Sex', col='Embarked', data=train)
<seaborn.axisgrid.FacetGrid at 0x1bd3de10>

kaggle泰坦尼克号处理流程

Pclass 有序分类变量

train[['Pclass', 'Survived']].groupby('Pclass').sum() #只要是有数字的,groupby就不太管用
Survived
Pclass
1 136
2 87
3 119
pd.crosstab(train.Pclass, train.Survived, margins=True)
Survived 0 1 All
Pclass
1 80 136 216
2 97 87 184
3 372 119 491
All 549 342 891
sns.countplot('Pclass', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bf66d68>

kaggle泰坦尼克号处理流程

等级3的人口那么多,加起来获救人数都比不上等级1的多,有钱真好……

性别是个很重要的特征,再看一下:

pd.crosstab([train.Sex, train.Survived], 
           train.Pclass, margins=True)
Pclass 1 2 3 All
Sex Survived
female 0 3 6 72 81
1 91 70 72 233
male 0 77 91 300 468
1 45 17 47 109
All 216 184 491 891
sns.factorplot('Pclass', 'Survived', hue='Sex', data=train)
<seaborn.axisgrid.FacetGrid at 0x1bfad9e8>

kaggle泰坦尼克号处理流程

可以看到,女性就是享有特权,加上有钱,等级1的女性活下来的几乎为1……

Age 连续值

print('min:', train.Age.min())
print('max:', train.Age.max())
print('mean:', train.Age.mean())
print('median:', train.Age.median())
print('kurt:', train.Age.kurt())
print('skew:', train.Age.skew())

min: 0.42
max: 80.0
mean: 29.69911764705882
median: 28.0
kurt: 0.17827415364210353
skew: 0.38910778230082704

空值还没有处理,所以不能用sns.distplot()画出分布图

sns.violinplot( 'Pclass', 'Age', hue='Survived', data=train, split=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c0067b8>

kaggle泰坦尼克号处理流程

sns.violinplot( 'Sex', 'Age', hue='Survived', data=train, split=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c04af98>

kaggle泰坦尼克号处理流程

用violin图的方法就可以看出各个等级下年龄的分布了,岁数大的人一般更有钱,20-40岁的的人获救概率更大一些

Fare 连续值

print('min:', train.Fare.min())
print('max:', train.Fare.max())
print('mean:', train.Fare.mean())
print('median:', train.Fare.median())
print('kurt:', train.Fare.kurt())
print('skew:', train.Fare.skew())

min: 0.0
max: 512.3292
mean: 32.204207968574636
median: 14.4542
kurt: 33.39814088089868
skew: 4.787316519674893

很明显,有钱的人多,偏度系数才那么的大

from scipy.stats import norm
plt.figure(figsize=(10, 10))
sns.distplot(train.Fare, kde=True, bins=30, fit=norm)
<matplotlib.axes._subplots.AxesSubplot at 0x1bffa358>

kaggle泰坦尼克号处理流程

明显不符合正态分布,一般认为符合正态分布的数据才是最好的

sns.factorplot('Pclass', 'Fare', hue='Survived', data=train, split=False)
<seaborn.axisgrid.FacetGrid at 0x1c0f1ef0>

kaggle泰坦尼克号处理流程

Parch, SibSp, 连续值

pd.crosstab(train.Parch, train.Survived, margins=True)
Survived 0 1 All
Parch
0 445 233 678
1 53 65 118
2 40 40 80
3 2 3 5
4 4 0 4
5 4 1 5
6 1 0 1
All 549 342 891
pd.crosstab(train.SibSp, train.Survived, margins=True)
Survived 0 1 All
SibSp
0 398 210 608
1 97 112 209
2 15 13 28
3 12 4 16
4 15 3 18
5 5 0 5
8 7 0 7
All 549 342 891
sns.barplot('SibSp', 'Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bfae630>

kaggle泰坦尼克号处理流程

sns.factorplot('SibSp', 'Survived', data=train)
<seaborn.axisgrid.FacetGrid at 0x1c08ca90>

kaggle泰坦尼克号处理流程

sns.countplot('Parch', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1c1a4400>

kaggle泰坦尼克号处理流程

sns.countplot('SibSp', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1c1a0e10>

kaggle泰坦尼克号处理流程

单独一个人的获救几率最大……

特征之间的相关性

f = plt.figure(figsize=(10, 8))
sns.heatmap(train.corr(), annot=True, cmap='RdYlGn', linewidth=0.2)
<matplotlib.axes._subplots.AxesSubplot at 0x1c241f60>

kaggle泰坦尼克号处理流程

特征工程和数据清洗

在这一步进行处理的时候,要train.csv中的数据和test.csv一起进行处理,不然训练出来的模型在test.csv上不能进行预测的操作

train.shape, test.shape
((891, 12), (418, 11))
all_data = pd.concat([train, test])
print(all_data.shape)
all_data.head(2)
(1309, 12)
Age Cabin Embarked Fare Name Parch PassengerId Pclass Sex SibSp Survived Ticket
0 22.0 NaN S 7.2500 Braund, Mr. Owen Harris 0 1 3 male 1 0.0 A/5 21171
1 38.0 C85 C 71.2833 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 2 1 female 1 1.0 PC 17599
len(all_data[all_data.Survived.isnull()])
418
len(all_data[all_data.Survived.notnull()])
891

连接之后进行统一处理后,若要将两者分开的话,可以使用Survived特征来将两者分开就行了

缺失值处理

all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age            1046 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
all_data.describe(include='O')
Cabin Embarked Name Sex Ticket
count 295 1307 1309 1309 1309
unique 186 3 1307 2 929
top C23 C25 C27 S Connolly, Miss. Kate male CA. 2343
freq 6 914 2 843 11
all_data.Age.fillna(all_data.Age.median(), inplace=True)
all_data.Embarked.fillna('S', inplace=True)
all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age            1309 non-null float64
Cabin          295 non-null object
Embarked       1309 non-null object
Fare           1308 non-null float64
Name           1309 non-null object
Parch          1309 non-null int64
PassengerId    1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null object
SibSp          1309 non-null int64
Survived       891 non-null float64
Ticket         1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB

特征Cabin缺失值太多了,不做填充处理了,直接丢弃掉

Age

将连续的年龄Age特征,分成一段段的,为此先创建一个特征

按照参照的那篇kernel,作者的意思就是说,在机器学习模型中,连续的特征会存在一些问题,要将这些值离散化才好,这段话存疑,不知道对错,目前以我的知识来说,持怀疑态度

根据Age的取值范围是0-80,分成5段的话,每段的长度的是16

all_data['Age_band'] = 0
all_data.loc[all_data.Age <= 16, 'Age_band'] = 0
all_data.loc[(all_data.Age > 16) & (all_data.Age <= 32), 'Age_band'] = 1
all_data.loc[(all_data.Age > 32) & (all_data.Age <= 48), 'Age_band'] = 2
all_data.loc[(all_data.Age > 48) & (all_data.Age <= 64), 'Age_band'] = 3
all_data.loc[all_data.Age > 64, 'Age_band'] = 4
all_data.Age_band.value_counts().to_frame()
Age_band
1 787
2 269
0 134
3 106
4 13

产生这个新的特征,也可以对这个特征画图进行分析一下:

sns.factorplot('Age_band', 'Survived', col='Pclass', data=all_data)
<seaborn.axisgrid.FacetGrid at 0x1c2f7160>

kaggle泰坦尼克号处理流程

真的与Pclass无关,随着年龄的增长,存活率降低

家庭

all_data['Family_size'] = 0
all_data['Family_size'] = all_data.Parch + all_data.SibSp

all_data['Alone'] = 0
all_data.loc[all_data.Family_size == 0, 'Alone'] = 1
f, ax = plt.subplots(1, 2, figsize=(10, 6))
sns.factorplot('Family_size', 'Survived', data=all_data, ax=ax[0])
sns.factorplot('Alone', 'Survived', data=all_data, ax=ax[1])
plt.close(2)
plt.close(3)

kaggle泰坦尼克号处理流程

一个人的存活概率很低,而家庭人数大于4的,存活概率也减少

Fare

Fare也是一个连续值,我们要将其离散化,这里使用pd.qcut

all_data['Fare_range'] = pd.qcut(all_data.Fare, 4)
all_data.groupby('Fare_range')['Survived'].mean().to_frame()
Survived
Fare_range
(-0.001, 7.896] 0.197309
(7.896, 14.454] 0.303571
(14.454, 31.275] 0.441048
(31.275, 512.329] 0.600000
type(all_data.Fare_range[0])
pandas.core.series.Series

钱越多,生存概率越高……

现在的Fare_range是个原始值的一个间隔,我们应该像处理Age那样,将其转换成单一的值

all_data['Fare_cat']=0
all_data.loc[all_data['Fare']<=7.91,'Fare_cat']=0
all_data.loc[(all_data['Fare']>7.91)&(all_data['Fare']<=14.454),'Fare_cat']=1
all_data.loc[(all_data['Fare']>14.454)&(all_data['Fare']<=31),'Fare_cat']=2
all_data.loc[(all_data['Fare']>31)&(all_data['Fare']<=513),'Fare_cat']=3
sns.factorplot('Fare_cat', 'Survived', hue='Sex', data=all_data)
<seaborn.axisgrid.FacetGrid at 0x1c222f28>

kaggle泰坦尼克号处理流程

将字符串转化成数值

all_data['Sex'].replace(['male', 'female'], [0,1], inplace=True)
all_data['Embarked'].replace(['S', 'C', 'Q'], [0, 1, 2], inplace=True)

将无用特征给删掉

plt.figure(figsize=(10, 8))
all_data.drop(['Name', 'Age', 'Ticket', 'Fare', 'Cabin', 'Fare_range',
           'PassengerId'], axis=1, inplace=True)
sns.heatmap(all_data.corr(), annot=True, cmap='RdYlGn', linewidth=0.2)
<matplotlib.axes._subplots.AxesSubplot at 0x1c794ba8>

kaggle泰坦尼克号处理流程

all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 10 columns):
Embarked       1309 non-null int64
Parch          1309 non-null int64
Pclass         1309 non-null int64
Sex            1309 non-null int64
SibSp          1309 non-null int64
Survived       891 non-null float64
Age_band       1309 non-null int64
Family_size    1309 non-null int64
Alone          1309 non-null int64
Fare_cat       1309 non-null int64
dtypes: float64(1), int64(9)
memory usage: 152.5 KB

异常值检测

有时间补充

搭建模型

准备数据

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
train_data = all_data[all_data.Survived.notnull()]
train_data.shape
(891, 10)
test_data = all_data[all_data.Survived.isnull()]
test_data.drop('Survived', axis=1, inplace=True)
test_data.shape
(418, 9)
train_data.head(1)
Embarked Parch Pclass Sex SibSp Survived Age_band Family_size Alone Fare_cat
0 0 0 3 0 1 0.0 1 1 0 0
X = train_data.drop('Survived', axis=1)
y = train_data['Survived']
X.shape, y.shape
((891, 9), (891,))
# X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((712, 9), (179, 9), (712,), (179,))

逻辑回归

lg = LogisticRegression(C=14)
lg.fit(X_train, y_train)
LogisticRegression(C=14, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
lg.score(X_test, y_test)
0.7877094972067039
params = np.arange(1, 20 )
scores = []
for c in params:
    lg = LogisticRegression(C=c)
    score = cross_val_score(lg, X_train, y_train, cv=10).mean()
    scores.append(score)
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x22ec81d0>]

kaggle泰坦尼克号处理流程

params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.7993600491839927
params_score[max_score]
6

自调整的逻辑回归

from sklearn.linear_model import LogisticRegressionCV
log_cv = LogisticRegressionCV()
log_cv.fit(X_train, y_train)
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
           fit_intercept=True, intercept_scaling=1.0, max_iter=100,
           multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
           refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
log_cv.score(X_test, y_test)
0.776536312849162

SVM

clf = SVC(C=1)
clf.fit(X_train, y_train)
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
clf.score(X_test, y_test)
0.7988826815642458
params = np.arange(1, 20,)
scores = []

for c in params:
    clf = SVC(C=c)
    score = cross_val_score(clf, X_train, y_train, cv=10).mean()
    scores.append(score)
    
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x22f0e908>]

kaggle泰坦尼克号处理流程

params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.8245372233400403
params_score[max_score]
1

决策树

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10, max_features=6)
dt.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=17,
            max_features=6, max_leaf_nodes=10, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=9,
            min_samples_split=17, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

可调节参数太多,目前我也不太清楚参数的具体是啥意思……,这些参数都是我xjb调的

dt.score(X_test, y_test)
0.7988826815642458
params = np.arange(2, len(X_train.columns),)
scores = []

for c in params:
    clf = DecisionTreeClassifier(max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10, max_features=6, )
    score = cross_val_score(clf, X_train, y_train, cv=10).mean()
    scores.append(score)
    
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x231b0208>]

kaggle泰坦尼克号处理流程

params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
print(max_score)
params_score[max_score]
0.8316398390342054





5

随机森林

from sklearn.ensemble import RandomForestClassifier
c:\users\administrator\appdata\local\programs\python\python35\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
  from numpy.core.umath_tests import inner1d
rf = RandomForestClassifier(n_estimators=117, oob_score=True)
rf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=117, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)
rf.oob_score_
0.8148148148148148
params = np.arange(1, 200)
scores = []

for c in params:
    clf = RandomForestClassifier(n_estimators=c, max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10,
                            max_features=6, oob_score=True)
    clf.fit(X, y)
    score = clf.oob_score_
    scores.append(score)
    
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x2348f7f0>]

kaggle泰坦尼克号处理流程

params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.8260381593714927
params_score[max_score]
193

knn

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=12, p=2,
           weights='uniform')
knn.score(X_test, y_test)
0.7597765363128491
params = np.arange(1, 20,)
scores = []

for c in params:
    clf = KNeighborsClassifier(n_neighbors=c)
    score = cross_val_score(clf, X_train, y_train, cv=10).mean()
    scores.append(score)
    
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x2487d438>]

kaggle泰坦尼克号处理流程

params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.7964442208808407
params_score[max_score]
6

集成学习

投票

from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[
    ('log_clf', LogisticRegressionCV()),
    ('svm_clf', SVC(C=6)),
    ('knn_clf', KNeighborsClassifier(n_neighbors=7)),
    ('dt_clf', DecisionTreeClassifier(max_depth=8, min_samples_split=7, min_samples_leaf=8, max_leaf_nodes=7))
])
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)
c:\users\administrator\appdata\local\programs\python\python35\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
  if diff:





0.7988826815642458

还有其他的集成方法,以后有时间补充

提交

# svc = SVC(C=2)
# svc.fit(X, y)
clf = RandomForestClassifier(n_estimators=277, max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10,
                            max_features=6, oob_score=True)
clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=17, max_features=6, max_leaf_nodes=10,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=9, min_samples_split=17,
            min_weight_fraction_leaf=0.0, n_estimators=277, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False)
# clf = VotingClassifier(estimators=[
#     ('log_clf', LogisticRegressionCV()),
#     ('svm_clf', SVC(C=6)),
#     ('knn_clf', KNeighborsClassifier(n_neighbors=7)),
#     ('dt_clf', DecisionTreeClassifier(max_depth=8, min_samples_split=7, min_samples_leaf=8, max_leaf_nodes=7))
# ])
# clf.fit(X, y)
y_pred = clf.predict(test_data)
sub = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':y_pred})
sub.to_csv('submit.csv',index=None)

相关文章: