前言
本文探索性数据分析和特征工程部分大量参考了kaggle上的一篇教程,地址:https://www.kaggle.com/ash316/eda-to-prediction-dietanic
这篇文章的内容我花了几天的时间整理出来的,包含了kaggle上竞赛的基本过程,主要内容有探索性数据分析、特征工程、建模、调参
导库并加载数据
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
数据总览
print(train.shape)
train.head()
(891, 12)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train.tail()
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
print(test.shape)
test.head()
(418, 11)
| PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
| 1 | 893 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
| 2 | 894 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
| 3 | 895 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
| 4 | 896 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
train.describe() #显示数值型特征的描述信息
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
| mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
| std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
| 50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
| max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
train.describe(include='O') #显示字符串类型的特征描述
| Name | Sex | Ticket | Cabin | Embarked | |
|---|---|---|---|---|---|
| count | 891 | 891 | 891 | 204 | 889 |
| unique | 891 | 2 | 681 | 147 | 3 |
| top | Backstrom, Mrs. Karl Alfred (Maria Mathilda Gu... | male | 1601 | C23 C25 C27 | S |
| freq | 1 | 577 | 7 | 4 | 644 |
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
各个特征的含义:
- Survived:1代表被救,0代表遇难
- Pclass:票的等级,1最高,3最低
- sex:性别
- age:年龄
- sibsp:兄弟姐妹配偶等同辈的家人的数量
- parch:父母或儿女的数量
- ticket:票号
- fare:花了多少钱
- cabin:客舱号码
- Embarked:登陆的港口
探索性数据分析
通过下面的两个方法可以查看哪些特征是数值型,哪些是字符串,分类的
train.select_dtypes(include='number').columns
Index(['PassengerId', 'Survived', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], dtype='object')
train.select_dtypes(include=['object']).columns
Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')
Survived-目标变量
train.Survived.value_counts(normalize=True)
0 0.616162
1 0.383838
Name: Survived, dtype: float64
train.Survived.value_counts()
0 549
1 342
Name: Survived, dtype: int64
sns.countplot('Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bd52c88>
单单只从获救人数来看的话,获救的人数占总人数不多
Sex 无序分类变量
train.groupby(['Sex','Survived'])['Survived'].count()
Sex Survived
female 0 81
1 233
male 0 468
1 109
Name: Survived, dtype: int64
pd.crosstab(train.Sex, train.Survived, margins=True)
| Survived | 0 | 1 | All |
|---|---|---|---|
| Sex | |||
| female | 81 | 233 | 314 |
| male | 468 | 109 | 577 |
| All | 549 | 342 | 891 |
从这个交叉表可以看出,女性获救概率特别大,可视化会看的更清楚
sns.countplot('Sex', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bdcb1d0>
Embarked 无序分类变量
train[['Embarked', 'Survived']].groupby('Embarked').sum()
| Survived | |
|---|---|
| Embarked | |
| C | 93 |
| Q | 30 |
| S | 217 |
pd.crosstab(train.Embarked, train.Survived, margins=True)
| Survived | 0 | 1 | All |
|---|---|---|---|
| Embarked | |||
| C | 75 | 93 | 168 |
| Q | 47 | 30 | 77 |
| S | 427 | 217 | 644 |
| All | 549 | 340 | 889 |
sns.countplot('Embarked', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1be2e898>
可以看出,虽然从港口S上船的人获救人数最多,但是获救比最高的是从港口C上船的
前面知道,女性获救的概率很大,放到这里再看看:
sns.factorplot('Pclass', 'Survived',hue='Sex', col='Embarked', data=train)
<seaborn.axisgrid.FacetGrid at 0x1bd3de10>
Pclass 有序分类变量
train[['Pclass', 'Survived']].groupby('Pclass').sum() #只要是有数字的,groupby就不太管用
| Survived | |
|---|---|
| Pclass | |
| 1 | 136 |
| 2 | 87 |
| 3 | 119 |
pd.crosstab(train.Pclass, train.Survived, margins=True)
| Survived | 0 | 1 | All |
|---|---|---|---|
| Pclass | |||
| 1 | 80 | 136 | 216 |
| 2 | 97 | 87 | 184 |
| 3 | 372 | 119 | 491 |
| All | 549 | 342 | 891 |
sns.countplot('Pclass', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bf66d68>
等级3的人口那么多,加起来获救人数都比不上等级1的多,有钱真好……
性别是个很重要的特征,再看一下:
pd.crosstab([train.Sex, train.Survived],
train.Pclass, margins=True)
| Pclass | 1 | 2 | 3 | All | |
|---|---|---|---|---|---|
| Sex | Survived | ||||
| female | 0 | 3 | 6 | 72 | 81 |
| 1 | 91 | 70 | 72 | 233 | |
| male | 0 | 77 | 91 | 300 | 468 |
| 1 | 45 | 17 | 47 | 109 | |
| All | 216 | 184 | 491 | 891 |
sns.factorplot('Pclass', 'Survived', hue='Sex', data=train)
<seaborn.axisgrid.FacetGrid at 0x1bfad9e8>
可以看到,女性就是享有特权,加上有钱,等级1的女性活下来的几乎为1……
Age 连续值
print('min:', train.Age.min())
print('max:', train.Age.max())
print('mean:', train.Age.mean())
print('median:', train.Age.median())
print('kurt:', train.Age.kurt())
print('skew:', train.Age.skew())
min: 0.42
max: 80.0
mean: 29.69911764705882
median: 28.0
kurt: 0.17827415364210353
skew: 0.38910778230082704
空值还没有处理,所以不能用sns.distplot()画出分布图
sns.violinplot( 'Pclass', 'Age', hue='Survived', data=train, split=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c0067b8>
sns.violinplot( 'Sex', 'Age', hue='Survived', data=train, split=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1c04af98>
用violin图的方法就可以看出各个等级下年龄的分布了,岁数大的人一般更有钱,20-40岁的的人获救概率更大一些
Fare 连续值
print('min:', train.Fare.min())
print('max:', train.Fare.max())
print('mean:', train.Fare.mean())
print('median:', train.Fare.median())
print('kurt:', train.Fare.kurt())
print('skew:', train.Fare.skew())
min: 0.0
max: 512.3292
mean: 32.204207968574636
median: 14.4542
kurt: 33.39814088089868
skew: 4.787316519674893
很明显,有钱的人多,偏度系数才那么的大
from scipy.stats import norm
plt.figure(figsize=(10, 10))
sns.distplot(train.Fare, kde=True, bins=30, fit=norm)
<matplotlib.axes._subplots.AxesSubplot at 0x1bffa358>
明显不符合正态分布,一般认为符合正态分布的数据才是最好的
sns.factorplot('Pclass', 'Fare', hue='Survived', data=train, split=False)
<seaborn.axisgrid.FacetGrid at 0x1c0f1ef0>
Parch, SibSp, 连续值
pd.crosstab(train.Parch, train.Survived, margins=True)
| Survived | 0 | 1 | All |
|---|---|---|---|
| Parch | |||
| 0 | 445 | 233 | 678 |
| 1 | 53 | 65 | 118 |
| 2 | 40 | 40 | 80 |
| 3 | 2 | 3 | 5 |
| 4 | 4 | 0 | 4 |
| 5 | 4 | 1 | 5 |
| 6 | 1 | 0 | 1 |
| All | 549 | 342 | 891 |
pd.crosstab(train.SibSp, train.Survived, margins=True)
| Survived | 0 | 1 | All |
|---|---|---|---|
| SibSp | |||
| 0 | 398 | 210 | 608 |
| 1 | 97 | 112 | 209 |
| 2 | 15 | 13 | 28 |
| 3 | 12 | 4 | 16 |
| 4 | 15 | 3 | 18 |
| 5 | 5 | 0 | 5 |
| 8 | 7 | 0 | 7 |
| All | 549 | 342 | 891 |
sns.barplot('SibSp', 'Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1bfae630>
sns.factorplot('SibSp', 'Survived', data=train)
<seaborn.axisgrid.FacetGrid at 0x1c08ca90>
sns.countplot('Parch', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1c1a4400>
sns.countplot('SibSp', hue='Survived', data=train)
<matplotlib.axes._subplots.AxesSubplot at 0x1c1a0e10>
单独一个人的获救几率最大……
特征之间的相关性
f = plt.figure(figsize=(10, 8))
sns.heatmap(train.corr(), annot=True, cmap='RdYlGn', linewidth=0.2)
<matplotlib.axes._subplots.AxesSubplot at 0x1c241f60>
特征工程和数据清洗
在这一步进行处理的时候,要train.csv中的数据和test.csv一起进行处理,不然训练出来的模型在test.csv上不能进行预测的操作
train.shape, test.shape
((891, 12), (418, 11))
all_data = pd.concat([train, test])
print(all_data.shape)
all_data.head(2)
(1309, 12)
| Age | Cabin | Embarked | Fare | Name | Parch | PassengerId | Pclass | Sex | SibSp | Survived | Ticket | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 22.0 | NaN | S | 7.2500 | Braund, Mr. Owen Harris | 0 | 1 | 3 | male | 1 | 0.0 | A/5 21171 |
| 1 | 38.0 | C85 | C | 71.2833 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 2 | 1 | female | 1 | 1.0 | PC 17599 |
len(all_data[all_data.Survived.isnull()])
418
len(all_data[all_data.Survived.notnull()])
891
连接之后进行统一处理后,若要将两者分开的话,可以使用Survived特征来将两者分开就行了
缺失值处理
all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age 1046 non-null float64
Cabin 295 non-null object
Embarked 1307 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
all_data.describe(include='O')
| Cabin | Embarked | Name | Sex | Ticket | |
|---|---|---|---|---|---|
| count | 295 | 1307 | 1309 | 1309 | 1309 |
| unique | 186 | 3 | 1307 | 2 | 929 |
| top | C23 C25 C27 | S | Connolly, Miss. Kate | male | CA. 2343 |
| freq | 6 | 914 | 2 | 843 | 11 |
all_data.Age.fillna(all_data.Age.median(), inplace=True)
all_data.Embarked.fillna('S', inplace=True)
all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
Age 1309 non-null float64
Cabin 295 non-null object
Embarked 1309 non-null object
Fare 1308 non-null float64
Name 1309 non-null object
Parch 1309 non-null int64
PassengerId 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null object
SibSp 1309 non-null int64
Survived 891 non-null float64
Ticket 1309 non-null object
dtypes: float64(3), int64(4), object(5)
memory usage: 132.9+ KB
特征Cabin缺失值太多了,不做填充处理了,直接丢弃掉
Age
将连续的年龄Age特征,分成一段段的,为此先创建一个特征
按照参照的那篇kernel,作者的意思就是说,在机器学习模型中,连续的特征会存在一些问题,要将这些值离散化才好,这段话存疑,不知道对错,目前以我的知识来说,持怀疑态度
根据Age的取值范围是0-80,分成5段的话,每段的长度的是16
all_data['Age_band'] = 0
all_data.loc[all_data.Age <= 16, 'Age_band'] = 0
all_data.loc[(all_data.Age > 16) & (all_data.Age <= 32), 'Age_band'] = 1
all_data.loc[(all_data.Age > 32) & (all_data.Age <= 48), 'Age_band'] = 2
all_data.loc[(all_data.Age > 48) & (all_data.Age <= 64), 'Age_band'] = 3
all_data.loc[all_data.Age > 64, 'Age_band'] = 4
all_data.Age_band.value_counts().to_frame()
| Age_band | |
|---|---|
| 1 | 787 |
| 2 | 269 |
| 0 | 134 |
| 3 | 106 |
| 4 | 13 |
产生这个新的特征,也可以对这个特征画图进行分析一下:
sns.factorplot('Age_band', 'Survived', col='Pclass', data=all_data)
<seaborn.axisgrid.FacetGrid at 0x1c2f7160>
真的与Pclass无关,随着年龄的增长,存活率降低
家庭
all_data['Family_size'] = 0
all_data['Family_size'] = all_data.Parch + all_data.SibSp
all_data['Alone'] = 0
all_data.loc[all_data.Family_size == 0, 'Alone'] = 1
f, ax = plt.subplots(1, 2, figsize=(10, 6))
sns.factorplot('Family_size', 'Survived', data=all_data, ax=ax[0])
sns.factorplot('Alone', 'Survived', data=all_data, ax=ax[1])
plt.close(2)
plt.close(3)
一个人的存活概率很低,而家庭人数大于4的,存活概率也减少
Fare
Fare也是一个连续值,我们要将其离散化,这里使用pd.qcut
all_data['Fare_range'] = pd.qcut(all_data.Fare, 4)
all_data.groupby('Fare_range')['Survived'].mean().to_frame()
| Survived | |
|---|---|
| Fare_range | |
| (-0.001, 7.896] | 0.197309 |
| (7.896, 14.454] | 0.303571 |
| (14.454, 31.275] | 0.441048 |
| (31.275, 512.329] | 0.600000 |
type(all_data.Fare_range[0])
pandas.core.series.Series
钱越多,生存概率越高……
现在的Fare_range是个原始值的一个间隔,我们应该像处理Age那样,将其转换成单一的值
all_data['Fare_cat']=0
all_data.loc[all_data['Fare']<=7.91,'Fare_cat']=0
all_data.loc[(all_data['Fare']>7.91)&(all_data['Fare']<=14.454),'Fare_cat']=1
all_data.loc[(all_data['Fare']>14.454)&(all_data['Fare']<=31),'Fare_cat']=2
all_data.loc[(all_data['Fare']>31)&(all_data['Fare']<=513),'Fare_cat']=3
sns.factorplot('Fare_cat', 'Survived', hue='Sex', data=all_data)
<seaborn.axisgrid.FacetGrid at 0x1c222f28>
将字符串转化成数值
all_data['Sex'].replace(['male', 'female'], [0,1], inplace=True)
all_data['Embarked'].replace(['S', 'C', 'Q'], [0, 1, 2], inplace=True)
将无用特征给删掉
plt.figure(figsize=(10, 8))
all_data.drop(['Name', 'Age', 'Ticket', 'Fare', 'Cabin', 'Fare_range',
'PassengerId'], axis=1, inplace=True)
sns.heatmap(all_data.corr(), annot=True, cmap='RdYlGn', linewidth=0.2)
<matplotlib.axes._subplots.AxesSubplot at 0x1c794ba8>
all_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 10 columns):
Embarked 1309 non-null int64
Parch 1309 non-null int64
Pclass 1309 non-null int64
Sex 1309 non-null int64
SibSp 1309 non-null int64
Survived 891 non-null float64
Age_band 1309 non-null int64
Family_size 1309 non-null int64
Alone 1309 non-null int64
Fare_cat 1309 non-null int64
dtypes: float64(1), int64(9)
memory usage: 152.5 KB
异常值检测
有时间补充
搭建模型
准备数据
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
train_data = all_data[all_data.Survived.notnull()]
train_data.shape
(891, 10)
test_data = all_data[all_data.Survived.isnull()]
test_data.drop('Survived', axis=1, inplace=True)
test_data.shape
(418, 9)
train_data.head(1)
| Embarked | Parch | Pclass | Sex | SibSp | Survived | Age_band | Family_size | Alone | Fare_cat | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 3 | 0 | 1 | 0.0 | 1 | 1 | 0 | 0 |
X = train_data.drop('Survived', axis=1)
y = train_data['Survived']
X.shape, y.shape
((891, 9), (891,))
# X_train, X_test, y_train, y_test = train_test_split(train, target, random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((712, 9), (179, 9), (712,), (179,))
逻辑回归
lg = LogisticRegression(C=14)
lg.fit(X_train, y_train)
LogisticRegression(C=14, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
lg.score(X_test, y_test)
0.7877094972067039
params = np.arange(1, 20 )
scores = []
for c in params:
lg = LogisticRegression(C=c)
score = cross_val_score(lg, X_train, y_train, cv=10).mean()
scores.append(score)
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x22ec81d0>]
params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.7993600491839927
params_score[max_score]
6
自调整的逻辑回归
from sklearn.linear_model import LogisticRegressionCV
log_cv = LogisticRegressionCV()
log_cv.fit(X_train, y_train)
LogisticRegressionCV(Cs=10, class_weight=None, cv=None, dual=False,
fit_intercept=True, intercept_scaling=1.0, max_iter=100,
multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
refit=True, scoring=None, solver='lbfgs', tol=0.0001, verbose=0)
log_cv.score(X_test, y_test)
0.776536312849162
SVM
clf = SVC(C=1)
clf.fit(X_train, y_train)
SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
clf.score(X_test, y_test)
0.7988826815642458
params = np.arange(1, 20,)
scores = []
for c in params:
clf = SVC(C=c)
score = cross_val_score(clf, X_train, y_train, cv=10).mean()
scores.append(score)
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x22f0e908>]
params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.8245372233400403
params_score[max_score]
1
决策树
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier(max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10, max_features=6)
dt.fit(X_train, y_train)
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=17,
max_features=6, max_leaf_nodes=10, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=9,
min_samples_split=17, min_weight_fraction_leaf=0.0,
presort=False, random_state=None, splitter='best')
可调节参数太多,目前我也不太清楚参数的具体是啥意思……,这些参数都是我xjb调的
dt.score(X_test, y_test)
0.7988826815642458
params = np.arange(2, len(X_train.columns),)
scores = []
for c in params:
clf = DecisionTreeClassifier(max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10, max_features=6, )
score = cross_val_score(clf, X_train, y_train, cv=10).mean()
scores.append(score)
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x231b0208>]
params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
print(max_score)
params_score[max_score]
0.8316398390342054
5
随机森林
from sklearn.ensemble import RandomForestClassifier
c:\users\administrator\appdata\local\programs\python\python35\lib\site-packages\sklearn\ensemble\weight_boosting.py:29: DeprecationWarning: numpy.core.umath_tests is an internal NumPy module and should not be imported. It will be removed in a future NumPy release.
from numpy.core.umath_tests import inner1d
rf = RandomForestClassifier(n_estimators=117, oob_score=True)
rf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=117, n_jobs=1,
oob_score=True, random_state=None, verbose=0, warm_start=False)
rf.oob_score_
0.8148148148148148
params = np.arange(1, 200)
scores = []
for c in params:
clf = RandomForestClassifier(n_estimators=c, max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10,
max_features=6, oob_score=True)
clf.fit(X, y)
score = clf.oob_score_
scores.append(score)
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x2348f7f0>]
params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.8260381593714927
params_score[max_score]
193
knn
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=12)
knn.fit(X_train, y_train)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=12, p=2,
weights='uniform')
knn.score(X_test, y_test)
0.7597765363128491
params = np.arange(1, 20,)
scores = []
for c in params:
clf = KNeighborsClassifier(n_neighbors=c)
score = cross_val_score(clf, X_train, y_train, cv=10).mean()
scores.append(score)
plt.plot(params, scores)
[<matplotlib.lines.Line2D at 0x2487d438>]
params_score = dict(zip(scores, params))
max_score = np.array(scores).max()
max_score
0.7964442208808407
params_score[max_score]
6
集成学习
投票
from sklearn.ensemble import VotingClassifier
voting_clf = VotingClassifier(estimators=[
('log_clf', LogisticRegressionCV()),
('svm_clf', SVC(C=6)),
('knn_clf', KNeighborsClassifier(n_neighbors=7)),
('dt_clf', DecisionTreeClassifier(max_depth=8, min_samples_split=7, min_samples_leaf=8, max_leaf_nodes=7))
])
voting_clf.fit(X_train, y_train)
voting_clf.score(X_test, y_test)
c:\users\administrator\appdata\local\programs\python\python35\lib\site-packages\sklearn\preprocessing\label.py:151: DeprecationWarning: The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.
if diff:
0.7988826815642458
还有其他的集成方法,以后有时间补充
提交
# svc = SVC(C=2)
# svc.fit(X, y)
clf = RandomForestClassifier(n_estimators=277, max_depth=17, min_samples_split=17, min_samples_leaf=9, max_leaf_nodes=10,
max_features=6, oob_score=True)
clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=17, max_features=6, max_leaf_nodes=10,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=9, min_samples_split=17,
min_weight_fraction_leaf=0.0, n_estimators=277, n_jobs=1,
oob_score=True, random_state=None, verbose=0, warm_start=False)
# clf = VotingClassifier(estimators=[
# ('log_clf', LogisticRegressionCV()),
# ('svm_clf', SVC(C=6)),
# ('knn_clf', KNeighborsClassifier(n_neighbors=7)),
# ('dt_clf', DecisionTreeClassifier(max_depth=8, min_samples_split=7, min_samples_leaf=8, max_leaf_nodes=7))
# ])
# clf.fit(X, y)
y_pred = clf.predict(test_data)
sub = pd.DataFrame({'PassengerId':test['PassengerId'],'Survived':y_pred})
sub.to_csv('submit.csv',index=None)