对数据进行可视化探索

文章目录

泰坦尼克号

柱状图

用途一：查看某字段的各个取值的分布
用途二：查看某分类字段与类标签的相关性

饼状图
散点图
概率密度图

本文以kaggle的入门赛为例，介绍常用于探索数据特征的可视化技巧

泰坦尼克号

数据集下载地址：https://www.kaggle.com/c/titanic/data

首先导入数据（个别输入输出是个人探索的过程，被省略了，例如In[5]直接到In[7]了）

In [1]: %matplotlib
Using matplotlib backend: Qt5Agg

In [2]: import numpy as np

In [3]: import pandas as pd

In [4]: import matplotlib.pyplot as plt

In [5]: train=pd.read_csv('train.csv',dtype={'Age':np.float64})

In [7]: test=pd.read_csv('test.csv',dtype={'Age':np.float64})

In [8]: full_data=[train,test]

柱状图

用途一：查看某字段的各个取值的分布

#对Survived这个Series调用value_counts，返回对该列使用count聚合的结果，返回值是一个新的Series，调用plot绘制柱状图
In [24]:  train.Survived.value_counts().plot(kind='bar')		
Out[24]: <matplotlib.axes._subplots.AxesSubplot at 0x4cb771bcc0>

In [25]: plt.title('survived (1:survived)')		#设置标题
Out[25]: Text(0.5,1,'survived (1:survived)')

In [26]: plt.ylabel('people')		#设置y轴的标签（名称）
Out[26]: Text(38.3472,0.5,'people')

对数据进行可视化探索

类似的代码可以查看乘客的船舱等级分布（先关掉前面生成的窗口）

In [30]: train.Pclass.value_counts().plot(kind='bar')
Out[30]: <matplotlib.axes._subplots.AxesSubplot at 0x4cb748d320>

In [31]: plt.ylabel('people')
Out[31]: Text(38.3472,0.5,'people')

In [32]: plt.title('cabin_level')
Out[32]: Text(0.5,1,'cabin_level')

对数据进行可视化探索

查看各登船口岸上船的人数

In [30]: train.Embarked.value_counts().plot(kind='bar')
Out[30]: <matplotlib.axes._subplots.AxesSubplot at 0xcaa5cbecc0>

In [31]: plt.title('embarked')
Out[31]: Text(0.5,1,'embarked')

In [32]: plt.ylabel('people')
Out[32]: Text(38.3472,0.5,'people')

对数据进行可视化探索

用途二：查看某分类字段与类标签的相关性

In [40]: survived_0=train.Pclass[train.Survived==0].value_counts()

In [41]: survived_1=train.Pclass[train.Survived==1].value_counts()

In [42]: df=pd.DataFrame({'survived':survived_1,'not survived':survived_0})

In [44]: df
Out[44]:
   not survived  survived
1            80       136
2            97        87
3           372       119

In [45]: df.plot(kind='bar',stacked=True)
Out[45]: <matplotlib.axes._subplots.AxesSubplot at 0xcaaacab860>

In [46]: plt.title('level-survived')
Out[46]: Text(0.5,1,'level-survived')

In [47]: plt.xlabel('level')
Out[47]: Text(0.5,28.3172,'level')

In [48]: plt.ylabel('people')
Out[48]: Text(38.3472,0.5,'people')

对数据进行可视化探索
由上图可以看出，不同船舱等级的乘客，对应的获救概率不同（等级为1的橙色获救比例大）。

同理，我们可以查看性别对于获救概率的影响（性别也是分类属性）

In [49]: sur_0=train.Sex[train.Survived==0].value_counts()

In [50]: sur_1=train.Sex[train.Survived==1].value_counts()

In [51]: df=pd.DataFrame({'survived':sur_1,'not_survived':sur_0})

In [52]: df
Out[52]:
        not_survived  survived
female            81       233
male             468       109

In [53]: df.plot(kind='bar',stacked=True)
Out[53]: <matplotlib.axes._subplots.AxesSubplot at 0xcaa6430fd0>

对数据进行可视化探索
可以看出，不同性别的乘客获救比例不同

饼状图

对于柱状图，饼状图更容易看出比例大小。下面分析，获救的乘客中（我们关心的类标签），各个性别的比例（性别是一个类别字段）。

In [62]: s=train.Sex[train.Survived==1].value_counts()

In [63]: s
Out[63]:
female    233
male      109
Name: Sex, dtype: int64

In [64]: s.plot(kind='pie')
Out[64]: <matplotlib.axes._subplots.AxesSubplot at 0xcaa5b569e8>

In [65]: plt.axis('equal')		#调整轴的长度，使得圆形显示正常
Out[65]:
(-1.110477771437977,
 1.1004989744035067,
 -1.1099133246241948,
 1.106324041274519)
 
In [66]: plt.title('survived probability')
Out[66]: Text(0.5,1,'survived probability')

对数据进行可视化探索
由上图可以看出，女性的获救比例显然比男性大。

散点图

散点图用于查看字段间的相关性，以及查看离群点情况。

In [34]: plt.scatter(train.Survived,train.Age)		#x轴为survived字段（就是我们要预测的字段），y轴为age字段
Out[34]: <matplotlib.collections.PathCollection at 0x4cb779b978>

In [35]: plt.ylabel('age')
Out[35]: Text(47.0972,0.5,'age')

In [43]: plt.grid(axis='y')		#给y轴加上参考线

将图片纵向拉长后更加直观的看出未获救人员的年龄分布（x=0那列）与获救人员的年龄分布（x=1那列）
对数据进行可视化探索

概率密度图

#绘图思路：对Series调用plot(kind='kde')即可得到Series的value的概率密度曲线
#此处先用布尔索引分别选出不同等级船舱的乘客的Age，然后再绘制概率密度曲线
In [14]: train.Age[train.Pclass==1].plot(kind='kde')
Out[14]: <matplotlib.axes._subplots.AxesSubplot at 0xcaa5cbecc0>

In [15]: train.Age[train.Pclass==2].plot(kind='kde')
Out[15]: <matplotlib.axes._subplots.AxesSubplot at 0xcaa5cbecc0>

In [16]: train.Age[train.Pclass==3].plot(kind='kde')
Out[16]: <matplotlib.axes._subplots.AxesSubplot at 0xcaa5cbecc0>

In [17]: plt.xlabel('age')
Out[17]: Text(0.5,23.1922,'age')

In [18]: plt.ylabel('probability')
Out[18]: Text(25.0972,0.5,'probability')

In [23]: plt.title("probability distribution of age")
Out[23]: Text(0.5,1,'probability distribution of age')

In [27]: plt.legend(('level_1','level_2','level_3'),loc='best')		#设置图例
Out[27]: <matplotlib.legend.Legend at 0xca9a48b588>

对数据进行可视化探索