pandas实战 - 爱码网

参考：（秦路）https://mp.weixin.qq.com/s/RcrQmqty1FHEDbQfxv2XTQ（联系数据从这里获取）

主要内容是进行数据读取，数据概述，数据清洗和整理，分析和可视化。

一、数据读取

在pandas中，常用的载入函数是read_csv、read_excel和read_table，table可以读取txt。若是服务器相关的部署，则还会用到read_sql，直接访问数据库，但它必须配合mysql相关包。

1、read_csv：常用参数

　　encoding是最常用的参数之一，它用来读取csv格式的编码。这里使用了gb2312

sep参数是分割符，有些csv文件用逗号分割列，有些是分号，有些是\t，这些都需要具体设置。

header参数为是否使用表头作为列名，默认是。

names参数可以为列设置额外的名字，比如csv中的表头是中文，但是在pandas中最好转换成英文。

import pandas
df = pandas.read_csv(\'data.csv\',encoding=\'gb2312\')

　　2、获取部分数据

因为数据集的数据比较多，如果我们只想浏览部分的话。

　　head()：显示头部的数据，默认5，也可以自由设置参数

　　tail()：显示尾部数据。

df.head(5)
df.tail(5)

3、显示数据

options and settings

　　如果数据太多，pandas默认使用...省略，设置最大显示行数10、最大列数10如下：

pandas.set_option("display.max_rows",10)
pandas.set_option(\'display.max_columns\',10)

pandas.DataFrame.to_string　

常用参数：

columns : （sequence, optional）the subset of columns to write; default None writes all columns
col_space : （int, optional）the minimum width of each column
index : （bool, optional）whether to print index (row) labels, default True
line_width : （int, optional）Width to wrap a line in characters, default no wrap

显示前五行的两列（city和companyFullName）

df.head(5).to_string(columns=[\'city\',\'companyFullName\'])

二、，数据清洗和整理　　

1、查看是否有重复的数据　

　　背景：positionId是职位ID，它应是唯一值，所以要对数据去重。

首先查看数据是否有重复值，数据总数是已知的，并且unique()函数可以返回值唯一的数据。当"数据总长度"与”值唯一数据的长度“不一致时，说明有重复值

import pandas
df = pandas.read_csv(\'data.csv\',encoding=\'gb2312\')
print(len(df.positionId.unique()))

使用drop_duplicates清洗掉重复数据。

df.drop_duplicates(subset=\'positionId\',keep=\'first\')

　　drop_duplicates函数通过subset参数选择以哪个列为去重基准。keep参数则是保留方式，first是保留第一个，删除后余重复值，last还是删除前面，保留最后一个。

　　duplicated函数功能类似，但它返回的是布尔值。

2、接下来加工salary薪资字段。目的是计算出薪资下限以及薪资上限。

薪资内容没有特殊的规律，既有小写k，也有大小K，还有「k以上」这种蛋疼的用法，k以上只能上下限默认相同。

这里需要用到pandas中的apply。它可以针对DataFrame中的一行或者一列数据进行操作，允许使用自定义函数。

def cut_word(word):
    position = word.find(\'-\')
    bottomSalary = word[:position-1]
    return bottomSalary
df_duplicates[\'bottomSalary\']=df_duplicates.salary.apply(cut_word)

我们定义了个word_cut函数，它查找「-」符号所在的位置，并且截取薪资范围开头至K之间的数字，也就是我们想要的薪资上限。apply将word_cut函数应用在salary列的所有行。

「k以上」这类脏数据怎么办呢？find函数会返回-1，如果按照原来的方式截取，是word[:-2]，不是我们想要的结果，所以需要加一个if判断。

因为python大小写敏感，我们用upper函数将k都转换为K，然后以K作为截取。这里不建议用「以上」,因为有部分脏数据不包含这两字。

def cut_word(word):
    position = word.find(\'-\')
    if position != -1:
        bottomSalary = word[:position-1]
    else:
        bottomSalary = word[:word.upper().find(\'K\')]
        return bottomSalary
df_duplicates[\'bottomSalary\']=df_duplicates.salary.apply(cut_word)

将bottomSalary转换为数字，如果转换成功，说明所有的薪资数字都成功截取了。

df_duplicates.bottomSalary = df_duplicates.bottomSalary.astype(int)

　　薪资上限topSalary的思路也相近，只是变成截取后半部分，在word_cout函数增加了新的参数用以判断返回bottom还是top。

apply中，参数是添加在函数后面，而不是里面的。这点需要注意。

def cut_word(word,method):
    position = word.find(\'-\')
    length = len(word)
    if position != -1:
        bottomSalary = word[:position-1]
        topSalary = word[position+1:length-1]
    else:
        bottomSalary = word[:word.upper().find(\'K\')]
        topSalary = bottomSalary
    if method == \'bottom\':
        return bottomSalary
    else:
        return topSalary
df_duplicates[\'topSalary\'] = df_duplicates.salary.apply(cut_word, method=\'top\')

接下来求解平均薪资。

bottomSalary和topSalary数据类型转换为数字，并为数据集添加avgSalary列

df_duplicates.topSalary = df_duplicates.topSalary.astype(int)
df_duplicates.bottomSalary = df_duplicates.bottomSalary.astype(int)
df_duplicates[\'avgSalary\'] = df_duplicates.apply(lambda x:(x.bottomSalary+x.topSalary)/2,axis = 1)

到此，数据清洗的部分完成。切选出我们想要的内容进行后续分析

df_clean = df_duplicates[[\'city\',\'companyShortName\',\'companySize\',\'education\',\'positionName\',\'positionLables\',\'workYear\',\'avgSalary\']]

3、数据进行描述统计。

value_counts是计数，统计所有非零元素的个数，以降序的方式输出Series。数据中可以看到北京招募的数据分析师一骑绝尘。

df_clean.city.value_counts()

describe():快速生成各类统计指标

df_clean.describe()

数据分析师的薪资的平均数是17k，中位数是15k，两者相差不大，最大薪资在75k，应该是数据科学家或者数据分析总监档位的水平。标准差在8.99k，有一定的波动性，大部分分析师薪资在17+—9k之间。

一般分类数据用value_counts，数值数据用describe，这是最常用的两个统计函数。

三、数据分析

1、计算不同城市的各列计数，因为没有NaN，每列结果都是相等的。现在它和value_counts等价。

df_clean.groupby(\'city\').count()

group by：针对不同城市进行了分组。不过它并没有返回分组后的结果，只返回了内存地址。这时它只是一个对象，没有进行任何的计算。

2、计算出了不同城市的平均薪资。因为mean方法只针对数值，而各列中只有avgSalary是数值，于是返回了这个唯一结果。

df_clean.groupby(\'city\').mean()

groupby可以传递一组列表，这时得到一组层次化的Series。按城市和学历分组计算了平均薪资。

df_clean.groupby([\'city\',\'education\']).mean().unstack()

3、只统计avgSalary的计数结果，不用混入相同数据。

df_clean.groupby([\'city\',\'education\']).avgSalary.mean().unstack()

4、计算不同公司的计数和平均值。这里使用了agg函数，agg除了系统自带的几个函数，它也支持自定义函数。

df_clean.groupby(\'companyShortName\').avgSalary.agg([\'count\',\'mean\']).sort_values(by=\'count\',ascending=False)
df_clean.groupby(\'companyShortName\').avgSalary.agg(lambda x:max(x)-min(x))

四、数据可视化。

　　1、绘制直方图

pandas自带绘图函数，它是以matplotlib包为基础封装，所以两者能够结合使用。

a、用hist函数绘制直方图，列出数据分析师薪资的分布。因为大部分薪资集中20k以下，为了更细的粒度。将直方图的宽距继续缩小（bins）。

import matplotlib.pyplot
matplotlib.pyplot.style.use(\'ggplot\')
matplotlib.pyplot.hist(df_clean.avgSalary,bins=15)
或
df_clean.avgSalary.hist(bins=15)

plt.style.use(\'ggplot\')使用R语言中的ggplot2配色作为绘图风格，纯粹为了好看。

b、将上海和北京的薪资数据以直方图的形式进行对比。因为北京和上海的分析师人数相差较远，所以无法直接对比，需要用normed参数转化为密度。设置alpha透明度，它比箱线图更直观。

matplotlib.pyplot.hist(
    x = df_clean[df_clean.city=\'上海\'].avgSalary,
    bins = 15
    normed = 1
    facecolor = \'blue\'
    alpha = 0.5
)
matplotlib.pyplot.hist(
    x = df_clean[df_clean.city=\'北京\'].avgSalary,
    bins = 15
    normed = 1
    facecolor = red
    alpha = 0.5
)
matplotlib.pyplot.show()

2、绘制箱线图

数据分析的一大思想是细分维度，现在观察不同城市、不同学历和城市对薪资的影响。箱线图是最佳的观测方式。

import matplotlib.pyplot
matplotlib.pyplot.rcParams[\'font.sans-serif\']=[\'SimHei\']#用来正常显示中文标签
matplotlib.pyplot.rcParams[\'axes.unicode_minus\']=False#用来正常显示负号
df_clean.head(300).boxplot(column=\'avgSalary\',by=[\'education\',\'city\'],figsize=(30,6))

pandas.DataFrame.boxplot

figsize:The size of the figure to create in matplotlib.

3、绘制条形图

　　每个城市平均薪资的条形图

df_clean.groupby(\'city\').mean().plot.bar()

　　每个城市中每个学历平均薪资的条形图

df_clean.groupby([\'city\',\'education\']).mean().unstack().plot.bar()

对数据进行深加工。我们将薪资设立出不同的level

bins = [1,3,5,10,15,20,30,100]
level = [\'0-3\',\'3-5\',\'5-10\',\'10-15\',\'15-20\',\'20-30\',\'30+\']
df_clean[\'level\'] = pandas.cut(df_clean[\'avgSalary\'],bins=bins,labels=label)
df_level = df_clean.groupby([\'city\',\'level\']).avgSalary.count().unstact()
df_level_prop = de_level.apply(lambda x:x/x.sum(),axis =1)
df_level_prop.plot.bar(stacked = True,figsize(14,6))

cut的作用是分桶，它也是数据分析常用的一种方法，将不同数据划分出不同等级，也就是将数值型数据加工成分类数据，在机器学习的特征工程中应用比较多。cut可以等距划分，传入一个数字就好。这里为了更好的区分，我传入了一组列表进行人工划分，加工成相应的标签。

用lambda转换百分比，然后作堆积百分比柱形图(matplotlib好像没有直接调用的函数)。这里可以较为清晰的看到不同等级在不同地区的薪资占比。它比箱线图和直方图的好处在于，通过人工划分，具备业务含义。0～3是实习生的价位，3～6是刚毕业没有基础的新人，整理数据那种，6～10是有一定基础的，以此类推。