【发布时间】:2016-10-23 06:55:46
【问题描述】:
鉴于以下情况:
import pandas as pd
import numpy as np
df=pd.DataFrame({'County':['A','B','A','B','A','B','A','B','A','B'],
'Hospital':['a','b','e','f','i','j','m','n','b','r'],
'Enrollment':[44,55,95,54,81,54,89,76,1,67],
'Year':['2012','2012','2012','2012','2012','2013',
'2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns
d2
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
b NaN 1.0
e 95.0 NaN
i 81.0 NaN
m NaN 89.0
B b 55.0 NaN
f 54.0 NaN
j NaN 54.0
n NaN 76.0
r NaN 67.0
如果像“b”这样的医院不止一次存在并且没有上一年的数据(“b”的第一次出现),我想为另一行分配上一年的 Enrollment 值('b') 并删除不包含第一年数据的'b'行,如下所示:
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
b 55.0 1.0
e 95.0 NaN
i 81.0 NaN
m NaN 89.0
B f 54.0 NaN
j NaN 54.0
n NaN 76.0
r NaN 67.0
到目前为止,我可以识别重复的行并删除,但我只是坚持用需要的值替换 NaN:
-
重置索引后识别重复医院:
d2=d2.reset_index() d2['dup']=d2.duplicated('Hospital',keep=False) -
标记,删除最近一年没有数据的重复医院:
Hospital=d2.columns.levels[0][1] Y1=d2.columns.levels[1][0] Y2=d2.columns.levels[1][1] d2['Delete']=np.nan d2.loc[(pd.isnull(d2.Enrollment[Y2]))&(d2['dup']==True),'Delete']='Yes' -
保留所有要删除的行:
d2=d2.loc[d2['Delete']!='Yes']
【问题讨论】:
标签: python-3.x pandas filter group-by multi-index