【发布时间】:2020-04-06 10:21:41
【问题描述】:
尝试使用来自多个数组/组的值(在循环中)创建 DF
数据帧:
sample_data = [['USA', 'gdp', 2001, 10],['USA', 'avgIQ', 2001, 100],['USA', 'people', 2001, 1000],['USA', 'dragons', 2001, 3],['CHN', 'gdp', 2001, 12], ['CHN', 'avgIQ', 2001, 120],['CHN', 'people', 2001, 2000],['CHN', 'dragons', 2001, 1],['RUS', 'gdp', 2001, 11],['RUS', 'avgIQ', 2001, 105], ['RUS', 'people', 2001, 1500],['RUS', 'dragons', 2001, np.nan],['USA', 'gdp', 2002, 12],['USA', 'avgIQ', 2002, 105],['USA', 'people', 2002, 1200], ['USA', 'dragons', 2002, np.nan],['CHN', 'gdp', 2002, 14],['CHN', 'avgIQ', 2002, 127],['CHN', 'people', 2002, 3100],['CHN', 'dragons', 2002, 4], ['RUS', 'gdp', 2002, 11],['RUS', 'avgIQ', 2002, 99],['RUS', 'people', 2002, 1600],['RUS', 'dragons', 2002, np.nan],['USA', 'gdp', 2003, 15], ['USA', 'avgIQ', 2003, 115],['USA', 'people', 2003, 2000],['USA', 'dragons', 2003, np.nan],['CHN', 'gdp', 2003, 16],['CHN', 'avgIQ', 2003, 132], ['CHN', 'people', 2003, 4000],['CHN', 'dragons', 2003, 6],['RUS', 'gdp', 2003, 11],['RUS', 'avgIQ', 2003, 108],['RUS', 'people', 2003, 2000], ['RUS', 'dragons', 2003, np.nan],['USA', 'gdp', 2004, 18],['USA', 'avgIQ', 2004, 111],['USA', 'people', 2004, 2500],['USA', 'dragons', 2004, np.nan], ['CHN', 'gdp', 2004, 18],['CHN', 'avgIQ', 2004, 140],['CHN', 'people', 2004, np.nan],['CHN', 'dragons', 2004, np.nan], ['RUS', 'gdp', 2004, 15],['RUS', 'avgIQ', 2004, 103],['RUS', 'people', 2004, 2800],['RUS', 'dragons', 2004, np.nan], ['USA', 'gdp', 2005, 23],['USA', 'avgIQ', 2005, 111],['USA', 'people', 2005, 3700],['USA', 'dragons', 2005, 8],['CHN', 'gdp', 2005, 22], ['CHN', 'avgIQ', 2005, 143],['CHN', 'people', 2005, 6000],['CHN', 'dragons', 2005, 15],['RUS', 'gdp', 2005, 17],['RUS', 'avgIQ', 2005, np.nan], ['RUS', 'people', 2005, 3000],['RUS', 'dragons', 2005, 3]]
sample_df = pd.DataFrame(sample_data, columns = ['A','B','C','D'])
sample_df['C'] = sample_df['C'].astype(float)
sample_df.head()
Data columns (total 4 columns):
A 60 non-null object
B 60 non-null object
C 60 non-null float64
D 50 non-null float64
dtypes: float64(2), object(2)
from impyute.imputation.cs import mice
sample_group = sample_df.groupby(['A', 'B'])
for group_index, group in sample_group:
if group.isnull().values.any() == True:
group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
print(group)
else:
print(group)
continue
A B C D
5 CHN avgIQ 2,001.00 120.00
17 CHN avgIQ 2,002.00 127.00
29 CHN avgIQ 2,003.00 132.00
41 CHN avgIQ 2,004.00 140.00
53 CHN avgIQ 2,005.00 143.00
A B C D
7 CHN dragons 2,001.00 1.00
19 CHN dragons 2,002.00 4.00
31 CHN dragons 2,003.00 6.00
43 CHN dragons 2,004.00 10.86
55 CHN dragons 2,005.00 15.00
A B C D
4 CHN gdp 2,001.00 12.00
16 CHN gdp 2,002.00 14.00
28 CHN gdp 2,003.00 16.00
40 CHN gdp 2,004.00 18.00
52 CHN gdp 2,005.00 22.00
A B C D
6 CHN people 2,001.00 2,000.00
18 CHN people 2,002.00 3,100.00
30 CHN people 2,003.00 4,000.00
42 CHN people 2,004.00 5,014.29
54 CHN people 2,005.00 6,000.00
A B C D
9 RUS avgIQ 2,001.00 105.00
21 RUS avgIQ 2,002.00 99.00
33 RUS avgIQ 2,003.00 108.00
45 RUS avgIQ 2,004.00 103.00
57 RUS avgIQ 2,005.00 104.50
A B C D
11 RUS dragons 2,001.00 3.00
23 RUS dragons 2,002.00 3.00
35 RUS dragons 2,003.00 3.00
47 RUS dragons 2,004.00 3.00
59 RUS dragons 2,005.00 3.00
A B C D
8 RUS gdp 2,001.00 11.00
20 RUS gdp 2,002.00 11.00
32 RUS gdp 2,003.00 11.00
44 RUS gdp 2,004.00 15.00
56 RUS gdp 2,005.00 17.00
A B C D
10 RUS people 2,001.00 1,500.00
22 RUS people 2,002.00 1,600.00
34 RUS people 2,003.00 2,000.00
46 RUS people 2,004.00 2,800.00
58 RUS people 2,005.00 3,000.00
A B C D
1 USA avgIQ 2,001.00 100.00
13 USA avgIQ 2,002.00 105.00
25 USA avgIQ 2,003.00 115.00
37 USA avgIQ 2,004.00 111.00
49 USA avgIQ 2,005.00 111.00
A B C D
3 USA dragons 2,001.00 3.00
15 USA dragons 2,002.00 4.25
27 USA dragons 2,003.00 5.50
39 USA dragons 2,004.00 6.75
51 USA dragons 2,005.00 8.00
A B C D
0 USA gdp 2,001.00 10.00
12 USA gdp 2,002.00 12.00
24 USA gdp 2,003.00 15.00
36 USA gdp 2,004.00 18.00
48 USA gdp 2,005.00 23.00
A B C D
2 USA people 2,001.00 1,000.00
14 USA people 2,002.00 1,200.00
26 USA people 2,003.00 2,000.00
38 USA people 2,004.00 2,500.00
50 USA people 2,005.00 3,700.00
请注意,nan 值已正确填写^^^(到目前为止一切正常)
无法根据更新后的值制作类似于我开始使用的数据框:
我按照此处看到的类似问题的解决方案:How to build and fill pandas dataframe from for loop?
sample_group = sample_df.groupby(['A', 'B'])
d = []
for group_index, group in sample_group:
if group.isnull().values.any() == True:
group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
d.append({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
else:
d.append({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
continue
d = pd.DataFrame(d)
d.head(10)
- 您可以看到这些值都是准确的,但它会返回每个索引的整个值列表
使用.concat 的替代方法(似乎需要更长的时间):
sample_group = sample_df.groupby(['A', 'B'])
d = pd.DataFrame()
for group_index, group in sample_group:
if group.isnull().values.any() == True:
group['D'] = ((mice(group.apply({'C': lambda x: x.values, 'D': lambda y: y.values})))[1]).values
temp = pd.DataFrame({'A': group.A.values, 'B': group.B.values, 'C': group.C.values, 'D': group.D.values})
else:
temp = pd.DataFrame({'A': group.A, 'B': group.B, 'C': group.C.values, 'D': group.D.values})
continue
d = pd.concat([d, temp])
d.head()
- 此输出看起来正确,但如您所见,仅返回第一组...
期望的输出:
我想要一个看起来与最初的一模一样的 DataFrame,但在各自的位置有更新的 nan 值。
【问题讨论】:
-
我也尝试过使用类似的方法,其中我使用
.concat,虽然这有效,但它只适用于第一组(没有一直迭代)。此外,这种方法的速度明显较慢,并且在“最佳实践”方面似乎不受欢迎。 -
我也看到了对
.transform的引用,但我不知道这是否适用于这里......
标签: python pandas numpy dataframe