【问题标题】:DataFrame cleaning数据框清洗
【发布时间】:2021-11-21 17:26:36
【问题描述】:

我有一个 Excel 电子表格,导入后看起来类似于:

df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})
2021-08-01 2021-09-01 2021-10-01 2021-11-01 2021-12-01
120 NaN NaN 80 NaN
NaN NaN 40 NaN 20
NaN 50 NaN 50 NaN
NaN NaN 100 NaN NaN
300 NaN NaN NaN NaN

我正在寻找(通过 python)将它转换成这样的东西:

shouldbe = pd.DataFrame({
"PayDate1": 
[datetime(2021,8,1), datetime(2021,10,1), datetime(2021,9,1), datetime(2021,10,1), datetime(2021,8,1)],
"Amount1": [120, 40, 50, 100, 300],
"PayDate2":
[datetime(2021,11,1), datetime(2021,12,1), datetime(2021,11,1), '', ''],
"Amount2": [80, 20, 50, np.nan, np.nan]}))
PayDate1 Amount1 PayDate2 Amount2
2021-08-01 120 2021-11-01 80
2021-10-01 40 2021-12-01 20
2021-09-01 50 2021-11-01 50
2021-10-01 100 NaT NaN
2021-08-01 300 NaT NaN

我正在寻找一些如何实现这种转换的示例,在此先感谢您的帮助。

【问题讨论】:

  • 查看 pandas.DataFrame.pivot,或获取日期列表并手动构建数据
  • @2e0byo。枢轴的使用并不像看起来那么明显。要获得最终的数据框还有很长的路要走。如果你想检查我的答案:)
  • @Corralien 确实有;很好的答案。我没有时间弄清楚,虽然看着你的答案,我只是循环并处理执行时间,而不是与熊猫打架。不过非常好!

标签: python pandas dataframe data-cleaning


【解决方案1】:

为了完整起见,这里是非熊猫的做法:

import math
df = pd.DataFrame({
datetime(2021, 8, 1, 00, 00, 00): [120, np.nan, np.nan, np.nan, 300],
datetime(2021, 9, 1, 00, 00, 00): [np.nan, np.nan, 50, np.nan, np.nan],
datetime(2021, 10, 1, 00, 00, 00): [np.nan, 40, np.nan, 100, np.nan],
datetime(2021, 11, 1, 00, 00, 00): [80, np.nan, 50, np.nan, np.nan],
datetime(2021, 12, 1, 00, 00, 00): [np.nan, 20, np.nan, np.nan, np.nan]})

dates = df.columns
out = {k: [] for k in dates}

for row in df.iterrows():
    for i, val in enumerate(row[1]):
        d = dates[i]
        if not math.isnan(val):
            out[d].append(val)

print(out)

这不是pandasy(实际上,这里的最终输出甚至不是pandas 数据框,尽管将其转换回一个数据框很简单),但我认为它更容易阅读,因此更容易阅读Pythonic (TM)。更重要的是它可能更适合某些用例。

【讨论】:

  • 谢谢你,这是遍历行的好方法,非常有用。
【解决方案2】:

您可以使用meltgroupbypivot 来获取预期的数据帧:

  1. 使用 melt 重塑您的数据框:
out = df.reset_index() \
        .melt(id_vars='index', var_name='PayDate', value_name='Amount') \
        .dropna()
print(out)

# Output
    index    PayDate  Amount
0       0 2021-08-01   120.0  # <- index 0, 1st occurrence
4       4 2021-08-01   300.0  # <- index 4, 1st occurrence
7       2 2021-09-01    50.0  # <- index 2, 1st occurrence
11      1 2021-10-01    40.0  # <- index 1, 1st occurrence
13      3 2021-10-01   100.0  # <- index 3, 1st occurrence
15      0 2021-11-01    80.0  # <- index 0, 2nd occurrence
17      2 2021-11-01    50.0  # <- index 2, 2nd occurrence
21      1 2021-12-01    20.0  # <- index 1, 2nd occurrence
  1. index 分组并应用cumcount 以创建新列的索引('1' 和'2' 作为字符串以供将来连接):
out['col'] = out.groupby('index').cumcount().add(1).astype(str)
print(out)

# Output:
    index    PayDate  Amount  col
0       0 2021-08-01   120.0    1
4       4 2021-08-01   300.0    1
7       2 2021-09-01    50.0    1
11      1 2021-10-01    40.0    1
13      3 2021-10-01   100.0    1
15      0 2021-11-01    80.0    2
17      2 2021-11-01    50.0    2
21      1 2021-12-01    20.0    2
  1. 旋转数据框
out = out.pivot(index='index', columns='col', values=['PayDate', 'Amount'])
print(out)

# Output
         PayDate            Amount      
col            1          2      1     2
index                                   
0     2021-08-01 2021-11-01  120.0  80.0
1     2021-10-01 2021-12-01   40.0  20.0
2     2021-09-01 2021-11-01   50.0  50.0
3     2021-10-01        NaT  100.0   NaN
4     2021-08-01        NaT  300.0   NaN
  1. 获取最终数据帧
cols = out.columns.get_level_values(1).argsort()
out.columns = out.columns.to_flat_index().map(''.join)
out.index.name = None

out = out[out.columns[cols]]
print(out)
    PayDate1 Amount1   PayDate2 Amount2
0 2021-08-01   120.0 2021-11-01    80.0
1 2021-10-01    40.0 2021-12-01    20.0
2 2021-09-01    50.0 2021-11-01    50.0
3 2021-10-01   100.0        NaT     NaN
4 2021-08-01   300.0        NaT     NaN

【讨论】:

  • 感谢您的洞察力,有很多我以前从未见过的函数(.get_level_values、.argsort、.to_flat_index 和 cumcount)。也很高兴看到使用带有枢轴功能的熔化。
猜你喜欢
  • 1970-01-01
  • 2021-02-04
  • 2019-03-15
  • 2018-07-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-09-10
  • 2020-10-01
相关资源
最近更新 更多