高效的 Pandas 库存计算答案

【问题标题】：Efficient Pandas stock calculation高效的 Pandas 库存计算
【发布时间】：2020-07-10 14:35:01
【问题描述】：

我有一个出生和死亡日期的数据集，如下所示：

d1 = {'Birth_date': [1800,1810,1802,1804], 'Death_date': [1805, 1880,1854,1832]}
pd.DataFrame(data=d1)

   Birth_date  Death_date
0        1800        1805
1        1810        1880
2        1802        1854
3        1804        1832

我要计算：

给定年份在给定年龄的活人存量（例如生活在 1825 年的 18 岁的人数）
给定年份在给定年龄的死亡人数（例如 1825 年死亡的 18 岁的人数）

理论上，输出应该是这样的：

   Date Number ind. aged 1 Number ind. aged 2 Number ind. aged k
0  1800                 .                 .                 .
1  1801                 .                 .                 .
2  1802                 .                 .                 .
3  1803                 .                 .                 .

和

   Date Number death aged 1 Number death aged 2 Number death aged k
0  1800                 .                 .                 .
1  1801                 .                 .                 .
2  1802                 .                 .                 .
3  1803                 .                 .                 .

我没有看到任何简单的方法来计算它。有人遇到过类似的问题吗？

【问题讨论】：

标签： python pandas data-manipulation

【解决方案1】：

Q1：给定年份在给定年龄和年份的活人存量：

给定数据框d1 如上述问题：

d2 = \
pd.concat(\
    d1.apply(\
        lambda x: pd.DataFrame(\
        {'id': x.name,\
         'year': range(x['Birth_date'], x['Death_date']+1),\
         'age': range(x['Birth_date'], x['Death_date']+1)-x['Birth_date']}),\
     axis = 1).to_list())

d2 看起来像：

    id  year  age
0    0  1800    0
1    0  1801    1
2    0  1802    2
3    0  1803    3
4    0  1804    4
..  ..   ...  ...
24   3  1828   24
25   3  1829   25
26   3  1830   26
27   3  1831   27
28   3  1832   28

[159 rows x 3 columns]

id 表示从d1 的索引推断的个人。接下来只是旋转d2 计算给定年龄和年份的活着的人：

nlvng = pd.pivot_table(d2, columns = 'age', index = 'year', values = 'id', aggfunc = 'count', fill_value=0)

结果集：

age   0   1   2   3   4   5   6   7   8   ...  62  63  64  65  66  67  68  69  70
year                                      ...                                    
1800   1   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1801   0   1   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1802   1   0   1   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1803   0   1   0   1   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1804   1   0   1   0   1   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...  ..  ..  ..  ..  ..  ..  ..  ..  ..
1876   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   1   0   0   0   0
1877   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   1   0   0   0
1878   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   1   0   0
1879   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   1   0
1880   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   1

[81 rows x 71 columns]

Q2：特定年份特定年龄的死亡人数：

这里使用之前计算的d2 将其通过d1.index 和Death_date 合并到d1：

d3 = d2.merge(d1, left_on = ['id','year'], right_on = [d1.index,'Death_date'], how = 'outer')

ndeaths = pd.pivot_table(d3, columns = 'age', index = 'year', values = 'Death_date', aggfunc = 'count', fill_value=0)

输出：

age   0   1   2   3   4   5   6   7   8   ...  62  63  64  65  66  67  68  69  70
year                                      ...                                    
1800   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1801   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1802   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1803   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1804   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
  ..  ..  ..  ..  ..  ..  ..  ..  ..  ...  ..  ..  ..  ..  ..  ..  ..  ..  ..
1876   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1877   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1878   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1879   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   0
1880   0   0   0   0   0   0   0   0   0  ...   0   0   0   0   0   0   0   0   1

[81 rows x 71 columns]

【讨论】：

【解决方案2】：

编辑：抱歉，一开始就完全错误的答案。

现在我认为这可能接近于所要求的。这可能不是最有效的解决方案——也许其他人会找到更好的解决方案？

该解决方案首先创建一个包含所有可能年份的人工 df，并为每个人创建一个列。然后它计算每个人每年的年龄 - 最后计算每年的可能值和人的年龄。

import pandas as pd


def ind_age(x, min_val, max_val):
    if min_val <= x < max_val:
        return x - min_val + 1  # a person has no age 0
    else:
        return 0

# init df
d1 = {'Birth_date': [1800, 1810, 1802, 1804], 'Death_date': [1805, 1880, 1854, 1832]}
d1 = pd.DataFrame(data=d1)

# min and max years to init df
min_year = d1[['Birth_date', 'Death_date']].min().min()
max_year = d1[['Birth_date', 'Death_date']].max().max()

# get all years possible as a column
df_years = pd.DataFrame(range(min_year, max_year + 1))
df_years.columns = ['years']

# transpose to prepare left join
# the left join will make it possible to insert custom values
# for each year and person
d1 = d1.transpose()

for colname in d1.columns:
    # calculates the age of a person in each year
    df_years = pd.merge(left=df_years, right=pd.DataFrame(d1[colname]), how='left', left_on='years', right_on=colname)

for col in df_years.columns[1:]:
    col_min = df_years[col].min()
    col_max = df_years[col].max()
    df_years[col] = df_years['years'].apply(lambda x: ind_age(x, col_min, col_max))

df_years.set_index('years', inplace=True)

result = df_years.apply(pd.Series.value_counts, axis=1).fillna(0)

结果如下：

       0.0   1.0   2.0   3.0   4.0   5.0   ...  65.0  66.0  67.0  68.0  69.0  70.0
years                                      ...                                    
1800    3.0   1.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0
1801    3.0   0.0   1.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0
1802    2.0   1.0   0.0   1.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0
1803    2.0   0.0   1.0   0.0   1.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0
1804    1.0   1.0   0.0   1.0   0.0   1.0  ...   0.0   0.0   0.0   0.0   0.0   0.0
     ...   ...   ...   ...   ...   ...  ...   ...   ...   ...   ...   ...   ...
1876    3.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   1.0   0.0   0.0   0.0
1877    3.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   1.0   0.0   0.0
1878    3.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   1.0   0.0
1879    3.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   1.0
1880    4.0   0.0   0.0   0.0   0.0   0.0  ...   0.0   0.0   0.0   0.0   0.0   0.0
[81 rows x 71 columns]

对于死亡，您可以修改 ind_age() 方法以仅返回死亡日 ( x == max_val ) 的值并返回相应的死亡年龄。取决于您喜欢如何计算年龄（从 0 或 1 开始）。

【讨论】：

谢谢，但没有。这仅提供给定日期的出生和死亡人数。它不提供给定日期在给定年龄的活人数量。