Pandas 中的条件累积和答案

【问题标题】：Conditional Cumulative Sums in PandasPandas 中的条件累积和
【发布时间】：2019-10-29 07:21:15
【问题描述】：

我是一名前 Excel 高级用户，正在为自己的罪孽忏悔。我需要帮助为我重新创建一个通用计算。

我正在尝试计算贷款组合的表现。在分子中，我计算的是累计损失。在分母中，我需要包含在累计总额中的贷款的原始余额。

我不知道如何在 Pandas 中执行条件 groupby 来完成此操作。在 Excel 中它非常简单，所以我希望我想多了。

我在 StackOverflow 上找不到太多关于该问题的信息，但这是最接近的：python pandas conditional cumulative sum

我无法弄清楚的是我的条件是基于索引中的值并包含在列中

以下是我的数据：

| Loan    | Origination | Balance | NCO Date  | NCO | As of Date | Age     (Months) | NCO Age (Months) |
|---------|-------------|---------|-----------|-----|------------|--------------|------------------|
| Loan 1  | 1/31/2011   | 1000    | 1/31/2018 | 25  | 5/31/2019  | 100              | 84               |
| Loan 2  | 3/31/2011   | 2500    |           | 0   | 5/31/2019  | 98           |                  |
| Loan 3  | 5/31/2011   | 3000    | 1/31/2019 | 15  | 5/31/2019  | 96           | 92               |
| Loan 4  | 7/31/2011   | 2500    |           | 0   | 5/31/2019  | 94           |                  |
| Loan 5  | 9/30/2011   | 1500    | 3/31/2019 | 35  | 5/31/2019  | 92           | 90               |
| Loan 6  | 11/30/2011  | 2500    |           | 0   | 5/31/2019  | 90           |                  |
| Loan 7  | 1/31/2012   | 1000    | 5/31/2019 | 5   | 5/31/2019  | 88           | 88               |
| Loan 8  | 3/31/2012   | 2500    |           | 0   | 5/31/2019  | 86           |                  |
| Loan 9  | 5/31/2012   | 1000    |           | 0   | 5/31/2019  | 84           |                  |
| Loan 10 | 7/31/2012   | 1250    |           | 0   | 5/31/2019  | 82           |                  |

在 Excel 中，我将使用以下公式计算此总数：

未结余额线：=SUMIFS(Balance,Age (Months),Reference Age)

Cumulative NCO: =SUMIFS(NCO,Age (Months),>=Reference Age,NCO Age (Months),<=&Reference Age)

数据：

| Reference Age       | 85    | 90    | 95   | 100  
|---------------------|-------|-------|------|------
| Outstanding Balance | 16500 | 13000 | 6500 | 1000 
| Cumulative NCO      | 25    | 60    | 40   | 25

这里的目标是在“未偿余额”中包含足以让 NCO 进行观察的事物。 NCO 是在此之前未偿还贷款的总金额。

编辑：

我已经通过这种方式进行了计算。但这是最有效的吗？

age_bins = list(np.arange(85, 101, 5))
final_df = pd.DataFrame()
df.fillna(value=0, inplace=True)
df["NCO Age (Months)"] = df["NCO Age (Months)"].astype(int)

for x in age_bins:

    age = x

    nco = df.loc[(df["Age (Months)"] >= x) & (df["NCO Age (Months)"] <= x), "NCO"].sum()

    bal = df.loc[(df["Age (Months)"] >= x), "Balance"].sum()

    temp_df = pd.DataFrame(
        data=[[age, nco, bal]],
        columns=["Age", "Cumulative NCO", "Outstanding Balance"],
        index=[age],
    )

    final_df = final_df.append(temp_df, sort=True)

【问题讨论】：

什么是参考年龄？
对不起，我贴错了标签。参考将是数据部分中的月份（年龄）。我会编辑帖子
excel累计和和pandas/python一样吗？这是我总是失败的地方，相同的功能是不同的，例如python的回合使用银行家回合（从0.5向下四舍五入），而excel向上四舍五入。在 python 中重写 VBA 代码时引起了我的一些麻烦！你的问题的基础很好，但我发现很难从你的例子到你的输出

标签： python pandas pandas-groupby

【解决方案1】：

您可以尝试使用 pd.cut 在给定的年龄范围内构建贷款组，然后再使用 groupby。像这样的：

import pandas as pd

df = pd.DataFrame([[1, 2, 3, 4, 5], [7, 8, 9, 10, 11]], index=['age', 'value']).T
df['groups'] = pd.cut(df.age, [0, 1, 3, 5]) # define bins (0,1], (1,3], (3,5]
df.groupby('groups')['value'].sum()

【讨论】：

问题是 groupby 只是简单的求和，但是随着时间的推移，我们需要排除某些值。在示例中，您可以看到余额和累积 NCO 在后期下降。如果我们提供了 10 万美元的贷款，但它们还没有全部老化到 90 个月，如果我们将它们全部包括在内，我们将低估我们在 90 个月的损失率，因为所有贷款在第 90 个月还没有机会变坏。谢谢你的回答:)

【解决方案2】：

您根据变量使用复杂的条件。很容易找到简单累积和的矢量化方法，但我无法想象累积 NCO 的好方法。

所以我会回到 Python 理解：

data = [
    { 'Reference Age': ref,
      'Outstanding Balance': df.loc[df.iloc[:,6]>=ref,'Balance'].sum(),
      'Cumulative NCO': df.loc[(df.iloc[:,6]>=ref)&(df.iloc[:,7]<=ref),
                   'NCO'].sum() }
    for ref in [85, 90, 95, 100]]

result = pd.DataFrame(data).set_index('Reference Age').T

它产生：

Reference Age          85     90    95    100
Cumulative NCO          25     60    40    25
Outstanding Balance  16500  13000  6500  1000

【讨论】：

这看起来平衡正确，但 NCO 是错误的。尽管在该时间段内仅发生了 1 个 NCO，但它占用了第 85 列的所有 NCO。
@RussW 因为我不是 Excel 公式专家，而且我的 Excel 有公式的法语名称，所以我不明白如何计算累积 MCO。这个答案是错误的，除非我找到正确的方法，否则将被删除。
@RussW：优化程度较低，但满足SUMIFS 的要求。
谢谢！这看起来正是我所需要的。

【解决方案3】：

不确定我是否完全遵循您要使用的确切逻辑，但您可以使用 pandas query 和 groupby 的组合来完成 sumifs。

示例

import pandas as pd
import numpy as np

age = np.random.randint(85, 100, 50)
balance = np.random.randint(1000, 2500, 50)
nco = np.random.randint(85, 100, 50)

df = pd.DataFrame({'age': age, 'balance': balance, 'nco':nco})


df['reference_age'] = df['age'].apply(lambda x: 5 * round(float(x)/5))

outstanding_balance = (
   df
   .query('age >= reference_age')
   .groupby('reference_age')
   [['balance']]
   .sum()
   .rename(columns={'balance': 'Outstanding Balance'}
   )

cumulative_nco = (
   df
   .query('age < reference_age')
   .groupby('reference_age')
   [['nco']]
   .sum()
   .rename(columns={'nco': 'cumulative nco'})
   .cumsum()
   )


result = outstanding_balance.join(cumulative_sum).T

结果

reference_age            85       90       95
Outstanding Balance  2423.0  16350.0  13348.0
cumulative nco          NaN    645.0   1107.0

【讨论】：