pandas：逐行操作以随时间变化答案

【问题标题】：pandas: row-wise operation to get change over timepandas：逐行操作以随时间变化
【发布时间】：2021-07-03 22:45:29
【问题描述】：

我有一个大数据框。下面的示例

| year | sentences         | company |
|------|-------------------|---------|
| 2020 | [list of strings] | A       |
| 2019 | [list of strings] | A       |
| 2018 | [list of strings] | A       |
| ...  | ....              | ...     |
| 2020 | [list of strings] | Z       |
| 2019 | [list of strings] | Z       |
| 2018 | [list of strings] | Z       |

我想按公司逐年比较句子列，以获得逐年变化。
示例：对于 A 公司，我想对 [list of strings]2020 和 [list of strings]2019 应用诸如句子相似度或一些距离度量之类的运算符，然后是 [list of strings]2019 和 [list of strings]2018。

同样适用于 B、C、...Z 公司。

如何做到这一点？

编辑

[list of strings] 的长度是可变的。所以一些简单的量化运算符可以是

元素数量的差异 --> length([list of strings]2020) - length([list of strings]2019)
公共元素的计数 --> length(set([list of strings]2020, [list of strings]2019))

比较应该是：

| years     | Y-o-Y change (Some function) | company |
|-----------|------------------------------|---------|
| 2020-2019 | 15                           | A       |
| 2019-2018 | 3                            | A       |
| 2018-2017 | 55                           | A       |
| ...       | ....                         | ...     |
| 2020-2019 | 33                           | Z       |
| 2019-2018 | 32                           | Z       |
| 2018-2017 | 27                           | Z       |

【问题讨论】：

从您的问题中我并不完全清楚您要做什么。如果您想将函数应用于列，您可以使用 df.apply 轻松完成。从那里我可以向您展示如何计算数字特征的逐年变化？
字符串列表[list of strings]的长度是否相同？应该如何进行比较（0->0, 1->1 或0->1, 0->2, 0->N, 1->0, 1->1, 1->N）？请至少举一个 2020-A、2019-A、2018-A 的例子。
我添加了一个编辑来澄清。

标签： pandas bert-language-model sentence-similarity

【解决方案1】：

TL;DR：查看底部的完整代码

您必须将任务分解为更简单的子任务。基本上，您希望对连续行的数据框应用一个或多个计算，这按公司分组。这意味着您必须使用groupby 和apply。

让我们从生成示例数据框开始。这里我使用小写字母作为“句子”列的单词。

import numpy as np
import string

df = pd.DataFrame({'date':      np.tile(range(2020, 2010, -1), 3),
                   'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
                   'company':   np.repeat(list('ABC'), 10)})
df

输出：

    date                    sentences company
0   2020                          [z]       A
1   2019  [s, f, g, a, d, a, h, o, c]       A
2   2018                          [b]       A
…
26  2014                          [q]       C
27  2013                       [i, w]       C
28  2012     [o, p, i, d, f, w, k, d]       C
29  2011                 [l, f, h, p]       C

连接下一行（上一年）的“句子”列：

pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)

输出：

    date                    sentences company  date_pre                sentences_pre company_pre
0   2020                          [z]       A    2019.0  [s, f, g, a, d, a, h, o, c]           A
1   2019  [s, f, g, a, d, a, h, o, c]       A    2018.0                          [b]           A
2   2018                          [b]       A    2017.0           [x, n, r, a, s, d]           A
3   2017           [x, n, r, a, s, d]       A    2016.0  [u, n, g, u, k, s, v, s, o]           A
4   2016  [u, n, g, u, k, s, v, s, o]       A    2015.0     [v, g, d, i, b, z, y, k]           A
5   2015     [v, g, d, i, b, z, y, k]       A    2014.0                    [q, o, p]           A
6   2014                    [q, o, p]       A    2013.0                    [j, s, s]           A
7   2013                    [j, s, s]       A    2012.0              [g, u, l, g, n]           A
8   2012              [g, u, l, g, n]       A    2011.0              [v, p, y, a, s]           A
9   2011              [v, p, y, a, s]       A    2020.0                 [a, h, c, w]           B
…

定义一个函数来计算一些距离度量（这里是问题中定义的两个）。捕获 TypeError 以处理没有可比较的行的情况（每组出现一次）。

def compare_lists(s):
    l1 = s['sentences_pre']
    l2 = s['sentences']
    try:
        return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
                          'yoy_diff_len': len(l2)-len(l1),
                          'yoy_nb_common': len(set(l1).intersection(set(l2))),
                          'company': s['company'],
                         })
    except TypeError:
        return

这适用于过滤后仅匹配一家公司的子数据框：

df2 = df.query('company == "A"')
pd.concat([df2, df2.shift(-1).add_suffix('_pre')], axis=1).dropna().apply(compare_lists, axis=1

输出：

       years  yoy_diff_len  yoy_nb_common company
0  2020–2019            -4              0       A
1  2019–2018             6              1       A
2  2018–2017             1              0       A
3  2017–2016             1              0       A
4  2016–2015            -7              0       A
5  2015–2014             4              0       A
6  2014–2013             1              0       A
7  2013–2012            -1              0       A
8  2012–2011            -5              1       A

现在您可以创建一个函数来构造每个组的每个数据帧并应用计算：

def group_compare(df):
    df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
    return df2.apply(compare_lists, axis=1)

并使用此函数应用于每个组：

df.groupby('company').apply(group_compare)

完整代码：

import numpy as np
import string

df = pd.DataFrame({'date':      np.tile(range(2020, 2010, -1), 3),
                   'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
                   'company':   np.repeat(list('ABC'), 10)})

def compare_lists(s):
    l1 = s['sentences_pre']
    l2 = s['sentences']
    try:
        return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
                          'yoy_diff_len': len(l2)-len(l1),
                          'yoy_nb_common': len(set(l1).intersection(set(l2))),
                          'company': s['company'],
                         })
    except TypeError:
        return
    
def group_compare(df):
    df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1).dropna()
    return df2.apply(compare_lists, axis=1)
                                           ## uncomment below to remove "company" index
df.groupby('company').apply(group_compare) #.reset_index(level=0, drop=True)

输出：

            years     yoy_diff_len     yoy_nb_common     company
company                     
A     0     2020–2019            -8            0            A
      1     2019–2018             8            0            A
      2     2018–2017            -5            0            A
      3     2017–2016            -3            2            A
      4     2016–2015             1            3            A
      5     2015–2014             5            0            A
      6     2014–2013             0            0            A
      7     2013–2012            -2            0            A
      8     2012–2011             0            0            A
B    10     2020–2019             3            0            B
     11     2019–2018            -6            1            B
     12     2018–2017             3            0            B
     13     2017–2016            -5            1            B
     14     2016–2015             2            2            B
     15     2015–2014             4            1            B
     16     2014–2013             3            0            B
     17     2013–2012            -8            0            B
     18     2012–2011             1            1            B
C    20     2020–2019             8            1            C
     21     2019–2018            -7            0            C
     22     2018–2017             0            1            C
     23     2017–2016             7            0            C
     24     2016–2015            -3            0            C
     25     2015–2014             3            0            C
     26     2014–2013            -1            0            C
     27     2013–2012            -6            2            C
     28     2012–2011             4            2            C

【讨论】：