TL;DR:查看底部的完整代码
您必须将任务分解为更简单的子任务。基本上,您希望对连续行的数据框应用一个或多个计算,这按公司分组。这意味着您必须使用groupby 和apply。
让我们从生成示例数据框开始。这里我使用小写字母作为“句子”列的单词。
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
df
输出:
date sentences company
0 2020 [z] A
1 2019 [s, f, g, a, d, a, h, o, c] A
2 2018 [b] A
…
26 2014 [q] C
27 2013 [i, w] C
28 2012 [o, p, i, d, f, w, k, d] C
29 2011 [l, f, h, p] C
连接下一行(上一年)的“句子”列:
pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
输出:
date sentences company date_pre sentences_pre company_pre
0 2020 [z] A 2019.0 [s, f, g, a, d, a, h, o, c] A
1 2019 [s, f, g, a, d, a, h, o, c] A 2018.0 [b] A
2 2018 [b] A 2017.0 [x, n, r, a, s, d] A
3 2017 [x, n, r, a, s, d] A 2016.0 [u, n, g, u, k, s, v, s, o] A
4 2016 [u, n, g, u, k, s, v, s, o] A 2015.0 [v, g, d, i, b, z, y, k] A
5 2015 [v, g, d, i, b, z, y, k] A 2014.0 [q, o, p] A
6 2014 [q, o, p] A 2013.0 [j, s, s] A
7 2013 [j, s, s] A 2012.0 [g, u, l, g, n] A
8 2012 [g, u, l, g, n] A 2011.0 [v, p, y, a, s] A
9 2011 [v, p, y, a, s] A 2020.0 [a, h, c, w] B
…
定义一个函数来计算一些距离度量(这里是问题中定义的两个)。捕获 TypeError 以处理没有可比较的行的情况(每组出现一次)。
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
这适用于过滤后仅匹配一家公司的子数据框:
df2 = df.query('company == "A"')
pd.concat([df2, df2.shift(-1).add_suffix('_pre')], axis=1).dropna().apply(compare_lists, axis=1
输出:
years yoy_diff_len yoy_nb_common company
0 2020–2019 -4 0 A
1 2019–2018 6 1 A
2 2018–2017 1 0 A
3 2017–2016 1 0 A
4 2016–2015 -7 0 A
5 2015–2014 4 0 A
6 2014–2013 1 0 A
7 2013–2012 -1 0 A
8 2012–2011 -5 1 A
现在您可以创建一个函数来构造每个组的每个数据帧并应用计算:
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1)
return df2.apply(compare_lists, axis=1)
并使用此函数应用于每个组:
df.groupby('company').apply(group_compare)
完整代码:
import numpy as np
import string
df = pd.DataFrame({'date': np.tile(range(2020, 2010, -1), 3),
'sentences': [np.random.choice(list(string.ascii_lowercase), size=np.random.randint(10)) for i in range(30)],
'company': np.repeat(list('ABC'), 10)})
def compare_lists(s):
l1 = s['sentences_pre']
l2 = s['sentences']
try:
return pd.Series({'years': '%d–%d' % (s['date'], s['date_pre']),
'yoy_diff_len': len(l2)-len(l1),
'yoy_nb_common': len(set(l1).intersection(set(l2))),
'company': s['company'],
})
except TypeError:
return
def group_compare(df):
df2 = pd.concat([df, df.shift(-1).add_suffix('_pre')], axis=1).dropna()
return df2.apply(compare_lists, axis=1)
## uncomment below to remove "company" index
df.groupby('company').apply(group_compare) #.reset_index(level=0, drop=True)
输出:
years yoy_diff_len yoy_nb_common company
company
A 0 2020–2019 -8 0 A
1 2019–2018 8 0 A
2 2018–2017 -5 0 A
3 2017–2016 -3 2 A
4 2016–2015 1 3 A
5 2015–2014 5 0 A
6 2014–2013 0 0 A
7 2013–2012 -2 0 A
8 2012–2011 0 0 A
B 10 2020–2019 3 0 B
11 2019–2018 -6 1 B
12 2018–2017 3 0 B
13 2017–2016 -5 1 B
14 2016–2015 2 2 B
15 2015–2014 4 1 B
16 2014–2013 3 0 B
17 2013–2012 -8 0 B
18 2012–2011 1 1 B
C 20 2020–2019 8 1 C
21 2019–2018 -7 0 C
22 2018–2017 0 1 C
23 2017–2016 7 0 C
24 2016–2015 -3 0 C
25 2015–2014 3 0 C
26 2014–2013 -1 0 C
27 2013–2012 -6 2 C
28 2012–2011 4 2 C