【发布时间】:2021-04-30 07:07:48
【问题描述】:
我的目标是下面的输出。
| A | B | C | D | E | F |
|---|---|---|---|---|---|
| 0000 | ZZZ | 987 | QW1 | 8 | first three-four col and offset |
| 0000 | ZZZ | 987 | QW1 | -8 | first three-four col and offset |
| 1111 | AAA | 123 | AB1 | 1 | first three-four col and offset |
| 1111 | AAA | 123 | CD1 | -1 | first three-four col and offset |
| 2222 | BBB | 456 | EF1 | -4 | first three-four col and offset |
| 2222 | BBB | 456 | GH1 | -1 | first three-four col and offset |
| 2222 | BBB | 456 | IL1 | 5 | first three-four col and offset |
| 3333 | CCC | 789 | MN1 | 2 | first two col and offset |
| 3333 | CCC | 101 | MN1 | -2 | first two col and offset |
| 4444 | DDD | 121 | UYT | 6 | first two col and offset |
| 4444 | DDD | 131 | FB1 | -5 | first two col and offset |
| 4444 | DDD | 141 | UYT | -1 | first two col and offset |
| 5555 | EEE | 151 | CB1 | 3 | first two col and offset |
| 5555 | EEE | 161 | CR1 | -3 | first two col and offset |
| 6666 | FFF | 111 | CB1 | 4 | first or no match |
| 7777 | GGG | 222 | ZB1 | 10.5 | first three-four col and small offset |
| 7777 | GGG | 222 | ZB1 | -10 | first three-four col and small offset |
第一条规则)前三列必须彼此相等 - 无论第四列如何,它可以相等或不相等。每个组合必须将关联的数字 (col E) 偏移为零(可以从 2 到 X 条记录组合)。
第二条规则)前两列必须彼此相等 - 无论第四列如何,它可以相等或不相等。每个组合必须将关联的数字 (col E) 偏移为零(可以从 2 到 X 条记录组合)。
第三条规则)不匹配。
第四条规则)前三列必须彼此相等 - 无论第四列如何,它可以相等或不相等。每个组合可以有 0.5 AT MOST (col E) 和 NO 偏移为零的差异(可以从 2 到 X 记录组合)。
请看下面我的代码。
我完全清楚我没有以最有效的方式编写代码。您能否建议一种更有效的方法来实现这一目标?
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
if (df['A'][i] == df['A'][j]) & (df['B'][i] == df['B'][j]) & (df['C'][i] == df['C'][j]) & (df['E'][i] + df['E'][j] == 0) :
df['E'][i] = 'first three-four col and offset'
df['E'][j] = 'first three-four col and offset'
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
if (df['A'][i] == df['A'][j]) & (df['B'][i] == df['B'][j]) & (df['E'][i] + df['E'][j] == 0) & (df['E'][i] != 'first three-four col and offset') & (df['E'][j] != 'first three-four col and offset'):
df['E'][i] = 'first two col and offset'
df['E'][j] = 'first two col and offset'
for i in range(0, len(df)-1):
for j in range(i+1, len(df)):
if (df['A'][i] == df['A'][j]) & (df['B'][i] == df['B'][j]) & (df['C'][i] == df['C'][j]) & (df['E'][i] + df['E'][j] != 0) & (df['E'][i] + df['E'][j] =< 0.5) & (df['E'][i] + df['E'][j] >= -0.5) & (df['E'][i] != 'first three-four col and offset') & (df['E'][j] != 'first three-four col and offset') & (df['E'][i] != 'first two col and offset') & (df['E'][j] != 'first two col and offset'):
df['E'][i] = 'first three-four col and small offset'
df['E'][j] = 'first three-four col and small offset'
有没有办法以更有效的方式获得预期的结果?
我也知道以下代码不起作用。我尝试用正确的评论更新这条记录,但徒劳无功。
for ... :
if.... :
df['col'][index] = 'comment'
让我们进一步假设我想以这种“效率不高的方式”保留我的代码,这似乎是有效的(最后一行代码除外)。我应该如何更改最后一行以使我的脚本正常工作?
【问题讨论】:
-
你检查过 pd.DataFrame.where() 了吗? pandas.pydata.org/docs/reference/api/…。它还有助于查看原始数据。