Pandas Dataframes 的高效循环答案

【问题标题】：Efficient for loops for Pandas DataframesPandas Dataframes 的高效循环
【发布时间】：2020-07-09 05:53:16
【问题描述】：

我有 2 个 Pandas 数据框，X_ol 和 y_ol，形状分别为 29000 x 29 和 29000 x 21，我正在对这些数据运行嵌套 for 循环以生成更多数据（如下所示）。我试图用这个 for 循环实现的是这样的：

    DataFrame X_ol                              DataFrame y_ol
    id     Date      c1      c2      c3         c1      c2      c3
    1      2000      0       1       1          0       1       1
    2      2001      1       0       1          1       0       1
    3      2002      1       1       0          1       1       0
    4      2003      1       1       1          1       1       1

    # (New DataFrame X)                         # (Second New DataFrame, y)
    id     Date      c1      c2      c3         c1      c2      c3 
    1      2000      0       0       1          0       1       0
    1      2000      0       1       0          0       0       1
    2      2001      0       0       1          1       0       0
    2      2001      1       0       0          0       0       1
    3      2002      0       1       0          1       0       0
    3      2002      1       0       0          0       1       0
    4      2003      0       1       1          1       0       0
    4      2003      1       0       1          0       1       0
    4      2003      1       1       0          0       0       1

所以它逐行查看 y_ol 数据帧，对于每个值为 1 的单元格，它在数据帧 X 中创建一个新行，该单元格关闭，并在 y 数据帧中创建一个新行，相应单元格打开并且现在将关闭 y Dataframe 中同一行上的所有其他值。我编写了这段代码，它正确地完成了它，但花费了很多时间。 12 多分钟生成 2 个 60,000 行的数据帧，是否有内置的 pandas 函数/方法可用于提高效率或完全消除 for 循环的另一种方法？

for i in range(len(y_ol)):
    ab = y_ol.iloc[i].where(y_ol.iloc[i]==1)
    abInd = ab[ab==1.0].index
    for j in abInd:
        y_tmp = deepcopy(y_ol.iloc[i:i+1, :])
        y_ol[j][i] = 0
        conc = pd.concat([X_ol.iloc[i:i+1,:], y_ol.iloc[i:i+1, :]], axis=1)
        X = X.append(conc)
        y_tmp.iloc[:, :] = 0
        y_tmp[j] = 1
        y = y.append(y_tmp)
        y_ol[j][i] = 1

提前致谢

【问题讨论】：

可以肯定的是，列 c1、c2 和 c3 是相同的，在 X_ol 和 y_ol 之间按行排列？
@Ben.T 是的

标签： python pandas for-loop

【解决方案1】：

要创建新的 y_ol，您可以在将 0 更改为 with where 后使用 stack to。然后reset_index 1级，也就是y_ol中的列名，原来是1。

df_ = y_ol.where(y_ol.eq(1)).stack().reset_index(level=1)
print (df_)
  level_1    0
0      c2  1.0
0      c3  1.0
1      c1  1.0
1      c3  1.0
2      c1  1.0
2      c2  1.0
3      c1  1.0
3      c2  1.0
3      c3  1.0

使用这个名为 level_1 的列和 numpy 广播将其与 y_ol 的列名进行比较以获得 True/False。将类型更改为 int 并根据需要构建新的 y_ol 数据框。

y_ol_new = pd.DataFrame((df_['level_1'].to_numpy()[:, None] 
                         == y_ol.columns.to_numpy()).astype(int),
                        columns=y_ol.columns)
print (y_ol_new)
   c1  c2  c3
0   0   1   0
1   0   0   1
2   1   0   0
3   0   0   1
4   1   0   0
5   0   1   0
6   1   0   0
7   0   1   0
8   0   0   1

现在对于 X_ol，您可以使用 df_ 的索引 reindex 它来复制行。然后你只需要删除 y_ol_new。

X_ol_new = X_ol.reindex(df_.index).reset_index(drop=True)
X_ol_new[y_ol_new.columns] -= y_ol_new
print (X_ol_new)
   id  Date  c1  c2  c3
0   1  2000   0   0   1
1   1  2000   0   1   0
2   2  2001   0   0   1
3   2  2001   1   0   0
4   3  2002   0   1   0
5   3  2002   1   0   0
6   4  2003   0   1   1
7   4  2003   1   0   1
8   4  2003   1   1   0

【讨论】：

您好，您的解决方案运行良好，只需几秒钟，非常感谢您
@TochiBedford 很高兴它对你有用，编码愉快:)

【解决方案2】：

我将按列处理数据帧，其中 y_ol 中的一列包含 1，并连接每列获得的临时数据帧。

假设

x_ol = pd.DataFrame({'id': [1, 2, 3, 4],  'Date': [2000, 2001, 2002, 2003],
                     'c1': [0, 1, 1, 1], 'c2': [1, 0, 1, 1], 'c3': [1, 1, 0, 1]}
y_ol = pd.DataFrame({'c1': [0, 1, 1, 1], 'c2': [1, 0, 1, 1], 'c3': [1, 1, 0, 1]})

我会以这种方式构建新的数据框：

cols = ['c1', 'c2', 'c3']
x_new = pd.concat((x_ol[y_ol[c] == 1].assign(**{c: 0}) for c in cols)).sort_values('id')
y_new = pd.concat((y_ol[y_ol[c] == 1].assign(**{x: 1 if x == c else 0 for x in cols})
                   for c in cols)).sort_index()

它按预期给出

print(x_new)

   id  Date  c1  c2  c3
0   1  2000   0   0   1
0   1  2000   0   1   0
1   2  2001   0   0   1
1   2  2001   1   0   0
2   3  2002   0   1   0
2   3  2002   1   0   0
3   4  2003   0   1   1
3   4  2003   1   0   1
3   4  2003   1   1   0

和

print(y_new)

   c1  c2  c3
0   0   1   0
0   0   0   1
1   1   0   0
1   0   0   1
2   1   0   0
2   0   1   0
3   1   0   0
3   0   1   0
3   0   0   1

【讨论】：

非常感谢，但您的方法仍然使用 for 循环，这首先导致速度缓慢，最终输出将是大约 60,000 行，并且“for”循环占用了可怕的数量的时间。再次感谢
@TochiBedford 这里的循环是在列上，而不是在行上，所以在效率方面，它会比你原来的解决方案快得多:)