如何有效地对 pandas 数据框中的数组执行操作？答案

【问题标题】：How to perform operations over arrays in a pandas dataframe efficiently?如何有效地对 pandas 数据框中的数组执行操作？
【发布时间】：2022-01-12 21:44:34
【问题描述】：

我有一个 pandas DataFrame，它在某些列中包含 NumPy 数组：

import numpy as np, pandas as pd

data = {'col1': [np.array([1, 2]), np.array([3, 4])],
        'col2': [np.array([5, 6]), np.array([7, 8])],
        'col3': [9, 10]}

df = pd.DataFrame(data)

我需要在 CSV 文件中存储一个像这样的大框架，但数组必须是如下所示的字符串：

col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10

我目前为实现此结果所做的是遍历 DataFrame 的每一列和每一行，但我的解决方案似乎效率不高。

这是我目前的解决方案：

pd.options.mode.chained_assignment = None
array_columns = [column for column in df.columns if isinstance(df[column].iloc[0], np.ndarray)]

for index, row in df.iterrows():
    for column in array_columns:
        # Here 'tuple' is only used to replace brackets for parenthesis
        df[column][index] = str(tuple(row[column]))

我尝试使用 apply，虽然我听说它通常不是一个有效的选择：

def array_to_str(array):
    return str(tuple(array))

df[array_columns] = df[array_columns].apply(array_to_str)

但是我的数组变成了NaN:

   col1  col2  col3
0   NaN   NaN     9
1   NaN   NaN    10

我尝试了其他类似的解决方案，但错误：

ValueError: Must have equal len keys and value when setting with an iterable

经常出现。

有没有更有效的方法来执行相同的操作？我的真实数据框可以包含许多列和数千行。

【问题讨论】：

csv 是基于文本的，不应用于嵌套数据结构。为什么需要 csv？你能以二进制形式存储数据吗，例如df.to_pickle?
一个要求是匹配人们可以从属于我工作的公司分支机构的特定网页/档案中获得的数据的输出格式。这种带引号和括号的格式已使用多年，不会更改。
df[column][index] = scalar 永远不应使用。使用df.at[column, index] = scalar...另外，请不要这样做：pd.options.mode.chained_assignment = None 这些警告是好的警告是有原因的。
无论如何，你做什么都不会特别有效，在数据帧中包含 numpy.ndarray 对象并不是 pandas 的设计初衷。

标签： python arrays pandas dataframe numpy

【解决方案1】：

试试这个：

tupcols = ['col1', 'col2']
df[tupcols] = df[tupcols].apply(lambda col: col.apply(tuple)).astype('str')
df.to_csv()

【讨论】：

【解决方案2】：

您需要将数组转换为 tuple 以获得正确的表示。为此，您可以在具有object dtype 的列上应用tuple 函数。

to_save = df.apply(lambda x: x.map(lambda y: tuple(y)) if x.dtype=='object' else x)

to_save.to_csv(index=False)

输出：

col1,col2,col3
"(1, 2)","(5, 6)",9
"(3, 4)","(7, 8)",10

注意：如果您有其他列，这将是危险的，例如字符串类型。

【讨论】：

【解决方案3】：

data = {'col1': [np.array([1, 2]), np.array([3, 4])],
        'col2': [np.array([5, 6]), np.array([7, 8])],
        'col3': [9, 10]}

df = pd.DataFrame(data)
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: tuple(x))
df[['col1', 'col2']] = df[['col1', 'col2']].applymap(lambda x: ''' "{}" '''.format(x))

         col1        col2  col3
0   "(1, 2)"    "(5, 6)"      9
1   "(3, 4)"    "(7, 8)"     10

【讨论】：