【问题标题】：finding nested columns in pandas dataframe在熊猫数据框中查找嵌套列
【发布时间】：2020-07-26 11:56:55
【问题描述】：

我有一个大型数据集，其中包含（压缩）JSON 格式的许多列。我正在尝试将其转换为镶木地板以进行后续处理。某些列具有嵌套结构。现在我想忽略这个结构，把这些列写成一个（JSON）字符串。

所以对于我已经确定我正在做的列：

df[column] = df[column].astype(str)

但是，我不确定哪些列是嵌套的，哪些不是。当我用镶木地板写字时，我看到了这条消息：

<stack trace redacted> 

  File "pyarrow/_parquet.pyx", line 1375, in pyarrow._parquet.ParquetWriter.write_table
  File "pyarrow/error.pxi", line 78, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Nested column branch had multiple children: struct<coordinates: list<item: double>, type: string>

这表明我未能将我的列之一从嵌套对象转换为字符串。但应该归咎于哪一栏？我怎么知道？

当我打印我的 pandas 数据框的 .dtypes 时，我无法区分字符串和嵌套值，因为两者都显示为 object。

编辑：该错误通过显示结构详细信息来提示嵌套列，但这相当耗时。此外，它只打印第一个错误，如果您有多个嵌套列，这可能会很烦人

【问题讨论】：

当您说嵌套列时，您是指包含 Python 对象（list、dict 等）的任何列吗？你想把这些转换成字符串吗？
您的数据框中的某些列似乎包含 pyarrow.parquet.write_table 无法处理的 C 对象。 “嵌套列”只是镶木地板中的一个术语，在“熊猫数据框”中没有多大意义。请明确定义这些术语。
也许df.applymap(type) 以便获取数据框中每个单元格的类型...df.applymap(type).eq(dict).any() 如果每列的任何单元格中都有字典，则返回 True。因此，如果我们使用df.applymap(type).eq(dict).any()，我们可以过滤列..
@ansev 我必须使用来自 Outlook API 的数据的流式数据集来执行此操作，这些数据总是会发生变化，有时还会带有嵌套和未嵌套的列。你的方法和我的很相似

标签： python python-3.x pandas pyarrow

【解决方案1】：

我在使用 Pyspark 和流式数据集时遇到了类似的问题，有些列是嵌套的，有些则不是。

鉴于您的数据框可能如下所示：

df = pd.DataFrame({'A' : [{1 : [1,5], 2 : [15,25], 3 : ['A','B']}],
                   'B' : [[[15,25,61],[44,22,87],['A','B',44]]],
                   'C' : [((15,25,87),(22,91))],
                   'D' : 15,
                   'E' : 'A'
                  })


print(df)

                                         A  \
0  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B                         C   D  E  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]  ((15, 25, 87), (22, 91))  15  A

我们可以堆叠您的数据框并使用apply 和type 来获取每列的类型并将其传递给字典。

df.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
out:
{'A': dict, 'B': list, 'C': tuple, 'D': int, 'E': str}

有了这个，我们可以使用一个函数来返回嵌套和非嵌套列的元组。

功能

def find_types(dataframe):

    col_dict = dataframe.head(1).stack().apply(type).reset_index(0,drop=True).to_dict()
    unnested_columns = [k for (k,v) in col_dict.items() if v not in (dict,set,list,tuple)]
    nested_columns = list(set(col_dict.keys()) - set(unnested_columns))
    return nested_columns,unnested_columns

在行动中。

nested,unested = find_types(df)

df[unested]

   D  E
0  15  A

print(df[nested])

                          C                                        A  \
0  ((15, 25, 87), (22, 91))  {1: [1, 5], 2: [15, 25], 3: ['A', 'B']}   

                                          B  
0  [[15, 25, 61], [44, 22, 87], [A, B, 44]]

【讨论】：

【解决方案2】：

将嵌套结构转换为字符串

如果我正确理解您的问题，您希望将df 中的那些嵌套 Python 对象（列表、字典）序列化为 JSON 字符串，并保持其他元素不变。最好写自己的强制转换方法：

def json_serializer(obj):
    if isinstance(obj, [list, dict]): # please add other types that you considered as nested structure to the type list
        return json.dumps(obj)
    return obj

df = df.applymap(json_serializer)

如果数据框很大，使用astype(str) 会更快。

nested_cols = []
for c in df:
    if any(isinstance(obj, [list, dict]) for obj in df[c]):
        nested_cols.append(c)

for c in nested_cols:
    df[c] = df[c].astype(str) # this convert every element in the column independent of their types

由于调用any(...) 中的短路评估，这种方法具有性能优势。一旦击中列中的第一个嵌套对象，它将立即返回，并且不会浪费时间检查其余对象。如果任何“Dtype Introspection”方法适合您的数据，那么使用它会更快。

查看最新版本的pyarrow

我假设这些嵌套结构只需要转换为字符串，因为它们会导致pyarrow.parquet.write_table 中的错误。也许你根本不需要转换它，因为在 pyarrow 中处理嵌套列的问题一直是reportedly solved recently（2020 年 3 月 29 日，版本 0.17.0）。但是支持可能有问题，在active discussion下。

【讨论】：

这没有回答问题。我试图找出数据框中的哪些列包含字典或列表等数据而不是字符串。我无意放弃这些价值观。正如我在问题中展示的那样，我可以将这些列转换为字符串，但不想转换每一列，因为其他列具有整数或布尔值
@DanielKats 你能至少提供一个minimum reproducible example吗？我真的很难理解你的问题。您的输入是什么样的，预期的输出是什么？
我误解了你的句子“现在我想忽略这个结构......”。现在我知道您想将这些结构（列表、字典）序列化为 json 字符串，而其他列保持不变。

【解决方案3】：

如果您只想找出哪些列是罪魁祸首，那么只需编写一个循环，一次写入一列并存储哪些列失败...

bad_cols = []
for i in range(df.shape[1]):
    try:
        df.iloc[:, [i]].to_parquet(...)
    except KeyboardInterrupt:
        raise
    except Exception:  # you may want to catch ArrowInvalid exceptions instead
        bad_cols.append(i)
print(bad_cols)

【讨论】：

【解决方案4】：

在 pandas 中使用像 infer_dtype() 这样的通用实用程序函数，您可以确定列是否嵌套。

from pandas.api.types import infer_dtype

for col in df.columns:
  if infer_dtype(df[col]) == 'mixed' : 
    # ‘mixed’ is the catchall for anything that is not otherwise specialized
    df[col] = df[col].astype('str')

如果您针对特定数据类型，请参阅Dtype Introspection

【讨论】：