来自具有不同长度和键的字典列表的 CSV答案

【问题标题】：CSV from list of dictionaries with differing length and keys来自具有不同长度和键的字典列表的 CSV
【发布时间】：2020-05-25 19:15:36
【问题描述】：

我有一个要写入 csv 文件的字典列表。第一个字典的长度不同，并且具有与以下字典不同的键。

dict_list = [{"A": 1, "B": 2}, {"C": 3, "D": 4, "E": 5}, {"C": 6, "D": 7, "E": 8}, ...]

如何将其写入 csv 文件，以使文件如下所示：

A B C D E
1 2 3 4 5
    6 7 8
    . . .

【问题讨论】：

你真的希望 1,2,3,4,5 在一行中，而它们来自 2 个不同的字典吗？

标签： python python-3.x csv dictionary data-conversion

【解决方案1】：

您也可以仅使用 Python 语言附带的内置功能。我下面的示例类似于@Serge Ballesta 提出的示例。代码如下：

import csv

# sample data
data = [{'A': 1, 'B': 2}, {'A': 3, 'D': 4, 'E': 5}, {'C': 6, 'D': 7, 'E': 8}]
# Collect from elements in **data** (they are dict object) the field names and store
# them in a **set** to preserve their uniqueness
fields = set()
for item in data:
    names = set(item.keys())
    fields = fields | names   # we used the **or** i.e | operator for **set**

fields = list(fields)   # cast the fields into a list
# and sort the content so that during the display everything is in order :)
fields.sort()

# Now let write a function that return a cleaned data from the original, that is all
# data items have the same field names.

def clean_data(origdata, fieldnames):
    """Turn the original data into a new data with similar field in data items.

    Parameters
    ----------
    origdata: list of dict
         original data which will be cleaned or harmonized according to the field names
    fieldnames: list of strings
         fields names in the new data items

    Returns
    -------
    Returns a new data consisting of list of dict where all dict items have the same
    keys (i.e fieldnames)
    """
    newdata = []
    for dataitem in data:
        keys = dataitem.keys()
        for key in fieldnames:
             if key not in keys:
                  # In this instance we update the datitem with **key** and value= ' '
                  dataitem[key] = ' '
        newdata.append(dataitem)

    return newdata


def main():
    """Test the above function and display the result"""
    newdata = clean_data(data, fields)

    # write the data to a csv file
    with open("data.csv", "w", newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fields)
        writer.writeheader()
        for row in newdata:
            writer.writerow(row)

    # Now let load our newly written csv file and print the content
    # -- some fancy display formatting here: not needed but I like it. :)
    nfields = len(fields)
    fmt = " %s " * nfields
    headInfo = fmt % tuple(fields)
    line = '-'* (len(headInfo)+1)
    print(line)
    print("|" + headInfo)
    print(line)
    with open("data.csv", "r", newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for item im reader:
            row = [item[field] for field in fields]
            printf("|" + fmt % tuple(row))

    print(line)



main()

上面的脚本将产生以下输出：

---------------------
| A | B | C | D | E |
---------------------
| 1 | 2 |   |   |   |
|   |   | 3 | 4 | 5 |
|   |   | 6 | 7 | 8 |
---------------------

【讨论】：

【解决方案2】：

问题是您需要完整的列集才能在文件开头写入标题。但除此之外，csv.DictWriter 是您所需要的：

# optional: compute the fieldnames:
fieldnames = set()
for d in dict_list:
    fieldnames.update(d.keys())
fieldnames = sorted(fieldnames)    # sort the fieldnames...

# produce the csv file
with open("file.csv", "w", newline='') as fd:
    wr = csv.DictWriter(fd, fieldnames)
    wr.writeheader()
    wr.writerows(dict_list)

生成的 csv 将如下所示：

A,B,C,D,E
1,2,,,
,,3,4,5
,,6,7,8

如果您真的想将行与不相交的键集组合在一起，您可以这样做：

# produce the csv file
with open("file.csv", "w", newline='') as fd:
    wr = csv.DictWriter(fd, sorted(fieldnames))
    old = { k: k for k in wr.fieldnames }     # use old for the header line
    for row in dict_list:
        if len(set(old.keys()).intersection(row.keys())) != 0:
            wr.writerow(old)                  # common fields: write old and start a new row
            old = row
        old.update(row)                       # disjoint fields: just combine
    wr.writerow(old)                          # do not forget last row

你会得到：

A,B,C,D,E
1,2,3,4,5
,,6,7,8

【讨论】：

这基本上就是我一直在寻找的东西，我很欣赏它的简单性。但是，列的顺序被弄乱了。 “A”和“B”列出现在“C”、“D”和“E”之间。在 fieldnames.update() 期间到底发生了什么？
@MaxJ.：您可以对字段名称进行排序（请参阅我的答案第一部分中的编辑）。我还展示了如何将行与不相交的键集组合起来。
我通过将第二个字典的键添加到第一个字典的键来设置字段名。您的第一个解决方案可能比我要求的（第二部分）更适合我的问题。由于这对最低要求有效，因此我将其作为公认的解决方案。

【解决方案3】：

如果您在列表上调用pd.DataFrame()，Pandas 能够从字典列表中生成数据框。在生成的数据框中，每个字典都是一行，每个键对应一列。因此，第 7 个字典中第 3 个键对应的值（我称之为 key3）将位于 key3 列的第 7 行。

这对您的问题意味着什么：您首先必须修改您的 dict_list 以包含合并的字典，如下所示：

dict_list.insert(2, dict(**dict_list[0], **dict_list[1]))
print(dict_list)

[{'A': 1, 'B': 2},
 {'C': 3, 'D': 4, 'E': 5},
 {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5},
 {'C': 6, 'D': 7, 'E': 8}]

这会将索引 2 处的前两个字典的组合插入到您的列表中。为什么要索引 2？这使您可以在将列表转换为数据框时方便地对其进行切片，从而为您提供所需的输出

df = pd.DataFrame(dict_list[2:])
print(df)

     A    B  C  D  E
0  1.0  2.0  3  4  5
1  NaN  NaN  6  7  8

为了比较，直接在未修改列表上调用pd.DataFrame给你

df_unmodified = pd.DataFrame(dict_list)
print(df_unmodified)

     A    B    C    D    E
0  1.0  2.0  NaN  NaN  NaN
1  NaN  NaN  3.0  4.0  5.0
2  NaN  NaN  6.0  7.0  8.0

之后，您可以使用df.to_csv() 将数据框保存到 csv 文件中

【讨论】：

这会导致预期的结果。谢谢！