如何从 csv 文件中获取存储聚合值的字典字典答案

【问题标题】：how to obtain dictionary of dictionaries that stores aggregated values from a csv file如何从 csv 文件中获取存储聚合值的字典字典
【发布时间】：2018-08-03 18:01:00
【问题描述】：

我有一个包含以下内容的数据文件：

 Part#1
         A 10 20 10 10 30 10 20 10 30 10 20
         B 10 10 20 10 10 30 10 30 10 20 30
  Part#2
         A 30 30 30 10 10 20 20 20 10 10 10
         B 10 10 20 10 10 30 10 30 10 30 10
  Part#3
         A 10 20 10 30 10 20 10 20 10 20 10
         B 10 10 20 20 20 30 10 10 20 20 30

从那里我希望有一个字典词典，每个字母都有汇总数据，所以它会是这样的：

dictionary = {{Part#1:{A:{10:6, 20:3, 30:2},
                       B:{10:6, 20:2, 30:3}}}, 
              {Part#2:{A:{10:5, 20:3, 30:3}, 
                       B:{10:7, 20:1, 30:3}}}, 
              {Part#3:{A:{10:6, 20:4, 30:1}, 
                       B:{10:4, 20:5, 30:2}}}}

这样，如果我想显示每个部分，它会给我这样的输出：

dictionary[Part#1]

A
 10: 6
 20: 3
 30: 2

B
 10: 6
 20: 2
 30: 3

… 以此类推，用于文件中接下来的几个分区。

目前我已经能够将文件从 txt 解析为 csv。并将其转换为字典让我们说外部字典。我一直在测试几种方法来查看我得到的输出，到目前为止，这段代码最接近（但不是全部）我正在寻找的结构，我已经在上面描述过。

partitions_dict = df_head(5).to_dict(orient='list')      

print(partitions_dict)

Output:

{0: ['A', 'B', 'A', 'B', 'A'], 1: ['10', '10', '10', '10', '10'], 2: [10, 10, 10, 10, 10], 3: [10, 10, 10, 10, 10], 4: [10, 10, 10, 10, 10], 5: [10, 10, 10, 10, 10], 6: [10, 10, 10, 10, 10], 7: [10, 10, 10, 10, 10]

我用来解析文件的函数：

def fileFormatConverter(txt_file):
    """ Receives a generated text file  of partitions as a parameter
        and converts it into csv format.
        input: text file
        return: csv file """

    filename, ext = os.path.splitext(txt_file)
    csv_file = filename + ".csv"
    in_txt = csv.reader(open(txt_file, "r"), delimiter = ' ')
    out_csv = csv.writer(open(csv_file,'w'))
    out_csv.writerows(in_txt)   
    return (csv_file)

# removes "Part#0" as a header from the dataframe
df_traces = pd.read_csv(fileFormatConverter("sample.txt"), skiprows=1, header=None)   #, error_bad_lines=False)
df_traces.head()

输出：

    0   1   2   3   4   5   6   7   8   9   ...     15  16  17  18  19  20  21  22  23  24
0   A,  10,     20,     10,     10,     30,     10,     20,     10,     30,     ...     20,     10,     10,     30,     10,     30,     10,     20,     30.0    NaN
1   Part#2  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
2   A,  30,     30,     30,     10,     10,     20,     20,     20,     10,     ...     20,     10,     10,     30,     10,     30,     10,     30,     10.0    NaN
3   Part#3  NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     ...     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN     NaN
4   A,  10,     20,     10,     30,     10,     20,     10,     20,     10,     ...     20,     20,     20,     30,     10,     10,     20,     20,     30.0    NaN

我使用了一个函数来更改标题，以便更容易操作每个分区内的字母：

def changeDFHeaders(df):

    df_transpose = df.T
    new_header = df_transpose.iloc[0]                       # stores the first row for the header
    df_transpose = df_transpose[1:]                         # take the data less the header row
    df_transpose.columns = new_header                       # set the header row as the df header
    return(df_transpose)


# The counter column serves as an index for the entire dataframe
#df_transpose['counter'] = range(len(df_transpose))      # adds the counter for rows column
#df_transpose.set_index('counter', inplace=True)
df_transpose_headers = changeDFHeaders(df_traces)
df_transpose_headers.infer_objects()

输出：

    A,  Part#2  A,  Part#3  A,
1   10,     NaN     30,     NaN     10,
2   20,     NaN     30,     NaN     20,
3   10,     NaN     30,     NaN     10,
4   10,     NaN     10,     NaN     30,
5   30,     NaN     10,     NaN     10,
6   10,     NaN     20,     NaN     20,
7   20,     NaN     20,     NaN     10,
8   10,     NaN     20,     NaN     20,
9   30,     NaN     10,     NaN     10,
10  10,     NaN     10,     NaN     20,
11  20,     NaN     10,     NaN     10,
12  B,  NaN     B,  NaN     B,
13  10,     NaN     10,     NaN     10,
14  10,     NaN     10,     NaN     10,
15  20,     NaN     20,     NaN     20,
16  10,     NaN     10,     NaN     20,
17  10,     NaN     10,     NaN     20,
18  30,     NaN     30,     NaN     30,
19  10,     NaN     10,     NaN     10,
20  30,     NaN     30,     NaN     10,
21  10,     NaN     10,     NaN     20,
22  20,     NaN     30,     NaN     20,
23  30  NaN     10  NaN     30
24  NaN     NaN     NaN     NaN     NaN

--还是不太对……

如果您检查此声明：

df = df_transpose_headers
partitions_dict = df.head(5).to_dict(orient='list')      

print(partitions_dict)

输出：

{'A,': ['10,', '20,', '10,', '30,', '10,'], 'Part#2': [nan, nan, nan, nan, nan], 'Part#3': [nan, nan, nan, nan, nan]}

【问题讨论】：

我注意到你已经编辑了你的问题以澄清为什么这不是重复的：你也可以edit 包括你试图解决这个问题的内容吗？请包括您拥有的所有相关代码。
@TemporalWolf 感谢您的建议！
我已投票支持重新开放，但我看不出您是如何从问题顶部给出的输入中得出代码中的输出的。
@TemporalWolf 好的。我将添加这些函数，以便您查看正在执行的操作。不过还是不太对。
感谢您回复提供更多信息的请求。为了进一步改进您的问题，您会发现How to Ask 和minimal reproducible example 中的提示非常有帮助

标签： python dictionary nested aggregate summary

【解决方案1】：

我会避免使用熊猫，只是因为我不太了解它：

from collections import Counter

result = {}
part = ""
group = ""
for line in f:  # f being an open file
    sline = line.strip()
    if sline.startswith("Part"):
        part = sline
        result[part] = {}
        continue
    group = sline.split()[0]
    result[part][group] = Counter(sline.split()[1:])

结果采用以下形式：

{'Part#1': {'A': Counter({'10': 6, '20': 3, '30': 2}), 'B': Counter({'10': 6, '30': 3, '20': 2})}, 
 'Part#2': {'A': Counter({'10': 5, '30': 3, '20': 3}), 'B': Counter({'10': 7, '30': 3, '20': 1})}, 
 'Part#3': {'A': Counter({'10': 6, '20': 4, '30': 1}), 'B': Counter({'20': 5, '10': 4, '30': 2})}}

如果你直接从一个没有行分隔的文件开始，你可以使用“Part”来查找行，然后使用“B”的索引来分隔两种数据类型：

result = {}
sf = f.split("Part")[1:]  # drop the empty first part
for line in sf:
    line = line.strip()  # remove trailing spaces
    sline = line.split()  # split on spaces
    result["Part%s" % sline[0]] = {}  # Use the index of B to split the value lists
    result["Part%s" % sline[0]][sline[1]] = Counter(sline[2:sline.index("B")])
    result["Part%s" % sline[0]]["B"] = Counter(sline[sline.index("B") + 1:])

【讨论】：

谢谢！我主要使用 pandas 将文件从 txt 转换为 csv。显示器用于检查输出。
我添加了另一种方法，如果它是一条大线，它应该直接从文件中工作。如果有两种以上的类型 (A/B)，则需要将其抽象为通用切片。
TemporalWolf 和@sehafoc，你不知道我多么感谢你的帮助。我让它与你的方法一起工作（是的！）。不过我确实有疑问，这是我在查看此表格时遇到的问题。以这种方式进行摘要时，我是否会丢失原始值，而只保留摘要？如果是这种情况，那么我每次都应该将文件保存在 3D 矩阵中而不是调用文件（这会是保留原始值的更好方法吗？）我已经开始使用矩阵来保持这些值显示始终是一个问题!

【解决方案2】：

输入文件为：

  Part#1
         A 10 20 10 10 30 10 20 10 30 10 20
         B 10 10 20 10 10 30 10 30 10 20 30
  Part#2
         A 30 30 30 10 10 20 20 20 10 10 10
         B 10 10 20 10 10 30 10 30 10 30 10
  Part#3
         A 10 20 10 30 10 20 10 20 10 20 10
         B 10 10 20 20 20 30 10 10 20 20 30

这应该可以工作

def parse_file(file_name):
    return_dict = dict()
    section = str()
    with open(file_name, "r") as source:
        for line in source.readlines():
            if "#" in line:
                section = line.strip()
                return_dict[section] = dict()
                continue
            tmp = line.strip().split()
            group = tmp.pop(0)
            return_dict[section][group] = dict()
            for item in tmp:
                if item in return_dict[section][group].keys():
                    return_dict[section][group][item] += 1
                else:
                    return_dict[section][group][item] = 1

    return return_dict

输出

{'Part#1': {'A': {'10': 6, '20': 3, '30': 2},
            'B': {'10': 6, '20': 2, '30': 3}},
 'Part#2': {'A': {'10': 5, '20': 3, '30': 3},
            'B': {'10': 7, '20': 1, '30': 3}},
 'Part#3': {'A': {'10': 6, '20': 4, '30': 1},
            'B': {'10': 4, '20': 5, '30': 2}}}

老实说，我不明白您为什么想要一个中间阶段，似乎如果您必须解析文件一次以创建 CSV，您可以将创建 dict() 的逻辑放入其中。因此，如果我错过了问题中的一些微妙之处，我深表歉意。

编辑：根据输入文件实际上是单行的 cmets 重新制定答案

所以输入文件为

Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30

以下修改后的代码将起作用

import string
from pprint import pprint

def parse_file2(file_name):
    return_dict = dict()
    section = None
    group = None
    with open(file_name, "r") as source:
        for line in source.readlines():
            tmp_line = line.strip().split()
            for token in tmp_line:
                if "#" in token:
                    section = token
                    return_dict[section] = dict()
                    continue
                elif token in string.ascii_uppercase:
                    group = token
                    return_dict[section][group] = dict()
                    continue
                if section and group:
                    if token in return_dict[section][group].keys():
                        return_dict[section][group][token] += 1
                    else:
                        return_dict[section][group][token] = 1

    return return_dict

if __name__ == "__main__":
    pprint(parse_file(file_name))
    pprint(parse_file2(file_name2))

请注意，此功能专门针对您在 cmets 中记录的文件格式。如果文件格式不是你说的那样，它可能会炸毁。

根据问题，虽然这应该可行。

此外，如果您可以简化上面的问题帖子以仅说明实际文件内容和所需结果，或者只是输入我有结构 A 并想将其转换为结构 B，我将清理所有历史记录在这篇文章中，也有一个更简单的答案。

希望这会有所帮助！ :)

【讨论】：

因为出于某种原因，如果我直接执行此操作，文件将被视为一个对象（整个文件一个对象）并且超过 50k 个条目，这就是为什么它不能直接作为文本文件调用的原因。直接做一个功能，它没有工作。我非常感谢您的帮助，这是一个巨大的帮助...谢谢:)！
嗯，这是否意味着您的文件没有换行符？如果您使用 read() 而不是 readlines()，您所描述的应该是这种情况。此外，您的中间阶段是否正确的问题尚不清楚。在您的 partition_dict 函数中，您的所有列表看起来都只包含 10 个（除了第一个包含 A、B 重复模式的列表。如果您正在寻找一种将中间结构转换为最终结构的方法，它可能会删除一些混淆只是为了说明。例如：“我有结构A，我想把它变形为这个结构B”
事情是这样的，我不应该在一开始就把它放在一个好的格式中。该文件实际上显示如下信息： Part#1 A 10 20 10 10 30 10 20 10 30 10 20 B 10 10 20 10 10 30 10 30 10 20 30 Part#2 A 30 30 30 10 10 20 20 20 10 10 10 B 10 10 20 10 10 30 10 30 10 30 10 Part#3 A 10 20 10 30 10 20 10 20 10 20 10 B 10 10 20 20 20 30 10 10 20 20 30 所以当将文件作为 .text 文件读取时它把整个文件当作只读取一个参数（尽管是巨大的参数）它不识别两者之间的空格。所以这就是我将其转换为 csv 文件的原因；它将一切分开
当直接用你的函数读取它时，当我用我的函数将它解析为 csv 然后用你的函数读取 csv 时，我确实在 A 和 B 之间用文本文件分隔了两行我得到三行，这些行把所有东西都弄乱了，但是空白不明显（至少在文件中），是的，分区是均匀分布的（每个字母 10 个）
啊，我明白了！现在更有意义了。让我在此基础上重新构建答案。