熊猫数据汇总答案

【问题标题】：Pandas Data Summarization熊猫数据汇总
【发布时间】：2020-07-06 17:30:04
【问题描述】：

我有一个模糊数据，如下所示。请注意，第一项有重复的名称（这很重要）。

('Alex', ['String1', 'String34'])
('Piper', ['String5', 'String64', 'String12'])
('Nicky', ['String3', 'String21', 'String42', 'String51'])
('Linda', ['String14'])
('Suzzane', ['String11', 'String36', 'String16'])
('Alex', ['String64', 'String34', 'String12', 'String5'])
('Linda', ['String3', 'String77'])
('Piper', ['String41', 'String64', 'String11', 'String34'])
('Suzzane', ['String12'])
('Nicky', ['String11',  'String51'])
('Alex', ['String77', 'String64', 'String3', 'String5'])
('Linda', ['String51'])
('Nicky', ['String77', 'String12', 'String34'])
('Suzzane', ['String51', 'String3'])
('Piper', ['String11', 'String64', 'String5'])

如果上述数据在一个名为“output.txt”的文件中，如何将其导入并汇总如下所示的数据？

[只保留唯一的名称，并且对于每个主名称，只会从所有存在的重复项中填充唯一的字符串]

('Alex', ['String1', 'String34', 'String64', 'String12', 'String5', 'String77', 'String3'])
('Piper', ['String5', 'String64', 'String12', 'String11', 'String41', 'String34'])
('Nicky', ['String3', 'String21', 'String42', 'String51', 'String11', 'String77', 'String12', 'String34'])
('Linda', ['String14', 'String3', 'String77', 'String51'])
('Suzzane', ['String11', 'String36', 'String16', 'String12', 'String51', 'String3'])

【问题讨论】：

标签： python-3.x summarization

【解决方案1】：

您可以将数据加载到 pandas dataframe：

import pandas as pd

df = pd.DataFrame(data=[('Alex', ['String1', 'String34']),
('Alex', ['String64', 'String34', 'String12', 'String5']),
('Nicky', ['String11',  'String51']),
('Nicky', ['String77', 'String12', 'String34'])])
df = df.rename(columns={0:'name', 1:'strings'})

然后创建一个function 来连接熊猫列上的列表：

def concatenate(strings):
   strings_agg = []
    for string in strings:
        strings_agg.extend(string)
    return strings_agg

最后是apply这个函数的专栏：

df.groupby('name').apply(lambda x: concatenate(x['strings'])).to_frame()

【讨论】：

亲爱的@Franco Piccolo，感谢您的友好回答，我在这里有一点疑问。如果数据在“Output.txt”之类的文件中，我应该将扩展名更改为“.csv”并导入数据框吗？
是的，您可以从 csv 加载数据。我不认为它会那么简单，因为该列表将被读取为文本。也许您应该针对导入部分提出另一个问题。

【解决方案2】：

我同意pandas 是一个很棒的 库，但是使用普通的python 内置包¹ 可以很容易地完成这类事情。您可以简单地使用 python defaultdict 与集合，并使用正则表达式 finditer 进行解析。

^{¹特别有意义，因为您的输入和输出都不属于任何 pandas 数据类型（pd.Series、pd.DataFrame、..）或甚至是标准的 .csv / 表格格式..}

代码

from collections import defaultdict
import re

dataset = defaultdict(set)

with open('output.txt') as f:
    for line in f:
        itr = re.finditer("'(\S+?)'", line)
        name = next(itr).groups()[0]
        strings = {x.groups()[0] for x in itr}
        dataset[name] |= strings

with open('results.txt', 'w') as f:
    for name, strings in dataset.items():
        print(f"('{name}', {list(strings)})", file=f)

示例输出

('Alex', ['String1', 'String5', 'String77', 'String64', 'String34', 'String12', 'String3'])
('Piper', ['String5', 'String11', 'String64', 'String34', 'String12', 'String41'])
('Nicky', ['String21', 'String77', 'String34', 'String11', 'String51', 'String3', 'String12', 'String42'])
('Linda', ['String77', 'String14', 'String51', 'String3'])
('Suzzane', ['String11', 'String36', 'String12', 'String16', 'String51', 'String3'])

解释代码的工作原理

逐行阅读。我们可以使用正则表达式来捕获两个单引号 (') 之间的任何非空格 (\S)。因此，正则表达式模式为'(\S+?)'。加号+ 表示匹配一个或多个字符，? 使搜索不贪婪（匹配尽可能少的字符），因此我们解析了所有单独的字符串，而不仅仅是一个包含所有内容的字符串行。
re.finditer 用于匹配具有相同模式的多个组。在这种情况下，使用它来代替re.findall，因为re.findall 创建了一个list，而re.finditer 创建了一个迭代器。（小优化：不要创建列表，因为它根本不需要）
然后，我们通过在itr 上调用next() 来捕获name。它从迭代器返回第一个元素。
然后，调用groups() 并从返回值中取出第一项。这就是模式中用括号 (()) 捕获的组的访问方式。
然后，对于迭代器itr 的其余部分，我们只有要从中创建python sets 的字符串，它可以保证唯一的元素。显示的语法是集合理解。
在同一行，我们将结果集保存到dataset 变量中，即defaultdict。 defaultdicts 很好，因为当访问不存在的项目时，它会自动创建一个具有该类型的条目。我们使用defaultdict(set) 将set 作为默认类型。操作 d[key] |= val 与 d[key] = d[key] | val 相同，| 创建的集合是新集合的 union 和我们可能已经在 dataset 中拥有的集合。李>
最后一部分只是将输出逐行写入results.txt。将strings 强制转换为列表是可选的，但这样做是为了使输出类似于问题中的内容。

【讨论】：

哇.. 谢谢@np8.. 我需要一段时间才能弄清楚，因为与我的代码相比，代码是如此干净.. 但你的解释真的很有帮助.. 非常感谢.. . :)

【解决方案3】：

import ast
import csv
import pandas as pd

#load data from txt file, doesnt has to be csv, can be a txt file!
df = pd.read_csv(r"D:\test\output.txt", sep="/n", header=None, names=["data"], engine='python')

#convert text data to tupels and list
df["data"] = df["data"].map(lambda x: ast.literal_eval(x))
#extract surename
df["surename"] = df["data"].map(lambda x: x[0])
#extract list of strings
df["strings"] = df["data"].map(lambda x: x[1])
#create 1 row for each string in the list of strings
df = df.explode("strings")
#remove duplicate entries
df = df.drop_duplicates(subset=["surename", "strings"], keep="first")
#group the data by surename to get a list of unique strings (unique because we removed duplicates, order will be kept)
df_result = df.groupby(["surename"]).aggregate({"strings":list}).reset_index()
#combine both th extractd surename and the modified list of strings again
df_result["result"] = df_result.apply(lambda x: (x["surename"], x["strings"]), axis=1)

#output the data to a file of your choice
df_result[["result"]].to_csv(r"D:\test\result.txt",index=False, header=None, quoting=csv.QUOTE_NONE, escapechar = '')

【讨论】：

【解决方案4】：

data = []
a_dict = {}
unique = []

#considering that the file name is a.txt here.
#After opening the file i used the eval function to turn the string into code
#now the list data will have all the file's data, all elements inside list data are tuples
with open('a.txt','r') as file:
    for i in file.readlines():
        a = eval(i)
        data.append(a)

#here i wrote this code for collecting all unique name in a list
for i in data:
    if i[0] not in unique:
        unique.append(i[0])


#after collecting unique names inside list unique, i performed iteration over all values inside list unique.
#
#then i performed iteration on the list which is holding all the data
#
#compared all the unique values with the list data and
#then if they are matching then adding those values inside a list a_list
#
#when it is finished with the iteration inside list data, it will add that list into a dict a_dict with its unique value
#
#a_list will be assigned a new list for the next unique value
for i in unique:
    a_list = []
    for j in data:
        if i==j[0]:
            a_list.extend(j[1])
    a_dict[i] = list(tuple(a_list))
    
#This piece of code is to print out the data in a formatted way.
for i,j in a_dict.items():
    print("('{}', {})".format(i,j))

【讨论】：