在列中查找重复项，返回唯一项并从 python 中的另一列列出其对应的值答案

【问题标题】：find duplicates in a column, return the unique item and list its corresponding values from another column in python在列中查找重复项，返回唯一项并从 python 中的另一列列出其对应的值
【发布时间】：2015-03-23 02:08:40
【问题描述】：

我想从第 1 列中删除重复项，并在第 2 列中返回与使用 python 的每个唯一项关联的值的相关列表。

输入是

1 2
Jack London 'Son of the Wolf'
Jack London 'Chris Farrington'
Jack London 'The God of His Fathers'
Jack London 'Children of the Frost'
William Shakespeare  'Venus and Adonis' 
William Shakespeare 'The Rape of Lucrece'
Oscar Wilde 'Ravenna'
Oscar Wilde 'Poems'

而输出应该是

1 2
Jack London 'Son of the Wolf, Chris Farrington, Able Seaman, The God of His Fathers,Children of the Frost'
William Shakespeare 'The Rape of Lucrece,Venus and Adonis' 
Oscar Wilde 'Ravenna,Poems'

其中第二列包含与每个项目关联的值的总和。我尝试了字典上的 set() 函数

dic={'Jack London': 'Son of the Wolf', 'Jack London': 'Chris Farrington', 'Jack London': 'The God of His Fathers'}
set(dic)

但它只返回字典的第一个键

set(['Jack London'])

【问题讨论】：

你是如何划分列的？
@AdamSmith 我认为这不重要，他不是在问如何解析输入。
编写代码来为你做这件事很诱人，但我认为你或我不会从中学到很多东西。这是一个我认为会有所帮助的示例：docs.python.org/2/library/collections.html#defaultdict-examples

标签： python no-duplicates

【解决方案1】：

在 Python 中，字典的每个键只能包含一个值。但该值可以是项目的集合：

>>> d = {'Jack London': ['Son of the Wolf', 'Chris Farrington']}
>>> d['Jack London']
['Son of the Wolf', 'Chris Farrington']

要从一系列键值对构造这样的字典，您可以执行以下操作：

dct = {}
for author, title in items:
    if author not in dct:
        # Create a new entry for the author
        dct[author] = [title]
    else:
        # Add another item to the existing entry
        dct[author].append(title)

循环体可以像这样更简洁：

dct = {}
for author, title in items:
    dct.setdefault(author, []).append(title)

【讨论】：

【解决方案2】：

您应该使用itertools.groupby，因为您的列表已排序。

rows = [('1', '2'),
        ('Jack London', 'Son of the Wolf'),
        ('Jack London', 'Chris Farrington'),
        ('Jack London', 'The God of His Fathers'),
        ('Jack London', 'Children of the Frost'),
        ('William Shakespeare', 'Venus and Adonis'),
        ('William Shakespeare', 'The Rape of Lucrece'),
        ('Oscar Wilde', 'Ravenna'),
        ('Oscar Wilde', 'Poems')]
# I'm not sure how you get here, but that's where you get

from itertools import groupby
from operator import itemgetter

grouped = groupby(rows, itemgetter(0))
result = {group:', '.join([value[1] for value in values]) for group, values in grouped}

这会给你一个结果：

In [1]: pprint(result)
{'1': '2',
 'Jack London': 'Son of the Wolf, Chris Farrington, The God of His Fathers, '
                'Children of the Frost',
 'Oscar Wilde': 'Ravenna, Poems',
 'William Shakespeare': 'Venus and Adonis, The Rape of Lucrece'}

【讨论】：

我认为以下结果更接近预期的规格： result = {group:[x[1:][0] for x in values] for group,values in grouped}
@JimDennis 真。我什至应该做data = {group:[col[1] for col in values] for group,values in grouped}; result = "{} {}".format(row[0], ' '.join(row[1:]) for row in data)
是的，从技术上讲，他的意思是“输出应该是”......但我假设他实际上对结果数据结构而不是文字输出更感兴趣。我的建议，以及我赞成的预兆的回答，都是基于对他问题的解释，而不是对“输出”的字面要求。