Python：合并字典并添加值但保留其他字段答案

【问题标题】：Python: merging dictionaries with adding values but conserving other fieldsPython：合并字典并添加值但保留其他字段
【发布时间】：2017-04-19 20:00:32
【问题描述】：

我有一个格式如下的文本文件：

word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency

...拥有 100 万个项目

但是有些word_forms包含撇号（'），有些则不包含，所以我想把它们算作同一个单词的实例，也就是说我想合并这两行：

cup'board   cup     blabla  12
cupboard    cup     blabla2 10

进入这个（添加频率）：

cupboard    cup     blabla2  22

我正在 Python 2.7 中搜索解决方案来做到这一点，我的第一个想法是读取文本文件，将带撇号的单词和不带撇号的单词存储在两个不同的字典中，然后检查带撇号的单词字典，测试是否这些词已经在没有撇号的字典中，如果它们实现了频率，如果不是简单地添加这一行并删除撇号。这是我的代码：

class Lemma:
    """Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
    def __init__(self,lop):
        self.word_form = lop[0]
        self.root = lop[1]
        self.morph = lop[2]
        self.freq = int(lop[3])

def Reader(filename):
    """Keeps the lines of a file in memory for a single reading, memory efficient"""
    with open(filename) as f:
        for line in f:
            yield line

def get_word_dict(filename):
    '''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
    '''Works in a reasonable time'''
    '''This step can be done writing line by line, avoiding all storage in memory'''
    word_dict = {}
    word_dict_striped = {}

    # We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe   
    with open('word_dict.txt', 'wb') as f:
        with open('word_dict_striped.txt', 'wb') as g:

            reader = Reader(filename)
            for line in reader:
                items = line.split("\t")
                word_form = items[0]
                if "'" in word_form:
                    # we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
                    items[0] = word_form.replace("'","")
                    items[2] = items[2].replace("\+Apos", "")

                    g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
                    word_dict_striped({items[0] : Lemma(items)})
                else:
                    # we just add the lemma to the dictionary word_dict
                    f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
                    word_dict.update({items[0] : Lemma(items)})

    return word_dict, word_dict_striped

def merge_word_dict(word_dict, word_dict_striped):
    '''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
    ''' Does not run in reasonable time on the whole list '''

    with open('word_compiled_dict.txt', 'wb') as f:

        for word in word_dict_striped.keys():
            if word in word_dict.keys():
                word_dict[word].freq += word_dict_striped[word].freq
                f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
            else:
                word_dict.update(word_dict_striped[word])

    print "Number of words: ",
    print(len(word_dict))

    for x in word_dict:
        print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq

    return word_dict

此解决方案在存储两个字典之前的合理时间内有效，无论是我逐行写入两个文本文件以避免任何存储，还是将它们作为 dict 对象存储在程序中。但是两本词典的合并永远不会结束！

字典的“更新”功能可以工作，但会覆盖一个频率计数而不是添加两个频率计数。我看到了一些合并字典的解决方案加上计数器： Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? 但它们似乎只在字典的形式为 (word, count) 时才有效，而我也想在字典中携带其他字段。

我对您的所有想法或对问题的重新定义持开放态度，因为我的目标是让这个程序运行一次只是为了在一个文本文件中获得这个合并列表，提前谢谢你！

【问题讨论】：

你不能简单地用一个空字符串替换所有的撇号来删除它们吗？像这样：word_form = items[0].replace("'", "")
但是我会有两行相同的词，这些频率不会被添加，对吧？
对于一个给定的单词是否最多可以合并两行，或者可能更多？需要组合的是否必须彼此相邻？如果要合并两行，是否保证其他所有内容（除了计数）都相同？
是的，对于一个给定的单词，最多可以组合两行，只有一个带有撇号的版本，一个没有撇号的版本。但是不，要组合的不一定彼此相邻。不，如果合并两行，第三列实际上是不同的，但理想情况下，应该保留没有撇号的那一行（如示例所示）
哦，还有一件事，除了第一个单词以外的地方还有撇号吗？（即，是否可以像 Sven 所说的那样，将它们全部替换为空字符串）

标签： python python-2.7 dictionary merge

【解决方案1】：

这里或多或少可以满足您的需求。只需更改顶部的文件名。它不会修改原始文件。

input_file_name = "input.txt"
output_file_name = "output.txt"

def custom_comp(s1, s2):
    word1 = s1.split()[0]
    word2 = s2.split()[0]
    stripped1 = word1.translate(None, "'")
    stripped2 = word2.translate(None, "'")

    if stripped1 > stripped2:
        return 1
    elif stripped1 < stripped2:
        return -1
    else:
        if "'" in word1:
            return -1
        else:
            return 1

def get_word(line):
    return line.split()[0].translate(None, "'")

def get_num(line):
    return int(line.split()[-1])

print "Reading file and sorting..."

lines = []
with open(input_file_name, 'r') as f:
    for line in sorted(f, cmp=custom_comp):
        lines.append(line)

print "File read and sorted"

combined_lines = []

print "Combining entries..."

i = 0
while i < len(lines) - 1:
    if get_word(lines[i]) == get_word(lines[i+1]):
        total = get_num(lines[i]) + get_num(lines[i+1])
        new_parts = lines[i+1].split()
        new_parts[-1] = str(total)
        combined_lines.append(" ".join(new_parts))
        i += 2
    else:
        combined_lines.append(lines[i].strip())
        i += 1

print "Entries combined"
print "Writing to file..."

with open(output_file_name, 'w+') as f:
    for line in combined_lines:
        f.write(line + "\n")

print "Finished"

它对单词进行排序并稍微弄乱了间距。如果这很重要，请告诉我，我们可以进行调整。

另一件事是它对整个事情进行排序。对于只有一百万行，这可能不会花费太长时间，但如果这是一个问题，请再次告诉我。

【讨论】：

非常感谢您在不到一分钟的时间内回答！我对其进行了一些修改，即使没有带有撇号的条目要合并，也可以插入不带撇号的条目，并且我意识到我必须多次运行该程序，因为在某些情况下要合并的行超过两行（我的错，我不知道有），但是拥有一个完成的程序会改变一切！