【发布时间】:2017-04-19 20:00:32
【问题描述】:
我有一个格式如下的文本文件:
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
word_form root_form morphological_form frequency
...拥有 100 万个项目
但是有些word_forms包含撇号('),有些则不包含,所以我想把它们算作同一个单词的实例,也就是说我想合并这两行:
cup'board cup blabla 12
cupboard cup blabla2 10
进入这个(添加频率):
cupboard cup blabla2 22
我正在 Python 2.7 中搜索解决方案来做到这一点,我的第一个想法是读取文本文件,将带撇号的单词和不带撇号的单词存储在两个不同的字典中,然后检查带撇号的单词字典,测试是否这些词已经在没有撇号的字典中,如果它们实现了频率,如果不是简单地添加这一行并删除撇号。这是我的代码:
class Lemma:
"""Creates a Lemma with the word form, the root, the morphological analysis and the frequency in the corpus"""
def __init__(self,lop):
self.word_form = lop[0]
self.root = lop[1]
self.morph = lop[2]
self.freq = int(lop[3])
def Reader(filename):
"""Keeps the lines of a file in memory for a single reading, memory efficient"""
with open(filename) as f:
for line in f:
yield line
def get_word_dict(filename):
'''Separates the word list into two dictionaries, one for words with apostrophe and one for words with apostrophe'''
'''Works in a reasonable time'''
'''This step can be done writing line by line, avoiding all storage in memory'''
word_dict = {}
word_dict_striped = {}
# We store the lemmas in two dictionaries, word_dict for words without apostrophe, word_dict_striped for words with apostrophe
with open('word_dict.txt', 'wb') as f:
with open('word_dict_striped.txt', 'wb') as g:
reader = Reader(filename)
for line in reader:
items = line.split("\t")
word_form = items[0]
if "'" in word_form:
# we remove the apostrophe in the word form and morphological analysis and add the lemma to the dictionary word_dict_striped
items[0] = word_form.replace("'","")
items[2] = items[2].replace("\+Apos", "")
g.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict_striped({items[0] : Lemma(items)})
else:
# we just add the lemma to the dictionary word_dict
f.write( "%s\t%s\t%s\t%s" % (items[0], items[1], items[2], items[3]))
word_dict.update({items[0] : Lemma(items)})
return word_dict, word_dict_striped
def merge_word_dict(word_dict, word_dict_striped):
'''Takes two dictionaries and merge them by adding the count of their frequencies if there is a common key'''
''' Does not run in reasonable time on the whole list '''
with open('word_compiled_dict.txt', 'wb') as f:
for word in word_dict_striped.keys():
if word in word_dict.keys():
word_dict[word].freq += word_dict_striped[word].freq
f.write( "%s\t%s\t%s\t%s" % (word_dict[word].word_form, word_dict[word].root, word_dict[word].morph, word_dict[word].freq))
else:
word_dict.update(word_dict_striped[word])
print "Number of words: ",
print(len(word_dict))
for x in word_dict:
print x, word_dict[x].root, word_dict[x].morph, word_dict[x].freq
return word_dict
此解决方案在存储两个字典之前的合理时间内有效,无论是我逐行写入两个文本文件以避免任何存储,还是将它们作为 dict 对象存储在程序中。但是两本词典的合并永远不会结束!
字典的“更新”功能可以工作,但会覆盖一个频率计数而不是添加两个频率计数。我看到了一些合并字典的解决方案 加上计数器: Python: Elegantly merge dictionaries with sum() of values Merge and sum of two dictionaries How to sum dict elements How to merge two Python dictionaries in a single expression? Is there any pythonic way to combine two dicts (adding values for keys that appear in both)? 但它们似乎只在字典的形式为 (word, count) 时才有效,而我也想在字典中携带其他字段。
我对您的所有想法或对问题的重新定义持开放态度,因为我的目标是 让这个程序运行一次只是为了在一个文本文件中获得这个合并列表,提前谢谢你!
【问题讨论】:
-
你不能简单地用一个空字符串替换所有的撇号来删除它们吗?像这样:
word_form = items[0].replace("'", "") -
但是我会有两行相同的词,这些频率不会被添加,对吧?
-
对于一个给定的单词是否最多可以合并两行,或者可能更多?需要组合的是否必须彼此相邻?如果要合并两行,是否保证其他所有内容(除了计数)都相同?
-
是的,对于一个给定的单词,最多可以组合两行,只有一个带有撇号的版本,一个没有撇号的版本。但是不,要组合的不一定彼此相邻。不,如果合并两行,第三列实际上是不同的,但理想情况下,应该保留没有撇号的那一行(如示例所示)
-
哦,还有一件事,除了第一个单词以外的地方还有撇号吗? (即,是否可以像 Sven 所说的那样,将它们全部替换为空字符串)
标签: python python-2.7 dictionary merge