【问题标题】:Update dictionary value with next word in file?用文件中的下一个单词更新字典值?
【发布时间】:2016-03-03 12:27:00
【问题描述】:

我想读取一个文件并创建一个字典,其中每个单词作为键,后面的单词作为值。

例如,如果我有一个文件包含:

'Cake is cake okay.'

创建的字典应包含:

{'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}

到目前为止,我已经设法对我的代码做相反的事情。我已经用文件中的前一个单词更新了字典值。我不太确定如何更改它以使其按预期工作。

def create_dict(file):

    word_dict = {}
    prev_word = ''

    for line in file:

        for word in line.lower().split():
            clean_word = word.strip(string.punctuation)

            if clean_word not in word_dict:
                word_dict[clean_word] = []

            word_dict[clean_word].append(prev_word)
            prev_word = clean_word

提前感谢大家的帮助!

编辑

更新进度:

def create_dict(file):
    word_dict = {}
    next_word = ''

    for line in file:
        formatted_line = line.lower().split()

        for word in formatted_line:
            clean_word = word.strip(string.punctuation)

            if next_word != '':
                if next_word not in word_dict:
                    word_dict[next_word] = []

            if clean_word == '':
                clean_word.

            next_word = clean_word
    return word_dict

【问题讨论】:

    标签: string python-3.x dictionary


    【解决方案1】:

    您可以使用itertools.zip_longest()dict.setdefault() 获得更短的解决方案:

    import io
    from itertools import zip_longest  # izip_longest in Python 2
    import string
    
    def create_dict(fobj):
        word_dict = {}
        punc = string.punctuation
        for line in fobj:
            clean_words = [word.strip(punc) for word in line.lower().split()]
            for word, next_word in zip_longest(clean_words, clean_words[1:]):
                words = word_dict.setdefault(word, [])
                if next_word is not None:
                    words.append(next_word)
        return word_dict
    

    测试一下:

    >>> fobj = io.StringIO("""Cake is cake okay.""")
    >>> create_dict(fobj)
    {'cake': ['is', 'okay'], 'is': ['cake'], 'okay': []}
    

    【讨论】:

      【解决方案2】:

      将从给定文件中生成单词的代码(分割空间、大小写折叠、剥离标点符号等)与创建二元字典的代码(本题的主题)分开:

      #!/usr/bin/env python3
      from collections import defaultdict
      from itertools import tee
      
      def create_bigram_dict(words):
          a, b = tee(words) # itertools' pairwise recipe
          next(b)
          bigrams = defaultdict(list)
          for word, next_word in zip(a, b):  
              bigrams[word].append(next_word)
          bigrams[next_word] # last word may have no following words
          return bigrams
      

      itertools' pairwise() recipe。为了在一个文件中支持少于两个单词,代码需要稍作调整。如果您需要确切的类型,您可以在此处致电return dict(bigrams)。示例:

      >>> create_bigram_dict('cake is cake okay'.split())
      defaultdict(list, {'cake': ['is', 'okay'], 'is': ['cake']}
      

      要从文件创建字典,你可以定义get_words(file):

      #!/usr/bin/env python3
      import regex as re  # $ pip install regex
      
      def get_words(file):
          with file:
              for line in file:
                  words = line.casefold().split()
                  for w in words:
                      yield re.fullmatch(r'\p{P}*(.*?)\p{P}*', w).group(1)
      

      用法:create_bigram_dict(get_words(open('filename')))


      To strip Unicode punctuation, \p{P} regex is used。代码可能会在 inside 单词中保留标点符号,例如:

      >>> import regex as re
      >>> re.fullmatch(r'\p{P}*(.*?)\p{P}*', "doesn't.").group(1)
      "doesn't"
      

      注意:末尾的点消失了,但内部的' 被保留了。要删除所有标点符号,可以使用s = re.sub(r'\p{P}+', '', s)

      >>> re.sub(r'\p{P}+', '', "doesn't.")
      'doesnt'
      

      注意:单引号也没有了。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-07-24
        • 2013-02-23
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多