【问题标题】:n-grams in python, four, five, six grams?python中的n-gram,四,五,六克?
【发布时间】:2013-07-06 01:59:56
【问题描述】:

我正在寻找一种将文本拆分为 n-gram 的方法。 通常我会这样做:

import nltk
from nltk import bigrams
string = "I really like python, it's pretty awesome."
string_bigrams = bigrams(string)
print string_bigrams

我知道 nltk 只提供二元组和三元组,但有没有办法将我的文本分成四克、五克甚至一百克?

谢谢!

【问题讨论】:

  • 您希望将文本按单词或字符分成 n 大小的组吗?你能举一个例子说明上面的输出应该是什么样的吗?
  • 从未做过nltk,但看起来有一个函数ingrams,它的第二个参数是你想要的ngrams的程度。 THIS 是您使用的 nltk 版本吗?即使没有,这里是source 编辑:那里有ngramsingramsingrams 是一个生成器。
  • 这个帖子下还有一个答案可能有用:stackoverflow.com/questions/7591258/fast-n-gram-calculation

标签: python string nltk n-gram


【解决方案1】:

这是一个老问题,但是如果您想将 n-gram 作为子字符串列表(而不是列表或元组列表)并且不想导入任何内容,则以下代码可以正常工作并且是易于阅读:

def get_substrings(phrase, n):
    phrase = phrase.split()
    substrings = []
    for i in range(len(phrase)):
        if len(phrase[i:i+n]) == n:
            substrings.append(' '.join(phrase[i:i+n]))
    return substrings

您可以使用它,例如以这种方式获得一个术语列表的所有 n-gram,不超过 a 个单词长度:

a = 5
terms = [
    "An n-gram is a contiguous sequence of n items",
    "An n-gram of size 1 is referred to as a unigram",
]

for term in terms:
    for i in range(1, a+1):
        print(f"{i}-grams: {get_substrings(term, i)}")

打印:

1-grams: ['An', 'n-gram', 'is', 'a', 'contiguous', 'sequence', 'of', 'n', 'items']
2-grams: ['An n-gram', 'n-gram is', 'is a', 'a contiguous', 'contiguous sequence', 'sequence of', 'of n', 'n items']
3-grams: ['An n-gram is', 'n-gram is a', 'is a contiguous', 'a contiguous sequence', 'contiguous sequence of', 'sequence of n', 'of n items']
4-grams: ['An n-gram is a', 'n-gram is a contiguous', 'is a contiguous sequence', 'a contiguous sequence of', 'contiguous sequence of n', 'sequence of n items']
5-grams: ['An n-gram is a contiguous', 'n-gram is a contiguous sequence', 'is a contiguous sequence of', 'a contiguous sequence of n', 'contiguous sequence of n items']
1-grams: ['An', 'n-gram', 'of', 'size', '1', 'is', 'referred', 'to', 'as', 'a', 'unigram']
2-grams: ['An n-gram', 'n-gram of', 'of size', 'size 1', '1 is', 'is referred', 'referred to', 'to as', 'as a', 'a unigram']
3-grams: ['An n-gram of', 'n-gram of size', 'of size 1', 'size 1 is', '1 is referred', 'is referred to', 'referred to as', 'to as a', 'as a unigram']
4-grams: ['An n-gram of size', 'n-gram of size 1', 'of size 1 is', 'size 1 is referred', '1 is referred to', 'is referred to as', 'referred to as a', 'to as a unigram']
5-grams: ['An n-gram of size 1', 'n-gram of size 1 is', 'of size 1 is referred', 'size 1 is referred to', '1 is referred to as', 'is referred to as a', 'referred to as a unigram']

【讨论】:

  • 你能补充一下这个答案与以前的答案有什么不同吗?
【解决方案2】:

在python中做n gram很容易,例如:

def n_gram(list,n): 
    return [ list[i:i+n] for i in range(len(list)-n+1) ]

如果你这样做了:

str = "I really like python, it's pretty awesome."
n_gram(str.split(" "),4)

你会得到

[['I', 'really', 'like', 'python,'], 
['really', 'like', 'python,', "it's"], 
['like', 'python,', "it's", 'pretty'], 
['python,', "it's", 'pretty', 'awesome.']]

【讨论】:

    【解决方案3】:

    大约七年后,这里有一个更优雅的答案,使用collections.deque

    def ngrams(words, n):
        d = collections.deque(maxlen=n)
        d.extend(words[:n])
        words = words[n:]
        for window, word in zip(itertools.cycle((d,)), words):
            print(' '.join(window))
            d.append(word)
        print(' '.join(window))
    
    words = ['I', 'am', 'become', 'death,', 'the', 'destroyer', 'of', 'worlds']
    

    输出:

    In [236]: ngrams(words, 2)
    I am
    am become
    become death,
    death, the
    the destroyer
    destroyer of
    of worlds
    
    In [237]: ngrams(words, 3)
    I am become
    am become death,
    become death, the
    death, the destroyer
    the destroyer of
    destroyer of worlds
    
    In [238]: ngrams(words, 4)
    I am become death,
    am become death, the
    become death, the destroyer
    death, the destroyer of
    the destroyer of worlds
    
    In [239]: ngrams(words, 1)
    I
    am
    become
    death,
    the
    destroyer
    of
    worlds
    
    

    【讨论】:

    • 最后一个 ngram 似乎丢失了。
    • @BjörnLindqvist:感谢您的错误报告。现在修复:)
    【解决方案4】:

    如果您想要一个纯迭代器解决方案来处理具有恒定内存使用的大字符串:

    from typing import Iterable  
    import itertools
    
    def ngrams_iter(input: str, ngram_size: int, token_regex=r"[^\s]+") -> Iterable[str]:
        input_iters = [ 
            map(lambda m: m.group(0), re.finditer(token_regex, input)) 
            for n in range(ngram_size) 
        ]
        # Skip first words
        for n in range(1, ngram_size): list(map(next, input_iters[n:]))  
    
        output_iter = itertools.starmap( 
            lambda *args: " ".join(args),  
            zip(*input_iters) 
        ) 
        return output_iter
    

    测试:

    input = "If you want a pure iterator solution for large strings with constant memory usage"
    list(ngrams_iter(input, 5))
    

    输出:

    ['If you want a pure',
     'you want a pure iterator',
     'want a pure iterator solution',
     'a pure iterator solution for',
     'pure iterator solution for large',
     'iterator solution for large strings',
     'solution for large strings with',
     'for large strings with constant',
     'large strings with constant memory',
     'strings with constant memory usage']
    

    【讨论】:

      【解决方案5】:

      人们已经很好地回答了你需要二元组或三元组的情况,但如果你需要 everygram 来表示这种情况下的句子,你可以使用nltk.util.everygrams

      >>> from nltk.util import everygrams
      
      >>> message = "who let the dogs out"
      
      >>> msg_split = message.split()
      
      >>> list(everygrams(msg_split))
      [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out'), ('who', 'let', 'the'), ('let', 'the', 'dogs'), ('the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs'), ('let', 'the', 'dogs', 'out'), ('who', 'let', 'the', 'dogs', 'out')]
      

      如果你有一个限制,比如最大长度应该为 3 的三元组,那么你可以使用 max_len 参数来指定它。

      >>> list(everygrams(msg_split, max_len=2))
      [('who',), ('let',), ('the',), ('dogs',), ('out',), ('who', 'let'), ('let', 'the'), ('the', 'dogs'), ('dogs', 'out')]
      

      您可以修改 max_len 参数以达到任何克数,即四克、五克、六克甚至百克。

      可以修改前面提到的解决方案来实现上面提到的解决方案,但是这个解决方案比这更简单。

      更多阅读请点击here

      当您只需要一个特定的 gram(如 bigram 或 trigram 等)时,您可以使用 M.A.Hassan 的回答中提到的 nltk.util.ngrams

      【讨论】:

        【解决方案6】:

        其他用户给出的基于 Python 的优秀答案。但这是nltk 方法(以防万一,OP 因重新发明nltk 库中已有的内容而受到惩罚)。

        nltk 中有一个人们很少使用的ngram module。这不是因为 ngrams 难以阅读,而是基于 ngrams 训练模型,其中 n > 3 会导致大量数据稀疏。

        from nltk import ngrams
        
        sentence = 'this is a foo bar sentences and i want to ngramize it'
        
        n = 6
        sixgrams = ngrams(sentence.split(), n)
        
        for grams in sixgrams:
          print grams
        

        【讨论】:

        • 有没有办法使用 N-gram 来检查整个文档,例如 txt ?我对Python不熟悉所以不知道它是否可以打开一个txt文件然后使用N-gram分析来检查?
        • 有人可以评论如何测试sixgrams的准确性吗?
        【解决方案7】:

        您可以使用以下代码获取所有 4-6gram,无需其他包:

        from itertools import chain
        
        def get_m_2_ngrams(input_list, min, max):
            for s in chain(*[get_ngrams(input_list, k) for k in range(min, max+1)]):
                yield ' '.join(s)
        
        def get_ngrams(input_list, n):
            return zip(*[input_list[i:] for i in range(n)])
        
        if __name__ == '__main__':
            input_list = ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
            for s in get_m_2_ngrams(input_list, 4, 6):
                print(s)
        

        输出如下:

        I am aware that
        am aware that nltk
        aware that nltk only
        that nltk only offers
        nltk only offers bigrams
        only offers bigrams and
        offers bigrams and trigrams
        bigrams and trigrams ,
        and trigrams , but
        trigrams , but is
        , but is there
        but is there a
        is there a way
        there a way to
        a way to split
        way to split my
        to split my text
        split my text in
        my text in four-grams
        text in four-grams ,
        in four-grams , five-grams
        four-grams , five-grams or
        , five-grams or even
        five-grams or even hundred-grams
        I am aware that nltk
        am aware that nltk only
        aware that nltk only offers
        that nltk only offers bigrams
        nltk only offers bigrams and
        only offers bigrams and trigrams
        offers bigrams and trigrams ,
        bigrams and trigrams , but
        and trigrams , but is
        trigrams , but is there
        , but is there a
        but is there a way
        is there a way to
        there a way to split
        a way to split my
        way to split my text
        to split my text in
        split my text in four-grams
        my text in four-grams ,
        text in four-grams , five-grams
        in four-grams , five-grams or
        four-grams , five-grams or even
        , five-grams or even hundred-grams
        I am aware that nltk only
        am aware that nltk only offers
        aware that nltk only offers bigrams
        that nltk only offers bigrams and
        nltk only offers bigrams and trigrams
        only offers bigrams and trigrams ,
        offers bigrams and trigrams , but
        bigrams and trigrams , but is
        and trigrams , but is there
        trigrams , but is there a
        , but is there a way
        but is there a way to
        is there a way to split
        there a way to split my
        a way to split my text
        way to split my text in
        to split my text in four-grams
        split my text in four-grams ,
        my text in four-grams , five-grams
        text in four-grams , five-grams or
        in four-grams , five-grams or even
        four-grams , five-grams or even hundred-grams
        

        你可以找到更多关于这个blog的细节

        【讨论】:

          【解决方案8】:

          如果效率是一个问题,并且您必须构建多个不同的 n-gram(如您所说最多一百个),但您想使用纯 python 我会这样做:

          from itertools import chain
          
          def n_grams(seq, n=1):
              """Returns an itirator over the n-grams given a listTokens"""
              shiftToken = lambda i: (el for j,el in enumerate(seq) if j>=i)
              shiftedTokens = (shiftToken(i) for i in range(n))
              tupleNGrams = zip(*shiftedTokens)
              return tupleNGrams # if join in generator : (" ".join(i) for i in tupleNGrams)
          
          def range_ngrams(listTokens, ngramRange=(1,2)):
              """Returns an itirator over all n-grams for n in range(ngramRange) given a listTokens."""
              return chain(*(n_grams(listTokens, i) for i in range(*ngramRange)))
          

          用法:

          >>> input_list = input_list = 'test the ngrams generator'.split()
          >>> list(range_ngrams(input_list, ngramRange=(1,3)))
          [('test',), ('the',), ('ngrams',), ('generator',), ('test', 'the'), ('the', 'ngrams'), ('ngrams', 'generator'), ('test', 'the', 'ngrams'), ('the', 'ngrams', 'generator')]
          

          ~和 NLTK 一样的速度:

          import nltk
          %%timeit
          input_list = 'test the ngrams interator vs nltk '*10**6
          nltk.ngrams(input_list,n=5)
          # 7.02 ms ± 79 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          
          %%timeit
          input_list = 'test the ngrams interator vs nltk '*10**6
          n_grams(input_list,n=5)
          # 7.01 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          
          %%timeit
          input_list = 'test the ngrams interator vs nltk '*10**6
          nltk.ngrams(input_list,n=1)
          nltk.ngrams(input_list,n=2)
          nltk.ngrams(input_list,n=3)
          nltk.ngrams(input_list,n=4)
          nltk.ngrams(input_list,n=5)
          # 7.32 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          
          %%timeit
          input_list = 'test the ngrams interator vs nltk '*10**6
          range_ngrams(input_list, ngramRange=(1,6))
          # 7.13 ms ± 165 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
          

          从我的previous answer转发。

          【讨论】:

            【解决方案9】:

            我很惊讶这还没有出现:

            In [34]: sentence = "I really like python, it's pretty awesome.".split()
            
            In [35]: N = 4
            
            In [36]: grams = [sentence[i:i+N] for i in xrange(len(sentence)-N+1)]
            
            In [37]: for gram in grams: print gram
            ['I', 'really', 'like', 'python,']
            ['really', 'like', 'python,', "it's"]
            ['like', 'python,', "it's", 'pretty']
            ['python,', "it's", 'pretty', 'awesome.']
            

            【讨论】:

            • 这正是第一个答案减去频率计数和元组转换所做的。
            • 不过,将其重写为理解更好。
            • @amirouche:很好的收获。感谢您的错误报告。现在已经修复了
            【解决方案10】:

            使用 python 的内置 zip() 构建二元组的更优雅的方法。 只需将原始字符串通过split() 转换为列表,然后将列表正常传递一次并偏移一个元素。

            string = "I really like python, it's pretty awesome."
            
            def find_bigrams(s):
                input_list = s.split(" ")
                return zip(input_list, input_list[1:])
            
            def find_ngrams(s, n):
              input_list = s.split(" ")
              return zip(*[input_list[i:] for i in range(n)])
            
            find_bigrams(string)
            
            [('I', 'really'), ('really', 'like'), ('like', 'python,'), ('python,', "it's"), ("it's", 'pretty'), ('pretty', 'awesome.')]
            

            【讨论】:

              【解决方案11】:

              仅使用 nltk 工具

              from nltk.tokenize import word_tokenize
              from nltk.util import ngrams
              
              def get_ngrams(text, n ):
                  n_grams = ngrams(word_tokenize(text), n)
                  return [ ' '.join(grams) for grams in n_grams]
              

              示例输出

              get_ngrams('This is the simplest text i could think of', 3 )
              
              ['This is the', 'is the simplest', 'the simplest text', 'simplest text i', 'text i could', 'i could think', 'could think of']
              

              为了将 ngram 保持为数组格式,只需删除 ' '.join

              【讨论】:

                【解决方案12】:

                Nltk 很棒,但有时对于某些项目来说是开销:

                import re
                def tokenize(text, ngrams=1):
                    text = re.sub(r'[\b\(\)\\\"\'\/\[\]\s+\,\.:\?;]', ' ', text)
                    text = re.sub(r'\s+', ' ', text)
                    tokens = text.split()
                    return [tuple(tokens[i:i+ngrams]) for i in xrange(len(tokens)-ngrams+1)]
                

                使用示例:

                >> text = "This is an example text"
                >> tokenize(text, 2)
                [('This', 'is'), ('is', 'an'), ('an', 'example'), ('example', 'text')]
                >> tokenize(text, 3)
                [('This', 'is', 'an'), ('is', 'an', 'example'), ('an', 'example', 'text')]
                

                【讨论】:

                  【解决方案13】:

                  你可以使用sklearn.feature_extraction.text.CountVectorizer:

                  import sklearn.feature_extraction.text # FYI http://scikit-learn.org/stable/install.html
                  ngram_size = 4
                  string = ["I really like python, it's pretty awesome."]
                  vect = sklearn.feature_extraction.text.CountVectorizer(ngram_range=(ngram_size,ngram_size))
                  vect.fit(string)
                  print('{1}-grams: {0}'.format(vect.get_feature_names(), ngram_size))
                  

                  输出:

                  4-grams: [u'like python it pretty', u'python it pretty awesome', u'really like python it']
                  

                  您可以将ngram_size 设置为任何正整数。 IE。您可以将文本拆分为四克、五克甚至一百克。

                  【讨论】:

                    【解决方案14】:

                    对于four_grams,它已经在NLTK 中,这里有一段代码可以帮助您实现这一目标:

                     from nltk.collocations import *
                     import nltk
                     #You should tokenize your text
                     text = "I do not like green eggs and ham, I do not like them Sam I am!"
                     tokens = nltk.wordpunct_tokenize(text)
                     fourgrams=nltk.collocations.QuadgramCollocationFinder.from_words(tokens)
                     for fourgram, freq in fourgrams.ngram_fd.items():  
                           print fourgram, freq
                    

                    希望对你有帮助。

                    【讨论】:

                      【解决方案15】:

                      这里是做 n-gram 的另一种简单方法

                      >>> from nltk.util import ngrams
                      >>> text = "I am aware that nltk only offers bigrams and trigrams, but is there a way to split my text in four-grams, five-grams or even hundred-grams"
                      >>> tokenize = nltk.word_tokenize(text)
                      >>> tokenize
                      ['I', 'am', 'aware', 'that', 'nltk', 'only', 'offers', 'bigrams', 'and', 'trigrams', ',', 'but', 'is', 'there', 'a', 'way', 'to', 'split', 'my', 'text', 'in', 'four-grams', ',', 'five-grams', 'or', 'even', 'hundred-grams']
                      >>> bigrams = ngrams(tokenize,2)
                      >>> bigrams
                      [('I', 'am'), ('am', 'aware'), ('aware', 'that'), ('that', 'nltk'), ('nltk', 'only'), ('only', 'offers'), ('offers', 'bigrams'), ('bigrams', 'and'), ('and', 'trigrams'), ('trigrams', ','), (',', 'but'), ('but', 'is'), ('is', 'there'), ('there', 'a'), ('a', 'way'), ('way', 'to'), ('to', 'split'), ('split', 'my'), ('my', 'text'), ('text', 'in'), ('in', 'four-grams'), ('four-grams', ','), (',', 'five-grams'), ('five-grams', 'or'), ('or', 'even'), ('even', 'hundred-grams')]
                      >>> trigrams=ngrams(tokenize,3)
                      >>> trigrams
                      [('I', 'am', 'aware'), ('am', 'aware', 'that'), ('aware', 'that', 'nltk'), ('that', 'nltk', 'only'), ('nltk', 'only', 'offers'), ('only', 'offers', 'bigrams'), ('offers', 'bigrams', 'and'), ('bigrams', 'and', 'trigrams'), ('and', 'trigrams', ','), ('trigrams', ',', 'but'), (',', 'but', 'is'), ('but', 'is', 'there'), ('is', 'there', 'a'), ('there', 'a', 'way'), ('a', 'way', 'to'), ('way', 'to', 'split'), ('to', 'split', 'my'), ('split', 'my', 'text'), ('my', 'text', 'in'), ('text', 'in', 'four-grams'), ('in', 'four-grams', ','), ('four-grams', ',', 'five-grams'), (',', 'five-grams', 'or'), ('five-grams', 'or', 'even'), ('or', 'even', 'hundred-grams')]
                      >>> fourgrams=ngrams(tokenize,4)
                      >>> fourgrams
                      [('I', 'am', 'aware', 'that'), ('am', 'aware', 'that', 'nltk'), ('aware', 'that', 'nltk', 'only'), ('that', 'nltk', 'only', 'offers'), ('nltk', 'only', 'offers', 'bigrams'), ('only', 'offers', 'bigrams', 'and'), ('offers', 'bigrams', 'and', 'trigrams'), ('bigrams', 'and', 'trigrams', ','), ('and', 'trigrams', ',', 'but'), ('trigrams', ',', 'but', 'is'), (',', 'but', 'is', 'there'), ('but', 'is', 'there', 'a'), ('is', 'there', 'a', 'way'), ('there', 'a', 'way', 'to'), ('a', 'way', 'to', 'split'), ('way', 'to', 'split', 'my'), ('to', 'split', 'my', 'text'), ('split', 'my', 'text', 'in'), ('my', 'text', 'in', 'four-grams'), ('text', 'in', 'four-grams', ','), ('in', 'four-grams', ',', 'five-grams'), ('four-grams', ',', 'five-grams', 'or'), (',', 'five-grams', 'or', 'even'), ('five-grams', 'or', 'even', 'hundred-grams')]
                      

                      【讨论】:

                      • 必须执行 nltk.download('punkt') 才能使用 nltk.word_tokenize() 函数。此外,要打印结果,还必须使用 list() 将生成器对象(如二元组、三元组和四元组)转换为列表。
                      【解决方案16】:

                      您可以使用 itertools 轻松创建自己的函数来执行此操作:

                      from itertools import izip, islice, tee
                      s = 'spam and eggs'
                      N = 3
                      trigrams = izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))
                      list(trigrams)
                      # [('s', 'p', 'a'), ('p', 'a', 'm'), ('a', 'm', ' '),
                      # ('m', ' ', 'a'), (' ', 'a', 'n'), ('a', 'n', 'd'),
                      # ('n', 'd', ' '), ('d', ' ', 'e'), (' ', 'e', 'g'),
                      # ('e', 'g', 'g'), ('g', 'g', 's')]
                      

                      【讨论】:

                      • 你能解释一下izip(*(islice(seq, index, None) for index, seq in enumerate(tee(s, N))))我不太明白。
                      【解决方案17】:

                      我从未处理过 nltk,但将 N-gram 作为一些小班项目的一部分。如果你想找到字符串中所有 N-gram 出现的频率,这里有一种方法可以做到这一点。 D 会给你 N 个单词的直方图。

                      D = dict()
                      string = 'whatever string...'
                      strparts = string.split()
                      for i in range(len(strparts)-N): # N-grams
                          try:
                              D[tuple(strparts[i:i+N])] += 1
                          except:
                              D[tuple(strparts[i:i+N])] = 1
                      

                      【讨论】:

                      • collections.Counter(tuple(strparts[i:i+N]) for i in xrange(len(strparts)-N)) 会比 try-except 工作得更快
                      猜你喜欢
                      • 2021-07-29
                      • 2018-08-12
                      • 1970-01-01
                      • 2020-07-09
                      • 2019-02-01
                      • 1970-01-01
                      • 2018-08-10
                      • 2015-05-31
                      相关资源
                      最近更新 更多