【问题标题】:Best way to extract keywords from input NLP sentence从输入 NLP 句子中提取关键字的最佳方法
【发布时间】:2015-02-08 22:31:52
【问题描述】:

我正在做一个需要从句子中提取重要关键字的项目。我一直在使用基于 POS 标签的基于规则的系统。但是,我遇到了一些我无法解析的模棱两可的术语。是否有一些机器学习分类器可用于根据不同句子的训练集提取相关关键字?

【问题讨论】:

    标签: python machine-learning nlp


    【解决方案1】:

    查看RAKE:这是一个相当不错的小型 Python 库。

    编辑:我还找到了a tutorial on how to get started with it

    【讨论】:

    • 谢谢。看起来很有趣。一定会去看看的!
    • @DanielSvoboda:如果它对你有帮助,你能接受这个答案吗? - 我目前正在挖掘一些额外的声誉点。非常感谢。
    【解决方案2】:

    还可以尝试这种多语言 RAKE 实现 - 适用于任何语言。

    可以用pip install multi-rake安装

    from multi_rake import Rake
    
    text_en = (
        'Compatibility of systems of linear constraints over the set of '
        'natural numbers. Criteria of compatibility of a system of linear '
        'Diophantine equations, strict inequations, and nonstrict inequations '
        'are considered. Upper bounds for components of a minimal set of '
        'solutions and algorithms of construction of minimal generating sets '
        'of solutions for all types of systems are given. These criteria and '
        'the corresponding algorithms for constructing a minimal supporting '
        'set of solutions can be used in solving all the considered types of '
        'systems and systems of mixed types.'
    )
    
    rake = Rake()
    
    keywords = rake.apply(text_en)
    
    print(keywords[:10])
    
    #  ('minimal generating sets', 8.666666666666666),
    #  ('linear diophantine equations', 8.5),
    #  ('minimal supporting set', 7.666666666666666),
    #  ('minimal set', 4.666666666666666),
    #  ('linear constraints', 4.5),
    #  ('natural numbers', 4.0),
    #  ('strict inequations', 4.0),
    #  ('nonstrict inequations', 4.0),
    #  ('upper bounds', 4.0),
    #  ('mixed types', 3.666666666666667)
    

    【讨论】:

      【解决方案3】:

      我们也可以使用 gensim 从给定文本中提取关键字

      from gensim.summarization import keywords
      
      
       text_en = (
          'Compatibility of systems of linear constraints over the set of'
          'natural numbers. Criteria of compatibility of a system of linear '
          'Diophantine equations, strict inequations, and nonstrict inequations '
          'are considered. Upper bounds for components of a minimal set of '
          'solutions and algorithms of construction of minimal generating sets '
          'of solutions for all types of systems are given. These criteria and '
          'the corresponding algorithms for constructing a minimal supporting '
          'set of solutions can be used in solving all the considered types of '
          'systems and systems of mixed types.')
      
      print(keywords(text_en,words = 10,scores = True, lemmatize = True))
      

      输出将是:

      [('numbers', 0.31009020729627595),
      ('types', 0.2612797117033426),
      ('upper', 0.26127971170334247),
      ('considered', 0.2539581373644024),
      ('minimal', 0.25089449987505835),
      ('sets', 0.2508944998750583),
      ('inequations', 0.25051980840329924),
      ('linear', 0.2505198084032991),
      ('strict', 0.23778663563992564),
      ('diophantine', 0.23778663563992555)]
      

      【讨论】:

        【解决方案4】:

        sklearn 尝试TfidfVectorizer

        from sklearn.feature_extraction.text import TfidfVectorizer
        corpus = [
            'This is the first document.',
            'This document is the second document.',
            'And this is the third one.',
            'Is this the first document?',
        ]
        vectorizer = TfidfVectorizer()
        X = vectorizer.fit_transform(corpus)
        print(vectorizer.get_feature_names())
        

        这给出了语料库中的关键字。还可以获取关键词的得分,获取前n个关键词等。

        输出

        ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
        

        在上面的输出中出现了“is”和“the”等停用词,因为语料库非常小。使用大型语料库,您可以按优先级顺序获取最重要的关键字。请查看TfidfVectorizer 了解更多信息。

        【讨论】:

          【解决方案5】:

          如果从整个语料库中提取重要的关键字,则此 sn-p 可能有助于根据 idf 值提取单词。我们将提取 20 个新闻组数据集的无神论类别中的关键字。也许不是你的 goto 选择:)

          ## THE CODE IS SELF EXPLANATORY AND COMMENTED 
          
          ## loading some dependencies
          import gensim
          from gensim.utils import simple_preprocess
          from gensim.parsing.preprocessing import STOPWORDS
          from nltk.stem import WordNetLemmatizer, SnowballStemmer
          from nltk.stem.porter import *
          import nltk
          nltk.download('wordnet')
          from sklearn.feature_extraction.text import TfidfVectorizer
          
          ## our dataset
          from sklearn.datasets import fetch_20newsgroups
          newsgroups_train = fetch_20newsgroups(subset='train' , shuffle = True , categories =  [ "alt.atheism" ])
          ## defining a stemmer to use
          stemmer = SnowballStemmer("english")
          
          ## this dictiaoniary will come in handy later on ..
          stemmed_to_original = {}
          
          ## Basic Preprocessings Functions ##
          def lemmatize_stemming(text):
              return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
          
          def preprocess(text):
              result=[]
              for token in gensim.utils.simple_preprocess(text) :
          
                  if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
                      stemmed_token = lemmatize_stemming(token)
                      stemmed_to_original[stemmed_token] = token
                      result.append(stemmed_token)
                      
              return result
          
          
          news_data = [ preprocess(i) for i in newsgroups_train.data  ]
          ## notice, min_df and max_df parameters are really important in getting the most important keywords out of your corpus
          vectorizer = TfidfVectorizer(   stop_words= gensim.parsing.preprocessing.STOPWORDS , min_df = 20 , max_df = 0.72, tokenizer= lambda x : x , lowercase= False   )
          vectorizer.fit_transform( news_data  )
          
          ## get idf values of all the corresponding tokens used by vectorizer and sort them in ascending order
          ## Depends on how you define it, but for most of cases while working in text corpus,  after unnecessary stopwords and  ( really high / really rare ) frequent words have been filtered out
          ## by parameters we used in our vectorizer above,  this type of sorting gets you important keywords
          
          ## make a dictionairy of words and corresponding idf weight
          word_to_idf = {  i:j for i,j in zip(vectorizer.get_feature_names() , vectorizer.idf_ ) }
          ## sort the dictionairy in ascending order of idf weights
          word_to_idf = sorted(   word_to_idf.items() ,key = lambda x : x[1]  ,  reverse = False )
          print(word_to_idf)
          

          让我们打印前 N 个结果

          for k,v in word_to_idf[:5]:
              print( '{} ---> {} ----> {}'.format( k , stemmed_to_original[k] , v    )  ) 
          

          让我们看看顶级结果

          如果我们在删除新闻标题和称呼时更加小心,我们可以避免使用 post、article、host 之类的词。不过没关系

          post ---> posting ----> 1.4392949726265691
          articl ---> article ----> 1.4754236967150747
          host ---> host ----> 1.7035965964342865
          nntp ---> nntp ----> 1.7248288165400607
          think ---> think ----> 1.8287597393882924
          peopl ---> people ----> 1.887600239411226
          know ---> know ----> 1.994083719813676
          univers ---> universe ----> 1.994083719813676
          atheist ---> atheists ----> 2.011081296182247
          like ---> like ----> 2.016811970891232
          thing ---> things ----> 2.094462905121298
          time ---> time ----> 2.199133527685187
          mean ---> means ----> 2.2271073797275927
          believ ---> believe ----> 2.2705924916673315
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 1970-01-01
            • 2018-03-16
            • 1970-01-01
            • 2012-04-03
            • 1970-01-01
            • 1970-01-01
            • 2016-05-12
            • 1970-01-01
            相关资源
            最近更新 更多