【问题标题】:Practical examples of NLTK use [closed]NLTK 使用的实际示例 [关闭]
【发布时间】:2010-10-06 07:16:40
【问题描述】:

我正在玩 Natural Language Toolkit (NLTK)。

它的文档(BookHOWTO)非常庞大,示例有时稍微高级一些。

有没有关于 NLTK 的使用/应用的基本示例?我正在考虑Stream Hacker 博客上的NTLK articles 之类的东西。

【问题讨论】:

    标签: python nlp nltk


    【解决方案1】:

    这是我自己的实际示例,方便其他人查找此问题(请原谅示例文本,这是我在 Wikipedia 上找到的第一件事):

    import nltk
    import pprint
    
    tokenizer = None
    tagger = None
    
    def init_nltk():
        global tokenizer
        global tagger
        tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
        tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())
    
    def tag(text):
        global tokenizer
        global tagger
        if not tokenizer:
            init_nltk()
        tokenized = tokenizer.tokenize(text)
        tagged = tagger.tag(tokenized)
        tagged.sort(lambda x,y:cmp(x[1],y[1]))
        return tagged
    
    def main():
        text = """Mr Blobby is a fictional character who featured on Noel
        Edmonds' Saturday night entertainment show Noel's House Party,
        which was often a ratings winner in the 1990s. Mr Blobby also
        appeared on the Jamie Rose show of 1997. He was designed as an
        outrageously over the top parody of a one-dimensional, mute novelty
        character, which ironically made him distinctive, absurd and popular.
        He was a large pink humanoid, covered with yellow spots, sporting a
        permanent toothy grin and jiggling eyes. He communicated by saying
        the word "blobby" in an electronically-altered voice, expressing
        his moods through tone of voice and repetition.
    
        There was a Mrs. Blobby, seen briefly in the video, and sold as a
        doll.
    
        However Mr Blobby actually started out as part of the 'Gotcha'
        feature during the show's second series (originally called 'Gotcha
        Oscars' until the threat of legal action from the Academy of Motion
        Picture Arts and Sciences[citation needed]), in which celebrities
        were caught out in a Candid Camera style prank. Celebrities such as
        dancer Wayne Sleep and rugby union player Will Carling would be
        enticed to take part in a fictitious children's programme based around
        their profession. Mr Blobby would clumsily take part in the activity,
        knocking over the set, causing mayhem and saying "blobby blobby
        blobby", until finally when the prank was revealed, the Blobby
        costume would be opened - revealing Noel inside. This was all the more
        surprising for the "victim" as during rehearsals Blobby would be
        played by an actor wearing only the arms and legs of the costume and
        speaking in a normal manner.[citation needed]"""
        tagged = tag(text)    
        l = list(set(tagged))
        l.sort(lambda x,y:cmp(x[1],y[1]))
        pprint.pprint(l)
    
    if __name__ == '__main__':
        main()
    

    输出:

    [('rugby', None),
     ('Oscars', None),
     ('1990s', None),
     ('",', None),
     ('Candid', None),
     ('"', None),
     ('blobby', None),
     ('Edmonds', None),
     ('Mr', None),
     ('outrageously', None),
     ('.[', None),
     ('toothy', None),
     ('Celebrities', None),
     ('Gotcha', None),
     (']),', None),
     ('Jamie', None),
     ('humanoid', None),
     ('Blobby', None),
     ('Carling', None),
     ('enticed', None),
     ('programme', None),
     ('1997', None),
     ('s', None),
     ("'", "'"),
     ('[', '('),
     ('(', '('),
     (']', ')'),
     (',', ','),
     ('.', '.'),
     ('all', 'ABN'),
     ('the', 'AT'),
     ('an', 'AT'),
     ('a', 'AT'),
     ('be', 'BE'),
     ('were', 'BED'),
     ('was', 'BEDZ'),
     ('is', 'BEZ'),
     ('and', 'CC'),
     ('one', 'CD'),
     ('until', 'CS'),
     ('as', 'CS'),
     ('This', 'DT'),
     ('There', 'EX'),
     ('of', 'IN'),
     ('inside', 'IN'),
     ('from', 'IN'),
     ('around', 'IN'),
     ('with', 'IN'),
     ('through', 'IN'),
     ('-', 'IN'),
     ('on', 'IN'),
     ('in', 'IN'),
     ('by', 'IN'),
     ('during', 'IN'),
     ('over', 'IN'),
     ('for', 'IN'),
     ('distinctive', 'JJ'),
     ('permanent', 'JJ'),
     ('mute', 'JJ'),
     ('popular', 'JJ'),
     ('such', 'JJ'),
     ('fictional', 'JJ'),
     ('yellow', 'JJ'),
     ('pink', 'JJ'),
     ('fictitious', 'JJ'),
     ('normal', 'JJ'),
     ('dimensional', 'JJ'),
     ('legal', 'JJ'),
     ('large', 'JJ'),
     ('surprising', 'JJ'),
     ('absurd', 'JJ'),
     ('Will', 'MD'),
     ('would', 'MD'),
     ('style', 'NN'),
     ('threat', 'NN'),
     ('novelty', 'NN'),
     ('union', 'NN'),
     ('prank', 'NN'),
     ('winner', 'NN'),
     ('parody', 'NN'),
     ('player', 'NN'),
     ('actor', 'NN'),
     ('character', 'NN'),
     ('victim', 'NN'),
     ('costume', 'NN'),
     ('action', 'NN'),
     ('activity', 'NN'),
     ('dancer', 'NN'),
     ('grin', 'NN'),
     ('doll', 'NN'),
     ('top', 'NN'),
     ('mayhem', 'NN'),
     ('citation', 'NN'),
     ('part', 'NN'),
     ('repetition', 'NN'),
     ('manner', 'NN'),
     ('tone', 'NN'),
     ('Picture', 'NN'),
     ('entertainment', 'NN'),
     ('night', 'NN'),
     ('series', 'NN'),
     ('voice', 'NN'),
     ('Mrs', 'NN'),
     ('video', 'NN'),
     ('Motion', 'NN'),
     ('profession', 'NN'),
     ('feature', 'NN'),
     ('word', 'NN'),
     ('Academy', 'NN-TL'),
     ('Camera', 'NN-TL'),
     ('Party', 'NN-TL'),
     ('House', 'NN-TL'),
     ('eyes', 'NNS'),
     ('spots', 'NNS'),
     ('rehearsals', 'NNS'),
     ('ratings', 'NNS'),
     ('arms', 'NNS'),
     ('celebrities', 'NNS'),
     ('children', 'NNS'),
     ('moods', 'NNS'),
     ('legs', 'NNS'),
     ('Sciences', 'NNS-TL'),
     ('Arts', 'NNS-TL'),
     ('Wayne', 'NP'),
     ('Rose', 'NP'),
     ('Noel', 'NP'),
     ('Saturday', 'NR'),
     ('second', 'OD'),
     ('his', 'PP$'),
     ('their', 'PP$'),
     ('him', 'PPO'),
     ('He', 'PPS'),
     ('more', 'QL'),
     ('However', 'RB'),
     ('actually', 'RB'),
     ('also', 'RB'),
     ('clumsily', 'RB'),
     ('originally', 'RB'),
     ('only', 'RB'),
     ('often', 'RB'),
     ('ironically', 'RB'),
     ('briefly', 'RB'),
     ('finally', 'RB'),
     ('electronically', 'RB-HL'),
     ('out', 'RP'),
     ('to', 'TO'),
     ('show', 'VB'),
     ('Sleep', 'VB'),
     ('take', 'VB'),
     ('opened', 'VBD'),
     ('played', 'VBD'),
     ('caught', 'VBD'),
     ('appeared', 'VBD'),
     ('revealed', 'VBD'),
     ('started', 'VBD'),
     ('saying', 'VBG'),
     ('causing', 'VBG'),
     ('expressing', 'VBG'),
     ('knocking', 'VBG'),
     ('wearing', 'VBG'),
     ('speaking', 'VBG'),
     ('sporting', 'VBG'),
     ('revealing', 'VBG'),
     ('jiggling', 'VBG'),
     ('sold', 'VBN'),
     ('called', 'VBN'),
     ('made', 'VBN'),
     ('altered', 'VBN'),
     ('based', 'VBN'),
     ('designed', 'VBN'),
     ('covered', 'VBN'),
     ('communicated', 'VBN'),
     ('needed', 'VBN'),
     ('seen', 'VBN'),
     ('set', 'VBN'),
     ('featured', 'VBN'),
     ('which', 'WDT'),
     ('who', 'WPS'),
     ('when', 'WRB')]
    

    【讨论】:

    • 这是做什么的?你能添加一些描述吗?还有为什么要使用全局,你可以直接使用它们
    • @avi 它正在为单词生成词性标签(向下滚动以查看完整列表)。例如:('called', 'VBN') 表示calledpast participle verb。看起来使用了 Global 以便可以在函数范围内更改变量(这样就不必在每次调用函数时都传递它们)。
    • 为 Blobby 先生点赞 1
    【解决方案2】:

    NLP 通常非常有用,因此您可能希望将搜索范围扩大到文本分析的一般应用。我使用 NLTK 通过提取概念图来生成文件分类来帮助 MOSS 2010。它工作得非常好。用不了多久,文件就会开始以有用的方式聚集起来。

    通常,要理解文本分析,您必须与您习惯的思维方式相切合。例如,文本分析对于发现非常有用。但是,大多数人甚至不知道搜索和发现之间的区别。如果您阅读了这些主题,您可能会“发现”您可能希望让 NLTK 发挥作用的方式。

    另外,考虑一下您对没有 NLTK 的文本文件的世界观。您有一堆由空格和标点符号分隔的随机长度字符串。一些标点符号改变了它的使用方式,例如句点(它也是小数点和缩写的后缀标记)。使用 NLTK,您可以获得单词,甚至更多地获得词性。现在您已经掌握了内容。使用 NLTK 发现文档中的概念和操作。使用 NLTK 来了解文档的“含义”。这里的意义是指文档中的本质关系。

    对 NLTK 感到好奇是件好事。文本分析将在未来几年内大举突破。那些了解它的人将更适合更好地利用新机会。

    【讨论】:

    • 您能否发布指向 MOSS 2010 参考的链接?
    • 最好的链接是我几年前写的一篇论文。今年我将重建我的网页,专注于我的工作数据挖掘射电望远镜,但有一段时间这篇论文应该还在:nectarineimp.com/automated-folksonomy-whitepaper
    【解决方案3】:

    我是streamhacker.com 的作者(感谢您的提及,我从这个特定问题中获得了相当多的点击流量)。你具体想做什么? NLTK 有很多工具可以做各种各样的事情,但在使用这些工具的目的以及如何最好地使用它们方面缺乏明确的信息。它也面向学术问题,因此将pedagogical 示例转换为实际解决方案可能会很繁重。

    【讨论】:

      猜你喜欢
      • 2011-06-06
      • 2012-02-03
      • 1970-01-01
      • 2013-03-03
      • 2018-04-18
      • 2013-08-14
      • 2011-06-06
      • 2011-11-05
      • 2010-09-11
      相关资源
      最近更新 更多