【问题标题】:Frequency Distribution Comparison Python频率分布比较 Python
【发布时间】:2015-11-09 12:04:32
【问题描述】:

我正在使用 python 和 nltk 研究一些文本,我想比较不同文本中词性的频率分布。

我可以为一个文本做到这一点:

from nltk import *

X_tagged = pos_tag(word_tokenize(open('/Users/X.txt').read()))

X_fd = FreqDist([tag for word, tag in X_tagged])
X_fd.plot(cumulative=True, title='Part of Speech Distribution in Corpus X')

我尝试添加另一个,但运气不佳。我有条件频率分布示例,用于比较多个文本中三个单词的计数,但我希望这些行代表四个不同的文本,y 轴代表计数,x 轴代表不同的文本词性。如何比较同一图表中的文本 Y 和 Z?

【问题讨论】:

    标签: python nltk word-frequency frequency-distribution


    【解决方案1】:

    这是一个使用 matplotlib 的示例:

    from matplotlib import pylab as plt
    from nltk import *
    import numpy as np
    
    # you may use a tokenizer like nltk.tokenize.word_tokenize()
    dist = {}
    dist["win"] = FreqDist(tokenizer("first text"))
    dist["draw"] =  FreqDist(tokenizer("second text"))
    dist["lose"] =  FreqDist(tokenizer("third text"))
    dist["mixed"] = FreqDist(tokenizer("fourth text"))
    
    # sorted list of 50 most common terms in one of the texts
    # (too many terms would be illegible in the graph)
    most_common = [item for item, _ in dist["mixed"].most_common(50)] 
    
    colors = ["green", "blue", "red", "turquoise"]
    
    # loop over the dictionary keys to plot each distribution
    for i, label in enumerate(dist):
        frequency = [dist[label][term] for term in most_common]
        color = colors[i]
        plt.plot(frequency, color=color, label=label)
    plt.gca().grid(True)
    plt.xticks(np.arange(0, len(most_common), 1), most_common, rotation=90)
    plt.xlabel("Most common terms")
    plt.ylabel("Frequency")
    plt.legend(loc="upper right")
    plt.show()
    

    【讨论】:

      【解决方案2】:

      我想通了,如果有人感兴趣的话;您需要获取单独的频率分布并将它们输入到字典中,其中包含所有 FreqDist 共有的键和表示每个 FreqDist 结果的值元组,然后您需要绘制每个 FreqDist 的值并设置键作为 xvalues,按照您拉出它们的相同顺序。

      win = FreqDist([tag for word, tag in win]) # 'win', 'draw', 'lose' and 'mixed' are already POS tagged (lists of tuples ('the', 'DT'))
      
      draw = FreqDist([tag for word, tag in draw])
      
      lose = FreqDist([tag for word, tag in lose])
      
      mixed = FreqDist([tag for word, tag in mixed])
      
      POS = [item for item in win] # list of common keys
      
      results = {}
      for key in POS:
          results[key] = tuple([win[key], draw[key], lose[key], mixed[key]]) # one key, tuple of values for each FreqDist (in order)
      
      win_counts = [results[item][0] for item in results]
      
      draw_counts = [results[item][1] for item in results]
      
      lose_counts = [results[item][2] for item in results]
      
      mixed_counts = [results[item][3] for item in results]
      
      display = [item for item in results] # over-cautious, same as POS above
      
      plt.plot(win_counts, color='green', label="win") # need to 'import pyplot as plt'
      plt.plot(draw_counts, color='blue', label="draw")
      plt.plot(lose_counts, color='red', label="lose")
      plt.plot(mixed_counts, color='turquoise', label="mixed")
      plt.gca().grid(True)
      plt.xticks(np.arange(0, len(display), 1), display, rotation=45) # will put keys as x values
      plt.xlabel("Parts of Speech")
      plt.ylabel("Counts per 10,000 tweets")
      plt.suptitle("Part of Speech Distribution across Pre-Win, Pre-Loss and Pre-Draw Corpora")
      plt.legend(loc="upper right")
      plt.show()
      

      【讨论】:

        【解决方案3】:

        FreqDist.plot() 方法只是一种方便的方法。

        您需要自己编写绘图逻辑(使用matplotlib)以在一个绘图中包含多个频率分布。

        FreqDist 的绘图功能的source code 可能是让您入门的神点。 matplotlib 也有很好的tutorial 和初学者指南。

        【讨论】:

          猜你喜欢
          • 2013-03-19
          • 1970-01-01
          • 2014-07-08
          • 2012-04-19
          • 2011-08-20
          • 1970-01-01
          • 2020-10-17
          • 2016-01-04
          • 2011-02-05
          相关资源
          最近更新 更多