【问题标题】:NLTK Tokenizer with Twitter API带有 Twitter API 的 NLTK 标记器
【发布时间】:2020-04-20 13:18:19
【问题描述】:

我试图找出一系列推文中的频率分布,但频率分布是对每条推文进行唯一计数,而不是对整个推文进行计数。我该如何解决这个问题?

import tweepy
from tweepy import OAuthHandler
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk import FreqDist
from nltk.corpus import stopwords

consumer_key = 'x'
consumer_secret = 'x'
access_token = 'x'
access_secret = 'x'

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)

api = tweepy.API(auth)

for tweet in tweepy.Cursor(api.user_timeline,
        "johnnywalker",
        result_type = "recent",
        count = 50,
        include_entities = False,
        exclude_replies = True,
        include_rts = False,
        trim_user = True,
        lang = "en").items():

        croy = word_tokenize(tweet.text)
        ensw = stopwords.words('english')
        filterArr = [word for word in croy if word not in ensw]
        filterArr = [word for word in croy if len(word) > 7]
        fdist = FreqDist(filterArr)
        fdist.most_common(50)

【问题讨论】:

    标签: python nltk twitterapi-python


    【解决方案1】:

    您正在计算每条推文的频率分布。如果你想要一系列推文的分布,你应该在循环之外进行计算。

    tweet_tokenized = [] 
    for tweet in tweepy.Cursor(api.user_timeline,
            "johnnywalker",
            result_type = "recent",
            count = 50,
            include_entities = False,
            exclude_replies = True,
            include_rts = False,
            trim_user = True,
            lang = "en").items():
    
            croy = word_tokenize(tweet.text)
            ensw = stopwords.words('english')
            filterArr = [word for word in croy if word not in ensw]
            filterArr = [word for word in croy if len(word) > 7]
            tweet_tokenized.extend(filterArr)
    fdist = FreqDist(tweet_tokenized)
    fdist.most_common(50)
    

    【讨论】:

    • 这让我更接近了。谢谢。
    猜你喜欢
    • 2014-12-31
    • 1970-01-01
    • 2010-12-12
    • 2011-04-25
    • 1970-01-01
    • 2013-03-13
    • 2011-05-19
    • 2018-03-11
    • 2016-06-23
    相关资源
    最近更新 更多