【问题标题】:How to get only the text part from tweets extracted using Tweepy?如何仅从使用 Tweepy 提取的推文中获取文本部分?
【发布时间】:2017-09-20 04:17:01
【问题描述】:

我正在做一个类似于情绪分析的研究项目。我使用 Tweepy 从 twitter 中提取了推文。 我得到的数据是这样的:

{"created_at":"Sat Apr 22 07:28:47 +0000 2017","id":855684794939842560,"id_str":"855684794939842560","text":"#PL | FIXTURES - 22 April 2017 \nWest Ham v Everton 16:00\nHull v Watford\nSwansea v Stoke \nBournemouth v Middlesbrough #CCFMSport","source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":256051042,"id_str":"256051042","name":"Ayanda Frances Felem","screen_name":"AyandaFelemZA","location":"Cape Town, South Africa","url":"http:\/\/ccfm.org.za","description":"Sports Producer\/Reporter for @RadioCCFm, Views are my own. ayanda@ccfm.org.za","protected":false,"verified":false,"followers_count":446,"friends_count":1648,"listed_count":23,"favourites_count":1625,"statuses_count":16110,"created_at":"Tue Feb 22 15:15:38 +0000 2011","utc_offset":7200,"time_zone":"Pretoria","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_tile":false,"profile_link_color":"DD2E44","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/850335374446665728\/BvVIo7oB_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/850335374446665728\/BvVIo7oB_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/256051042\/1491570881","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"PL","indices":[0,3]},{"text":"CCFMSport","indices":[117,127]}],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1492846127625"}

现在我只想从此文件中提取推文“文本”。我试过这个:

import json

tweets_data_path = 'twitter_streaming.txt'
tweets_data = []
tweets_file = open(tweets_data_path, "r")

json_load = json.load(tweets_file)
texts = json_load['text']
coded = texts.encode('utf-8')
s = str(coded)
tweets_data.append(s[1:-2))
print tweets_data

但我收到一条错误消息:

json.decoder.JSONDecodeError:预期值:第 1 行第 1 列(字符 0)

试图寻找这个错误的原因,但没有找到任何具体的东西。

我做错了什么?有没有更好的办法?

【问题讨论】:

    标签: python json python-3.x twitter twitter-streaming-api


    【解决方案1】:
    null,false = None,False
    a = {"created_at":"Sat Apr 22 07:28:47 +0000 2017","id":855684794939842560,"id_str":"855684794939842560","text":"#PL | FIXTURES - 22 April 2017 \nWest Ham v Everton 16:00\nHull v Watford\nSwansea v Stoke \nBournemouth v Middlesbrough #CCFMSport","source":"\u003ca href=\"http:\/\/twitter.com\" rel=\"nofollow\"\u003eTwitter Web Client\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":256051042,"id_str":"256051042","name":"Ayanda Frances Felem","screen_name":"AyandaFelemZA","location":"Cape Town, South Africa","url":"http:\/\/ccfm.org.za","description":"Sports Producer\/Reporter for @RadioCCFm, Views are my own. ayanda@ccfm.org.za","protected":false,"verified":false,"followers_count":446,"friends_count":1648,"listed_count":23,"favourites_count":1625,"statuses_count":16110,"created_at":"Tue Feb 22 15:15:38 +0000 2011","utc_offset":7200,"time_zone":"Pretoria","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"000000","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme11\/bg.gif","profile_background_tile":false,"profile_link_color":"DD2E44","profile_sidebar_border_color":"000000","profile_sidebar_fill_color":"000000","profile_text_color":"000000","profile_use_background_image":false,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/850335374446665728\/BvVIo7oB_normal.jpg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/850335374446665728\/BvVIo7oB_normal.jpg","profile_banner_url":"https:\/\/pbs.twimg.com\/profile_banners\/256051042\/1491570881","default_profile":false,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[{"text":"PL","indices":[0,3]},{"text":"CCFMSport","indices":[117,127]}],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"en","timestamp_ms":"1492846127625"}
    print a["text"]
    

    我只是使用了这行代码,它返回了以下输出。

    #PL | FIXTURES - 22 April 2017 
    West Ham v Everton 16:00
    Hull v Watford
    Swansea v Stoke 
    Bournemouth v Middlesbrough #CCFMSport
    

    虽然问题不清楚,但您是否在寻找此文本?

    【讨论】:

    • 是的,这正是我要找的,但你能告诉我你在哪里使用这个 print a["text"]。非常感谢朋友。
    • 好的,我明白了你所做的,但为此我必须将所有 false 更改为 False,将 null 更改为 None
    • @Liam 是的,我已经编辑了我的答案,以便对您有更多帮助!
    • @Liam 希望对您有所帮助!谢谢!
    【解决方案2】:

    此代码按预期工作 -

    import json
    
    tweets_data_path = 'twitter_data.txt'
    tweets_data = []
    tweets_file = open(tweets_data_path, "r")
    
    json_load = json.load(tweets_file)
    texts = json_load['text']
    print(texts)
    

    如果预期的输出是 - 则不需要以下代码部分

    coded = texts.encode('utf-8')
    s = str(coded)
    tweets_data.append(s[1:-2))
    print tweets_data
    
    #output
    '''
    #PL | FIXTURES - 22 April 2017 
    West Ham v Everton 16:00
    Hull v Watford
    Swansea v Stoke 
    Bournemouth v Middlesbrough #CCFMSport
    None
    '''
    

    【讨论】:

      猜你喜欢
      • 2020-06-28
      • 2019-05-06
      • 2017-07-31
      • 1970-01-01
      • 2016-08-11
      • 2019-04-09
      • 1970-01-01
      • 2023-01-19
      • 2018-12-31
      相关资源
      最近更新 更多