【问题标题】:Python encoding issue when trying to parse JSON tweets尝试解析 JSON 推文时出现 Python 编码问题
【发布时间】:2015-09-04 20:35:11
【问题描述】:

我正在尝试使用以下代码解析从 Twitter 返回的 JSON 对象的推文和用户名部分:

class listener(StreamListener):

  def on_data(self, data):
          all_data = json.loads(data)
          tweet = all_data["text"]
          username = all_data["user"]["screen_name"]

          c.execute("INSERT INTO tweets (tweet_time, username, tweet) VALUES (%s,%s,%s)" ,
                    (time.time(), username, tweet))
          print (username, tweet)
          return True

  def on_error(self, status):
      print (status)

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
twitterStream = Stream(auth, listener())
twitterStream.filter(track = ["LeBron James"])

但我收到以下错误。如何调整代码以正确解码或编码响应?

Traceback (most recent call last):
   File "C:/Users/sagars/PycharmProjects/YouTube NLP Lessons/Twitter Stream to DB.py", line 45, in <module>
    twitterStream.filter(track = ["LeBron James"])
  File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 428, in filter
    self._start(async)
  File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 346, in _start
    self._run()
  File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 286, in _run
    raise exception
  File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 255, in _run
    self._read_loop(resp)
  File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 309, in _read_loop
    self._data(next_status_obj)
  File "C:\Python34\lib\site-packages\tweepy\streaming.py", line 289, in _data
    if self.listener.on_data(data) is False:
  File "C:/Users/sagars/PycharmProjects/YouTube NLP Lessons/Twitter Stream to DB.py", line 36, in on_data
    print (username, tweet)
  File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-8: character maps to <undefined>

【问题讨论】:

    标签: python json twitter


    【解决方案1】:

    不幸的是,您从 twitter 获得的信息不是utf-8 编码的,这导致您收到charmap 错误。要解决这个问题,您需要对其进行编码。

    tweet = all_data["text"].encode('utf-8')
    username = all_data["user"]["screen_name"].encode('utf-8')
    

    这将导致您丢失一些出现在推文中的表情符号和特殊字符,它将被转换为\x899。如果您确实需要该信息(我自己丢弃)来进行情绪分析,那么您需要安装一个带有预编译列表的软件包,以进行相应的转换。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-12-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-16
      • 1970-01-01
      • 1970-01-01
      • 2018-06-23
      相关资源
      最近更新 更多