使用 spark-submit 和 BeautifulSoup 时出现 UnicodeEncodeError答案

【问题标题】：UnicodeEncodeError while using spark-submit and BeautifulSoup使用 spark-submit 和 BeautifulSoup 时出现 UnicodeEncodeError
【发布时间】：2019-07-26 20:19:14
【问题描述】：

当我提交作业以触发 1.6、hadoop 2.7 时，我在 Python 2.7 中不断收到 UnicodeEncodeError，但 当我在 pyspark shell 上逐行执行相同的代码时，我没有收到相同的错误.

我正在使用BeautifulSoup 获取所有标签并使用这行代码从中获取文本：

[r.text for r in BeautifulSoup(line).findAll('ref') if r.text]

我尝试了以下方法：

设置export PYTHONIOENCODING="utf8"
使用r.text.encode('ascii', 'ignore')
也尝试申请sysdefaultencoding('utf-8')

谁能告诉我如何解决它？以下是错误堆栈：

"/hdata/dev/sdf1/hadoop/yarn/local/usercache/harshdee/appcache/application_1551632819863_0039/container_e36_1551632819863_0039_01_000004/pyspark.zip/pyspark/serializers.py", line 263, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/home/harshdee/get_data.py", line 63, in get_as_row
    return Row(citations=get_citations(line.content), id=line.id, title=line.title)
  File "/home/harshdee/get_data.py", line 47, in get_citations
    refs_in_line = [r.text for r in BeautifulSoup(line).findAll('ref') if r.text]
  File "/usr/lib/python2.7/site-packages/bs4/__init__.py", line 274, in __init__
    self._check_markup_is_url(markup)
  File "/usr/lib/python2.7/site-packages/bs4/__init__.py", line 336, in _check_markup_is_url
    ' that document to Beautiful Soup.' % decoded_markup
  File "/usr/lib64/python2.7/warnings.py", line 29, in _show_warning
    file.write(formatwarning(message, category, filename, lineno, line))
  File "/usr/lib64/python2.7/warnings.py", line 38, in formatwarning
    s =  "%s:%s: %s: %s\n" % (filename, lineno, category.__name__, message)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 21-28: ordinal not in range(128)```

【问题讨论】：

Python 2.7 现在已经很老了，但是您是否尝试过将所有内容都转换为 unicode：...findAll(u'ref')？
不，我没有，我试试看！
[r.text for r in BeautifulSoup(line).findAll(u'ref') if r.text]: 试过这个但给了我同样的错误@SergeBallesta

标签： python python-2.7 apache-spark hadoop beautifulsoup

【解决方案1】：

我自己解决了这个问题。我认为问题出在字符串的表示上。

为此，我使用了返回对象表示的repr 函数。也就是说，它基本上返回了一个统一编码的string。

我将此应用于line 变量。

【讨论】：