【问题标题】:How to avoid UnicodeEncodeError '\xf8' when scraping table with beautifulsoup [duplicate]用beautifulsoup刮表时如何避免UnicodeEncodeError '\xf8' [重复]
【发布时间】:2017-09-29 10:22:37
【问题描述】:

下面的脚本返回'UnicodeEncode Error: 'ascii' codec can't encode character '\xf8' in position 118: ordinal not in range(128)'

我找不到很好的解释。

from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

results = {}

for page_num in range(0, 1000, 20):
    address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører' 

    html = urlopen(address)
    soup = BeautifulSoup(html, 'lxml')
    table = soup.find_all(class_='table-condensed')
    output = pd.read_html(str(table))[0]
    results[page_num] = output


df = pd.concat([v for v in results.values()], axis = 0)

【问题讨论】:

    标签: python-3.x web-scraping beautifulsoup


    【解决方案1】:

    您正在使用 std 库打开 url。该库强制将地址编码为 ascii。因此,像 ø 这样的非 ascii 字符会引发 Unicode 错误。

    Line 1116-1117 of http/client.py

        # Non-ASCII characters should have been eliminated earlier
        self._output(request.encode('ascii'))
    

    作为 urllib.request 的替代方案,第 3 方请求非常棒。

    import requests
    
    address = 'https://www.proff.no/nyetableringer?industryCode=p441&fromDate=22.01.2007&location=Nord-Norge&locationId=N&offset=' + str(page_num) + '&industry=Entreprenører'
    html = requests.get(address).text
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-12-05
      • 1970-01-01
      • 2011-07-14
      • 1970-01-01
      • 2020-07-05
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多