【问题标题】:I am getting a error using urllib and bs4 "http.client.BadStatusLine:"使用 urllib 和 bs4 "http.client.BadStatusLine:" 时出现错误
【发布时间】:2020-09-12 05:41:41
【问题描述】:

我有一个名为“recognized.txt”的文件,其中包含一些类似这样的文本

已识别.txt 的链接:https://drive.google.com/file/d/1yCQz6cQPDmcCOuXBOCAX4nvNoUqewE0y/view?usp=sharing

:

我的代码:-

f = open('recognized.txt','r')
message = f.read()
message.replace(" ", "")
print(message)
f.close()


import bs4 as bs
import urllib.request
url = ('https://html.duckduckgo.com/html?q='+message)                                                              # no javascript



sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, 'lxml')
a = soup.body.b
print(a)

for i in soup.find_all('a', class_='result__snippet'):
    print(i.get_text(separator=' - ', strip=True))

所以当我运行上面的代码时,它给了我一个错误:-

Traceback (most recent call last):
  File "D:\ocr\webparse.py", line 26, in <module>
    sauce = urllib.request.urlopen(url).read()
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 526, in open
    response = self._open(req, data)
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 544, in _open
    '_open', req)
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 504, in _call_chain
    result = func(*args)
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 1361, in https_open
    context=self._context, check_hostname=self._check_hostname)
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\urllib\request.py", line 1321, in do_open
    r = h.getresponse()
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\http\client.py", line 1331, in getresponse
    response.begin()
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\http\client.py", line 297, in begin
    version, status, reason = self._read_status()
  File "C:\Users\Praveen\AppData\Local\Programs\Python\Python36\lib\http\client.py", line 279, in _read_status
    raise BadStatusLine(line)
http.client.BadStatusLine: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">


  1. 错误是什么意思?

  2. 为什么会出现这个错误?

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    在使用您的 txt 文件运行您的代码后,我设法复制了该问题。这就是我所做的:

    • 使用 strip() 从您的消息和空格中删除所有换行符
    • 从 BeautifulSoup() 中删除了“lxml”

    这似乎产生了不错的结果。

    import bs4 as bs
    import urllib.request
    
    with open('Downloads/recognized.txt') as f:
        message = f.read().strip()
    
    url = ('https://html.duckduckgo.com/html?q='+message)
    
    sauce = urllib.request.urlopen(url).read()
    soup = bs.BeautifulSoup(sauce)
    a = soup.body.b
    print(a)
    
    for i in soup.find_all('a', class_='result__snippet'):
        print(i.get_text(separator=' - ', strip=True))
    

    打印输出如下:

    <b>Dinosaur</b>
    Dinosaurs - are a diverse group of reptiles of the clade Dinosauria. They first appeared during the Triassic period, between 243 and 233.23 million years ago...
    ? - Dinosaur - . Quite the same Wikipedia. Just better. - Dinosaur - . From Wikipedia, the free encyclopedia.
    Мультфильм, триллер, приключения. Режиссер: Эрик Лейтон, Ральф Зондаг. В ролях: Элфри Вудард, Осси Дэвис, Макс Казелла и др. Путешествие трехтонного игуанодонта по имени Аладар...
    Перевод слова - dinosaur - , американское и британское произношение, транскрипция, словосочетания, примеры использования.
    

    问题似乎出在您的消息变量上。我清理了它,所以它是一个没有换行符的简单字符串。现在它工作正常。

    【讨论】:

    • 是的,我在问题中添加了链接,您可以查看
    • 我用你的新消息格式调整了代码。基本上你的消息字符串没有被正确清理。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2023-04-05
    • 1970-01-01
    • 2018-08-25
    • 2016-12-27
    • 2015-08-26
    • 2021-10-10
    相关资源
    最近更新 更多