使用 urllib 和请求获取页面的错误 HTML 内容答案

【问题标题】：getting incorrect HTML content of a page using urlib and request使用 urllib 和请求获取页面的错误 HTML 内容
【发布时间】：2021-10-14 05:55:52
【问题描述】：

我使用了两种方法来获取内部应用链接的页面源。

第一种方法 - 使用 Robot Framework 关键字 ${html_page} =。获取源代码
第二种方法 -
- 使用请求-- visit_url_content = urllib.request.urlopen(url).read().decode('utf-8') 和
- visit_url_content = requests.get(url, 'html.parser').text

在获取页面源代码后，我使用beautifulsoup 提取所有带有标签a 和属性为“href”的链接。汤 = BeautifulSoup(html_page, "html.parser")

第一种方法我得到大约 20 个链接，但第二种方法我只得到 2 个链接... 我需要在 python 中处理这个，所以不能使用机器人框架选项。关于为什么会发生的任何帮助

【问题讨论】：

为什么这个问题是-1？

标签： html python-3.x python-requests robotframework urllib

【解决方案1】：

有点不清楚您的代码到底是什么样子，因为您只发布了一些代码 sn-ps。我认为它看起来像这样：

import urllib.request
from bs4 import BeautifulSoup

URL = "your-url"

html = urllib.request.urlopen(URL).read().decode('utf-8')

soup = BeautifulSoup(html, "html.parser")

for a in soup.find_all('a', href=True):
    print(a["href"])

基于 StackOverflow：BeautifulSoup getting href

此代码与您的代码有什么不同吗？你能分享你爬取网站/你要爬取的URL的完整代码吗？否则很难找出问题所在。

【讨论】：