Beautifulsoup 返回错误的 href 值答案

【问题标题】：Beautifulsoup Returning Wrong href ValueBeautifulsoup 返回错误的 href 值
【发布时间】：2021-12-27 19:13:20
【问题描述】：

我正在使用以下 SERP 代码进行一些 SEO，但是当我尝试读取 href 属性时，我得到的结果不正确，显示页面中的其他有线 URL，但不是预期的。我的代码有什么问题？

import requests
from bs4 import BeautifulSoup

URL = "https://www.google.com/search?q=beautiful+soup&rlz=1C1GCEB_enIN922IN922&oq=beautiful+soup&aqs=chrome..69i57j69i60l3.2455j0j7&sourceid=chrome&ie=UTF-8"
r = requests.get(URL)
webPage = html.unescape(r.text) 

soup = BeautifulSoup(webPage, 'html.parser')
text =''
gresults = soup.findAll('h3') 

for result in gresults:
    print (result.text)
    links = result.parent.parent.find_all('a', href=True)
    for link in links:
        print(link.get('href'))

输出如下所示：

/url?q=https://www.crummy.com/software/BeautifulSoup/bs4/doc/&sa=U&ved=2ahUKEwjv6-q3tJ30AhX_r1YBHU9OAeMQFnoECAAQAg&usg=AOvVaw2Q

【问题讨论】：

找到该信息的方式很奇怪。您无法按包含您的链接的类搜索特定的<div>？
我不是很明白你的回复，但是我没有使用类我只是依赖标签，在这种情况下我找到了h3标签，然后升两次到祖父节点，然后寻找标签，并检索 href 属性。

标签： python beautifulsoup python-requests href

【解决方案1】：

1。它将从 HTML 返回所有 <h3> 元素，包括诸如“相关搜索、视频、人们也询问”部分之类的文本，在这种情况下，这不是您要查找的内容。

gresults = soup.findAll('h3')

2。这种搜索方法在某些情况下很好，但在特定情况下不是首选，因为您这样做有点盲目或成像，如果其中一个 .parent 节点（元素）将消失，则会引发错误。

不要执行所有这些操作，而是调用适当的 CSS 选择器（更多内容在下面），而不执行此方法链接可能不可读（如果有很多父节点） .

result.parent.parent.find_all()

3。 get('href') 可以工作，但是您会得到这样的 URL，因为没有传递 user-agent 来请求 headers，这是“充当”真实用户访问所必需的。当user-agent 被传递给请求headers 时，您将得到一个正确的URL（我不知道这种行为的正确解释）。

如果在使用requests 库时没有将user-agent 传递给请求headers，则默认为python-requests，因此Google 或其他搜索引擎（网站）知道这是一个bot/script，并且可能会阻止请求或收到的 HTML 将与您在浏览器中看到的不同。检查what's your user-agent。 List of user-agents.

通过user-agent 请求headers：

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('URL', headers=headers)

要使其正常工作，您需要：

1。通过调用特定的CSS 选择器来查找包含所有需要数据的容器（查看SelectorGadget 扩展名）。 CSS selectors reference.

将容器想象成一个盒子，里面装着东西，您可以通过指定要获取的项目来从中获取项目。在您的情况下，它将是（不使用 2 for 循环）：

# .yuRUbf -> container
for result in soup.select('.yuRUbf'):
    
    # .DKV0Md -> CSS selector for title which is located inside a container
    title = result.select_one('.DKV0Md').text

    # grab <a> and extract href attribute.
    # .get('href') equal to ['href']
    link = result.select_one('a')['href']

完整代码和example in the online IDE:

import requests
from bs4 import BeautifulSoup


headers = {
    'User-Agent':
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582'
}

response = requests.get('https://www.google.com/search?q=beautiful+soup', headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')


# enumerate() -> adds a counter to an iterable and returns it
# https://www.programiz.com/python-programming/methods/built-in/enumerate
for index, result in enumerate(soup.select('.yuRUbf')):
    position = index + 1
    title = result.select_one('.DKV0Md').text
    link = result.select_one('a')['href']

    print(position, title, link, sep='\n')


# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的案例的不同之处在于它是为此类任务创建的。您不必弄清楚要使用哪个 CSS 选择器、如何绕过来自 Google 或其他搜索引擎的阻止、随着时间的推移维护代码（如果 HTML 中的某些内容将被更改）。相反，请专注于您想要获取的数据。查看playground（需要登录）。

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "api_key": os.getenv("API_KEY"),  # YOUR API KEY
    "engine": "google",               # search engine
    "q": "Beautiful Soup",            # query
    "hl": "en"                        # language
    # other parameters
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
    position = result["position"]          # website rank position
    title = result["title"]
    link = result["link"]

    print(position, title, link, sep="\n")


# part of the output
'''
1
Beautiful Soup 4.9.0 documentation - Crummy
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
2
Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...
https://beautiful-soup-4.readthedocs.io/
3
BeautifulSoup4 - PyPI
https://pypi.org/project/beautifulsoup4/
'''

免责声明，我为 SerpApi 工作。

附：我有一个dedicated web scraping blog。

【讨论】：

这真的是很好的解释，特别是用户代理部分。关于 serpApi 的一件事是将它与 AWS lambda 集成会很复杂，我目前依赖 AWS-Lambda 作为我的项目的无服务器技术
@AhmedOsama 很高兴您发现它很有用 :-) 另外，我非常感谢您提到与 AWS Lambda 的复杂集成过程。我，我们，记下了它。

【解决方案2】：

会发生什么？

仅选择 <h3> 将为您提供一个包含不需要元素的结果集。
升级到父母parent 是可以的，但尝试find_all()（不要在新代码中使用旧语法findAll()）是没有必要的，这也会给你@987654325 @你可能不想要。

如何解决？

选择你的目标元素更具体，然后你可以使用：

result.parent.parent.find('a',href=True).get('href')

但我建议使用以下示例。

示例

from bs4 import BeautifulSoup
import requests

    
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = f'http://www.google.com/search?q=beautiful+soup'

r = requests.get(url, headers= headers)
soup = BeautifulSoup(r.text, 'lxml')

data = []

for r in soup.select('#search a h3'):
    data.append({
        'title':r.text,
        'url':r.parent['href'],
     })
data

输出

[{'title': 'Beautiful Soup 4.9.0 documentation - Crummy',
  'url': 'https://www.crummy.com/software/BeautifulSoup/bs4/doc/'},
 {'title': 'Beautiful Soup Tutorial: Web Scraping mit Python',
  'url': 'https://lerneprogrammieren.de/beautiful-soup-tutorial/'},
 {'title': 'Beautiful Soup 4 - Web Scraping mit Python | HelloCoding',
  'url': 'https://hellocoding.de/blog/coding-language/python/beautiful-soup-4'},
 {'title': 'Beautiful Soup - Wikipedia',
  'url': 'https://de.wikipedia.org/wiki/Beautiful_Soup'},
 {'title': 'Beautiful Soup (HTML parser) - Wikipedia',
  'url': 'https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)'},
 {'title': 'Beautiful Soup Documentation — Beautiful Soup 4.4.0 ...',
  'url': 'https://beautiful-soup-4.readthedocs.io/'},
 {'title': 'BeautifulSoup4 - PyPI',
  'url': 'https://pypi.org/project/beautifulsoup4/'},
 {'title': 'Web Scraping und Parsen von HTML in Python mit Beautiful ...',
  'url': 'https://www.twilio.com/blog/web-scraping-und-parsen-von-html-python-mit-beautiful-soup'}]

【讨论】：

非常感谢，这就像一个魅力，但我仍然不知道我的方法有什么问题，以及为什么 get('href') 不能正常工作。跨度>