BeautifulSoup 返回空括号答案

【问题标题】：BeautifulSoup returns empty bracketsBeautifulSoup 返回空括号
【发布时间】：2021-07-21 18:50:03
【问题描述】：

我正在尝试使用 python 中的 bs4 库在 Google 中搜索有多少结果，但在执行此操作时，它返回空括号。

这是我的代码：

import requests
from bs4 import BeautifulSoup


url_page = 'https://www.google.com/search?q=covid&oq=covid&aqs=chrome.0.0i433l2j0i131i433j0i433j0i131i433l2j0j0i131i433j0i433j0i131i433.691j0j7&sourceid=chrome&ie=UTF-8'

page = requests.get(url_page).text
soup = BeautifulSoup(page, "lxml")

elTexto = soup.find_all(attrs ={'class': 'LHJvCe'})
print(elTexto)

我在 google 中有一个扩展程序，它检查 html 类是否正确，它给了我我正在寻找的东西，所以我想这不是问题......也许与“文本”的格式有关'我试图得到... 谢谢！

【问题讨论】：

Google 正在随机化类名，以防止您正在做的事情。

标签： python google-chrome web-scraping beautifulsoup

【解决方案1】：

最好使用gsearch 包来完成你的任务，而不是手动抓取网页。

【讨论】：

【解决方案2】：

Google 不将类随机化为 baduker mentioned。随着时间的推移，他们可能会更改一些 class 名称，但不会随机化。

您得到空结果的原因之一是您没有指定 HTTP user-agent aka headers，因此 Google 可能会阻止您的请求，而 headers 可能有助于避免它。您可以查看您的user-agent here 是什么。标题将如下所示：

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('YOUR URL', headers=headers)

此外，您不需要使用find_all()/findAll() 或select()，因为您只想获得一次，而不是全部。改为使用：

find('ELEMENT NAME', class_='CLASS NAME')
select_one('.CSS_SELECTORs')

select()/select_one() usually faster.

代码和example in the online IDE（注意：结果的数量总是不同的。It just works this way。）：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "fus ro dah defenition",
  "gl": "us",
  "hl": "en"
}

response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')

number_of_results = soup.select_one('#result-stats nobr').previous_sibling
print(number_of_results)

# About 104,000 results

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的，只是您不需要弄清楚为什么某些东西不起作用，而是迭代结构化的 JSON 字符串并获取您想要的数据。

代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "fus ro dah defenition",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

result = results["search_information"]['total_results']
print(result)

# 104000

免责声明，我为 SerpApi 工作。

【讨论】：