【问题标题】：Beautifoulsoup scraping googleBeautifulsoup 抓取 google
【发布时间】：2020-09-16 16:10:16
【问题描述】：

如果药物没有出现在谷歌搜索中，我正在尝试抓取谷歌知识面板以检索药物名称。例如，如果我在 Google 中查找“Buscopan”，出现的网页如下所示：

现在，我尝试对显示的代码执行的操作是在知识面板中使用术语“Scopolamina-N-butilbromuro”，但实际上在检查元素后无法在 html 代码中检索它。准确地说。我与错误信息一起实现的代码如下：

import requests 
from bs4 import BeautifulSoup

网址

url = "https://www.google.com/search?client=safari&rls=en&q="+"buscopan"+"&ie=UTF-8&oe=UTF-8"

# Sending HTTP request 
req = requests.get(url) 
  
# Pulling HTTP data from internet 
sor = BeautifulSoup(req.text, "html.parser")  
   
temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text


print(temp)


---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-ef5599a1a1fc> in <module>
     13 # Finding temperature in Celsius
     14 #temp = sor.find("h2", class_='qrShPb').text
---> 15 temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
     16 
     17 

AttributeError: 'NoneType' object has no attribute 'text'

我不知道我做错了什么。我认为我需要查看的 html 代码如下：

&lt;h2 class="qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe" data-local-attribute="d3bn" data-attrid="title" data-ved="2ahUKEwjujfLcgO7rAhWKjosKHSiBAFEQ3B0oATASegQIEBAL"&gt;&lt;/h2&gt;

当然其余的html代码在报告的图片中，但是如果您需要更大的版本，请不要esitate！

有什么建议吗？

谢谢，

费德里科

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

要从 Google 搜索中获得正确的结果页面，请指定 User-Agent HTTP 标头。例如：

import requests 
from bs4 import BeautifulSoup


params = {
    'q': 'buscopan',    # <-- change to your keyword
    'hl': 'it'          # <-- change to `en` for english results
}

headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.google.com/search'
soup = BeautifulSoup(requests.get(url, params=params, headers=headers).content, 'html.parser')

print(soup.select_one('h2[data-attrid="title"]').text)

打印：

Scopolamina-N-butilbromuro

【讨论】：

谢谢@Andrej Kesley。我也可以问你什么是标题。抱歉，我不太喜欢 Beautfulsoup。再次感谢您！
@NutarelliFederico HTTP 标头是 HTTP 协议的一部分（BeautifulSoup 只是一个解析器）。当requests 发出 HTTP 请求时，它会发送多个 HTTP 标头。其中有标识浏览器的User-Agent。有关 HTTP 标头的更多信息，例如：developer.mozilla.org/en-US/docs/Web/HTTP/Headers
我明白了。再次非常感谢您！很有帮助的cmets

【解决方案2】：

或者，对于 Andrej Kesely 解决方案，您可以使用来自 SerpApi 的第三方 Google Knowledge Graph API。这是一个带有免费计划的付费 API。查看Playground 进行测试。

要集成的代码和full example in the online IDE:

from serpapi import GoogleSearch
import os

params = {
    "q": "Buscopan",
    "google_domain": "google.com",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

title = results['knowledge_graph']['title']
print(title)

输出：

Butylscopolamine

部分 JSON 知识图输出：

"knowledge_graph": {
  "title": "Butylscopolamine",
  "type": "Medication",
  "description": "Hyoscine butylbromide, also known as scopolamine butylbromide and sold under the brandname Buscopan among others, is an anticholinergic medication used to treat crampy abdominal pain, esophageal spasms, renal colic, and bladder spasms. It is also used to improve respiratory secretions at the end of life.",
  "source": {
    "name": "Wikipedia",
    "link": "https://en.wikipedia.org/wiki/Hyoscine_butylbromide"
  },
  "formula": "C₂₁H₃₀BrNO₄",
  "molar_mass": "440.371 g/mol",
  "chem_spider_id": "16736107",
  "trade_name": "Buscopan, others",
  "pub_chem_cid": "6852391",
  "ch_ebi_id": "32123",
  "people_also_search_for": "Scopolamine, Metamizole, MORE"
}

免责声明，我为 SerpApi 工作。

【讨论】：