使用 Python 从 Google 检索第一个搜索结果答案

【问题标题】：Retrieve the first search result from Google using Python使用 Python 从 Google 检索第一个搜索结果
【发布时间】：2018-09-05 07:12:56
【问题描述】：

我一直在尝试使用以下代码获取第一个搜索结果。该代码在某些情况下工作正常，但在某些情况下，它提供的输出是不完整的。

代码：

import requests
from bs4 import BeautifulSoup

research_later = "ABCD filetype:pdf"
goog_search = "http://google.com/search?q=" + research_later


r = requests.get(goog_search)

soup = BeautifulSoup(r.text, "html.parser")
print(soup.find('cite').text)

输出：

www.altogetherbetter.org.uk/.../5-assetbasedcommunitydevelopment.pdf

实际输出应该是：

http://www.altogetherbetter.org.uk/Data/Sites/1/5-assetbasedcommunitydevelopment.pdf

【问题讨论】：

标签： python-3.x beautifulsoup python-requests

【解决方案1】：

以下是我用来解决问题的代码。我已经下载了文件，这是我找到网络链接后的最终目标。

    from selenium import webdriver
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.common.exceptions import TimeoutException
    from selenium.webdriver.support.ui import WebDriverWait


    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(executable_path=r'C:\chromedriver_win32\chromedriver.exe', chrome_options=options)
    research_later = "ABCD filetype:pdf"

    driver.get("http://google.com/search?q="+research_later)
    elem=driver.find_element_by_css_selector("#rso > div > div > div:nth-child(1) > div > div > h3 > a").click()

【讨论】：

【解决方案2】：

似乎 cite 标签不包含整个链接。您可能想从“a”标签中获取它。试试这个：

import re
regex = re.compile(r'https://(.)+')
elem = soup.find('a',attrs={'href':re.compile(r'/url?')})['href']
regex.search(elem).group()

这将为您提供链接，但您可能需要使用另一个正则表达式对其进行更多格式化。

【讨论】：

【解决方案3】：

其实，selenium 或 regex 是不需要的。

您正在寻找此链接以获取第一个链接（查看SelectorGadget Chrome 扩展程序以通过单击浏览器中的元素来获取 CSS 选择器）： p>

first_link = soup.select_one('.yuRUbf a')['href']
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

另外，下一个问题可能是因为没有指定user-agent，Google 最终会阻止请求，您将收到完全不同的 HTML。

代码和example in the online IDE：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "ABCD filetype:pdf",   # query
  "gl": "us",                 # country to search from
  "hl": "en"                  # language
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

first_link = soup.select_one('.yuRUbf a')['href']
print(first_link)

# https://resources.depaul.edu/abcd-institute/resources/Documents/WhatisAssetBasedCommunityDevelopment.pdf

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于，您只需要迭代结构化 JSON 并快速获取所需的数据，而不是弄清楚为什么某些事情不能按预期工作并且您不必维护解析器时间。

要集成的代码：

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "bABCD filetype:pdf",
    "hl": "en",
    "gl": "us",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

first_link = results['organic_results'][0]['link']
print(first_link)

# https://dohcoey14i4kf.cloudfront.net/sites/default/files/despiece_maquina_mc507_0.pdf

免责声明，我为 SerpApi 工作。

【讨论】：