从搜索字符串或 URL 获取 Google 搜索结果 URL答案

【问题标题】：Getting Google Search Result URLs from Search String or URL从搜索字符串或 URL 获取 Google 搜索结果 URL
【发布时间】：2020-03-19 09:11:03
【问题描述】：

所以我想找到所有的搜索结果并将它们存储在一个列表或其他东西中。分析 Google 页面后，我发现所有结果在技术上都属于 g 类：

所以从技术上讲，从搜索结果页面中提取 URL（即）应该很简单：

import urllib
from bs4 import BeautifulSoup
import requests

text = 'cyber security'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

然而，我没有输出。为什么？

编辑：即使手动解析存储的页面也无济于事：

with open('output.html', 'wb') as f:
     f.write(response.content)
webbrowser.open('output.html')

url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")

#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

【问题讨论】：

pypi.org/project/google-search-results-serpwow
我在玩这个代码repl.it/repls/ThirdSneakyConfig，看来google发送的html和浏览器中的html真的不一样
@Gui3 stackoverflow.com/questions/7746832/…
回答类似问题：stackoverflow.com/a/60889629/1291371。包含代码示例。

标签： python web-scraping beautifulsoup

【解决方案1】：

实际上，如果你打印 response.content 并检查输出你会发现没有带有 class g 的 HTML 标签。这些元素似乎是通过动态加载来的，而 BeautifulSoap 仅加载静态内容。这就是为什么当您查找带有 g 类的 HTML 标记时，它不会在结果中给出任何元素。

【讨论】：

是的，它没有显示在输出中的原因，那是因为谷歌在页面加载后使用JavaScript 渲染。所以唯一的方法是使用selenium 或dryscrape :) 否则pypi.org/project/google-search-results-serpwow
@Jishan 检查我的答案 :)
@Jishan Buddy 你又犯了同样的错误。 response.content 不会为您提供在浏览器中看到的完整 HTML 页面。尝试从浏览器保存页面，然后在您的代码中打开它。它会正常工作。

【解决方案2】：

from selenium import webdriver
from bs4 import BeautifulSoup
import time

browser = webdriver.Firefox()
dork = 'cyber security'
sada = browser.get(f"https://www.google.com/search?q={dork}")
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')

for item in soup.findAll('div', attrs={'class': 'r'}):
    for href in item.findAll('a'):
        print(href.get('href'))

【讨论】：

【解决方案3】：

以下方法应该会从其着陆页的所有结果链接中为您提取几个随机链接。您可能需要删除一些以点结尾的链接。使用请求从谷歌搜索中抓取链接确实是一项艰巨的工作。

import requests
from bs4 import BeautifulSoup

url = "http://www.google.com/search?q={}&hl=en"

def scrape_google_links(query):
    res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
    soup = BeautifulSoup(res.text,"lxml")
    for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
        print(result.text.replace(" › ","/"))

if __name__ == '__main__':
    scrape_google_links('cyber security')

【讨论】：

【解决方案4】：

您始终可以使用next_sibling/previous_sibling 或next_element/previous_element 向上或向下攀爬多个元素进行测试。所有结果都在带有.tF2Cxc 类的<div> 元素中。

抓取网址很简单：

将for loop 与bs4 .select() 方法组合使用，该方法将СSS 选择器作为输入。
使用.select_one() 方法调用.yuRUbf CSS 选择器。
使用href 属性调用<a> 标签。

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href']

代码和example in the online IDE：

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'cyber security'}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

# containver with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf').a['href'] # or ('.yuRUbf a')['href']
  print(link)

# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://digitalguardian.com/blog/what-cyber-security
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://en.wikipedia.org/wiki/Computer_security
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
https://staysafeonline.org/
'''

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来做同样的事情。这是一个带有免费计划的付费 API。

要集成的代码：

params = {
  "api_key": os.getenv("API_KEY"), # environment for API_KEY
  "engine": "google", # search engine
  "q": "cyber security", # query
  "hl": "en", # defining a language
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
  link = result['link']
  print(link)

# output:
'''
https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
https://digitalguardian.com/blog/what-cyber-security
https://en.wikipedia.org/wiki/Computer_security
https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
https://staysafeonline.org/
https://searchsecurity.techtarget.com/definition/cybersecurity
https://www.cisa.gov/cybersecurity
https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
'''

免责声明，我为 SerpApi 工作。

【讨论】：