【问题标题】:Getting Google Search Result URLs from Search String or URL从搜索字符串或 URL 获取 Google 搜索结果 URL
【发布时间】:2020-03-19 09:11:03
【问题描述】:

所以我想找到所有的搜索结果并将它们存储在一个列表或其他东西中。分析 Google 页面后,我发现所有结果在技术上都属于 g 类:

所以从技术上讲,从搜索结果页面中提取 URL(即)应该很简单:

import urllib
from bs4 import BeautifulSoup
import requests

text = 'cyber security'
text = urllib.parse.quote_plus(text)

url = 'https://google.com/search?q=' + text

response = requests.get(url)

soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

然而,我没有输出。为什么?

编辑:即使手动解析存储的页面也无济于事:

with open('output.html', 'wb') as f:
     f.write(response.content)
webbrowser.open('output.html')

url = "output.html"
page = open(url)
soup = BeautifulSoup(page.read(), features="lxml")

#soup = BeautifulSoup(response.content, 'lxml')
for result_divs in soup.find_all(class_='g'):
    links = [div.find('a') for div in result_divs]
    hrefs = [link.get('href') for link in links]
    print(hrefs)

【问题讨论】:

标签: python web-scraping beautifulsoup


【解决方案1】:

实际上,如果你打印 response.content 并检查输出你会发现没有带有 class g 的 HTML 标签。这些元素似乎是通过动态加载来的,而 BeautifulSoap 仅加载静态内容。这就是为什么当您查找带有 g 类的 HTML 标记时,它不会在结果中给出任何元素。

【讨论】:

  • 是的,它没有显示在输出中的原因,那是因为谷歌在页面加载后使用JavaScript 渲染。所以唯一的方法是使用seleniumdryscrape :) 否则pypi.org/project/google-search-results-serpwow
  • @Jishan 检查我的答案 :)
  • @Jishan Buddy 你又犯了同样的错误。 response.content 不会为您提供在浏览器中看到的完整 HTML 页面。尝试从浏览器保存页面,然后在您的代码中打开它。它会正常工作。
【解决方案2】:
from selenium import webdriver
from bs4 import BeautifulSoup
import time

browser = webdriver.Firefox()
dork = 'cyber security'
sada = browser.get(f"https://www.google.com/search?q={dork}")
time.sleep(5)
source = browser.page_source
soup = BeautifulSoup(source, 'html.parser')

for item in soup.findAll('div', attrs={'class': 'r'}):
    for href in item.findAll('a'):
        print(href.get('href'))

【讨论】:

    【解决方案3】:

    以下方法应该会从其着陆页的所有结果链接中为您提取几个随机链接。您可能需要删除一些以点结尾的链接。使用请求从谷歌搜索中抓取链接确实是一项艰巨的工作。

    import requests
    from bs4 import BeautifulSoup
    
    url = "http://www.google.com/search?q={}&hl=en"
    
    def scrape_google_links(query):
        res = requests.get(url.format(query.replace(" ","+")),headers={"User-Agent":"Mozilla/5.0"})
        soup = BeautifulSoup(res.text,"lxml")
        for result in soup.select(".kCrYT > a > .BNeawe:nth-of-type(2)"):
            print(result.text.replace(" › ","/"))
    
    if __name__ == '__main__':
        scrape_google_links('cyber security')
    

    【讨论】:

      【解决方案4】:

      您始终可以使用next_sibling/previous_siblingnext_element/previous_element 向上或向下攀爬多个元素进行测试。所有结果都在带有.tF2Cxc 类的<div> 元素中。

      抓取网址很简单:

      1. for loopbs4 .select() 方法组合使用,该方法将СSS 选择器作为输入。
      2. 使用.select_one() 方法调用.yuRUbf CSS 选择器。
      3. 使用href 属性调用<a> 标签。
      for result in soup.select('.tF2Cxc'):
        link = result.select_one('.yuRUbf').a['href']
      

      代码和example in the online IDE

      import requests, lxml
      from bs4 import BeautifulSoup
      
      headers = {
          "User-Agent":
          "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
      }
      
      params = {'q': 'cyber security'}
      html = requests.get('https://www.google.com/search', headers=headers, params=params).text
      soup = BeautifulSoup(html, 'lxml')
      
      # containver with all needed data
      for result in soup.select('.tF2Cxc'):
        link = result.select_one('.yuRUbf').a['href'] # or ('.yuRUbf a')['href']
        print(link)
      
      # output:
      '''
      https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
      https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
      https://digitalguardian.com/blog/what-cyber-security
      https://searchsecurity.techtarget.com/definition/cybersecurity
      https://www.cisa.gov/cybersecurity
      https://en.wikipedia.org/wiki/Computer_security
      https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
      https://staysafeonline.org/
      '''
      

      或者,您可以使用来自 SerpApi 的 Google Organic Results API 来做同样的事情。这是一个带有免费计划的付费 API。

      要集成的代码:

      params = {
        "api_key": os.getenv("API_KEY"), # environment for API_KEY
        "engine": "google", # search engine
        "q": "cyber security", # query
        "hl": "en", # defining a language
      }
      
      search = GoogleSearch(params)
      results = search.get_dict()
      
      for result in results['organic_results']:
        link = result['link']
        print(link)
      
      # output:
      '''
      https://www.kaspersky.com/resource-center/definitions/what-is-cyber-security
      https://digitalguardian.com/blog/what-cyber-security
      https://en.wikipedia.org/wiki/Computer_security
      https://www.cisco.com/c/en/us/products/security/what-is-cybersecurity.html
      https://staysafeonline.org/
      https://searchsecurity.techtarget.com/definition/cybersecurity
      https://www.cisa.gov/cybersecurity
      https://www.csoonline.com/article/3482001/what-is-cyber-security-types-careers-salary-and-certification.html
      '''
      

      免责声明,我为 SerpApi 工作。

      【讨论】:

        猜你喜欢
        • 2013-01-30
        • 1970-01-01
        • 2020-06-27
        • 1970-01-01
        • 2014-04-12
        • 1970-01-01
        • 1970-01-01
        • 2019-12-15
        • 2012-08-05
        相关资源
        最近更新 更多