【问题标题】:Web Scraping HTML Not same as Browser Result网页抓取 HTML 与浏览器结果不同
【发布时间】:2019-06-02 15:15:11
【问题描述】:

对于我的项目,我需要谷歌搜索结果。我正在使用 python 请求和 BeautifulSoup。我得到了结果,但它们与我在浏览器上看到的不同。我需要浏览器上显示的确切内容。我也试过 urllib。但它也不同于网络结果。谁能帮我解决这个问题?

import requests
import bs4

link = 'https://www.google.com/'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0) Gecko/20100101 Firefox/49.0',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate',
        'DNT': '1',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }
response = requests.get(link, headers = headers)
soup = bs4.BeautifulSoup(response.text, 'lxml')

【问题讨论】:

  • 什么意思和浏览器结果不一样 ?
  • 您的浏览器一般会存储cookies,这可能会导致结果不一致,请尝试将响应与隐身模式进行比较。此外,像谷歌这样的网站会根据您的位置个性化您的搜索结果
  • 是的。结果与浏览器结果不同。 @ᴀʀᴍᴀɴ

标签: python python-3.x beautifulsoup python-requests


【解决方案1】:

大多数网站都运行 javascript 来更新网站。其中一些还尝试检测爬虫。

使用headless browser 代替用于抓取目的。

如 cmets 中所述,一些网站也使用 cookie。例如,谷歌搜索结果之所以如此出色,主要是因为它们是为用户定制的。

【讨论】:

    【解决方案2】:

    这不是因为 JavaScript。为了使其接近您在浏览器中看到的结果,您可以传递额外的query params to your request

    params = {
      "q": "what is the best minecraft skin in 2021", # query 
      "gl": "uk",                                     # country to search from (United Kingdom)
      "hl": "en",                                     # language
      "google_domain": "google.com"                   # google domain
    }
    requests.get("YOUR_URL", params=params)
    

    代码和example in the online IDE

    from bs4 import BeautifulSoup
    import requests, json, lxml
    
    headers = {
        'User-agent':
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    params = {
      "q": "what is the best minecraft skin in 2021",
      "gl": "uk",
      "hl": "en"
    }
    
    html = requests.get("https://www.google.com/search", headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')
    
    data = []
    
    for result in soup.select('.tF2Cxc'):
      title = result.select_one('.DKV0Md').text
      link = result.select_one('.yuRUbf a')['href']
      try: 
        snippet = result.select_one('#rso .lyLwlc').text
      except: snippet = None
    
      data.append({
          'title': title,
          'link': link,
          'snippet': snippet,
      })
    
    print(json.dumps(data, indent=2, ensure_ascii=False))
    
    -------
    '''
    [
      {
        "title": "The Ultimate Guide to Minecraft Skins in 2021 - CodaKid",
        "link": "https://codakid.com/minecraft-skins/",
        "snippet": null
      },
      {
        "title": "Search results for \"2021\" - Minecraft Skins",
        "link": "https://www.minecraftskins.com/search/skin/2021/1/",
        "snippet": "Red Dream Classic (4px arms) · Banan132. 1. 0. NewThealexx · NewThealexx. 1. 0. Zero Dragneel 2021 skin · reddwd123. 0. 0. contest // because i'm bored :(."
      },
      {
        "title": "Best Minecraft skins for 2021 | Rock Paper Shotgun",
        "link": "https://www.rockpapershotgun.com/best-minecraft-skins",
        "snippet": "12 Jul 2021 — Gamer Girl skin. This Gamer Girl Minecraft skin is one of the most popular skins out there. It's cute, it's well-designed, and with the grass ..."
      }
     # more ... 
    ]
    '''
    

    我在浏览器中看到的屏幕截图(相同的结果):


    或者,您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

    您的情况的不同之处在于它可以默认完成,您无需考虑它并随着时间的推移维护解析器。所需要做的就是迭代结构化 JSON 并获取所需的数据。查看Playground

    import os
    from serpapi import GoogleSearch
    
    params = {
        "engine": "google",
        "q": "best minecraft skin in 2021",
        "hl": "en",
        "gl": "uk",
        "api_key": os.getenv("API_KEY"),
    }
    
    search = GoogleSearch(params)
    results = search.get_dict()
    
    for result in results["organic_results"]:
      # print(result['title'])
      # print(result['link'])
      print(f"Title: {result['title']}\nSummary: {result['snippet']}\nLink: {result['link']}\n")
    
    -------
    '''
    Title: Search results for "2021" - Minecraft Skins
    Summary: Red Dream Classic (4px arms) · Banan132. 1. 0. NewThealexx · NewThealexx. 1. 0. Zero Dragneel 2021 skin · reddwd123. 0. 0. contest // because i'm bored :(.
    Link: https://www.minecraftskins.com/search/skin/2021/1/
    
    Title: Best Minecraft skins for 2021 | Rock Paper Shotgun
    Summary: Gamer Girl skin. This Gamer Girl Minecraft skin is one of the most popular skins out there. It's cute, it's well-designed, and with the grass ...
    Link: https://www.rockpapershotgun.com/best-minecraft-skins
    
    Title: 5 best Minecraft skins in 2021 - Sportskeeda
    Summary: Five best Minecraft skins in 2021 · #5 Glowing Devil · #4 Glitching · #3 Rose and Bows · #2 Liquid Rainbow · #1 Save the bees.
    Link: https://www.sportskeeda.com/minecraft/5-best-minecraft-skins-2021
    
    Title: Best Minecraft skins | PC Gamer
    Summary: Cool skins; Girl skins; Anime Skins; Funny skins; Videogame skins; Skin sites. Best of Minecraft.
    Link: https://www.pcgamer.com/uk/the-best-minecraft-skins/
    
    # more ...
    '''
    

    免责声明,我为 SerpApi 工作。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-09-22
      • 2013-03-05
      • 1970-01-01
      • 2016-12-25
      相关资源
      最近更新 更多