【问题标题】:Scraping Google search using BeautifulSoup使用 BeautifulSoup 抓取 Google 搜索
【发布时间】:2018-11-15 17:23:09
【问题描述】:

我想抓取 Google 搜索的多个页面。 到现在我只能抓取第一页,但我怎么能做到多页呢。

from bs4 import BeautifulSoup
import requests
import urllib.request
import re
from collections import Counter

def search(query):
    url = "http://www.google.com/search?q="+query

    text = []
    final_text = []

    source_code = requests.get(url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text,"html.parser")

    for desc in soup.find_all("span",{"class":"st"}):
        text.append(desc.text)

    for title in soup.find_all("h3",attrs={"class":"r"}):
        text.append(title.text)

    for string in text:
        string  = re.sub("[^A-Za-z ]","",string)
        final_text.append(string)

    count_text = ' '.join(final_text)
    res = Counter(count_text.split())

    keyword_Count = dict(sorted(res.items(), key=lambda x: (-x[1], x[0])))

    for x,y in keyword_Count.items():
        print(x ," : ",y)


search("girl")

【问题讨论】:

  • 抓取指向下一页的链接,request.get(href_for_next_page) 冲洗并重复。
  • 我推荐阅读这本书,Web Scraping with Python。我相信你可以在网上的某个地方找到一个 pdf,但我也会买它。第 68 页有关于这个主题的很好的信息。但是,您应该将代码置于一个循环中,并限制其运行次数,否则您将运行无休止的代码并占用服务器资源。
  • @Kamikaze_goldfish 在谷歌的情况下你必须限制不是因为它会使服务器崩溃,但如果你做无休止的请求谷歌简单地阻止你的 IP 几个小时。
  • 是的,我应该澄清这一点。大多数网站会将您列入黑名单。

标签: python search beautifulsoup scrape


【解决方案1】:
url = "http://www.google.com/search?q=" + query + "&start=" + str((page - 1) * 10)

【讨论】:

    【解决方案2】:

    像上面的评论一样,您需要下一页 URL 并将代码放入循环中

    def search(query):
        url = "https://www.google.com/search?hl=en&q=" + query
        while url:
            text = []
            ....
            ....
            for x,y in keyword_Count.items():
                print(x ," : ",y)
    
            # get next page url
            url = soup.find('a', id='pnnext')
            if url:
                url = 'https://www.google.com/' + url['href']
            else:
                print('no next page, loop ended')
                break
    

    要使soup.find('a', id='pnnext') 工作,您可能需要为请求设置用户代理

    【讨论】:

      【解决方案3】:

      下面的代码通过“下一步”按钮链接进行实际分页。

      from bs4 import BeautifulSoup
      import requests, urllib.parse
      import lxml
      
      def print_extracted_data_from_url(url):
      
          headers = {
              "User-Agent":
              "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
          }
          response = requests.get(url, headers=headers).text
      
          soup = BeautifulSoup(response, 'lxml')
      
          print(f'Current page: {int(soup.select_one(".YyVfkd").text)}')
          print(f'Current URL: {url}')
          print()
      
          for container in soup.findAll('div', class_='tF2Cxc'):
              head_text = container.find('h3', class_='LC20lb DKV0Md').text
              head_sum = container.find('div', class_='IsZvec').text
              head_link = container.a['href']
              print(head_text)
              print(head_sum)
              print(head_link)
              print()
      
          return soup.select_one('a#pnnext')
      
      
      def scrape():
          next_page_node = print_extracted_data_from_url(
              'https://www.google.com/search?hl=en-US&q=coca cola')
      
          while next_page_node is not None:
              next_page_url = urllib.parse.urljoin('https://www.google.com', next_page_node['href'])
      
              next_page_node = print_extracted_data_from_url(next_page_url)
      
      scrape()
      

      部分输出:

      Results via beautifulsoup
      
      Current page: 1
      Current URL: https://www.google.com/search?hl=en-US&q=coca cola
      
      The Coca-Cola Company: Refresh the World. Make a Difference
      We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.‎Careers · ‎Contact Us · ‎Jobs at Coca-Cola · ‎Our Company
      https://www.coca-colacompany.com/home
      
      Coca-Cola
      2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
      https://www.coca-cola.com/
      

      或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来执行此操作。这是一个带有免费计划的付费 API。

      要集成的代码:

      import os
      from serpapi import GoogleSearch
      
      def scrape():
        
        params = {
          "engine": "google",
          "q": "coca cola",
          "api_key": os.getenv("API_KEY"),
        }
      
        search = GoogleSearch(params)
        results = search.get_dict()
      
        print(f"Current page: {results['serpapi_pagination']['current']}")
      
        for result in results["organic_results"]:
            print(f"Title: {result['title']}\nLink: {result['link']}\n")
      
        while 'next' in results['serpapi_pagination']:
            search.params_dict["start"] = results['serpapi_pagination']['current'] * 10
            results = search.get_dict()
      
            print(f"Current page: {results['serpapi_pagination']['current']}")
      
            for result in results["organic_results"]:
                print(f"Title: {result['title']}\nLink: {result['link']}\n")
      
      scrape()
      

      部分输出:

      Results from SerpApi
      
      Current page: 1
      Current URL: https://www.google.com/search?hl=en-US&q=coca cola
      
      The Coca-Cola Company: Refresh the World. Make a Difference
      We are here to refresh the world and make a difference. Learn more about the Coca-Cola Company, our brands, and how we strive to do business the right way.‎Careers · ‎Contact Us · ‎Jobs at Coca-Cola · ‎Our Company
      https://www.coca-colacompany.com/home
      
      Coca-Cola
      2021 The Coca-Cola Company, all rights reserved. COCA-COLA®, "TASTE THE FEELING", and the Contour Bottle are trademarks of The Coca-Cola Company.
      https://www.coca-cola.com/
      

      免责声明,我为 SerpApi 工作。

      【讨论】:

        猜你喜欢
        • 2015-11-23
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-11-14
        相关资源
        最近更新 更多