无法在 python 中使用 selenium 抓取动态网页答案

【问题标题】：Failing to scrape dynamic webpage using selenium in python无法在 python 中使用 selenium 抓取动态网页
【发布时间】：2021-02-18 09:23:27
【问题描述】：

我试图从这个页面上抓取所有 5000 家公司。当我向下滚动时，它的动态页面和公司被加载。但是我只能抓取 5 家公司，那么我怎样才能抓取所有 5000 家公司呢？ 当我向下滚动页面时，URL 会发生变化。我试过硒但没有工作。 https://www.inc.com/profile/onetrust 注意：我想抓取公司的所有信息，但现在选择了两个。

import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

my_url = 'https://www.inc.com/profile/onetrust'

options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()

uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()

page_soup = soup(page_html, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

更新了代码，但页面根本没有滚动。更正了 BeautifulSoup 代码中的一些错误

import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver

my_url = 'https://www.inc.com/profile/onetrust'

driver = webdriver.Chrome()
driver.get(my_url)


def scroll_down(self):
    """A method for scrolling the page."""

    # Get scroll height.
    last_height = self.driver.execute_script("return document.body.scrollHeight")

    while True:

        # Scroll down to the bottom.
        self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load the page.
        time.sleep(2)

        # Calculate new scroll height and compare with last scroll height.
        new_height = self.driver.execute_script("return document.body.scrollHeight")

        if new_height == last_height:

            break

        last_height = new_height


page_soup = soup(driver.page_source, "html.parser")

containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]

for container in containers:
    rank = container.h2.get_text()
    company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
    Company_name = company_name_1[0].get_text()


    print("rank :" + rank)
    print("Company_name :" + Company_name)

感谢您的阅读！

【问题讨论】：

您可以滚动到页面末尾，例如像这里：stackoverflow.com/a/48851166/2776376 或者您可以使用您尝试抓取的页面的 API，例如inc.com/rest/companyprofile/leadcrunch/withlist
谢谢，我两个都试试。请问你是怎么找到那个页面的API的？
当您在浏览器中打开页面时。您可以检查在开发者工具部分进行的网络调用。

标签： python selenium web-scraping

【解决方案1】：

尝试以下使用 python 的方法 - requests 简单、直接、可靠、快速且在处理请求时需要更少的代码。在检查了谷歌浏览器的网络部分后，我从网站本身获取了 API URL。

下面的脚本到底在做什么：

首先它将获取 API URL 并执行 GET 请求。
获取数据后脚本会使用json.loads库解析JSON数据。

最后，它将遍历所有公司列表并打印它们，例如：排名、公司名称、社交帐户链接、CEO 姓名等。

import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

def scrap_inc_5000():

URL = 'https://www.inc.com/rest/companyprofile/nuleaf-naturals/withlist'

response = requests.get(URL,verify = False)
result = json.loads(response.text) #Parse result using JSON loads
extracted_data = result['fullList']['listCompanies']
for data in extracted_data:
    print('-' * 100)
    print('Rank : ',data['rank'])
    print('Company : ',data['company'])
    print('Icon : ',data['icon'])
    print('CEO Name : ',data['ifc_ceo_name'])
    print('Facebook Address : ',data['ifc_facebook_address'])
    print('File Location : ',data['ifc_filelocation'])
    print('Linkedin Address : ',data['ifc_linkedin_address'])
    print('Twitter Handle : ',data['ifc_twitter_handle'])
    print('Secondary Link : ',data['secondary_link'])
    print('-' * 100)
scrap_inc_5000()

【讨论】：

非常感谢。有用！虽然我在公司的网站之后，但我看到 API json 文件中没有数据，这很奇怪。你知道为什么网页上有数据，为什么会发生这样的事情吗？