【发布时间】:2021-02-18 09:23:27
【问题描述】:
我试图从这个页面上抓取所有 5000 家公司。当我向下滚动时,它的动态页面和公司被加载。但是我只能抓取 5 家公司,那么我怎样才能抓取所有 5000 家公司呢? 当我向下滚动页面时,URL 会发生变化。我试过硒但没有工作。 https://www.inc.com/profile/onetrust 注意:我想抓取公司的所有信息,但现在选择了两个。
import time
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
my_url = 'https://www.inc.com/profile/onetrust'
options = Options()
driver = webdriver.Chrome(chrome_options=options)
driver.get(my_url)
time.sleep(3)
page = driver.page_source
driver.quit()
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]
for container in containers:
rank = container.h2.get_text()
company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
Company_name = company_name_1[0].get_text()
print("rank :" + rank)
print("Company_name :" + Company_name)
更新了代码,但页面根本没有滚动。更正了 BeautifulSoup 代码中的一些错误
import time
from bs4 import BeautifulSoup as soup
from selenium import webdriver
my_url = 'https://www.inc.com/profile/onetrust'
driver = webdriver.Chrome()
driver.get(my_url)
def scroll_down(self):
"""A method for scrolling the page."""
# Get scroll height.
last_height = self.driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to the bottom.
self.driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load the page.
time.sleep(2)
# Calculate new scroll height and compare with last scroll height.
new_height = self.driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
page_soup = soup(driver.page_source, "html.parser")
containers = page_soup.find_all("div", class_="sc-prOVx cTseUq company-profile")
container = containers[0]
for container in containers:
rank = container.h2.get_text()
company_name_1 = container.find_all("h2", class_="sc-AxgMl LXebc h2")
Company_name = company_name_1[0].get_text()
print("rank :" + rank)
print("Company_name :" + Company_name)
感谢您的阅读!
【问题讨论】:
-
您可以滚动到页面末尾,例如像这里:stackoverflow.com/a/48851166/2776376 或者您可以使用您尝试抓取的页面的 API,例如inc.com/rest/companyprofile/leadcrunch/withlist
-
谢谢,我两个都试试。请问你是怎么找到那个页面的API的?
-
当您在浏览器中打开页面时。您可以检查在开发者工具部分进行的网络调用。
标签: python selenium web-scraping