使用相同的分页链接刮分页[关闭]答案

【问题标题】：Scraping pagination with the same pagination links [closed]使用相同的分页链接刮分页[关闭]
【发布时间】：2020-01-09 05:30:05
【问题描述】：

我正在尝试从这个链接中抓取股票信息：https://www.affarsvarlden.se/bors/kurslistor/stockholm-large/kurs/

它适用于python中requests的前100行，但其余行隐藏在分页元素下。问题是我怎样才能得到这些。困难在于第二页的链接（包含剩余的行）与第一页的链接相同，并且在查看“网络”选项卡时，在两者之间进行更改时，我看不到任何请求在开发者工具中。有没有办法使用requests 模块来做到这一点，还是我需要使用selenium 之类的东西？我也无法让后者工作。

我非常感谢任何意见。

【问题讨论】：

标签： python web-scraping pagination

【解决方案1】：

据我所知，当请求页面时，所有数据都已经上传到页面。所以，你可以试试这个，

from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
import requests
import json

url = 'https://www.affarsvarlden.se/bors/kurslistor/stockholm-large/kurs/'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')

for tag in soup.findAll('script'):
    content = tag.get_text()

    if '__INITIAL_STATE__' not in content:
        continue

    index = content.find('{')
    data = json.loads(content[index:])
    df = json_normalize(data['stocklist']['stockholm-large/kurs/'], 'info')

【讨论】：

谢谢！这非常有效（除了我必须将 if 语句移动到内容分配之后）。
哦，很抱歉，我编辑了。谢谢。 :)

【解决方案2】：

你可以用硒做到这一点。下面的脚本将打开网页并转到下一页。

import selenium
from selenium import webdriver

driver = webdriver.Chrome()

# navigate to webpage
driver.get('https://www.affarsvarlden.se/bors/kurslistor/stockholm-large/kurs/')

# next button path
next_button = driver.find_element_by_xpath('//*[@id="canvas"]/div[2]/div/div[2]/div/div/div[3]/div[2]/div/div/div[2]/ul/li[4]/a')

# Clicking button throws error the fist time
try:
    next_button.click()
    pass
except Exception:
    next_button.click()

编辑：您将需要工作目录中的 chromedriver.exe 才能使用 webdriver。

【讨论】：