【问题标题】:BeautifulSoup : Web scraping information after submit button is clickedBeautifulSoup : 点击提交按钮后的网页抓取信息
【发布时间】:2022-01-17 14:27:15
【问题描述】:

我对 python 编码比较陌生,我目前正在尝试从网站中提取数据,但信息仅在单击提交按钮后才会显示。网页是https://www.ccq.org/fr-CA/qualification-acces-industrie/bassins-main-oeuvre/etat-bassins-main-oeuvre

我必须点击的按钮:button

当我检查网站时,我能够检索按钮单击后包含/显示的信息的 url(检查网站时通过网络选项卡)。 这是按钮 URL 给出的信息输出的预览:info output

我想知道的是是否可以保留按 DIV 元素分类的信息,就像我单击网站上的按钮时所做的那样...谢谢!

代码:

import requests
from bs4 import BeautifulSoup
import re

URL = "https://www.ccq.org/fr-CA/qualification-acces-industrie/bassins-main-oeuvre/etat- 
bassins-main-oeuvre"
page = requests.get(URL)
soup = BeautifulSoup(page.content,features="html.parser")

btn4 = soup.find('button',{"id":"get-labourpools"})
btn4_click = btn4['onclick']

【问题讨论】:

  • 如果您必须单击按钮,那么BeautifulSoup 是一个糟糕的选择。你需要探索selenium

标签: python html beautifulsoup


【解决方案1】:

您可以查询一个端点以获取您所追求的表数据。

方法如下:

import json

import requests

region_id = "01"
occupation_id = "110"
url = f"https://www.ccq.org/api/labourpools?regionId={region_id}&occupationId={occupation_id}"

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:95.0) Gecko/20100101 Firefox/95.0",
    "X-Requested-With": "XMLHttpRequest",
}
data = requests.get(url, headers=headers).json()
print(json.dumps(data, indent=2))

输出:

[
  {
    "Id": "01",
    "Name": "Iles de la Madeleine",
    "Occupations": [
      {
        "Id": "110",
        "Name": "Briqueteur-ma\u00e7on",
        "Pool": {
          "IsOpen": true,
          "IsLessThan10": true,
          "IsLessThan30": true
        }
      }
    ],
    "EffectiveDate": "17 janvier 2022"
  }
]

编辑:

如果你想获取所有地区和职业的所有表格,你可以创建所有可能的 API 请求 url 并获取数据。

方法如下:

import json

import requests
from bs4 import BeautifulSoup

base_url = "https://www.ccq.org/fr-CA/qualification-acces-industrie/bassins-main-oeuvre/etat-bassins-main-oeuvre"
api_url = "https://www.ccq.org/api/labourpools?"

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:95.0) Gecko/20100101 Firefox/95.0",
    "X-Requested-With": "XMLHttpRequest",
}


def get_ids(id_value: str) -> list:
    return [
        i["value"] for i
        in soup.find("select", {"id": id_value}).find_all("option")[1:]
    ]


with requests.Session() as session:
    soup = BeautifulSoup(session.get(base_url, headers=headers).text, "lxml")
    region_ids = get_ids("dropdown-region")
    occupation_ids = get_ids("dropdown-occupation")

all_query_urls = [
    f"{api_url}regionId={region_id}&occupationId={occupation_id}"
    for region_id in region_ids for occupation_id in occupation_ids
]

for query_url in all_query_urls[:2]:  # remove [:2] to get all combinations
    data = session.get(query_url, headers=headers).json()
    print(json.dumps(data, indent=2))

这应该输出两个条目:

[
  {
    "Id": "01",
    "Name": "Iles de la Madeleine",
    "Occupations": [
      {
        "Id": "110",
        "Name": "Briqueteur-ma\u00e7on",
        "Pool": {
          "IsOpen": true,
          "IsLessThan10": true,
          "IsLessThan30": true
        }
      }
    ],
    "EffectiveDate": "17 janvier 2022"
  }
]
[
  {
    "Id": "01",
    "Name": "Iles de la Madeleine",
    "Occupations": [
      {
        "Id": "130",
        "Name": "Calorifugeur",
        "Pool": {
          "IsOpen": true,
          "IsLessThan10": true,
          "IsLessThan30": true
        }
      }
    ],
    "EffectiveDate": "17 janvier 2022"
  }
]

【讨论】:

  • 谢谢!!!!!!!!!
  • 有些事情最好留给 http 库。很好的回应!
猜你喜欢
  • 2021-04-26
  • 1970-01-01
  • 2020-10-04
  • 1970-01-01
  • 2021-04-09
  • 2015-01-05
  • 2021-10-07
  • 2012-11-30
  • 2015-03-04
相关资源
最近更新 更多