【问题标题】:How can I scrape career path job titles from this javascript page using Python如何使用 Python 从这个 javascript 页面中刮取职业道路的职位名称
【发布时间】:2024-04-24 17:45:01
【问题描述】:

如何使用 Python 从这个 javascript 页面中获取职业路径职位?

'https://www.dice.com/career-paths?title=PHP%2BDeveloper&location=San%2BDiego,%2BCalifornia,%2BUs,%2BCA&experience=0&sortBy=mostProbableTransition'

这是我的代码 sn-p,返回的汤没有我需要的任何文本数据!

import requests
from bs4 import BeautifulSoup
import json
import re
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


# get BeautifulSoup object
def get_soup(url):
    """
    This function returns the BeautifulSoup object.

    Parameters:
        url: the link to get soup object for

    Returns:
        soup: BeautifulSoup object
    """
    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    return soup

# get selenium driver object
def get_selenium_driver():
    """
    This function returns the selenium driver object.

    Parameters:
        None

    Returns:
        driver: selenium driver object
    """
    options = webdriver.FirefoxOptions()
    options.add_argument('-headless')

    driver = webdriver.Firefox(executable_path=r"geckodriver", firefox_options = options)

    return driver

# get soup obj using selenium
def get_soup_using_selenium(url):
    """
    Given the url of a page, this function returns the soup object.

    Parameters:
        url: the link to get soup object for

    Returns:
        soup: soup object
    """
    options = webdriver.FirefoxOptions()
    options.add_argument('-headless')

    driver = webdriver.Firefox(executable_path=r"geckodriver", firefox_options = options)
    driver.get(url)
    driver.implicitly_wait(3)

    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')

    driver.close()

    return soup




title = "PHP%2BDeveloper"
location = "San%2BDiego,%2BCalifornia,%2BUs,%2BCA"
years_of_experirence = "0"
sort_by_filter = "mostProbableTransition"

url = "https://www.dice.com/career-paths?title={}&location={}&experience={}&sortBy={}".format(title, location, years_of_experirence , sort_by_filter)
career_paths_page_soup = get_soup(url)

【问题讨论】:

  • 发布你的代码。到目前为止你做了什么研究。
  • 请记住,我们不为您工作,我们会在您遇到困难时为您提供帮助。所以你至少应该学习,尝试,如果你失败了,我们会帮助你。
  • 谢谢大家,非常抱歉!请检查代码sn-p!
  • 页面由 java 脚本呈现。所以在这种情况下请求不会帮助你。但是,由于你已经为 selenium 编码,你可以调用该函数 career_paths_page_soup=get_soup_using_selenium(url) 并且还提到了你的期望值是什么从页面返回。

标签: python selenium web-scraping beautifulsoup python-requests


【解决方案1】:

就像另一个用户在 cmets 中提到的那样,requests 在这里对你不起作用。但是,使用 Selenium,您可以使用 WebDriverWait 抓取页面内容以确保所有页面内容都已加载,并使用 element.text 获取网页内容。

以下代码 sn -p 将在页面左侧打印职业路径字符串:

from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

# navigate to the page
driver = get_selenium_driver()
driver.get(url)

# wait for loading indicator to be hidden
WebDriverWait(driver, 10).until(EC.invisibility_of_element((By.XPATH, "//*[contains(text(), 'Loading data')]")))

# wait for content to load
career_path_elements = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//div[@class='abcd']/ul/li")))

# print out career paths
for element in career_path_elements:

    # get title attribute that usually contains career path text
    title = element.get_attribute("title")

    # sometimes career path is in span below this element
    if not title:

        # find the element and print its text
        span_element = element.find_element_by_xpath("span[not(contains(@class, 'currentJobHead'))]")
        print(span_element.text)

   # print title in other cases
    else:
        print(title)

这将打印以下内容:

PHP Developer
Drupal Developer
Web Developer
Full Stack Developer
Back-End Developer
Full Stack PHP Developer
IT Director
Software Development Manager

这里有一些有趣的项目。主要的是该页面上的 Javascript 加载——在第一次打开页面时,会出现一个“正在加载数据...”指示器。在我们尝试定位任何页面内容之前,我们必须等待 EC.invisibility_of_element 以确保该项目已消失。

之后,我们再次调用WebDriverWait,但这次是在页面右侧的“职业路径”元素上。这个WebDriverWait 调用返回一个元素列表,存储在career_path_elements 中。我们可以遍历这个元素列表来打印每个项目的职业路径。

每个职业路径元素都包含title 属性中的职业路径文本,因此我们调用element.get_attribute("title") 来获取该文本。但是,“当前职位”项目存在一种特殊情况,其中职业路径文本包含在低一级的span 中。我们通过调用element.find_element_by_xpath() 来定位span 来处理title 为空的情况。这确保我们可以打印页面上的每个职业路径项目。

【讨论】:

  • 非常感谢,这正是我所需要的!