【问题标题】:Scraping multiple posts from a web page从网页中抓取多个帖子
【发布时间】:2021-08-14 23:32:19
【问题描述】:

我尝试抓取页面上的所有作业,但没有成功。我一直在尝试不同的方法,但我没有成功。打开并抓取第一个作品后,脚本会崩溃。我不知道接下来我应该做什么才能继续下一份工作。有没有人帮我让它工作?先感谢您。 我不得不缩短代码,因为它不允许我全部发布(代码太多)。

# Part 1
from selenium import webdriver
import pandas as pd 
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
options = Options()    
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

df = pd.DataFrame(columns=["Title","Description",'Job-type','Skills'])

for i in range(25):
    driver.get('https://www.reed.co.uk/jobs/care-jobs?pageno='+ str(i))
    jobs = []
    driver.implicitly_wait(20)

    for job in driver.find_elements_by_xpath('//*[@id="content"]/div[1]/div[3]'):

        soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        element = WebDriverWait(driver, 50).until(
                     EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler")))
        element.click()
        try:
            title = soup.find("h3",class_="title").text.replace("\n","").strip()
            print(title)
        except:
            title = 'None'

        sum_div = job.find_element_by_css_selector('#jobSection42826858 > div.row > div > header > h3 > a')

        sum_div.click()
    
        driver.implicitly_wait(2)
        try:            
            job_desc = driver.find_element_by_css_selector('#content > div > div.col-xs-12.col-sm-12.col-md-12 > article > div > div.branded-job-details--container > div.branded-job--content > div.branded-job--description-container > div').text
            #print(job_desc)
        except:
            job_desc = 'None'  

        try:
            job_type = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[2]/div/div/div[3]/div[3]/span').text
            #print(job_type)
        except:
            job_type = 'None' 

        try:
            job_skills = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[6]/div[2]/ul').text
            #print(job_skills)
        except:
            job_skills = 'None'
        driver.back()
        driver.implicitly_wait(2)   

        df = df.append({'Title':title,"Description":job_desc,'Job-type':job_type,'Skills':job_skills},ignore_index=True)

df.to_csv(r"C:\Users\Desktop\Python\newreed.csv",index=False)            

【问题讨论】:

  • 为什么是driver.back()?这真的需要吗?乍一看似乎是多余的。有调试信息吗?
  • 我只是插入了备份驱动程序让我回到主页,有或没有驱动程序返回都是同样的问题。

标签: javascript python selenium web-scraping beautifulsoup


【解决方案1】:

您应该避免使用 Selenium(它最初不是为网页抓取而设计的)。您应该调查 F12 -> 网络 -> html 或 xhr 选项卡。

这是我的代码:

import requests as rq
from bs4 import BeautifulSoup as bs

def processPageData(soup):
    articles = soup.find_all("article")
    resultats = {}
    for article in articles:

        resultats[article["id"][10:]] = {}

        res1 = article.find_all("div", {"class", "metadata"})[0]
        location = res1.find("li", {"class", "location"}).text.strip().split('\n')
        resultats[article["id"][10:]]['location'] = list(map(str.strip, location))
        resultats[article["id"][10:]]['salary'] = res1.find("li", {"class", "salary"}).text

        resultats[article["id"][10:]]['description'] = article.find_all("div", {"class", "description"})[0].find("p").text

        resultats[article["id"][10:]]['posted_by'] = article.find_all("div", {"class", "posted-by"})[0].text.strip()
    
    return resultats

迭代上一个函数:

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
           "Host": "www.reed.co.uk"}
            
resultats = {}

for i in range(1, 10):
    url = " https://www.reed.co.uk/jobs/care-jobs?pageno=%d" % i

    s = rq.session()
    resp = s.get(url, headers=headers)#.text
    soup = bs(resp.text, "lxml")
    r = processPageData(soup)
    resultats.update(r)

给:

{'42826858': {'location': ['Horsham', 'West Sussex'],
  'salary': '£11.50 - £14.20 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

 '42827040': {'location': ['Redhill', 'Surrey'],
  'salary': '£11.00 - £13.00 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

....

注意 1:resultats 键是标识符,可让您在需要更多详细信息时导航到工作页面。

注 2: 我在第 1 到 10 页上迭代;但是您可以尝试调整代码以使其具有最大页数。

注 3:(作为一般建议)尝试了解网站的数据模型,而不是过度尝试,除非以错误的方式使用 selenium。

注 4: css 选择器和 xpath 选择器很丑;更喜欢按标签进行更清洁的选择。 (个人意见)

【讨论】:

  • 我忘记了标题。但是您可以轻松添加它。
【解决方案2】:

在我看来,使用 selenium 管理 Chrome 比使用 firefox 或 edge 更棘手。如果不需要 chrome,那么我会尝试使用 firefox 或 Edge 驱动程序。当 Chrome 给我带来问题时,我在 Edge 上很幸运。

【讨论】:

  • 我认为问题出在任何驱动程序上,对我来说问题是我不知道如何让 selenium 刮掉我的下一篇文章。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-08-28
  • 1970-01-01
  • 1970-01-01
  • 2022-01-21
  • 1970-01-01
相关资源
最近更新 更多