从网页中抓取多个帖子答案

【问题标题】：Scraping multiple posts from a web page从网页中抓取多个帖子
【发布时间】：2021-08-14 23:32:19
【问题描述】：

我尝试抓取页面上的所有作业，但没有成功。我一直在尝试不同的方法，但我没有成功。打开并抓取第一个作品后，脚本会崩溃。我不知道接下来我应该做什么才能继续下一份工作。有没有人帮我让它工作？先感谢您。我不得不缩短代码，因为它不允许我全部发布（代码太多）。

# Part 1
from selenium import webdriver
import pandas as pd 
from bs4 import BeautifulSoup
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
options = Options()    
driver = webdriver.Chrome(ChromeDriverManager().install(), options=options)

df = pd.DataFrame(columns=["Title","Description",'Job-type','Skills'])

for i in range(25):
    driver.get('https://www.reed.co.uk/jobs/care-jobs?pageno='+ str(i))
    jobs = []
    driver.implicitly_wait(20)

    for job in driver.find_elements_by_xpath('//*[@id="content"]/div[1]/div[3]'):

        soup = BeautifulSoup(job.get_attribute('innerHTML'),'html.parser')
        element = WebDriverWait(driver, 50).until(
                     EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-accept-btn-handler")))
        element.click()
        try:
            title = soup.find("h3",class_="title").text.replace("\n","").strip()
            print(title)
        except:
            title = 'None'

        sum_div = job.find_element_by_css_selector('#jobSection42826858 > div.row > div > header > h3 > a')

        sum_div.click()
    
        driver.implicitly_wait(2)
        try:            
            job_desc = driver.find_element_by_css_selector('#content > div > div.col-xs-12.col-sm-12.col-md-12 > article > div > div.branded-job-details--container > div.branded-job--content > div.branded-job--description-container > div').text
            #print(job_desc)
        except:
            job_desc = 'None'  

        try:
            job_type = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[2]/div/div/div[3]/div[3]/span').text
            #print(job_type)
        except:
            job_type = 'None' 

        try:
            job_skills = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/article/div/div[2]/div[3]/div[6]/div[2]/ul').text
            #print(job_skills)
        except:
            job_skills = 'None'
        driver.back()
        driver.implicitly_wait(2)   

        df = df.append({'Title':title,"Description":job_desc,'Job-type':job_type,'Skills':job_skills},ignore_index=True)

df.to_csv(r"C:\Users\Desktop\Python\newreed.csv",index=False)

【问题讨论】：

为什么是driver.back()？这真的需要吗？乍一看似乎是多余的。有调试信息吗？
我只是插入了备份驱动程序让我回到主页，有或没有驱动程序返回都是同样的问题。

标签： javascript python selenium web-scraping beautifulsoup

【解决方案1】：

您应该避免使用 Selenium（它最初不是为网页抓取而设计的）。您应该调查 F12 -> 网络 -> html 或 xhr 选项卡。

这是我的代码：

import requests as rq
from bs4 import BeautifulSoup as bs

def processPageData(soup):
    articles = soup.find_all("article")
    resultats = {}
    for article in articles:

        resultats[article["id"][10:]] = {}

        res1 = article.find_all("div", {"class", "metadata"})[0]
        location = res1.find("li", {"class", "location"}).text.strip().split('\n')
        resultats[article["id"][10:]]['location'] = list(map(str.strip, location))
        resultats[article["id"][10:]]['salary'] = res1.find("li", {"class", "salary"}).text

        resultats[article["id"][10:]]['description'] = article.find_all("div", {"class", "description"})[0].find("p").text

        resultats[article["id"][10:]]['posted_by'] = article.find_all("div", {"class", "posted-by"})[0].text.strip()
    
    return resultats

迭代上一个函数：

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0",
           "Host": "www.reed.co.uk"}
            
resultats = {}

for i in range(1, 10):
    url = " https://www.reed.co.uk/jobs/care-jobs?pageno=%d" % i

    s = rq.session()
    resp = s.get(url, headers=headers)#.text
    soup = bs(resp.text, "lxml")
    r = processPageData(soup)
    resultats.update(r)

给：

{'42826858': {'location': ['Horsham', 'West Sussex'],
  'salary': '£11.50 - £14.20 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

 '42827040': {'location': ['Redhill', 'Surrey'],
  'salary': '£11.00 - £13.00 per hour',
  'description': 'Come and join the team as a Care Assistant and make the Alina Homecare difference. We are looking for kind and caring people who want to make a difference to the lives of others. If you have a caring attitude and willingness to make a difference, come...',
  'posted_by': 'Posted Today by Alina Homecare'},

....

注意 1：resultats 键是标识符，可让您在需要更多详细信息时导航到工作页面。

注 2： 我在第 1 到 10 页上迭代；但是您可以尝试调整代码以使其具有最大页数。

注 3：（作为一般建议）尝试了解网站的数据模型，而不是过度尝试，除非以错误的方式使用 selenium。

注 4： css 选择器和 xpath 选择器很丑；更喜欢按标签进行更清洁的选择。（个人意见）

【讨论】：

我忘记了标题。但是您可以轻松添加它。

【解决方案2】：

在我看来，使用 selenium 管理 Chrome 比使用 firefox 或 edge 更棘手。如果不需要 chrome，那么我会尝试使用 firefox 或 Edge 驱动程序。当 Chrome 给我带来问题时，我在 Edge 上很幸运。

【讨论】：

我认为问题出在任何驱动程序上，对我来说问题是我不知道如何让 selenium 刮掉我的下一篇文章。