【问题标题】:Indeed Job Scraper only for Postings with External LinkIndeed Job Scraper 仅适用于带有外部链接的帖子
【发布时间】:2021-09-07 01:39:06
【问题描述】:

目前使用下面的 Python 爬虫来提取职位、公司、工资和描述。寻找一种更进一步的方法,即仅过滤应用程序链接为公司网站 URL 的结果,而不是通过 Indeed 发送应用程序的“轻松申请”帖子。有没有办法做到这一点?

import requests
from bs4 import BeautifulSoup
import pandas as pd

def extract(page):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
    url = f'https://www.indeed.com/jobs?q=Software%20Engineer&l=Austin%2C%20TX&ts=1630951951455&rq=1&rsIdx=1&fromage=last&newcount=6&vjk=c8f4815c6ecfa793'
    r = requests.get(url, headers) # 200 is OK, 404 is page not found
    soup = BeautifulSoup(r.content, 'html.parser')
    return soup

# <span title="API Developer"> API Developer </span>
def transform(soup):
    divs = soup.find_all('div', class_ = 'slider_container')
    for item in divs:
        if item.find(class_ = 'label'):
            continue # need to fix, if finds a job that has a 'new' span before the title span, skips job completely
        title = item.find('span').text.strip()
        company = item.find('span', class_ = "companyName").text.strip()
        description = item.find('div', class_ = "job-snippet").text.strip().replace('\n', '')
        try:
            salary = item.find('span', class_ = "salary-snippet").text.strip()
        except:
            salary = ""
        
        job = {
                'title': title,
                'company': company,
                'salary': salary,
                'description': description
        }
        jobList.append(job)
#        print("Seeking a: "+title+" to join: "+company+" paying: "+salary+". Job description: "+description) 
    return

jobList = []

# go through multiple pages
for i in range(0,100, 10): #0-40 stepping in 10's
    print(f'Getting page, {i}')
    c = extract(0)
    transform(c)

print(len(jobList))

df = pd.DataFrame(jobList)
print(df.head())
df.to_csv('jobs.csv')

【问题讨论】:

  • 您能否发布一个示例链接,指向您想要过滤掉的招聘信息?
  • 当然,在下面的链接中;这是一个数据科学家职位的职位发布,用户可以直接在 Indeed 上申请(信息在 Indeed 上填写,从 Indeed 发送等)。理想情况下会寻找刮板来过滤掉所有这些结果,并且只刮掉用户必须去公司网站申请的帖子:indeed.com/…

标签: python pandas web-scraping beautifulsoup python-requests


【解决方案1】:

我的做法如下-

从初始页面上每个工作卡的&lt;a&gt; 标记中找到href,然后向每个链接发送请求,并获取外部工作​​链接(如果“在公司网站上申请”按钮是可用)从那里。

代码sn-p-

#function which gets external job links
def get_external_link(url):
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
    r = requests.get(url, headers) 
    soup = BeautifulSoup(r.content, 'html.parser')
    
    #if Apply On Company Site button is available, fetch the link
    if(soup.find('a',attrs={"referrerpolicy" : "origin"})) is not None:
        external_job_link=soup.find('a',attrs={"referrerpolicy" : "origin"})
        print(external_job_link['href'])

#add this piece of code to transform function
def transform(soup):
    cards=soup.find('div',class_='mosaic-provider-jobcards')
    links=cards.find_all("a", class_=lambda value: value and value.startswith("tapItem"))

    #for each job link in the page call get_external_links
    for link in links:
        get_external_link('https://www.indeed.com'+(link['href']))

注意-您还可以使用正在调用的新请求的页面源来获取您以前从主页上抓取的标题、公司、薪水、描述等数据。

【讨论】:

    猜你喜欢
    • 2014-02-17
    • 1970-01-01
    • 2012-09-11
    • 2015-02-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多