Selenium&Chrome实战:动态爬取51job招聘信息

Selenium自动化测试工具，可模拟用户输入,选择,提交。

爬虫实现的功能:

输入python,选择地点:上海,北京 ---->就去爬取上海,北京2个城市python招聘信息
输入会计,选择地址:广州,深圳,杭州---->就去爬取广州,深圳,杭州3个城市会计招聘信息
根据输入的不同，动态爬取结果

二、页面分析

输入关键字

selenium怎么模拟用户输入关键字,怎么选择城市,怎么点击搜索按钮？

Selenium模拟用户输入关键字，谷歌浏览器右键输入框,点检查,查看代码

Selenium&Chrome实战:动态爬取51job招聘信息

通过selenium的find_element_by_id 找到 id = 'kwdselectid'，然后send_keys('关键字')即可模拟用户输入

代码为:

textElement = browser.find_element_by_id('kwdselectid')
textElement.send_keys('python')

选择城市

selenium模拟用户选择城市--- (这个就难了,踩了很多坑)

点击城市选择,会弹出一个框

Selenium&Chrome实战:动态爬取51job招聘信息

然后选择:北京,上海, 右键检查，查看源代码

Selenium&Chrome实战:动态爬取51job招聘信息

可以发现:value的值变成了"北京+上海"

那么是否可以用selenium找到这个标签,更改它的属性值为"北京+上海"，可以实现选择城市呢？

答案:不行,因为经过自己的几次尝试,发现真正生效的是下面的"010000,020000"，这个是什么？城市编号，也就是说在输入"北京+上海"，实际上输入的是:"010000,020000", 那这个城市编号怎么来的,这个就需要去爬取51job弹出城市选择框那个页面了,页面代码里面有城市对应的编号

获取城市编号

getcity.py代码:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json


# 设置selenium使用chrome的无头模式
chrome_options = Options()
chrome_options.add_argument("--headless")
# 在启动浏览器时加入配置
browser = webdriver.Chrome(options=chrome_options)
cookies = browser.get_cookies()
browser.delete_all_cookies()
browser.get('https://www.51job.com/')
browser.implicitly_wait(20)

# 找到城市选择框,并模拟点击
button = browser.find_element_by_xpath("//div[@class='ush top_wrap']//div[@class='el on']/p\
[@class='addbut']//input[@id='work_position_input']").click()

# 选中城市弹出框
browser.current_window_handle

# 定义一个空字典
dic = {}

# 找到城市,和对应的城市编号
find_city_elements = browser.find_elements_by_xpath("//div[@id='work_position_layer']//\
div[@id='work_position_click_center_right_list_000000']//tbody/tr/td")
for element in find_city_elements:
    number = element.find_element_by_xpath("./em").get_attribute("data-value")  # 城市编号
    city = element.find_element_by_xpath("./em").text  # 城市
    # 添加到字典
    dic.setdefault(city, number)
print(dic)
# 写入文件
with open('city.txt', 'w', encoding='utf8') as f:
    f.write(json.dumps(dic, ensure_ascii=False))
browser.quit()

执行输出：

{'北京': '010000', '上海': '020000', '广州': '030200', '深圳': '040000', '武汉': '180200', '西安': '200200', '杭州': '080200', '南京': '070200', '成都': '090200', '重庆': '060000', '东莞': '030800', '大连': '230300', '沈阳': '230200', '苏州': '070300', '昆明': '250200', '长沙': '190200', '合肥': '150200', '宁波': '080300', '郑州': '170200', '天津': '050000', '青岛': '120300', '济南': '120200', '哈尔滨': '220200', '长春': '240200', '福州': '110200'}

通过selenium的find_element_by_xpath 找到城市编号这个input，然后读取city.txt文件，把对应的城市替换为城市编号，在用selenium执行js代码,就可以加载城市了---代码有点长,完整代码写在后面

selenium模拟用户点击搜索

通过selenium的find_element_by_xpath 找到这个button按钮，然后click() 即可模拟用户点击搜索

代码为:

browser.find_element_by_xpath("//div[@class='ush top_wrap']/button").click()

以上都是模拟用户搜索的行为,下面就是对数据提取规则

先定位总页数：158页

Selenium&Chrome实战:动态爬取51job招聘信息

找到每个岗位详细的链接地址:

Selenium&Chrome实战:动态爬取51job招聘信息

最后定位需要爬取的数据

岗位名,薪水,公司名,招聘信息,福利待遇,岗位职责,任职要求,上班地点,工作地点这些数据，总之需要什么数据，就爬什么

需要打开岗位详细的链接，比如：https://jobs.51job.com/shanghai-mhq/118338654.html?s=01&t=0

Selenium&Chrome实战:动态爬取51job招聘信息

三、完整代码

代码介绍

新建目录51cto-selenium，结构如下：

./
├── get51Job.py
├── getcity.py
└── mylog.py

文件说明：

getcity.py (首先运行)获取城市编号,会生成一个city.txt文件

mylog.py 日志程序,记录爬取过程中的一些信息

get51Job.py 爬虫主程序，里面包含:

Item类  定义需要获取的数据

GetJobInfo类 主程序类

getBrowser方法     设置selenium使用chrome的无头模式,打开目标网站,返回browser对象

userInput方法        模拟用户输入关键字,选择城市,点击搜索，返回browser对象

getUrl方法               找到所有符合规则的url，返回urls列表

spider方法               提取每个岗位url的详情，返回items

getresponsecontent方法  接收url,打开目标网站，返回html内容

piplines方法            处理所有的数据，保存为51job.txt

getPageNext方法   找到总页数，并获取下个页面的url,保存数据，直到所有页面爬取完毕

getcity.py

# ！/usr/bin/python3
# -*- coding: utf-8 -*-

#!/usr/bin/env python
# coding: utf-8
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import json


# 设置selenium使用chrome的无头模式
chrome_options = Options()
chrome_options.add_argument("--headless")
# 在启动浏览器时加入配置
browser = webdriver.Chrome(options=chrome_options)
cookies = browser.get_cookies()
browser.delete_all_cookies()
browser.get('https://www.51job.com/')
browser.implicitly_wait(20)

# 找到城市选择框,并模拟点击
button = browser.find_element_by_xpath("//div[@class='ush top_wrap']//div[@class='el on']/p\
[@class='addbut']//input[@id='work_position_input']").click()

# 选中城市弹出框
browser.current_window_handle

# 定义一个空字典
dic = {}

# 找到城市,和对应的城市编号
find_city_elements = browser.find_elements_by_xpath("//div[@id='work_position_layer']//\
div[@id='work_position_click_center_right_list_000000']//tbody/tr/td")
for element in find_city_elements:
    number = element.find_element_by_xpath("./em").get_attribute("data-value")  # 城市编号
    city = element.find_element_by_xpath("./em").text  # 城市
    # 添加到字典
    dic.setdefault(city, number)
print(dic)
# 写入文件
with open('city.txt', 'w', encoding='utf8') as f:
    f.write(json.dumps(dic, ensure_ascii=False))
browser.quit()

View Code