【发布时间】:2014-07-14 13:42:03
【问题描述】:
我想在https://www.akzonobel.com/nl/careers/vacatures/ 网站上查看和抓取职位列表。国家必须是“荷兰”,工作级别是“入门级”。
我正在使用httparty 发送 POST 请求,但它不断返回最初的 10 个职位列表。正确的回答应该是 3 个职位列表。
这是我正在使用的代码:
require 'httparty'
require 'nokogiri'
@base_url = 'https://www.akzonobel.com'
url = "#{@base_url}/careers/vacatures/"
data = {
'ctl00$contentLeft$ctl01$ddlCountryExt' => 'NLD',
'ctl00$contentLeft$ctl01$ddlJobLevelExt' => 'ENTRY_LEVEL'
}
response = HTTParty.post("#{@base_url}/nl/careers/vacatures/", :body => data)
html = Nokogiri::HTML(response)
jobs = html.xpath('//h3//a')
jobs.each do |job|
puts job.text
end
puts jobs.size
返回:
Regional Demand Planner Nordeuropa (m,w)
Forecast Analyst - TiO2 Spend Area
PS Regional Manager APAC
Production leader
Engineering Administrator - Temporary
Procurement Manager EMEA
Business Analyst, Americas
HR Business Partner Supply Chain and R&D
AS Regional Manager
Business Information Manager
10
如何将所需的表单数据发送到网站以获得正确的响应?
更新:
我尝试了以下方法:
require 'httparty'
require 'nokogiri'
@base_url = 'https://www.akzonobel.com'
url = "#{@base_url}/careers/vacatures/"
data = {
'ctl00$contentLeft$ctl01$ddlCountryExt' => 'NLD',
'ctl00$contentLeft$ctl01$ddlJobLevelExt' => 'ENTRY_LEVEL',
'ctl00$contentLeft$ctl01$ddlContinentExt' => 1,
'ctl00$contentLeft$ctl01$ddlRegionEx' => 4,
'ctl00$contentLeft$ctl01$ddlJobFamilyEx' => 45,
'ctl00$contentLeft$ctl01$ddlBusinessUnitExt' => 22,
'ctl00$contentLeft$ctl01$ddlJobLevelExt' => 1,
'ctl00$contentLeft$ctl01$ddlCountryExt' => 1,
}
response = HTTParty.post("#{@base_url}/nl/careers/vacatures/", :body => data)
html = Nokogiri::HTML(response)
jobs = html.xpath('//h3//a')
jobs.each do |job|
puts job.text
end
puts jobs.size
不幸的是结果完全一样。
更新 2:
这是更新后的代码:
require 'httparty'
require 'nokogiri'
@base_url = 'https://www.akzonobel.com'
url = "#{@base_url}/careers/vacatures/"
data = {
'contentLeft_ctl01_ddlContinentExt' => 'C_EUROPE',
'contentLeft_ctl01_ddlCountryExt' => 'NLD',
'contentLeft_ctl01_ddlRegionExt' => 'Gelderland',
'contentLeft_ctl01_ddlRegionExt' => 'Limburg',
'contentLeft_ctl01_ddlRegionExt' => 'North Holland',
'contentLeft_ctl01_ddlRegionExt' => 'South Holland',
'contentLeft_ctl01_ddlJobFamilyExt' => 'General Management',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Integrated Supply Chain',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Sales & Marketing',
'contentLeft_ctl01_ddlJobFamilyExt' => 'RD&I',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Support',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Other',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Lvl2_General Management',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Manufacturing',
'contentLeft_ctl01_ddlJobFamilyExt' => 'HSE',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Engineering',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Procurement',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Distribution & Logistics',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Sales',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Marketing',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Lvl2_RD&I',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Finance',
'contentLeft_ctl01_ddlJobFamilyExt' => 'IM',
'contentLeft_ctl01_ddlJobFamilyExt' => 'HR',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Legal, IP & Compliance',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Facilities',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Lvl2_Other',
'contentLeft_ctl01_ddlJobFamilyExt' => '80200000',
'contentLeft_ctl01_ddlJobFamilyExt' => '80300000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81900000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81100000',
'contentLeft_ctl01_ddlJobFamilyExt' => '82000000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81200000',
'contentLeft_ctl01_ddlJobFamilyExt' => '80700000',
'contentLeft_ctl01_ddlJobFamilyExt' => '80400000',
'contentLeft_ctl01_ddlJobFamilyExt' => '80500000',
'contentLeft_ctl01_ddlJobFamilyExt' => '80800000',
'contentLeft_ctl01_ddlJobFamilyExt' => '80900000',
'contentLeft_ctl01_ddlJobFamilyExt' => '82100000',
'contentLeft_ctl01_ddlJobFamilyExt' => '82200000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81010000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81020000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81030000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81040000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81300000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81410000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81420000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81430000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81600000',
'contentLeft_ctl01_ddlJobFamilyExt' => '81700000',
'contentLeft_ctl01_ddlJobFamilyExt' => 'Lvl3_Other',
'contentLeft_ctl01_ddlBusinessUnitExt' => '52000100',
'contentLeft_ctl01_ddlBusinessUnitExt' => '52000200',
'contentLeft_ctl01_ddlBusinessUnitExt' => '52000300',
'contentLeft_ctl01_ddlBusinessUnitExt' => '52000900',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000010',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000013',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000020',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000022',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000026',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000033',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000038',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000041',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000054',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000055',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000056',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000061',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000063',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000100',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000300',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000900',
'contentLeft_ctl01_ddlBusinessUnitExt' => '53000901',
'contentLeft_ctl01_ddlBusinessUnitExt' => '51000000',
'contentLeft_ctl01_ddlJobLevelExt' => 'ENTRY_LEVEL'
}
response = HTTParty.post("#{@base_url}/nl/careers/vacatures/", :body => data)
html = Nokogiri::HTML(response)
jobs = html.xpath('//h3//a')
jobs.each do |job|
puts job.text
end
puts jobs.size
给我和以前一样的结果。
【问题讨论】:
-
HTTParty 不是此类抓取的正确工具。除非需要执行 JavaScript,否则我会使用 Mechanize。
标签: ruby post http-post httparty open-uri