【发布时间】:2020-08-28 02:15:42
【问题描述】:
我正在学习 Python,并决定做一个网络抓取项目,我在其中使用 Beautifulsoup 和 Selenium
网站:https://careers.amgen.com/ListJobs?
目标: 检索与职位添加相关的所有变量。识别的变量:ID、职位、URL、城市、州、邮编、国家、职位发布日期。
问题:我设法从表格的第一页提取数据。但是,我无法从表的所有其他页面中提取数据。我确实使用了该选项转到下一页。
任何帮助将不胜感激。
请在下面找到我的代码。
import re
import os
import selenium
import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support.expected_conditions import presence_of_element_located
from bs4 import BeautifulSoup
#driver = webdriver.Chrome(ChromeDriverManager().install())
browser = webdriver.Chrome("") #path needed to execute chromedriver. Check your own
#path
browser.get('https://careers.amgen.com/ListJobs?')
browser.implicitly_wait(100)
soup = BeautifulSoup(browser.page_source, 'html.parser')
code_soup = soup.find_all('tr', attrs={'role': 'row'})
# creating data set
df =pd.DataFrame({'id':[],
'jobs':[],
'url':[],
'city':[],
'state':[],
'zip':[],
'country':[],
'added':[]
})
d = code_soup
next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')
for i in range(2,12): #catch error, out of bonds?
df = df.append({'id' : d[i].find_all("td", {"class": "DisplayJobId-cell"}),
"jobs" : d[i].find_all("td", {"class":"JobTitle-cell"}),
"url" : d[i].find("a").attrs['href'],
"city" : d[i].find_all("td", {"class": "City-cell"}),
"state" : d[i].find_all("td", {"class": "State-cell"}),
"zip" : d[i].find_all("td", {"class": "Zip-cell"}),
"country" : d[i].find_all("td", {"class": "Country-cell"}),
"added" : d[i].find_all("td", {"class": "AddedOn-cell"})}, ignore_index=True)
df['url'] = 'https://careers.amgen.com/' + df['url'].astype(str)
df["company"] = "Amgen"
df
#iterate through the pages
next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')
for p in range(1,7): #go from page 1 to 6
next_page.click()
browser.implicitly_wait(20)
print(p)
>quote
I tried multiple things, this is my last multiple attempt. It did not work:
```
p = 0
next_page = browser.find_element_by_xpath('//*[@id="jobGrid0"]/div[2]/a[3]/span')
for p in range(1,7):
for i in range(2,12):
df1 = df.append({'id' : d[i].find_all("td", {"class": "DisplayJobId-cell"}),
"jobs" : d[i].find_all("td", {"class":"JobTitle-cell"}),
"url" : d[i].find("a").attrs['href'],
"city" : d[i].find_all("td", {"class": "City-cell"}),
"state" : d[i].find_all("td", {"class": "State-cell"}),
"zip" : d[i].find_all("td", {"class": "Zip-cell"}),
"country" : d[i].find_all("td", {"class": "Country-cell"}),
"added" : d[i].find_all("td", {"class": "AddedOn-cell"})}, ignore_index=True)
p += 1
next_page.click()
print(p)
【问题讨论】:
标签: python selenium selenium-webdriver web-scraping beautifulsoup