【发布时间】:2021-06-04 16:19:02
【问题描述】:
我想用“显示更多”按钮抓取一个谷歌学者页面。我从之前的问题中了解到,它不是 html 而是 javascript,并且有多种方法可以抓取此类页面。我尝试了 selenium 并尝试了以下代码。
from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
chrome_path = r"....path....."
driver = webdriver.Chrome(chrome_path)
driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
driver.find_element_by_xpath('/html/body/div/div[13]/div[2]/div/div[4]/form/div[2]/div/button/span/span[2]').click()
soup = BeautifulSoup(driver.page_source,'html.parser')
papers = soup.find_all('tr',{'class':'gsc_a_tr'})
for paper in papers:
title = paper.find('a',{'class':'gsc_a_at'}).text
author = paper.find('div',{'class':'gs_gray'}).text
journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)
浏览器现在单击“显示更多”按钮并显示整个页面。但是,我仍然只获得前 20 篇论文的信息。我不明白为什么。请帮忙!
谢谢!
【问题讨论】:
标签: javascript python selenium web-scraping beautifulsoup