【问题标题】:Using Selenium to scrape webpage with javascript使用 Selenium 通过 javascript 抓取网页
【发布时间】:2021-06-04 16:19:02
【问题描述】:

我想用“显示更多”按钮抓取一个谷歌学者页面。我从之前的问题中了解到,它不是 html 而是 javascript,并且有多种方法可以抓取此类页面。我尝试了 selenium 并尝试了以下代码。

from selenium import webdriver
from bs4 import BeautifulSoup

options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless')
chrome_path = r"....path....."
driver = webdriver.Chrome(chrome_path)

driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")

driver.find_element_by_xpath('/html/body/div/div[13]/div[2]/div/div[4]/form/div[2]/div/button/span/span[2]').click()

soup = BeautifulSoup(driver.page_source,'html.parser')

papers = soup.find_all('tr',{'class':'gsc_a_tr'})

for paper in papers:
    title = paper.find('a',{'class':'gsc_a_at'}).text
    author = paper.find('div',{'class':'gs_gray'}).text
    journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
    
       
    print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)

浏览器现在单击“显示更多”按钮并显示整个页面。但是,我仍然只获得前 20 篇论文的信息。我不明白为什么。请帮忙!

谢谢!

【问题讨论】:

    标签: javascript python selenium web-scraping beautifulsoup


    【解决方案1】:
    import time
    from selenium import webdriver
    from selenium.webdriver.chrome.options import Options
    
    options = Options()
    options.page_load_strategy = 'normal'
    driver = webdriver.Chrome(options=options)
    
    driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
    
    # Awkward method
    # Loading all available articles and then iterating over them
    for i in range(1, 3):
        driver.find_element_by_css_selector('#gsc_bpf_more').click()
        # waits until elements are loaded
        time.sleep(3)
    
    # Container where all data located
    for result in driver.find_elements_by_css_selector('#gsc_a_b .gsc_a_t'):
        title = result.find_element_by_css_selector('.gsc_a_at').text
        authors = result.find_element_by_css_selector('.gsc_a_at+ .gs_gray').text
        publication = result.find_element_by_css_selector('.gs_gray+ .gs_gray').text
        print(title)
        print(authors)
        print(publication)
        # just for separating purpose
        print()
    

    部分输出:

    Tax/subsidy policies in the presence of environmentally aware consumers
    S Bansal, S Gangopadhyay
    Journal of Environmental Economics and Management 45 (2), 333-355
    
    Choice and design of regulatory instruments in the presence of green consumers
    S Bansal
    Resource and Energy economics 30 (3), 345-368
    

    【讨论】:

      【解决方案2】:

      我相信您的问题是当您的程序检查网站时新元素尚未完全加载。尝试导入时间,然后睡几分钟。像这样(我删除了无头功能,以便您可以看到程序工作):

      from selenium import webdriver
      import time
      from bs4 import BeautifulSoup
      
      options = webdriver.ChromeOptions()
      options.add_argument('--ignore-certificate-errors')
      options.add_argument('--incognito')
      
      driver = webdriver.Chrome()
      
      driver.get("https://scholar.google.com/citations?user=TBcgGIIAAAAJ&hl=en")
      time.sleep(3)
      driver.find_element_by_id("gsc_bpf_more").click()
      time.sleep(4)
      soup = BeautifulSoup(driver.page_source, 'html.parser')
      
      papers = soup.find_all('tr', {'class': 'gsc_a_tr'})
      
      for paper in papers:
          title = paper.find('a', {'class': 'gsc_a_at'}).text
          author = paper.find('div', {'class': 'gs_gray'}).text
          journal = [a.text for a in paper.select("td:nth-child(1) > div:nth-child(3)")]
      
          print('Paper Title:', title, '\nAuthor:', author, '\nJournal:', journal)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2013-11-13
        • 1970-01-01
        • 2023-02-03
        • 2020-04-21
        • 2018-09-21
        • 1970-01-01
        • 2019-04-06
        相关资源
        最近更新 更多