使用 Selenium 抓取博客文章标题 - Python答案

【问题标题】：Scraping Blog Post Titles with Selenium - Python使用 Selenium 抓取博客文章标题 - Python
【发布时间】：2022-01-12 23:07:58
【问题描述】：

我正在尝试使用 Selenium 和 Python 来抓取以下 URL 的博客文章标题：https://blog.coinbase.com/tagged/coinbase-pro。当我使用 Selenium 获取页面源时，它不包含博客文章标题，但是当我右键单击并选择“查看页面源”时，Chrome 源代码会包含。我正在使用以下代码：

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.headless = True
driver = webdriver.Chrome(options=options)
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
pageSource = driver.page_source
print(pageSource)

任何帮助将不胜感激。谢谢。

【问题讨论】：

你想要 8 个标题与 graf graf--h3 graf-after--figure graf--trailing graf--title 作为它的类吗？
您可能希望在driver.get 之后实现等待，以允许 Selenium 动态加载内容。但既然它们是动态加载的——为什么不直接查询 api？

标签： python selenium web-scraping

【解决方案1】：

您可以通过多种方式从该网页获取所有标题。最有效和最快的方法是选择请求。

这是使用请求获取标题的方法：

import re
import json
import time
import requests

link = 'https://medium.com/the-coinbase-blog/load-more'
params = {
    'sortBy': 'tagged',
    'tagSlug': 'coinbase-pro',
    'limit': 25,
    'to': int(time.time() * 1000),
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    s.headers['accept'] = 'application/json'
    s.headers['referer'] = 'https://blog.coinbase.com/tagged/coinbase-pro'
    
    while True:
        res = s.get(link,params=params)
        container = json.loads(re.findall("[^{]+(.*)",res.text)[0])
        for k,v in container['payload']['references']['Post'].items():
            title = v['title']
            print(title)

        try:
            next_page = container['payload']['paging']['next']['to']
        except KeyError:
            break

        params['to'] = next_page

但是，如果您想坚持使用硒，请尝试以下操作：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC

def scroll_down_to_the_bottom():
    check_height = driver.execute_script("return document.body.scrollHeight;")
    while True:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        try:
            WebDriverWait(driver,10).until(lambda driver: driver.execute_script("return document.body.scrollHeight;")  > check_height)
            check_height = driver.execute_script("return document.body.scrollHeight;") 
        except TimeoutException:
             break

with webdriver.Chrome() as driver:                          
    driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
    scroll_down_to_the_bottom()
    for item in WebDriverWait(driver,10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".section-content h3.graf--title"))):
       print(item.text)

【讨论】：

【解决方案2】：

wait=WebDriverWait(driver,30)                                 
driver.get("https://blog.coinbase.com/tagged/coinbase-pro")
elements=wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR,".graf.graf--h3.graf-after--figure.graf--trailing.graf--title")))
for elem in elements:
   print(elem.text)

如果你想要这 8 个标题，你可以通过他们的 css 选择器使用等待来获取它们。

进口：

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC

输出：

Inverse Finance (INV), Liquity (LQTY), Polyswarm (NCT) and Propy (PRO) are launching on Coinbase Pro
Goldfinch Protocol (GFI) is launching on Coinbase Pro
Decentralized Social (DESO) is launching on Coinbase Pro
API3 (API3), Bluezelle (BLZ), Gods Unchained (GODS), Immutable X (IMX), Measurable Data Token (MDT) and Ribbon…
Circuits of Value (COVAL), IDEX (IDEX), Moss Carbon Credit (MCO2), Polkastarter (POLS), ShapeShift FOX Token (FOX)…
Voyager Token (VGX) is launching on Coinbase Pro
Alchemix (ALCX), Ethereum Name Service (ENS), Gala (GALA), mStable USD (MUSD) and Power Ledger (POWR) are launching…
Crypto.com Protocol (CRO) is launching on Coinbase Pro

【讨论】：

这正是我正在寻找的输出。当我将您的代码添加到我的时，我收到错误：文件“/home/ubuntu/uniswap_api/env/lib/python3.7/site-packages/selenium/webdriver/support/wait.py”，第 89 行，在直到 raise TimeoutException(message, screen, stacktrace) selenium.common.exceptions.TimeoutException: Message:
这是无头的。
您介意写出整个代码吗？我仍然遇到错误。