用硒刮掉链接答案

【问题标题】：Scraping links with selenium用硒刮掉链接
【发布时间】：2020-05-30 07:26:42
【问题描述】：

我正在努力抓取网站上文章的链接。但通常当网站加载时它只列出 5 篇文章，然后它需要单击加载更多按钮以显示更多文章列表。 Html 源代码只有前五篇文章的链接。

我使用 selenium python 自动单击加载更多按钮以完全加载包含所有文章列表的网页。

现在的问题是如何提取所有这些文章的链接。

在使用 selenium 完全加载网站后，我尝试使用 driver.page_source 获取 html 源并打印它，但它仍然只有前 5 篇文章的链接。

我想在点击加载更多按钮后获取网页中加载的所有文章的链接。

请有人帮忙提供解决方案。

【问题讨论】：

你能提供一个网址吗？

标签： selenium web-scraping

【解决方案1】：

也许链接需要一些时间才能显示出来，并且您的代码在更新源代码之前正在执行driver.source_code。您可以在显式等待后使用 Selenium 选择链接，以便确保动态添加到网页的链接已完全加载。如果没有指向您的源代码的链接，很难准确地总结出您需要的内容，但是（在 Python 中）它应该类似于：

from selenium.webdriver.support.ui import WebDriverWait

def condition(driver):
    """If the selector defined in the function retrieves 10 or more results, return the results.
    Else, return None.
    """
    selector = 'a.my_class' # Selects all <a> tags with the class "my_class" 
    els = driver.find_elements_by_css_selector(selector)
    if len(els) >= 10:
        return els

# Making an assignment only when the condition returns a truthy value when called (waiting until 2 min):
links_elements = WebDriverWait(driver, timeout=120).until(condition)
# Getting the href attribute of the links 
links_href = [link.get_attribute('href') for link in links_elements]

在这段代码中，你是：

不断寻找您想要的元素，直到有 10 个或更多。您可以通过 CSS 选择器（如示例）、XPath 或other method 来完成此操作。只要wait 条件返回一个具有True 值的对象，这就会为您提供一个 Selenium 对象列表，直到某个超时。 See more on explicit waits in the documentation。您应该为您的情况制定适当的条件 - 如果您不确定最终会有多少链接，那么可能期望一定数量的链接并不好。
从 Selenium 对象中提取您想要的内容。为此，请对从上述步骤获得的列表中的元素使用适当的方法。

【讨论】：