如何返回 X 元素 [Selenium]？答案

【问题标题】：How to return X elements [Selenium]?如何返回 X 元素 [Selenium]？
【发布时间】：2015-09-11 19:24:08
【问题描述】：

一个页面加载了 35.000 个元素，我只对前 10 个元素感兴趣。返回所有元素会使抓取极其缓慢。我只成功地返回了第一个元素：

driver.find_element_by

或返回所有 35.000 个元素，其中：

driver.find_elements_by

有谁知道返回找到的 x 个元素的方法？

【问题讨论】：

你能给我们一个返回的 HTML 的例子吗？前 10 个与其余的格式是什么？无论元素是什么，您都只需要前 10 个元素？

标签： firefox selenium-webdriver web-scraping python-3.4

【解决方案1】：

Selenium 不提供仅允许返回一部分 .find_elements... 调用的工具。如果您想优化事物以便不需要让 Selenium 返回每个元素，一个通用的解决方案是在浏览器端使用 JavaScript 执行切片操作。我在这里的这个答案中提出了这个解决方案。如果您想使用 XPath 来选择 DOM 节点，您可以在此处调整答案，或者您可以使用我已提交的 another answer 中的方法。

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.example.com")

# We add 35000 paragraphs with class `test` to the page so that we can
# later show how to get the first 10 paragraphs of this class. Each
# paragraph is uniquely numbered.
driver.execute_script("""
var html = [];
for (var i = 0; i < 35000; ++i) {
  html.push("<p class='test'>"+ i + "</p>");
}
document.body.innerHTML += html.join("");
""")

elements = driver.execute_script("""
return Array.prototype.slice.call(document.querySelectorAll("p.test"), 0, 10);
""")

# Verify that we got the first 10 elements by outputting the text they
# contain to the console. The loop here is for illustration purposes
# to show that the `elements` array contains what we want. In real
# code, if I wanted to process the text of the first 10 elements, I'd
# do what I show next.
for element in elements:
    print element.text

# A better way to get the text of the first 10 elements. This results
# in 1 round-trip between this script and the browser. The loop above
# would take 10 round-trips.
print driver.execute_script("""
return Array.prototype.slice.call(document.querySelectorAll("p.test"), 0, 10)
           .map(function (x) { return x.textContent; });;
""")

driver.quit()

需要Array.prototype.slice.call rigmarole，因为document.querySelectorAll 返回的看起来 像Array，但实际上不是Array 对象。（它是一个NodeList。）所以它没有.slice 方法，但你可以将它传递给Array 的slice 方法。

【讨论】：

【解决方案2】：

这是一种明显不同的方法，作为不同的答案提出，因为有些人会更喜欢这个方法而不是我给出的other one，或者另一种方法而不是这个方法。

这个依赖于使用 XPath 对结果进行切片：

from selenium import webdriver
driver = webdriver.Chrome()
driver.get("http://www.example.com")

# We add 35000 paragraphs with class `test` to the page so that we can
# later show how to get the first 10 paragraphs of this class. Each
# paragraph is uniquely numbered. These paragraphs are put into
# individual `div` to make sure they are not siblings of one
# another. (This prevents offering a naive XPath expression that would
# work only if they *are* siblings.)
driver.execute_script("""
var html = [];
for (var i = 0; i < 35000; ++i) {
  html.push("<div><p class='test'>"+ i + "</p></div>");
}
document.body.innerHTML += html.join("");
""")

elements = driver.find_elements_by_xpath(
    "(//p[@class='test'])[position() < 11]")
for element in elements:
    print element.text

driver.quit()

请注意，XPath 使用从 1 开始的索引，因此 < 11 确实是正确的表达方式。表达式第一部分的括号是绝对必要的。使用这些括号，[position() < 11] 测试检查每个节点在节点集中的位置 ，这是括号中表达式的结果。如果没有它们，位置测试将检查节点相对于其父节点的位置，这将匹配所有节点，因为所有<p> 都位于各自<div> 中的第一个位置。（这就是为什么我在上面添加了那些<div> 元素：以显示这个问题。）

如果我已经使用 XPath 进行选择，我会使用此解决方案。否则，如果我通过 CSS 选择器或 id 进行搜索，我不会将其转换为 XPath 来执行切片。我会使用我展示的其他方法。

【讨论】：