使用 Java JSoup 和 Selenium 抓取完整的动态 HTML 内容答案

【问题标题】：Scraping full dynamic HTML content using Java JSoup and Selenium使用 Java JSoup 和 Selenium 抓取完整的动态 HTML 内容
【发布时间】：2019-02-02 17:32:41
【问题描述】：

我正在尝试抓取这个网站

https://www.dailystrength.org/search?query=aspirin&type=discussion

为我的项目获取数据集（使用阿司匹林作为占位符搜索项）。

我决定用Jsoup做一个爬虫。但问题是这些帖子是通过 Ajax 请求动态带来的。该请求是使用“显示更多”按钮发出的

This button causes the problems

当显示整个内容时，它应该看起来像这样，带有文本“所有消息已加载”

end result

import java.io.IOException;
import java.util.ArrayList;
import java.util.logging.Level;
import java.util.logging.Logger;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.openqa.selenium.*;
import org.openqa.selenium.chrome.*;

/**
 *
 * @author Ahmed
 */
public class Crawler {

    public static void main(String args[]) {
        Document search_result;
        String requested[] = new String[]{"aspirin"/*, "Fentanyl"*/};
        ArrayList<Newsfeed_item> threads =  new ArrayList();

        String query = "https://www.dailystrength.org/search?query=";

        try {
            for (int i = 0; i < requested.length; i++) {
                search_result = Jsoup.connect(query+requested[i]+"&type=discussion").get();

                Elements posts = search_result.getElementsByClass("newsfeed__item");
                for (Element item : posts) {

                    Elements link=item.getElementsByClass("newsfeed__btn-container posts__discuss-btn");

                    Newsfeed_item currentItem=new Newsfeed_item();
                    currentItem.replysLink=link.attr("abs:href");
                    Document reply_result=Jsoup.connect(currentItem.replysLink).get();
                    Elements description = reply_result.getElementsByClass("posts__content");

                    currentItem.description=description.text();
                    currentItem.subject=requested[i];
                    System.out.println(currentItem);

                }
            }
        } catch (IOException ex) {
            Logger.getLogger(Crawler.class.getName()).log(Level.SEVERE, null, ex);
        }

    }
}

这段代码只给了我几个显示的帖子，而不是隐藏的帖子。我知道 JSoup 不能用于这个问题，所以我尝试寻找 selenium 的来源以显示完整内容并下载它以进行爬取。

我找不到任何来源，并且找到的唯一代码尝试从

进行初步了解

https://www.youtube.com/watch?v=g1IbI_qYsDg

给我这个错误

Exception in thread "main" java.lang.IllegalStateException: The path to the driver executable must be set by the webdriver.gecko.driver system property; for more information, see https://github.com/mozilla/geckodriver. The latest version can be downloaded from https://github.com/mozilla/geckodriver/releases
    at com.google.common.base.Preconditions.checkState(Preconditions.java:847)
    at org.openqa.selenium.remote.service.DriverService.findExecutable(DriverService.java:134)
    at org.openqa.selenium.firefox.GeckoDriverService.access$100(GeckoDriverService.java:44)
    at org.openqa.selenium.firefox.GeckoDriverService$Builder.findDefaultExecutable(GeckoDriverService.java:167)
    at org.openqa.selenium.remote.service.DriverService$Builder.build(DriverService.java:355)
    at org.openqa.selenium.firefox.FirefoxDriver.toExecutor(FirefoxDriver.java:190)
    at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:147)
    at org.openqa.selenium.firefox.FirefoxDriver.<init>(FirefoxDriver.java:125)
    at SeleniumTest.main(SeleniumTest.java:14)
C:\Users\Ahmed\AppData\Local\NetBeans\Cache\8.2\executor-snippets\run.xml:53: Java returned: 1
BUILD FAILED (total time: 0 seconds)

有任何帮助或示例代码或替代方法吗？我只需要获取完整页面并使用我拥有的爬虫。或者制作一个全新的爬虫，但我找不到代码并且遇到错误。

【问题讨论】：

标签： java selenium-webdriver web-crawler jsoup

【解决方案1】：

我将尝试在没有硒的情况下继续该方法。使用 Web 浏览器的调试器及其网络选项卡，您可以查看浏览器发送的所有请求。

查看单击“显示更多”时发生的情况很有用。你可以看到下一页是从这个 url 加载的： https://www.dailystrength.org/search/ajax?query=aspirin&type=discussion&page=2&_=1549130275261 您可以通过更改参数page=2 来获得更多页面。不幸的是，结果是包含转义 HTML 的 JSON，因此您必须使用一些 JSON 库来解析它，获取 HTML，然后使用 Jsoup 解析它。这将是很好的，因为这个 JSON 还包含一个变量 "has_more":true，所以你会知道是否还有更多内容。

【讨论】：

见this related post。