使用 selenium 抓取应用程序答案

【问题标题】：Application crawling using selenium使用 selenium 抓取应用程序
【发布时间】：2014-09-02 18:17:59
【问题描述】：

我正在尝试访问网站中存在的所有链接并想检查其状态（HTTP 200 或 500 等）。我在处理单击某些链接后生成的新窗口时遇到问题。很少有链接会通向新窗口，而在同一窗口中打开的链接也很少。如何检查新窗口并切换到该窗口并返回主窗口。到目前为止，这是我的代码：

public class TestLink {
    //list to save visited links
    static List<String> links = new ArrayList<String>();
    WebDriver driver;

    public TestLink(WebDriver driver) {
        this.driver = driver;
    }

    public void linkTest() {
        // loop over all the a elements in the page
            try{
            for(WebElement link : driver.findElements(By.tagName("a"))) {
                // Check if link is displayed and not previously visited
                if (link.isDisplayed() 
                            && !links.contains(link.getText())) {
                    // add link to list of links already visited
                    links.add(link.getText());
                    System.out.println(link.getText());
                    // click on the link. This opens a new page
                    link.click();
                    // call testLink on the new page
                    new TestLink(driver).linkTest();
                }
            }
            driver.navigate().back();
        }catch(StaleElementReferenceException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) throws InterruptedException {
        WebDriver driver = new HtmlUnitDriver();
        driver.get("http://www.flipkart.com/");
        // start recursive linkText
        new TestLink(driver).linkTest();
    }
}

编辑

以下代码适用于字符串 url，但我想要网站中每个链接的状态代码。如何动态构造每个链接的url。

 public static int getResponseCode(String url) {
        try {
            WebClient client = new WebClient();
           // webClient.getOptions().setThrowExceptionOnFailingStatusCode(false);
            client.getOptions().setThrowExceptionOnFailingStatusCode(false);
            if(url != null)
            return client.getPage(url).getWebResponse().getStatusCode();
        } catch (IOException ioe) {
            throw new RuntimeException(ioe);
        }
        return 0;
    }

【问题讨论】：

标签： java selenium web-crawler

【解决方案1】：

我不认为这是 Seleniums 的预期用途，因此不鼓励这种用途。最好使用一些 http 库甚至命令行实用程序（如 cURL 或 wget）与一些解析器一起使用。此外，webdriver 不提供有关 HTTP 状态代码的任何信息。我会在这里问——状态码是什么？如果您加载一个新页面，通常会下载数十个不同的资源，每个资源都有自己的状态代码。

也就是说，如果您仍想使用 Selenium 进行操作，那么可以，将所有可见链接放入 List<WebElement> 并将点击发送给他们。默认情况下，新链接是在新窗口中打开的，所以在点击之前你必须知道你的父窗口句柄，点击后等待新窗口句柄然后切换到它，做任何你需要做的事情并关闭窗口，启动下一轮循环。查看文档here，您可能对getWindowHandles() 和switchTo() 感兴趣。

对于状态代码，您需要通过代理路由您的流量。我会使用browsermob。它可以获取 HAR 格式的性能数据，其中包括所有请求和响应，包括状态代码，因此您可以执行以下操作：

Har har = server.getHar();
if (har == null) return;
for (HarEntry entry : har.getLog().getEntries()){
    if ((String.valueOf(entry.getResponse().getStatus()).startsWith("4"))
            || (String.valueOf(entry.getResponse().getStatus()).startsWith("5"))){
          // take action
    }
}

【讨论】：

感谢@Erki M，请查看我编辑的问题，如果您对此有任何想法，请告诉我。