Jsoup爬虫和HTTP错误获取URL答案

【问题标题】：Jsoup crawler and HTTP error fetching URLJsoup爬虫和HTTP错误获取URL
【发布时间】：2018-09-11 14:46:48
【问题描述】：

我正在用 Jsoup 编写一个爬虫，这是我得到的 HTTP 错误：

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:760)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:757)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:706)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:299)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:288)
at testing.DefinitelyNotSpiderLeg.crawl(DefinitelyNotSpiderLeg.java:31)
at testing.DefinitelyNotSpider.search(DefinitelyNotSpider.java:33)
at testing.Test.main(Test.java:9)

我阅读了有关此错误的所有其他类似问题和解决方案，因此我将他们的解决方案实施到我的代码中，但是当 Jsoup 连接到 url 时，我仍然收到相同的错误。

这是我用来爬取的方法：

public boolean crawl(String url)
{
    try
    {
         Document htmlDocument = Jsoup.connect(url)
                 .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.112 Safari/535.1")
                 .referrer("http://www.google.com")              
                 .timeout(1000*5) //it's in milliseconds, so this means 5 seconds.              
                 .get();

        Elements linksOnPage = htmlDocument.select("a[href]");

        for(Element link : linksOnPage)
        {    
            String a =link.attr("abs:href");

            if(a.startsWith(url)) {
                this.links.add(a);
            }               
        }            

    }catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
    } catch (IOException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return true;

}

各位有什么想法吗？？？

【问题讨论】：

我看到异常中的 url 是 mkyong.com/spring-boot/spring-boot-hibernate-search-example/…。这是正在传递的网址吗
嗯，异常说得很清楚，服务器找不到https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/ 的资源。你是如何使用这种方法的，你最终会得到这样的电话？
因为它是一个 https 连接，所以你在处理 ssl。 stackoverflow.com/questions/7744075/…
我从一个网页中收集所有的网址：mkyong.com，然后我在我收集的每个网址中“爬行”。我猜这些链接之一是“mkyong.com/spring-boot/spring-boot-hibernate-search-example/…”。

标签： java http web-crawler jsoup http-status-code-404

【解决方案1】：

是因为url不正确：-

在您的代码中，您使用的是 url - https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/

我可以在堆栈跟踪的第一行看到

org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404, URL=https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/

实际上没有找到:-)

【讨论】：

【解决方案2】：

问题不在于代码，问题在于您正在解析的网页中存在的链接。

这是包含更多链接的原始页面。当您抓取网页时，它会为您提供所有链接。 https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/

现在，如果您仔细查看并检查该页面，您将获得一个超链接，即

超链接中的代码显示-
<a href="“http://wildfly.org/downloads/“" target="“_blank”">official website</a>

如果你注意到这个 url 会产生问题，因为它作为额外的引号出现在其中，因此它会附加这个引号 url 和基本 url，输出是 -
https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/

您在 JSOUP 中以

https://www.mkyong.com/spring-boot/spring-boot-hibernate-search-example/%E2%80%9Chttp:/wildfly.org/downloads/ 的身份获得。因此，要在抓取网页时解决您的问题，您必须进行处理并删除不必要的内容并将所需的 url http:/wildfly.org/downloads/ 从混乱的 url 中分离出来，或者处理代码中的故障。希望它能给你更好的主意。

【讨论】：