【发布时间】:2018-03-14 22:21:36
【问题描述】:
我正在学习如何在 Java 8 中使用 htmlUnit 进行抓取,并且我正在尝试将一个应用程序部署到谷歌应用程序引擎,该引擎将不时抓取某些网站。我正在 Eclipse 中开发应用程序,它在本地运行时按预期工作,但是在部署到 GAE 后,我的应用程序不再能够连接到任何网站。
try (final WebClient webClient = new WebClient()) {
webClient.setCookieManager(new CookieManager() {
protected int getPort(final java.net.URL url) {
final int r = super.getPort(url);
return r != -1 ? r : 80;
}
});
final HtmlPage page = webClient.getPage("https://www.google.com");
}
catch(Exception e){
System.out.println(e.getMessage());
}
错误发生在“webClient.getPage(....)”
java.net.UnknownHostException: www.google.com
部分堆栈跟踪:
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: java.lang.RuntimeException: java.net.UnknownHostException: www.recreation.gov
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at com.gargoylesoftware.htmlunit.UrlFetchWebConnection.getResponse(UrlFetchWebConnection.java:162)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1394)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1312)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:396)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:317)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:465)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:450)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at pack.HelloAppEngine.doGet(HelloAppEngine.java:49)
[s~permitseacherbpd/20180314t161057.408306947286449649].<stderr>: at javax.servlet.http.HttpServlet.service(HttpServlet.java:687)
我尝试访问的任何网站都会发生此错误,并且不是 htmlUnit 独有的,因为我之前在其他项目中遇到过此错误。为什么部署到应用引擎后无法连接?
【问题讨论】:
-
您要部署到哪种类型的 GAE 容器?是否允许这种行为(启动传出连接)?
-
HtmlUnit (HttpClient) 不使用 jvm 的代理设置。您必须将此信息提供给 HtmlUnit 本身(请参阅 htmlunit.sourceforge.net/gettingStarted.html)也许这就是原因。
-
App Engine for Java 8 不需要代理。使用现有的 java.net 工具,我可以使用
ProxySelector来验证 App Engine 是否使用了DIRECT连接,并且我可以从https://www.google.com获取内容。 ``` response.getWriter().append("\n代理为google.com:"); for (代理代理: ProxySelector.getDefault().select(URI.create("google.com"))) { response.getWriter() .append("\n " + 代理); } response.getWriter().append("\n" + URI.create("google.com").toURL().getContent()); ``
标签: java eclipse google-app-engine htmlunit