【发布时间】:2017-01-19 05:18:49
【问题描述】:
首先我应该说我根本不了解 Javascript。我正在尝试模拟对彭博超链接页面的点击。我想获取新闻项目列表(超链接),然后简单地遍历列表获取每篇文章的标题和文章文本。这是我的代码:
public List<String> getBloomNewsHtmlUnit() throws IOException {
String searchString = "Apple";
List<String> bloombergNewsAll = new ArrayList<>();
WebClient webclient = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage mainpage = webclient.getPage("http://www.bloomberg.com/search?query=" + searchString);
HtmlAnchor pageanchor = mainpage.getFirstByXPath("//*[@id=\"content\"]/div/section/section[2]/section[1]/div[2]/div[2]/article/div[1]/h1/a");
webclient.waitForBackgroundJavaScript(50000);
webclient.getOptions().setThrowExceptionOnScriptError(false);
webclient.getOptions().setThrowExceptionOnFailingStatusCode(false);
webclient.setCssErrorHandler(new SilentCssErrorHandler());
mainpage = pageanchor.click();
System.out.println("Main page: " + mainpage.asText());
return bloombergNewsAll;
// return bloombergNewsAll;
}
这是个例外:
Sep 11, 2016 9:49:34 AM com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
SEVERE: runtimeError: message=[An invalid or illegal selector was specified (selector: '*,:x' error: Invalid selector: :x).] sourceName=[https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js] line=[153] lineSource=[null] lineOffset=[0]
Exception in thread "main" java.lang.RuntimeException: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:284)
at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:519)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:386)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:304)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:451)
at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:436)
at com.jsoup.test.BloombergTest.getBloomNewsHtmlUnit(BloombergTest.java:71)
at com.jsoup.test.BloombergTest.main(BloombergTest.java:37)
Caused by: com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:921)
at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:803)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.execute(JavaScriptEngine.java:779)
at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:975)
at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:352)
at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:238)
at com.gargoylesoftware.htmlunit.html.HtmlPage.initialize(HtmlPage.java:277)
... 7 more
Caused by: net.sourceforge.htmlunit.corejs.javascript.EcmaError: TypeError: Cannot call method "split" of undefined (https://assets.bwbx.io/business/public/javascripts/application-6e1529c288.js#79)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3915)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.constructError(ScriptRuntime.java:3899)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError(ScriptRuntime.java:3924)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.typeError2(ScriptRuntime.java:3940)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.undefCallError(ScriptRuntime.java:3956)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThisHelper(ScriptRuntime.java:2390)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.getPropFunctionAndThis(ScriptRuntime.java:2384)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1342)
at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.exec(InterpretedFunction.java:115)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$3.doRun(JavaScriptEngine.java:794)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:906)
... 15 more
Java Result: 1
即使我尝试执行代码的前 4 行(不引用 HtmlAnchor),也会出现相同的错误。我在网上阅读了一些关于此错误的错误报告,但建议的解决方案似乎都不适用于我的情况:
htmlunit : An invalid or illegal selector was specified
在上面的 SOF 问题中,我将建议的 waitForBackgroundJavaScript 应用到 webclient,但这并没有解决问题。
JavaScript Exception in HtmlUnit when clicking at google result page
在这个问题中我尝试添加:
JavaScriptEngine engine = webclient.getJavaScriptEngine();
engine.holdPosponedActions();
到代码,但错误仍然存在。
https://sourceforge.net/p/htmlunit/bugs/1744/
在上面的错误报告中,建议的解决方案是使用选择查询结果重新定义主页。就我而言,我尝试使用 click() 事件重新定义页面。我的代码没有走那么远,一旦我尝试定义 HtmlPage,就会抛出错误。
https://sourceforge.net/p/htmlunit/bugs/1661/
此报告建议简单地忽略警告,但在我的情况下,我遇到了一个异常(不仅仅是警告),这会阻止所需的输出。
我首先尝试使用 Jsoup 进行抓取。这工作得很好,但是当我在 Chrome 中检查它时,Jsoup 在文章文本之间提供了一些错误链接,这些链接不在原始页面上。我怀疑有一个 JS 或 Ajax 调用改变了页面 DOM。这就是我选择使用 Htmlunit 的原因。
如果我在哪里做错了得到这个错误以及如何纠正它,我将不胜感激。另外,如果有人认为可以仅使用 Jsoup 来实现我想要的,请告诉我(我刚刚读到 Jsoup 不支持 DOM 中的动态更改,因此无法单独工作)。提前致谢!
【问题讨论】:
-
与您的问题没有直接关系,但您为什么要设置
SilentCssErrorHandler?很可能你根本不需要 css。所以你可以禁用它:webClient.getOptions().setCssEnabled(false); -
你确定你的 xpath 是正确的吗?尝试记录
pageanchor的值,例如System.err.println(pageanchor.asXml());. -
感谢史密斯先生的有用提示。我删除了 SilentCssErrorHandler。似乎无法记录 pageanchor。异常发生在 getPage 语句上,在 pageAnchor 语句之前。除了堆栈跟踪,应用程序不输出任何内容。事实上,如果我删除所有行并尝试 getPage 我将得到完全相同的异常。这是否表明 HtmlUnit 和页面之间存在一些 JS 库冲突?在那种情况下,无论 pageanchor xpath 是否正确,异常不会总是发生吗?
-
xpath 是正确的(对于第二个标题)。是的,HtmlUnit 引擎有限(Rhino 也比较慢),所以是 HtmlUnit 和页面中使用的 js 冲突。我的做法通常是:在浏览器中使用禁用的 js 打开页面。如果所有需要的内容都在那里,我会使用 jsoup,否则我会尝试使用 HtmlUnit。如果 HtmlUnit 失败,我会使用 PhantomJS,虽然它不是纯 Java。
标签: javascript jquery html jsoup htmlunit