【发布时间】:2017-07-06 11:14:23
【问题描述】:
已成功将 nutch 1.12 与 solr 6.5 连接起来,并爬取了未经身份验证的站点。在尝试抓取经过身份验证的网站时,我无法继续。任何人都可以帮助克服它。
错误:
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form exists: user-login
at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:485)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:180)
at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:261)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:295)
Caused by: java.lang.IllegalArgumentException: No form exists: user-login
at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:183)
at org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95)
at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:483)
httpclient-auth.xml:
<auth-configuration>
<credentials authMethod="formAuth"
loginUrl="<url>"
loginFormId="user-login"
loginRedirect="true">
<loginPostData>
<field name="name"
value="*<name>*"/>
<field name="pass"
value="*<password>*"/>
<field name="op"
value="Log in"/>
</loginPostData>
</credentials>
</auth-configuration>
搜索了几个链接,但无法解决。
谢谢。
【问题讨论】:
-
在
$NUTCH_HOME/conf/nutch-site.xml加<property> <name>plugin.includes</name> <value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-elastic|scoring-opic|urlnormalizer-(pass|regex|basic)</value><description>Regular expression naming plugin directory names to include. </description> </property>已有的忽略,回复 -
查看您的错误日志了解详情!
标签: authentication solr web-crawler nutch