htmlunit 是一款开源的java 页面分析工具,读取页面后,可以有效的使用htmlunit分析页面上的内容。项目可以模拟浏览器运行,被誉为java浏览器的开源实现。这个没有界面的浏览器,运行速度也是非常迅速的。

二、下载地址:http://sourceforge.net/projects/htmlunit/?source=directory 

三、访问指定页面

  网络爬虫第一个要面临的问题,就是如何抓取网页,抓取其实很容易,没你想的那么复杂,一个开源HtmlUnit包,4行主要代码就OK啦!

 1 import java.io.IOException;
 2 import java.net.MalformedURLException;
 3 import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
 4 import com.gargoylesoftware.htmlunit.WebClient;
 5 import com.gargoylesoftware.htmlunit.html.HtmlPage;
 6 
 7 public class Main {
 8 
 9     public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
10         // TODO Auto-generated method stub
11         final WebClient mWebClient = new WebClient();
12         final HtmlPage mHtmlPage = mWebClient.getPage("http://www.baidu.com");
13         System.out.println(mHtmlPage.asText());
14         mWebClient.closeAllWindows();
15     }
16 
17 }

运行结果:

 1 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
 2 严重: runtimeError: message=[An invalid or illegal selector was specified (selector: ':checked' error: Invalid selector: *:checked).] sourceName=[http://s1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[14] lineSource=[null] lineOffset=[0]
 3 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
 4 严重: runtimeError: message=[An invalid or illegal selector was specified (selector: ':enabled' error: Invalid selector: *:enabled).] sourceName=[http://s1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[14] lineSource=[null] lineOffset=[0]
 5 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.javascript.StrictErrorReporter runtimeError
 6 严重: runtimeError: message=[The data necessary to complete this operation is not yet available.] sourceName=[http://s1.bdstatic.com/r/www/cache/static/jquery/jquery-1.10.2.min_f2fb5194.js] line=[10] lineSource=[null] lineOffset=[0]
 7 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
 8 警告: CSS error: 'http://www.baidu.com/' [1:81] Error in expression. (Invalid token ";". Was expecting one of: <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <DIMENSION>, <URI>, <FUNCTION>, "-".)
 9 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
10 警告: CSS error: 'http://www.baidu.com/' [1:143] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
11 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
12 警告: CSS warning: 'http://www.baidu.com/' [1:143] Ignoring the following declarations in this rule.
13 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
14 警告: CSS error: 'http://www.baidu.com/' [1:339] Error in expression. (Invalid token ";". Was expecting one of: <S>, <NUMBER>, "inherit", <IDENT>, <STRING>, <PLUS>, <HASH>, <EMS>, <EXS>, <LENGTH_PX>, <LENGTH_CM>, <LENGTH_MM>, <LENGTH_IN>, <LENGTH_PT>, <LENGTH_PC>, <ANGLE_DEG>, <ANGLE_RAD>, <ANGLE_GRAD>, <TIME_MS>, <TIME_S>, <FREQ_HZ>, <FREQ_KHZ>, <PERCENTAGE>, <DIMENSION>, <URI>, <FUNCTION>, "-".)
15 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
16 警告: CSS error: 'http://www.baidu.com/' [2:204] Error in declaration. (Invalid token "normal". Was expecting one of: <S>, ":".)
17 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
18 警告: CSS error: 'http://www.baidu.com/' [2:970] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
19 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
20 警告: CSS warning: 'http://www.baidu.com/' [2:970] Ignoring the following declarations in this rule.
21 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
22 警告: CSS error: 'http://www.baidu.com/' [4:856] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
23 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
24 警告: CSS warning: 'http://www.baidu.com/' [4:856] Ignoring the following declarations in this rule.
25 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
26 警告: CSS error: 'http://www.baidu.com/' [4:1016] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
27 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
28 警告: CSS warning: 'http://www.baidu.com/' [4:1016] Ignoring the following declarations in this rule.
29 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
30 警告: CSS error: 'http://www.baidu.com/' [5:68] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
31 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
32 警告: CSS warning: 'http://www.baidu.com/' [5:68] Ignoring the following declarations in this rule.
33 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
34 警告: CSS error: 'http://www.baidu.com/' [6:751] Error in style rule. (Invalid token "*". Was expecting one of: <EOF>, <S>, <IDENT>, "}", ";".)
35 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler warning
36 警告: CSS warning: 'http://www.baidu.com/' [6:751] Ignoring the following declarations in this rule.
37 二月 03, 2015 11:46:02 上午 com.gargoylesoftware.htmlunit.DefaultCssErrorHandler error
38 警告: CSS error: 'http://www.baidu.com/' [8:127] Error in expression; ':' found after identifier "progid".
39 二月 03, 2015 11:46:03 上午 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
40 警告: Obsolete content type encountered: 'text/javascript'.
41 二月 03, 2015 11:46:03 上午 com.gargoylesoftware.htmlunit.IncorrectnessListenerImpl notify
42 警告: Obsolete content type encountered: 'text/javascript'.
43 百度一下,你就知道
44 百度一下
45 新闻hao123地图视频贴吧登录设置更多产品
46 把百度设为主页关于百度About Baidu
47 ©2015 Baidu 使用百度前必读 京ICP证030173号 
运行结果

相关文章:

  • 2021-06-29
  • 2021-11-08
  • 2022-12-23
  • 2021-11-23
  • 2021-05-13
猜你喜欢
  • 2022-01-13
  • 2021-12-03
  • 2021-12-21
  • 2021-10-19
  • 2021-10-08
  • 2021-11-05
相关资源
相似解决方案