【发布时间】:2016-05-12 03:41:48
【问题描述】:
我想下载网页的源代码。我使用了 URL 方法,即 URL url=new URL("http://a.html");
和 Jsoup 方法,但没有得到实际源代码中提到的确切数据。例如-
<input type="image"
name="ctl00$dtlAlbums$ctl00$imbAlbumImage"
id="ctl00_dtlAlbums_ctl00_imbAlbumImage"
title="Independence Day Celebr..."
border="0"
onmouseover="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','0','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');"
onmouseout="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','1','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');"
src="Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG"
alt="Independence Day Celebr..."
style="height:79px;width:148px;border-width:0px;"
/>
在此标记中,jsoup 的代码未检测到最后一个属性“样式”。如果我从 URL 方法下载它,它会将样式标签更改为 border=""/> 属性。
谁能告诉我如何下载网页的确切源代码? 我的代码是-
URL url=new URL("http://www.apcob.org/");
InputStream is = url.openStream(); // throws an IOException
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
File fileDir = new File(contextpath+"\\extractedtxt.txt");
Writer fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while ((line = br.readLine()) != null)
{
// System.out.println("line\n "+line);
fw.write("\n"+line);
}
InputStream in = new FileInputStream(new File(contextpath+"extractedtxt.txt";));
String baseUrl="http://www.apcob.org/";
Document doc=Jsoup.parse(in,"UTF-8",baseUrl);
System.out.println(doc);
我遵循的第二种方法是-
Document doc = Jsoup.connect(url_of_currentpage).get();
我想在 java 中执行此操作,并且发生此问题的网站名称是“http://www.apcob.org/”。
【问题讨论】:
标签: javascript java html css jsoup