【发布时间】:2014-02-28 22:02:44
【问题描述】:
我在 jSoup 的帮助下用 Java 制作了一个爬虫,我需要优化或建议以使我的爬虫更好且无错误。请帮助我理解最后一个 for(Element link: questions) 中的 for each 循环这个循环实际上做了什么它获取同一页面的所有链接然后抓取或只是找到第一个链接并抓取到该链接。
提前致谢
public class crawler_html {
public static db_Connection db = new db_Connection();
public crawler_html(String url) throws SQLException, IOException
{ //db.runSql2("TRUNCATE Record;");
processPage(url);
}
public static void processPage(String url) throws SQLException, IOException{
//check if the given URL is already in database
String sql = "select * from crawler where URL ='"+url+"'";
ResultSet rs = db.runSql(sql);
if(rs.next()){
System.out.println("URL Found");
//If url found what to do next
}else{
System.out.println("Store the URL to database");
//store the URL to database to avoid parsing again
sql = "INSERT INTO crawler (URL) VALUES ('"+url+"')";
PreparedStatement stmt = db.conn.prepareStatement(sql);
if(stmt!=null){
stmt.execute();
System.out.println("Executed well");
}
//get useful information
// timeout(0) sets the time to infinite
Document doc = Jsoup.connect(url)
.userAgent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1453.110 Safari/537.36")
.timeout(0)
.ignoreHttpErrors(true)
.followRedirects(true)
.execute()
.parse();
//get all links and recursively call the processPage method
Elements questions = doc.select("a[href]");
for(Element link: questions){
{
db.stmtclose();
processPage(link.attr("abs:href"));
}
}
}
}
}
}
}
【问题讨论】:
标签: java html jsoup web-crawler