【发布时间】:2012-03-09 22:09:29
【问题描述】:
我正在构建一个简单的网络爬虫,我需要获取同一页面几百次,并且页面中有一个动态属性,应该在每次请求时更改。我已经构建了一个基于多线程 HttpClient 的类来处理请求,并且我正在使用 ExecutorService 来创建一个线程池并运行线程。问题是动态属性有时不会在每个请求上发生变化,我最终会在 3 或 4 个后续线程上获得相同的值。我已经阅读了很多关于 HttpClient 的内容,但我真的找不到这个问题来自哪里。会不会是关于缓存的东西,或者类似的东西!?
更新:这里是每个线程中执行的代码:
HttpContext localContext = new BasicHttpContext();
HttpParams params = new BasicHttpParams();
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1);
HttpProtocolParams.setContentCharset(params,
HTTP.DEFAULT_CONTENT_CHARSET);
HttpProtocolParams.setUseExpectContinue(params, true);
ClientConnectionManager connman = new ThreadSafeClientConnManager();
DefaultHttpClient httpclient = new DefaultHttpClient(connman, params);
HttpHost proxy = new HttpHost(inc_proxy, Integer.valueOf(inc_port));
httpclient.getParams().setParameter(ConnRoutePNames.DEFAULT_PROXY,
proxy);
HttpGet httpGet = new HttpGet(url);
httpGet.setHeader("User-Agent",
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)");
String iden = null;
int timeoutConnection = 10000;
HttpConnectionParams.setConnectionTimeout(httpGet.getParams(),
timeoutConnection);
try {
HttpResponse response = httpclient.execute(httpGet, localContext);
HttpEntity entity = response.getEntity();
if (entity != null) {
InputStream instream = entity.getContent();
String result = convertStreamToString(instream);
// System.out.printf("Resultado\n %s",result +"\n");
instream.close();
iden = StringUtils
.substringBetween(result,
"<input name=\"iden\" value=\"",
"\" type=\"hidden\"/>");
System.out.printf("IDEN:%s\n", iden);
EntityUtils.consume(entity);
}
}
catch (ClientProtocolException e) {
// TODO Auto-generated catch block
System.out.println("Excepção CP");
} catch (IOException e) {
// TODO Auto-generated catch block
System.out.println("Excepção IO");
}
【问题讨论】:
-
可以缓存在服务器端。
-
您可能正在编写线程不安全的代码,并且每当您下载数据时,旧结果都会被新结果覆盖。没有代码很难分辨。
-
我已经用代码更新了问题
标签: java multithreading http httpclient apache-httpcomponents