【发布时间】:2011-08-03 23:05:05
【问题描述】:
我正在尝试使用 C# 抓取网页,但是在页面加载后,它会执行一些 JavaScript,将更多元素加载到我需要抓取的 DOM 中。一个标准的抓取工具只是在加载时抓取页面的 html,并且不会获取通过 JavaScript 所做的 DOM 更改。如何添加某种功能以等待一两秒然后获取源?
这是我当前的代码:
private string ScrapeWebpage(string url, DateTime? updateDate)
{
HttpWebRequest request = null;
HttpWebResponse response = null;
Stream responseStream = null;
StreamReader reader = null;
string html = null;
try
{
//create request (which supports http compression)
request = (HttpWebRequest)WebRequest.Create(url);
request.Pipelined = true;
request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");
if (updateDate != null)
request.IfModifiedSince = updateDate.Value;
//get response.
response = (HttpWebResponse)request.GetResponse();
responseStream = response.GetResponseStream();
if (response.ContentEncoding.ToLower().Contains("gzip"))
responseStream = new GZipStream(responseStream,
CompressionMode.Decompress);
else if (response.ContentEncoding.ToLower().Contains("deflate"))
responseStream = new DeflateStream(responseStream,
CompressionMode.Decompress);
//read html.
reader = new StreamReader(responseStream, Encoding.Default);
html = reader.ReadToEnd();
}
catch
{
throw;
}
finally
{
//dispose of objects.
request = null;
if (response != null)
{
response.Close();
response = null;
}
if (responseStream != null)
{
responseStream.Close();
responseStream.Dispose();
}
if (reader != null)
{
reader.Close();
reader.Dispose();
}
}
return html;
}
这是一个示例网址:
http://www.realtor.com/realestateandhomes-search/geneva_ny#listingType-any/pg-4
您会看到页面第一次加载时显示找到 134 个列表,然后一秒钟后显示找到 187 个属性。
【问题讨论】:
标签: c# c#-4.0 screen-scraping web-scraping