【发布时间】:2019-04-23 14:30:42
【问题描述】:
我出于教育目的而抓取的网站有分页。
我的代码可以很好地抓取第一页...
但是我怎么刮
?page=2
?page=3
?page=4
?page=5
还有吗??...
应该指出,我已经寻找解决方案,但似乎找不到任何可以明确回答我需要知道的内容。
当前代码:
// @nuget: HtmlAgilityPack
using System;
using System.Data;
using System.Data.SqlClient;
using System.Net;
using HtmlAgilityPack;
public class Program
{
public static void Main()
{
ServicePointManager.Expect100Continue = true;
ServicePointManager.SecurityProtocol = SecurityProtocolType.Tls
| SecurityProtocolType.Tls11
| SecurityProtocolType.Tls12
| SecurityProtocolType.Ssl3;
HtmlWeb web = new HtmlWeb();
HtmlDocument html = web.Load("https://www.g2crowd.com/products/google-analytics/reviews");
// var divNodes = html.DocumentNode.SelectNodes("//div[@class='mb-2 border-bottom']");
var divNodes = html.DocumentNode.SelectNodes(@"//div[@itemprop='reviewBody']");
if (divNodes != null)
{
foreach (var tag in divNodes)
{
string review = tag.InnerText;
review = review.Replace("What do you like best?", "What do you like best?\n");
review = review.Replace("What do you dislike?", "\nWhat do you dislike?\n");
review = review.Replace("Recommendations to others considering the product", "\n\nRecommendations to others considering the product\n");
review = review.Replace("What business problems are you solving with the product? What benefits have you realized?", "\n\nWhat business problems are you solving with the product? What benefits have you realized?\n");
Console.WriteLine(review);
Console.WriteLine("\n------------------------------- Review found. Adding to Database -------------------------------\n");
review = review.Replace("'", "");
review = review.Replace("\n", "<br />");
}
}
}
}
【问题讨论】:
-
你本能地认为你会如何处理它?你可能已经有了答案……这里没有灵丹妙药,要么尝试下一页,要么搜索页面寻找线索,看看是否可以
-
我的猜测是跟随链接到下一页,或者在完成 page=1 后以某种方式编码 > 移动到 page=2?对 C# 来说很新——很难把我的想法变成代码。过去,SO 的轻推似乎帮助我学到了很多东西!有点难过!
-
取决于你是否在做一个爬虫,如果有的话,链接应该是可追踪的,如果你只是想再次获取集合,只需点击链接,而不是更多可以添加。也许其他人可以插话
标签: c# .net web-scraping pagination