Web Scraper 未返回正确的 URL答案

【问题标题】：Web Scraper not returning the correct URLWeb Scraper 未返回正确的 URL
【发布时间】：2020-02-10 21:16:46
【问题描述】：

所以我正在尝试用 c# 开发一个网络爬虫控制台应用程序。我无法从该网站检索帖子链接。 Hacker News

我可以检索除链接之外的所有内容。当我尝试获取链接时，它会返回投票按钮链接而不是帖子链接，但我相信我选择了正确的标签。

我的代码：

var postsHTML = htmlDocument.DocumentNode.Descendants("table")
    .Where(node => node.GetAttributeValue("class", "")
    .Equals("itemlist")).ToList();

var postList = postsHTML[0].Descendants("tr")
    .Where(node => node.GetAttributeValue("class", "")
    .Equals("athing")).ToList();

然后在我的 foreach 循环中访问数组中的元素并检索我输入的链接：

foreach (var post in postList)
{

    Console.WriteLine("Title: " + post.Descendants("a")
        .Where(node => node.GetAttributeValue("class", "")
        .Equals("storylink")).FirstOrDefault().InnerText);


    Console.WriteLine("URI: " + post.Descendants("a").FirstOrDefault()
        .GetAttributeValue("href", ""));

    Console.WriteLine();

}

这可以正确返回标题，但我的 URI 似乎是

标题：儒勒·凡尔纳最著名的书籍是 54 卷杰作的一部分

URI： vote?id=22292003&how=up&goto=news

我希望返回的链接是帖子链接：

URI： http://www.openculture.com/2020/02/jules-vernes-voyages-extraordinaires.html

【问题讨论】：

标签： c# .net web web-scraping console-application

【解决方案1】：

您第二次选择了错误的标签。您应该重复使用相同的选择器，因为 Title 和 Url 在同一个标签中。

var storyLink = post.Descendants("a")
    .Where(node => node.GetAttributeValue("class", "")
    .Equals("storylink")).FirstOrDefault();

Console.WriteLine("Title: " + storyLink.InnerText);


Console.WriteLine("URI: " + storyLink.GetAttributeValue("href", ""));

【讨论】：