如何获取 html 敏捷包中没有类或 id 的标签的值？答案

【问题标题】：how to get value of a tag that has no class or id in html agility pack?如何获取 html 敏捷包中没有类或 id 的标签的值？
【发布时间】：2020-03-08 01:11:40
【问题描述】：

我正在尝试获取此标签的文本值：

<a href="item?id=22513425">67&nbsp;comments</a>

所以我试图从中获得“67”。但是没有定义类或 ID。

我已经做到了这一点：

        IEnumerable<HtmlNode> commentsNode = htmlDoc.DocumentNode.Descendants(0).Where(n => n.HasClass("subtext"));

        var storyComments = commentsNode.Select(n =>
            n.SelectSingleNode("//a[3]")).ToList();

这只给我“cmets”就够烦了。

我不能使用href id，因为这些项目很多，所以我不能硬编码href

我怎样才能提取数字呢？

【问题讨论】：

标签： html-agility-pack

【解决方案1】：

只需使用@href 属性和专用的字符串函数：

substring-before(//a[@href="item?id=22513425"],"comments")

返回 67。

编辑：由于你不能硬编码@href 的所有内容，也许你可以使用starts-with。 XPath 1.0 解决方案。

最短格式（+ 文本必须包含“cmets”）：

substring-before(//a[starts-with(@href,"item?") and text()[contains(.,"comments")]],"c")

更多限制（+ 文本必须以“cmets”结尾）：

substring-before(//a[starts-with(@href,"item?")][substring(//a, string-length(//a) - string-length('comments')+1) = 'comments'],"c")

【讨论】：

对不起，我应该提到有多个具有不同hrefs的项目，我不能硬编码href id

【解决方案2】：

我正在使用ScrapySharp nuget，它在下面的示例中添加，（HtmlAgilityPack 可能提供与它相同的功能，我只是从几年前就习惯了 ScrapySharp）

    var doc = new HtmlDocument();
    doc.Load(@"C:\desktop\anchor.html"); //I created an html file with your <a> element as the body
    var anchor = doc.DocumentNode.CssSelect("a").FirstOrDefault();
    if (anchor == null) return;

    var digits = anchor.InnerText.ToCharArray().Where(c => Char.IsDigit(c));

    Console.WriteLine($"anchor text: {anchor.InnerText} - digits only: {new string(digits.ToArray())}");

输出：

【讨论】：