无法使用 HtmlAgilityPack 和 XPath 提取 <link> 元素答案

【问题标题】：Cannot extract <link> element using HtmlAgilityPack and XPath无法使用 HtmlAgilityPack 和 XPath 提取 <link> 元素
【发布时间】：2015-08-05 10:59:00
【问题描述】：

我正在使用 Html Agility 包从 rss xml 中选择文本数据。对于其他所有节点类型（标题、发布日期、guid 等），我可以使用 XPath 约定选择内部文本，但是在查询“//link”或实际上“item/link”时会返回空字符串。

public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
    //Create a new document.
    var document = new HtmlDocument();
    //Populate the document with an rss file.
    document.LoadHtml(rssSource);
    //Select out all of the required nodes.
    var itemNodes = document.DocumentNode.SelectNodes("item/link");
    //If zero nodes were found, return an empty list, otherwise return the content of those nodes.
    return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

有人知道为什么这个元素的行为与其他元素不同吗？

附加：运行“item/link”返回零节点。运行“//link”会返回正确数量的节点，但内部文本的长度为零。

使用下面的测试数据，使用“//name”为“fred”返回一条记录，但是使用“//link”返回一条带有空字符串的记录。

<site><link>Hello World</link><name>Fred</name></site>

我确定它是因为世界“链接”。如果我将其更改为“linkz”，它会完美运行。

以下解决方法非常有效。但是我想了解为什么在“//link”上搜索不像其他元素那样工作。

public static IEnumerable<string> ExtractAllLinks(string rssSource)
{
    rssSource = rssSource.Replace("<link>", "<link-renamed>");
    rssSource = rssSource.Replace("</link>", "</link-renamed>");
    //Create a new document.
    var document = new HtmlDocument();
    //Populate the document with an rss file.
    document.LoadHtml(rssSource);
    //Select out all of the required nodes.
    var itemNodes = document.DocumentNode.SelectNodes("//link-renamed");
    //If zero nodes were found, return an empty list, otherwise return the content of those nodes.
    return itemNodes == null ? new List<string>() : itemNodes.Select(itemNode => itemNode.InnerText).ToList();
}

【问题讨论】：

标签： c# xpath rss html-agility-pack

【解决方案1】：

如果你打印DocumentNode.OuterHtml，你会看到问题：

var html = @"<site><link>Hello World</link><name>Fred</name></site>";
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);

输出：

<site><link>Hello World<name>Fred</name></site>

link 恰好是一些被 HAP 视为自闭合标签的 特殊标签^* 之一。您可以通过在解析 HTML 之前设置 ElementsFlags 来更改此行为，例如：

var html = @"<site><link>Hello World</link><name>Fred</name></site>";
HtmlNode.ElementsFlags.Remove("link");  //remove link from list of special tags
var doc = new HtmlDocument();
doc.LoadHtml(html);
Console.WriteLine(doc.DocumentNode.OuterHtml);
var links = doc.DocumentNode.SelectNodes("//link");
foreach (HtmlNode link in links)
{
    Console.WriteLine(link.InnerText);
}

Dotnetfiddle Demo

输出：

<site><link>Hello World</link><name>Fred</name></site>
Hello World

*) 除了link之外的特殊标签的完整列表，默认包含在ElementsFlags字典中，可以在HtmlNode.cs的源代码中看到。其中一些最受欢迎的是<meta>、<img>、<frame>、<input>、<form>、<option> 等。

【讨论】：

谢谢！我有一种感觉，这与保留或确实“特别”这个词有关，但今天谷歌不是我的朋友。标记为已接受。