使用 HTMLAgilityPack 解析 HTML答案

【问题标题】：Parsing HTML using HTMLAgilityPack使用 HTMLAgilityPack 解析 HTML
【发布时间】：2026-01-26 00:45:01
【问题描述】：

我尝试使用 HTML 敏捷包解析以下 HTML。

这是代码返回的整个文件的 sn-p：

<div class="story-body fnt-13 p20-b user-gen">
    <p>text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <div  class="gallery clr bdr aln-c js-no-shadow mod  cld">
        <div>
            <ol>
                <li class="fader-item aln-c ">
                    <div class="imageWrap m10-b">
                       &#8203;<img class="http://www.domain.com/picture.png| " src="http://www.domain.com/picture.png" alt="alt text" />
                    </div>
                    <p class="caption">caption text</p>
                </li>
            </ol>
        </div>
    </div >
    <p>text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
    <p>text here text here text text here text here text text here text here text text here text here text text here text here text </p>
</div>

我使用以下代码获得了这个 sn-p 代码（我知道这很混乱）

string url = "http://www.domain.com/story.html";
var webGet = new HtmlWeb();
var document = webGet.Load(url);

var links = document.DocumentNode
        .Descendants("div")
        .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) //
        .SelectMany(div => div.Descendants("p"))
        .ToList();
int cn = links.Count;

HtmlAgilityPack.HtmlNodeCollection tl = document.DocumentNode.SelectNodes("/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/div[1]/div[2]/div[1]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    textBox1.AppendText(node.InnerText.Trim());
    textBox1.AppendText(System.Environment.NewLine);
}

代码循环遍历每个p 并（目前）将其附加到文本框。除了类gallery clr bdr aln-c js-no-shadow mod cld 的div 标记之外，所有都正常工作。这段 HTML 的结果是我得到了 &#8203; 和标题文本位。

从结果中省略它的最佳方法是什么？

【问题讨论】：

嘘...So two questions, what's the best way to omit that from the results?这是一个问题，另一个是什么？
我不知道你在说什么......：p

标签： c# html-agility-pack

【解决方案1】：

XPATH 是你的朋友。试试这个，忘记那个糟糕的 xlink 语法:-)

HtmlNodeCollection tl = document.DocumentNode.SelectNodes("//p[not(@*)]");
foreach (HtmlAgilityPack.HtmlNode node in tl)
{
    Console.WriteLine(node.InnerText.Trim());
}

此表达式将选择所有没有设置任何属性的 P 节点。其他示例请参见此处：XPath Syntax

【讨论】：

谢谢，这是一种享受，也会研究 xpath，因为它看起来是一个更好的解决方案！
它确实有效，但它还包括页面上的其他 P 节点。这只是一个贴在顶部的sn-p。
只需在表达式中添加其他过滤器（[ 和 ] 字符之间的内容）
使用 xpath 是否可以从具有特定类的特定 div 中获取节点集合。即
当然。 SelectNodes("//div[@class='story']") 将从根中获取所有具有 'class' 属性且具有 'story' 值的 div。

【解决方案2】：

不太清楚你在问什么。我认为您是在问如何仅获取特定 div 的直系后代。如果是这种情况，请使用ChildNodes 而不是Descendants。那就是：

.SelectMany(div => div.ChildNodes().Where(n => n.Name == "p"))

问题在于Descendants 对文档树进行了完全递归的遍历。

【讨论】：

使用 xpath 会更容易：//p
包含
字幕文本
。我试图不包括从第 4 行到第 15 行（div）的任何内容，只是其他
的
@Nathan：不，我认为不包括这些。 ChildNodes 只获取特定节点的直接后代。如果您将 LINQ 表达式中的 SelectMany 替换为我的 SelectMany，我想您会发现它会像宣传的那样工作。我的表达式使用Where，因为没有ChildNodes 重载可以让您指定类型（即您不能说ChildNodes("p")）。
好的，我想我明白你的意思了。像下面这样？ var links = document.DocumentNode .Descendants("div") .Where(div => div.GetAttributeValue("class", "").Contains("story-body fnt-13 p20-b user-gen")) / / .SelectMany(div => div.ChildNodes.Where(n => n.Name == "p")) .ToList();