C# 解析 HTML 网页答案

【问题标题】：C# Parse HTML WebpageC# 解析 HTML 网页
【发布时间】：2016-07-02 15:52:02
【问题描述】：

如何在 C# 中解析一个完整的 HTML 网站

小例子

<html>
 <head></head>
 <body>
  <div class="wrapper">
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
   <div class="row">
    <div>Value1</div>
    <div>Value2</div>
   </div>
  </div>
 </body>
</html>

我不能使用页面的类来识别。容器，因为它们是可变的。

现在我想保存这些值。

我现在的代码：

WebBrowser wb = (WebBrowser)sender;

var doc = wb.Document as HTMLDocument;

IHTMLElementCollection nodes = doc.getElementsByTagName("div");

foreach(IHTMLElement elem in nodes)
{
    var div = (HTMLDivElement)elem;

    if(div.className != null && div.className.Contains("t_row"))
    {
        //BREAKPOINT
        var inner = div.document as HTMLDocument;
        IHTMLElementCollection innerNode = inner.getElementsByTagName("div");

        log(div.innerText);
    }
}

直到断点一切正常，但直到那里我不知道我需要如何继续。

【问题讨论】：

根据您的 Html 页面的格式不正确，您应该考虑使用 HTML Agility Pack 进行解析。

标签： c# wpf

【解决方案1】：

您可以使用WebsiteParser 提取数据。它的用法类似于解析库。对于您的示例 html，它将是这样的：

IEnumerable<WrapperItem> items = WebContentParser.ParseList<WrapperItem>(html);

// ...

[ListSelector(".wrapper", ChildSelector = ".row")]
class WrapperItem
{
    [Selector("div:nth-child(1)")]
    public string Value1 { get; set; }

    [Selector("div:nth-child(2)")]
    public string Value2 { get; set; }
}

要下载网站的html，你可以使用WebClient

WebClient client = new WebClient ();
string html = client.DownloadString("https://example.com");

【讨论】：