【问题标题】:Parse HTML Data Using HTMLAgilityPack使用 HTMLAgilityPack 解析 HTML 数据
【发布时间】:2017-04-22 05:54:19
【问题描述】:

我想创建一个表,其中 ff.会显示,但我遇到了一些问题

    public class Book
    {
        public HtmlAttribute Href{ get; set; }
        public string Title{ get; set; }
        public string Author{ get; set; }
        public string Characters{ get; set; }
    }

这是我要解析的页面,我需要 href 值、 链接说明字符列表有时没有):

    <div id=title> 
        <li>
            <h3><a href="www.harrypotter.com">Harry Potter</a></h3>
            <div>Harry James Potter is the title character of J. K. Rowling's Harry Potter series. </div>
            <ul>
                <li>Harry Potter</li>
                <li>Hermione Granger</li>
                <li>Ron Weasley</li>
            </ul>
        </li>

        <li>
            <h3><a href="www.littleprince.com">Little Prince</a></h3>
            <div>A little girl lives in a very grown-up world with her mother, who tries to prepare her for it.  </div>
        </li>
    </div>

这是我的代码来解析它并把它放在一个列表中

    List<Book> BookList= new List<Book>();
    var titleNode = doc.DocumentNode.SelectNodes("//*[@id=\"title\"]//li//h3");
    var descNode = doc.DocumentNode.SelectNodes("//*[@id=\"title\"]//li//div");
    var authorNode = doc.DocumentNode.SelectNodes("//*[@id=\"title\"]//li//ul");

    var title = titleNode.Select(node => node.InnerText).ToList();
    var desc = descNode.Select(node => node.InnerText).ToList();
    var characters= authorNode.Select(node => node.InnerText).ToList();

    for (int i = 0; i < Title.Count(); ++i)
    {
        var list= new Book();
        list.Title= title[i];
        list.Author= desc[i];
        list.Characters = characters[i];
        BookList.Add(list);
    }

我的问题是:1)我将如何获取 href 值并将其添加到列表中? 2) 有些在 html 中没有字符标签,我怎样才能在没有 NullReferenceException 错误的情况下获取列表?注意:我无法对 html 进行任何更改。

【问题讨论】:

    标签: c# parsing html-parsing html-agility-pack


    【解决方案1】:

    我已经在不使用 HTMLAgilityPack 的情况下解决了您的问题,我正在使用 System.Xml

    注意:您应该添加一些唯一值来标识 Main li 元素,这里我将 Class 添加为 'Main'

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Xml;
    
    namespace Test
    {
    public class Book
    {
        public string Href { get; set; }
        public string Title { get; set; }
        public string Author { get; set; }
        public string Characters { get; set; }
    }
    
    class Program
    {
        static void Main(string[] args)
        {
            string str="<div id='title'><li class='Main'><h3><a href='www.harrypotter.com'>Harry Potter</a></h3><div>Harry James Potter is the title character of J. K. Rowling's Harry Potter series. </div>";
            str += "<ul><li>Harry Potter</li><li>Hermione Granger</li><li>Ron Weasley</li></ul></li><li class='Main'><h3><a href='www.littleprince.com'>Little Prince</a></h3><div>A little girl lives in a very grown-up world with her mother, who tries to prepare her for it.  </div></li></div>";
    
            XmlDocument doc = new XmlDocument();
            doc.LoadXml(str);
    
            XmlNodeList xnList= doc.SelectNodes("//*[@id=\"title\"]//li[@class=\"Main\"]");
    
            List<Book> BookList=new List<Book>();
    
            for (int i = 0; i < xnList.Count; i++)
            {
                XmlNode TitleNode = xnList[i].SelectSingleNode("h3");
                XmlNode DescNode = xnList[i].SelectSingleNode("div");
                XmlNode AuthorNode = xnList[i].SelectSingleNode("ul");
    
                Book list = new Book();
                if(TitleNode!=null)
                    list.Title=TitleNode.InnerText;
                else
                    list.Title="";
    
                if (DescNode != null)
                    list.Author = DescNode.InnerText;
                else
                    list.Author = string.Empty;
    
                if (AuthorNode != null)
                    list.Characters = AuthorNode.InnerText;
                else
                    list.Characters = string.Empty;
    
                if (TitleNode != null && TitleNode.ChildNodes.Count>0)
                {
                    XmlNode HrefNode = TitleNode.ChildNodes[0];
                    if (HrefNode != null && HrefNode.Attributes.Count > 0 && HrefNode.Attributes["href"] != null)
                        list.Href = HrefNode.Attributes["href"].Value;
                    else
                        list.Href = string.Empty;
                }
                else
                {
                    list.Href = string.Empty;
                }
    
                BookList.Add(list);
            }
        }
    }
    }
    

    【讨论】:

    • 我无法修改 html,因为它来自我要解析的网站。
    【解决方案2】:

    这就是我会做的。让我知道您的问题,以便我提供帮助。

            //get all li(s)
            var lis = doc.DocumentNode.Descendants("li").Where(_ => _.ParentNode.Id.Equals("title"));
            foreach (var li in lis)
            {
                //get tile and href
                var title = li.Descendants("h3").FirstOrDefault().InnerText; //you can check null or empty here
                var href = li.Descendants("h3").FirstOrDefault(_ => _.Name.Equals("a"))?.Attributes["href"]; //again check null here
                var desc = li.Descendants("div").FirstOrDefault().InnerHtml;
                var characters = li.Descendants("ul").FirstOrDefault()?.Descendants("li");
                foreach (var character in characters)
                {
                    var val = character.InnerText;
                }
            }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-01-26
      • 2016-11-20
      • 2010-12-03
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多