尝试使用 HtmlAgilityPack 从网页中提取数据答案

【问题标题】：Trying to extract data from a webpage using HtmlAgilityPack尝试使用 HtmlAgilityPack 从网页中提取数据
【发布时间】：2014-06-20 07:12:06
【问题描述】：

我正在尝试从
http://www.dsebd.org/displayCompany.php?name=NBL
中提取单个数据我在附图中显示了必填字段 Xpath：/html/body/table[2]/tbody/tr/td[2]/table/tbody/tr[3]/td1/p1/table1/tbody/tr/td 1/table/tbody/tr[2]/td[2]/font

错误：发生异常，使用该 Xpath 未找到数据。 “在 HtmlAgilityPack.dll 中发生了 'System.Net.WebException' 类型的未处理异常”

源代码：

static void Main(string[] args)
    {
        /************************************************************************/
        string tickerid = "Bse_Prc_tick";
        HtmlAgilityPack.HtmlDocument doc = new   HtmlWeb().Load(@"http://www.dsebd.org/displayCompany.php?name=NBL", "GET");

        if (doc != null)
        {
            // Fetch the stock price from the Web page
            string stockprice = doc.DocumentNode.SelectSingleNode(string.Format("./html/body/table[2]/tbody/tr/td[2]/table/tbody/tr[3]/td1/p1/table1/tbody/tr/td1/table/tbody/tr[2]/td[2]/font", tickerid)).InnerText;
            Console.WriteLine(stockprice);
        }
        Console.WriteLine("ReadKey Starts........");
        Console.ReadKey();
}

【问题讨论】：

您确定 XPath 是正确的吗？ Chrome F12 工具显示您标记的字段的不同路径。
我从名为“XPath Helper”的 chrome 扩展中获得了 XPath。它不应该是错误的。顺便说一句，我正在检查它。希望我能找到正确的。 @PTwr

标签： c# web html-agility-pack

【解决方案1】：

嗯，我查过了。我们使用的 XPath 完全不正确。当您尝试找出错误所在时，真正的乐趣就开始了。

只需检查您正在使用的页面的源代码，除了许多妨碍 XPath 的错误，它甚至包含多个 HTML 标记...

Chrome 开发工具和您使用的工具在浏览器纠正的 dom 树上工作（全部打包到单个 html 节点中，添加了一些 tbody 等）。

由于html结构被简单破坏了，所以变成了HtmlAgilityPack解析。

在这种情况下，您可以使用 RegExp 或只搜索源中的已知元素（这要快得多，但不太灵活）。

例如：

...
using System.Net; //required for Webclient
...
        class Program
        {
            //entry point of console app
            static void Main(string[] args)
            {
                // url to download
                // "var" means I am too lazy to write "string" and let compiler decide typing
                var url = @"http://www.dsebd.org/displayCompany.php?name=NBL";

                // creating object in using makes Garbage Collector delete it when using block ends, as opposed to standard cleaning after whole function ends
                using (WebClient client = new WebClient()) // WebClient class inherits IDisposable
                {

                    // simply download result to string, in this case it will be html code
                    string htmlCode = client.DownloadString(url);
                    // cut html in half op position of "Last Trade:"
                    // searching from beginning of string is easier/faster than searching in middle
                    htmlCode = htmlCode.Substring(
                        htmlCode.IndexOf("Last Trade:")
                        );
                    // select from .. to .. and then remove leading and trailing whitespace characters
                    htmlCode = htmlCode.Substring("2\">", "</font></td>").Trim();
                    Console.WriteLine(htmlCode);
                }
                Console.ReadLine();
            }
        }
        // http://stackoverflow.com/a/17253735/3147740 <- copied from here
        // this is Extension Class which adds overloaded Substring() I used in this code, it does what its comments says
        public static class StringExtensions
        {
            /// <summary>
            /// takes a substring between two anchor strings (or the end of the string if that anchor is null)
            /// </summary>
            /// <param name="this">a string</param>
            /// <param name="from">an optional string to search after</param>
            /// <param name="until">an optional string to search before</param>
            /// <param name="comparison">an optional comparison for the search</param>
            /// <returns>a substring based on the search</returns>
            public static string Substring(this string @this, string from = null, string until = null, StringComparison comparison = StringComparison.InvariantCulture)
            {
                var fromLength = (from ?? string.Empty).Length;
                var startIndex = !string.IsNullOrEmpty(from)
                    ? @this.IndexOf(from, comparison) + fromLength
                    : 0;

                if (startIndex < fromLength) { throw new ArgumentException("from: Failed to find an instance of the first anchor"); }

                var endIndex = !string.IsNullOrEmpty(until)
                ? @this.IndexOf(until, startIndex, comparison)
                : @this.Length;

                if (endIndex < 0) { throw new ArgumentException("until: Failed to find an instance of the last anchor"); }

                var subString = @this.Substring(startIndex, endIndex - startIndex);
                return subString;
            }
        }

【讨论】：

@Leon：我“修复”了 XPath 的问题，请参阅编辑后的帖子。
完美运行。感谢您宝贵的时间。原谅我的无知，但你的代码对我来说有点复杂，因为我是一个新学习者。我发现 XPath 稍微简单一些。顺便谢谢。它会起作用，我会学习方法。 @PTwr
@Leon 然后我会添加一些 cmets。

【解决方案2】：

将您的代码包装在 try-catch 中以获取有关异常的更多信息。

【讨论】：