【问题标题】:C# HtmlAgilityPack : startIndex cannot be larger than length of stringC# HtmlAgilityPack:startIndex 不能大于字符串的长度
【发布时间】:2014-02-21 19:19:15
【问题描述】:

我正在尝试做这样的事情:

var document = htmlWeb.Load(searchUrl);
var hotels = document.DocumentNode.Descendants("div")
             .Where(x => x.Attributes.Contains("class") &&
             x.Attributes["class"].Value.Contains("listing-content"));

int count = 1;
foreach (var hotel in hotels)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.OptionFixNestedTags = true;
    htmlDoc.Load(hotel.InnerText);      
    if (htmlDoc.DocumentNode != null)
    {
        var anchors = htmlDoc.DocumentNode.Descendants("div")
                    .Where(x => x.Attributes.Contains("class") &&
                    x.Attributes["class"].Value.Contains("srp-business-name")); // Error Occurring in here //
        foreach (var anchor in anchors)
        {
            Console.WriteLine(anchor.InnerHtml);
        }
    }
}

我得到这样的结果:

<a href="http://ad.doubleclick.net/clk;234504055;58257942;j?http://www.marriott.com/NYCMQ" class="url  mip-link" data-analytics="{&quot;click_id&quot;:1601,&quot;rank&quot;:1,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;target&quot;:&quot;name&quot;,&quot;supermedia&quot;:true}" rel="nofollow">New York Marriott Marquis</a>
<a href="http://www.yellowpages.com/new-york-ny/mip/new-york-marriott-marquis-468349733?lid=1000372156461" class="no-tracks hidden url" data-analytics="{&quot;click_id&quot;:1601,&quot;rank&quot;:1,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;target&quot;:&quot;name&quot;,&quot;supermedia&quot;:true}" rel="nofollow"></a>
<span class="external-link">
<img height="15" src="/images/sprites/search/icon-link-external.png" width="16">
</span>

<a href="http://www.yellowpages.com/new-york-ny/mip/courtyard-by-marriott-new-york-manhattan-times-square-south-2198956?lid=178101818" class="url redbold mip-link" data-analytics="{&quot;click_id&quot;:1600,&quot;rank&quot;:2,&quot;act&quot;:1,&quot;FL&quot;:&quot;list&quot;,&quot;target&quot;:&quot;name&quot;,&quot;supermedia&quot;:&quot;&quot;}">Courtyard by Marriott New York Manhattan/Times Square South</a>

等等。

现在我想要具有class="url redbold mip-link" 的锚标记的innerHtml。所以我正在这样做:

var document = htmlWeb.Load(searchUrl);
var hotels = document.DocumentNode.Descendants("div")
             .Where(x => x.Attributes.Contains("class") &&
             x.Attributes["class"].Value.Contains("listing-content"));

int count = 1;
foreach (var hotel in hotels)
{
    HtmlDocument htmlDoc = new HtmlDocument();
    htmlDoc.OptionFixNestedTags = true;
    htmlDoc.Load(hotel.InnerText);      
    if (htmlDoc.DocumentNode != null)
    {
        var anchors = htmlDoc.DocumentNode.Descendants("div")
                    .Where(x => x.Attributes.Contains("class") &&
                    x.Attributes["class"].Value.Contains("srp-business-name"));
        foreach (var anchor in anchors)
        {
            htmlDoc.LoadHtml(anchor.InnerHtml);
            var hoteltags = htmlDoc.DocumentNode.SelectNodes("//a");
            foreach (var tag in hoteltags)
            {
                if (!string.IsNullOrEmpty(tag.InnerHtml) || !string.IsNullOrWhiteSpace(tag.InnerHtml))
                {
                    Console.WriteLine(tag.InnerHtml);
                }
            }

        }
    }
}

我正确地获得了第一个结果,即New York Marriott Marquis,但在第二个结果中发生了错误: startIndex cannot be larger than length of string。我做错了什么??

【问题讨论】:

  • 异常发生在哪一行?
  • 我坚信这段代码不会产生你提到的异常。
  • Keith Payne 是的,我遇到了这个错误。并且我已经更新了我的问题,我在其中提到了发生错误的评论。
  • Sudhakar Tillapudi :我已经更新了我的问题,我在评论中提到了发生错误的地方。

标签: c# html-parsing html-agility-pack


【解决方案1】:

所有操作都使用同一个 DOM 对象:

foreach (var hotel in hotels)
{
    HtmlDocument htmlDoc = new HtmlDocument();

之后,您将使用相同的对象来加载锚标记:

foreach (var anchor in anchors)
        {
            htmlDoc.LoadHtml(anchor.InnerHtml);

只需更改第二个迭代器中的文档,它应该可以按预期工作。

  foreach (var anchor in anchors)
            {
                var htmlDocAnchor= new HtmlDocument();
                htmlDocAnchor.LoadHtml(anchor.InnerHtml);// And etc..

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-05-02
    • 1970-01-01
    • 1970-01-01
    • 2020-05-03
    • 2013-07-20
    相关资源
    最近更新 更多