【问题标题】:XPath retrieving <a> href, text, and <span>XPath 检索 <a> href、文本和 <span>
【发布时间】:2017-04-21 01:28:44
【问题描述】:

我目前正在抓取一些网站并从中检索信息以存储到数据库中以供以后使用。我正在使用 HtmlAgilityPack 并且我现在已经成功地为几个站点完成了此操作,但由于某种原因,这个给我带来了问题。我对 XPath 语法还很陌生,所以我可能搞砸了。

这是我试图检索的网站代码的样子:

<form ... id="_subcat_ids_">
  <input ....>
  <ul ...>
    <li ....>
      <input .....>
      <a class="facet-seleection multiselect-facets "
      .... href="INeedThisHref#1">
      Text I Need                          //need to retrieve this text between then <a></a>
      <span class="subtle-note">(2)</span> //I Need that number from inside the span
      </a>
    </li>
    <li ....>
      <input .....>
      <a class="facet-seleection multiselect-facets "
      .... href="INeedThisHref#2">
      Text I Need #2                        //need to retrieve this text between then <a></a>
      <span class="subtle-note">(6)</span> //I Need that number from inside the span
      </a>
    </li>

每一个都代表页面上的一个项目,但我只对每个&lt;a&gt;&lt;/a&gt; 发生的事情感兴趣。我想从&lt;a&gt; 中检索那个href 值,然后是开始和结束之间的文本,然后我需要&lt;span&gt; 中的文本。我省略了其他标签内的内容,因为它们无助于唯一标识每个项目,&lt;a&gt; 内的类是它们共享的唯一内容,它们都在 formid="_subcat_ids_" 内。

这是我的代码:

try
{
   string fullUrl = "...";
   HtmlWeb web = new HtmlWeb();
   ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls12;
  HtmlDocument html = web.Load(fullUrl);

  foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']")) //this gets me into the form 
  {
    foreach (HtmlNode node2 in node.SelectNodes(".//a[@class='facet-selection  multiselect-facets ']")) //this should get me into the the <a> tags, but it is throwing a fit with 'object reference not set to an instance of an object'
    {
      //get the href
      string tempHref = node2.GetAttributeValue("href", string.Empty);
      //get the text between <a>
      string tempCat = node2.InnerText.Trim();
      //get the text between <span>
      string tempNum = node2.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
    }
  }
}
catch (Exception ex)
{
  Console.Write("\nError: " + ex.ToString());
}

第一个 foreach 循环没有错误,但第二个循环在我的第二个 foreach 循环所在的行给了我object reference not set to an instance of an object。就像我之前提到的,我对这种语法还是新手,我在另一个网站上使用过这种方法并取得了巨大的成功,但我在这个网站上遇到了一些问题。任何提示将不胜感激。

【问题讨论】:

  • 检查所提供详细信息的正确性,因为您的XPath 表达式和HTML 示例中存在多个拼写错误/不准确之处,例如seleection/selection,类名中的空格...

标签: c# html xpath html-agility-pack


【解决方案1】:

我想通了,这是代码

foreach (HtmlNode node in html.DocumentNode.SelectNodes("//form[@id='_subcat_ids_']"))
{
  //get the categories, store in list
  foreach (HtmlNode node2 in node.SelectNodes("..//a[@class='facet-selection  multiselect-facets ']//text()[normalize-space() and not(ancestor::span)]"))
  {
    string tempCat = node2.InnerText.Trim();
    categoryList.Add(tempCat);
    Console.Write("\nCategory: " + tempCat);           
  }
  foreach (HtmlNode node3 in node.SelectNodes("..//a[@class='facet-selection  multiselect-facets ']"))
  {
    //get href for each category, store in list
    string tempHref = node3.GetAttributeValue("href", string.Empty);
    LinkCatList.Add(tempHref);
    Console.Write("\nhref: " + tempHref);
    //get the number of items from categories, store in list
    string tempNum = node3.SelectSingleNode(".//span[@class='subtle-note']").InnerText.Trim();
    string tp = tempNum.Replace("(", "");
    tempNum = tp;
    tp = tempNum.Replace(")", "");
    tempNum = tp;
    Console.Write("\nNumber of items: " + tempNum + "\n\n");
   }
}

像魅力一样工作

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2012-04-18
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-03-03
    • 2016-05-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多