【问题标题】:xpath and htmlagilitypack iterate through like nodesxpath 和 htmlagilitypack 遍历相似节点
【发布时间】:2013-07-21 13:25:30
【问题描述】:

我正在抓取的 HTML 如下。它包含一个帖子和 2 个回复:

<div class="share_buttons noprint">...</div>

<strong>Dan</strong> Says:<br/>
<span class="small soft"><time datetime="2009-10-05T02:27:38Z">Sun, Oct 04 '09, 7:27 PM</time></span>
<div class="quote_top">&nbsp;</div>
<div class="quote_item">Hello all, this is my original post.<br/></div>

<form class="action_heading noprint">
<strong>Page</strong> 
...
</form>

<div class="post_number" id="r_140626">1</div>
<strong>AnnieMae</strong> Says:<br/>
<span class="small soft"><time datetime="2009-10-05T02:30:27Z">Sun, Oct 04 '09, 7:30 PM</time></span>
<div class="quote_top clear_float">&nbsp;</div>
<div class="quote_item">What do you think of it?<br/></div>

<div class="post_number" id="r_140627">2</div>
<strong>Thomas77</strong> Says:<br/>
<span class="small soft"><time datetime="2009-10-05T02:32:32Z">Sun, Oct 04 '09, 7:32 PM</time></span>
<div class="quote_top clear_float">&nbsp;</div>
<div class="quote_item">Not really sure, can't see this pic?<br/>
</div>

所以我已经想出了如何获取原始帖子...

'get AUTHOR and DATE of original post
Dim divOriginalPostAuthor As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::strong")
Dim divOriginalPostDate As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::span/time")

Dim strDate As String = divOriginalPostDate.InnerText.Trim
strDate = strDate.Remove(0, InStr(strDate, ", ")).Trim
strDate = Replace(strDate, "'", 20)
Dim strAuthor As String = (divOriginalPostAuthor.InnerText).Trim
dtPosted = CDate(strDate)
divOriginalPostText = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::div[@class='quote_item']")

现在我只是想弄清楚如何获得回复...我正在考虑像这样获得当前行位置:

Dim currentNodePosition As Integer = threadDoc.DocumentNode.SelectSingleNode("//form[@class='action_heading noprint']").Line

然后在我增加当前行位置时使用它来遍历回复。对我来说这很棘手的想法是回复没有“容器”html元素供我立即收集....有什么想法吗?

【问题讨论】:

    标签: asp.net vb.net xpath html-agility-pack


    【解决方案1】:

    只是为了记录,我想出了这个答案,并想为将来需要它的任何人发布答案。

    'then get thread replies
    Dim nodesPostNumber As HtmlNodeCollection = threadDoc.DocumentNode.SelectNodes("//form[@class='action_heading noprint']/following-sibling::div[contains(@id, 'r_')]")
    Dim replies As New List(Of ThreadReply)
    
    If Not nodesPostNumber Is Nothing Then
    
    Dim intNumberOfReplies As Integer = nodesPostNumber.Count
    For i = 1 To intNumberOfReplies
        Dim nodeReplyDate As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//form[@class='action_heading noprint']/following-sibling::span[@class='small soft' and position()=" + i.ToString + "]")
        Dim strXPathForDate As String = nodeReplyDate.XPath
        Dim strReplyText As String = threadDoc.DocumentNode.SelectSingleNode(strXPathForDate + "/following-sibling::div[@class='quote_item']").InnerHtml
        strReplyText = Left(strReplyText, InStr(strReplyText, "<div class=""noprint""") - 1)
        Dim strReplyAuthor As String = threadDoc.DocumentNode.SelectSingleNode(nodeReplyDate.XPath + "/preceding-sibling::strong").InnerText
        Dim strReplyDate As String = nodeReplyDate.InnerText.Trim
        strReplyDate = strReplyDate.Remove(0, InStr(strReplyDate, ", ")).Trim
        strReplyDate = Replace(strReplyDate, "'", 20)
        strReplyDate = Replace(strReplyDate, "via mobile", "")
        Dim thisReply As New ThreadReply With {.Author = strReplyAuthor, .DatePosted = strReplyDate, .ThreadID = thisThread.ThreadID, .Text = strReplyText}
        replies.Add(thisReply)
    Next
    End If
    

    所以,它是关于“抓取”用于 1 个回复的节点并再次在 xpath 中使用它,以确保您只获得在您抓取的节点之后出现的回复。我通过使用 HTMLNode.Xpath 来做到这一点,它为您提供任何给定 HTMLAgilityPack.htmlnode 的 xpath 字符串,然后添加“/following-sibling”。

    【讨论】:

      猜你喜欢
      • 2019-01-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-09-18
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多