【问题标题】:Get HTML content with variable tags and extract innertext with VBA for EXcel获取带有变量标签的 HTML 内容并使用 VBA for EXcel 提取内部文本
【发布时间】:2021-07-12 19:03:42
【问题描述】:

我只想得到“ca”之间的 数字。和文本行中的“m²”。如何使用 VBA 来避免 Excel 中出现额外的字符串公式?

问题还在于 HTML 内容中的内部文本有时在 tr.td.p-tag* 中,有时仅在 tr.td-tag 中(没有 p)有时在 tr.td.b-tag 中,在这种情况下,“Description”会替换为相应 td-tag 中的“Appointment”。

是否有使用 queryselectorall 检查和提取的 VBA 代码?比如:

myString01 = html.queryselectorall(tr td).item(x).innertext

If InStr(myString, "DESCRIPTION") > 0 Then 
'NEED VBA CODE, value must be the number of innerText in td.p or td 
Else if 
   InStr(myString, "APPOINTEENT") > 0 Then 
'NEED VBA CODE, value must be the last word of innerText in td.b
end if 

这些是不同项目的相同属性的 3 个不同的 sn-ps:

<tr>
<td valign="top" align="left">Description:</td>
<td valign="top" align="left">
<p>
textA textB textC ca. 140 m².
</p>
</td>
</tr>

<tr>
<td valign="top" align="left">Description:</td>
<td valign="top" align="left">
textA textB textC ca. 85 m².
</td>
</tr>

<tr>
<td valign="top" align="left">Appointment</td>>
<td valign="top" align="left">
<b>
textA textB textC canceled!
</b>
</td>
</tr>

【问题讨论】:

  • 1. 如果你能识别出正确的tr标签td标签中唯一的文字p 标记 相同,从 tr 标记 获取innertext 就足够了。 tr 标签 内的其他标签将被忽略。 2. 使用split()innertext 并获取倒数第二个元素。比你有你想要的。 docs.microsoft.com/de-de/office/vba/language/reference/… 3. 如果您需要更多信息,它总是一样的:请发布有问题的网址。
  • @Zwenn:谢谢!如何识别没有任何 ID、名称等的 TR 标签?他们都是 TD-tags 的父母和 table.tbody 的孩子,他们的数量也总是不同的。
  • 只有你知道整个 html。尝试获取正确的表格标签并提取所有 tr 标签将是我的方法。但我不知道是否有办法识别正确的表格标签。请看我评论的第 3 点。

标签: html excel vba web-scraping css-selectors


【解决方案1】:

您可以在发布请求期间提取详细文档的链接,然后使用 Internet Explorer 访问每个链接,确保提供正确的引用标题;然后使用正则表达式获取该测量值。

TODO:代码确实需要重构,因为主子代码中发生了很多事情。真的每个子/功能都应该做c。一件事。

Option Explicit

Public Sub GetDataZvgPort()
    Const URL = "https://www.zvg-portal.de/index.php?button=Suchen"
    Dim html As MSHTML.HTMLDocument, xhr As Object

    Set html = New MSHTML.HTMLDocument
    Set xhr = CreateObject("MSXML2.ServerXMLHTTP.6.0")

    With xhr
        .Open "POST", URL, False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded"
        .send "land_abk=ni&ger_name=Peine&order_by=2&ger_id=P2411"
        html.body.innerHTML = .responseText
    End With

    Dim table As MSHTML.HTMLTable, r As Long, c As Long, headers(), row As MSHTML.HTMLTableRow
    Dim results() As Variant, html2 As MSHTML.HTMLDocument

    headers = Array("Aktenzeichen", "Amtsgericht", "Objekt/Lage", "Verkehrswert in €", "Termin", "Pdf-Link", "Addit Info Link", "m²")

    ReDim results(1 To 100, 1 To UBound(headers) + 1)

    Set table = html.querySelector("table")
    Set html2 = New MSHTML.HTMLDocument

    Dim lastRow As Boolean

    For Each row In table.Rows
        lastRow = False
        Dim header As String

        html2.body.innerHTML = row.innerHTML
        header = Trim$(row.Children(0).innerText)

        If header = "Aktenzeichen" Then          'start of new block. Assumes all blocks have this
            r = r + 1
            Dim dict As Scripting.Dictionary: Set dict = GetBlankDictionary(headers)
            On Error Resume Next
            dict("Addit Info Link") = Replace$(html2.querySelector("a").href, "about:", "https://www.zvg-portal.de/")
            On Error GoTo 0
        End If

        If dict.Exists(header) Then dict(header) = Trim$(row.Children(1).innerText)

        If (header = vbNullString And html2.querySelectorAll("a").Length > 0) Then
            dict("Pdf-Link") = Replace$(html2.querySelector("a").href, "about:blank", "https://www.zvg-portal.de/index.php")
            lastRow = True
        ElseIf header = "Termin" Then
            If row.NextSibling.NodeType = 1 Then lastRow = True
        End If

        If lastRow Then
            populateArrayFromDict dict, results, r
        End If
    Next

    results = Application.Transpose(results)
    ReDim Preserve results(1 To UBound(headers) + 1, 1 To r)
    results = Application.Transpose(results)
    
    Dim re As Object
    
    Set re = CreateObject("VBScript.RegExp")
    
    With re
        .Global = False
        .MultiLine = False
        .IgnoreCase = True
        .Pattern = "\s([0-9.]+)\sm²"
    End With

    Dim ie As SHDocVw.InternetExplorer
    
    Set ie = New SHDocVw.InternetExplorer
    
    With ie
        .Visible = True
        
        For r = LBound(results, 1) To UBound(results, 1)
            
            If results(r, 7) <> vbNullString Then
                
                .Navigate2 results(r, 7), headers:="Referer: " & URL
                
                While .Busy Or .readyState <> READYSTATE_COMPLETE: DoEvents: Wend
 
                'On Error Resume Next
                results(r, 8) = re.Execute(.document.querySelector("#anzeige").innerHTML)(0).Submatches(0)
                'On Error GoTo 0
   
            End If
            
        Next
        
        .Quit
        
    End With
    
    With ActiveSheet
        .Cells(1, 1).Resize(1, UBound(headers) + 1) = headers
        .Cells(2, 1).Resize(UBound(results, 1), UBound(results, 2)) = results
    End With

End Sub

Public Sub populateArrayFromDict(ByVal dict As Scripting.Dictionary, ByRef results() As Variant, ByVal r As Long)
    Dim key As Variant, c As Long

    For Each key In dict.Keys
        c = c + 1
        results(r, c) = Replace$(dict(key), " (Detailansicht)", vbNullString)
    Next

End Sub

Public Function GetBlankDictionary(ByRef headers() As Variant) As Scripting.Dictionary
    Dim dict As Scripting.Dictionary, i As Long

    Set dict = New Scripting.Dictionary

    For i = LBound(headers) To UBound(headers)
        dict(headers(i)) = vbNullString
    Next

    Set GetBlankDictionary = dict
End Function

【讨论】:

  • 不错。比我想象的要复杂一点;-)
  • 我仍然认为有更简单的方法。 Python 不需要任何浏览器自动化。我不能完全确定在后期没有错误的情况下,需要与 GET 请求一起检索信息的额外信息(引用者除外)。
  • 当我学会将 .item(x) 添加到 queryselectorall 时,我感到非常自豪。我应该怎么想出 .pattern = "\s([0-9.]+)\sm²"???它几乎读起来(感觉上)像 e=mc² lol
  • 我认为这不是 .包含在字符集中。见regex101.com/r/WAJqAh/1。请通过 pastebin.com 提供失败的确切字符串,或提供有关如何在网站上获得该结果的说明。
  • 将标志更改为 global = true 然后您需要循环匹配。 Dim matches As Object, match As Object: Set matches = re.execute(.......) 然后For Each match in Matches。当您循环匹配匹配时,将提取的匹配值清空到一个数组中。最后用 "=" &amp; Join(arr, "+") 加入数组。数组你可以Dim arr()Dim matches同时使用,然后在Set Matches之后的行ReDim arr(1 to matches.count)stackoverflow.com/questions/22542834/…
猜你喜欢
  • 2018-05-22
  • 2021-07-15
  • 2013-12-10
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2015-09-17
  • 2017-02-07
  • 1970-01-01
相关资源
最近更新 更多