Excel VBA网页抓取数据表答案

【问题标题】：Excel VBA web scraping for data tableExcel VBA网页抓取数据表
【发布时间】：2018-02-01 21:51:49
【问题描述】：

我正在尝试从名册资源中获取数据，这是一个网页示例 (https://www.rosterresource.com/mlb-arizona-diamondbacks)。至少，我想获得“预计的“首选”首发阵容”并将该数据导入我的电子表格。然后，我会为名册资源中的每个 MLB 团队执行此操作，以创建一个包含每个团队和每个团队的预计阵容的工作表。

我尝试了一些“getElementById”和“getElementsByClassName”的方法，但是我很难获得我想要的数据，因为这似乎只是网页上的一个非常大的表格。

任何能让我了解获取数据的正确方向的见解都会非常有帮助。

【问题讨论】：

标签： html excel web-scraping vba

【解决方案1】：

Matt：您可以使用 PowerQuery（在 Excel 或 PowerBI 中）来执行此操作……即使数据未存储在 HTML 表中（这里就是这种情况）。 https://datachant.com/2017/03/30/web-scraping-power-bi-excel-power-query/有一个很好的教程

我目前正面临自己的网络抓取挑战，但如果您决定使用 PowerQuery 并遇到困难，请大声喊叫，我会看看是否可以提供进一步帮助。

【讨论】：

【解决方案2】：

如果您浏览网页https://www.rosterresource.com/mlb-arizona-diamondbacks 并从表格的上下文菜单中选择检查元素，您将在浏览器开发人员工具中看到整个表格位于一个框架内：

<iframe id="pageswitcher-content" frameborder="0" marginheight="0" marginwidth="0" src="https://docs.google.com/spreadsheets/d/e/2PACX-1vSe6YBd7UW_ijhVHdRsM132Z3aUXUIzGuHcoqqdsr_nUXIYHbvRDFY0XCwGVndXJnWRaWVYhbeDbo5W/pubhtml/sheet?headers=false&amp;gid=1569103012" style="display: block; width: 100%; height: 100%;"></iframe>

所以实际上您需要从该 Google 电子表格文档中检索数据。这可以通过 XHR 和 Regex 来完成，如下代码所示：

Option Explicit

Sub Test()

    Dim sContent As String
    Dim i As Long
    Dim j As Long
    Dim k As Long
    Dim aTables()
    Dim aHeader() As String
    Dim aRows() As String

    ' Retrieve HTML content via XHR
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.rosterresource.com/mlb-arizona-diamondbacks", False
        .Send
        sContent = .ResponseText
    End With
    ' Cut all before iframe URL
    sContent = Split(sContent, "<iframe src=""", 2)(1)
    ' Cut all after ? sign within URL
    sContent = Split(sContent, "?", 2)(0)
    ' Download google spreadsheet by extracted URL
    ' e. g. https://docs.google.com/spreadsheets/d/e/2PACX-1vQngsjnOpqkD8FQIOLn4cFayZTe4dl5VJZLNjMzji2Iq0dVXan7nj20Pq6oKnVS_HFla9e5GUtCyYl_/pubhtml
    ' e. g. https://docs.google.com/spreadsheets/d/e/2PACX-1vSe6YBd7UW_ijhVHdRsM132Z3aUXUIzGuHcoqqdsr_nUXIYHbvRDFY0XCwGVndXJnWRaWVYhbeDbo5W/pubhtml
    ' Retrieve HTML content via XHR
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", sContent, False
        .Send
        sContent = .ResponseText
    End With
    ' Parse with RegEx
    With CreateObject("VBScript.RegExp")
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        ' Process all tables within iframe content
        .Pattern = "<table\b[\s\S]*?>([\s\S]*?)</table>"
        With .Execute(sContent)
            ReDim aTables(0 To .Count - 1)
            For i = 0 To .Count - 1
                aTables(i) = .Item(i).SubMatches(0)
            Next
        End With
        For k = 0 To UBound(aTables)
            ' Minor HTML simplification
            sContent = aTables(k)
            ' Remove all tags except table formatting
            .Pattern = "<(?!/td|/tr|/th|(?:td|tr|th)\b)[^>]*>|\r|\n|\t"
            sContent = .Replace(sContent, "")
            ' Remove tags attributes
            .Pattern = "<(\w+)\b[^>]+>"
            sContent = .Replace(sContent, "<$1>")
            ' Replace th with td
            .Pattern = "<(/?)th>"
            sContent = .Replace(sContent, "<$1td>")
            ' Replace HTML entities &name; &#number; with chars
            .Pattern = "&(?:\w+|#\d+);"
            .Global = False
            Do
                With .Execute(sContent)
                    If .Count = 0 Then Exit Do
                    sContent = Replace(sContent, .Item(0), DecodeHTMLEntities(.Item(0)))
                End With
            Loop
            .Global = True
           ' Extract rows
            .Pattern = "<tr>((?:<td>.*?</td>)+)</tr>"
            With .Execute(sContent)
                ReDim aRows(0 To .Count - 1, 0)
                For i = 0 To .Count - 1
                    aRows(i, 0) = .Item(i).SubMatches(0)
                Next
            End With
            ' Extract cells
            .Pattern = "<td>(.*?)</td>"
            For i = 0 To UBound(aRows, 1)
                With .Execute(aRows(i, 0))
                    For j = 0 To .Count - 1
                        If UBound(aRows, 2) < j Then ReDim Preserve aRows(UBound(aRows, 1), j)
                        aRows(i, j) = Trim(.Item(j).SubMatches(0))
                        DoEvents
                    Next
                End With
            Next
            aTables(k) = aRows
        Next
    End With
    ' Output
    With ThisWorkbook
        ' Remove all existing worksheets
        Application.DisplayAlerts = False
        .Sheets.Add , .Sheets(.Sheets.Count)
        Do While .Sheets.Count > 1
            .Sheets(1).Delete
        Loop
        Application.DisplayAlerts = True
        ' Output each table to separate worksheet
        For k = 0 To UBound(aTables)
            If .Sheets.Count < (k + 1) Then .Sheets.Add , .Sheets(.Sheets.Count)
            With .Sheets(k + 1)
                .Cells.Delete
                Output2DArray .Cells(1, 1), aTables(k)
                .Columns.AutoFit
            End With
        Next
    End With

End Sub

Function DecodeHTMLEntities(sText As String) As String

    Static oHtmlfile As Object
    Static oDiv As Object

    If oHtmlfile Is Nothing Then
        Set oHtmlfile = CreateObject("htmlfile")
        oHtmlfile.Open
        Set oDiv = oHtmlfile.createElement("div")
    End If
    oDiv.innerHTML = sText
    DecodeHTMLEntities = oDiv.innerText

End Function

Sub Output2DArray(oDstRng As Range, aCells As Variant)

    With oDstRng
        .Parent.Select
        With .Resize( _
                UBound(aCells, 1) - LBound(aCells, 1) + 1, _
                UBound(aCells, 2) - LBound(aCells, 2) + 1)
            .NumberFormat = "@"
            .Value = aCells
        End With
    End With

End Sub

通常不建议将 RegEx 用于 HTML 解析，因此 there is disclaimer。在这种情况下处理的数据非常简单，这就是使用 RegEx 对其进行解析的原因。关于正则表达式：introduction（尤其是syntax）、introduction JS、VB flavor。简化使 HTML 代码在某种程度上适合解析。 BTW there is one more answer 使用相同的方法。

【讨论】：