【问题标题】:How can I extract data from a website and fill an excel sheet using VBA?如何从网站中提取数据并使用 VBA 填写 Excel 表?
【发布时间】:2019-04-27 13:35:49
【问题描述】:

我想从 betexplorer.com 提取数据。我想从以下 URL 中提取两个不同的数据:

https://www.betexplorer.com/soccer/s...eague-1/stats/

我想提取已播放的比赛和剩余的比赛 我想提取主队进球和客队进球(每场比赛)

我有执行此操作的代码,如下所示:

Option Explicit

Sub GetSoccerStats()


'Set a reference (VBE > Tools > References) to the following libraries:
'   1) Microsoft XML, v6.0
'   2) Microsoft HTML Object Library

Dim xmlReq As New MSXML2.XMLHTTP60
Dim objDoc As New MSHTML.HTMLDocument
Dim objTable As MSHTML.htmlTable
Dim objTableRow As MSHTML.htmlTableRow
Dim strURL As String
Dim strResp As String
Dim strText As String
Dim rw As Long

strURL = "https://www.betexplorer.com/soccer/south-korea/k-league-1/stats/"

With xmlReq
    .Open "GET", strURL, False
    .send
    If .Status <> 200 Then
        MsgBox "Error " & .Status & ":  " & .statusText
        Exit Sub
    End If
    strResp = .responseText
End With

Worksheets.Add

objDoc.body.innerHTML = strResp

Set objTable = objDoc.getElementsByClassName("table-main leaguestats")(0)

If Not objTable Is Nothing Then
    rw = 1
    For Each objTableRow In objTable.Rows
        strText = objTableRow.Cells(0).innerText
        Select Case strText
            Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                Cells(rw, "a").Value = objTableRow.Cells(0).innerText
                Cells(rw, "b").Value = objTableRow.Cells(1).innerText
                Cells(rw, "c").Value = objTableRow.Cells(2).innerText
                rw = rw + 1
        End Select
    Next objTableRow
    Columns("a").AutoFit
End If

Set xmlReq = Nothing
Set objDoc = Nothing
Set objTable = Nothing
Set objTableRow = Nothing


End Sub

此代码有效,但我想更进一步。

我实际上想为同一站点上的许多不同 URL 运行此宏。我已经创建了一个工作表,其中包含足球联赛列表(在行中),列包含数据。

您可以在此处找到该文件: https://www.dropbox.com/s/77sol24sty75w5z/Avg%20Goals.xlsm?dl=0

这是一个文件,我将在其中将联赛添加到行中。是否可以调整提取数据的代码,以便它可以填充工作表中的列?我不需要像这段代码那样提取数据的名称(剩余比赛、主场进球、客场进球等),我只需要数字。提取的数字必须根据工作表填充列(因此每一行都包含每个联赛的数据。如您所见,有几个联赛,因此需要遍历每一行,然后使用相应的 URL行。

您会注意到有一列包含单词 CURRENT。这是为了表明它应该使用当前 URL 列中的 URL。如果我将值更改为 LAST,我希望它使用 Last URL 列中的 URL。

对于每个联赛,如果我使用 CURRENT 或 LAST,它会有所不同。

这是预期输出的图片:

非常感谢任何帮助。

【问题讨论】:

  • 对预期输出(包括问题)的小模拟可能有助于显示前几行输出。这可以作为图像插入。
  • 当前和最后一个网址有什么区别?你有每个例子吗?
  • 我们应该能够在不参考外部文件的情况下回答这个问题。也就是说:您的保管箱文件是公开的吗?我看到 404 未找到
  • 我已经更新了链接并添加了一张照片

标签: excel vba web-scraping data-extraction


【解决方案1】:

与您的代码保持一致,这将在 M:T 列中输出这些项目的数据。我有一个辅助函数GetLinks,它会根据 K 列中的值生成要使用的最终 url 数组:

inputArray = GetLinks(inputArray)

这个数组被循环并发出 xhr 请求以获取信息。所有结果信息都存储在一个数组中,results,该数组在最后写入工作表。

我一直在使用数组,因为您不想继续阅读工作表;这是一项昂贵的操作,会减慢您的代码速度。出于同样的原因,如果出现 200,我将消息和 url 打印到即时窗口,以免降低代码速度。您实际上有一个日志,然后您可以在最后查看。

检索到的结果从 M 列写出,但由于数据在数组中,您可以轻松地写出到您想要的任何位置;只需将用于粘贴的起始单元格从M4 更改为您想要的最左上角的单元格。您现有的列中没有百分比,因此我可以放心地假设您希望写出的数据位于新列中(甚至可能在不同的工作表中)。

Option Explicit   
Public Sub GetSoccerStats()
    Dim xmlReq As New MSXML2.XMLHTTP60, response As String
    Dim objDoc As New MSHTML.HTMLDocument, text As String
    Dim lastRow As Long, dataSheet As Worksheet, inputArray(), i As Long

    Set dataSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")

    With dataSheet
        lastRow = .Cells(.Rows.Count, "B").End(xlUp).Row
    End With

    inputArray = dataSheet.Range("J4:L" & lastRow).Value
    inputArray = GetLinks(inputArray)

    Dim results(), r As Long, c As Long
    ReDim results(1 To UBound(inputArray, 1), 1 To 8)

    With xmlReq

        For i = LBound(inputArray, 1) To UBound(inputArray, 1)
            r = r + 1
            .Open "GET", inputArray(i, 4), False
            .send
            If .Status <> 200 Then
                Debug.Print inputArray(i, 4), vbTab, "Error " & .Status & ":  " & .statusText
            Else
                response = .responseText
                objDoc.body.innerHTML = response

                Dim objTable As MSHTML.HTMLTable, objTableRow As MSHTML.HTMLTableRow

                Set objTable = objDoc.getElementsByClassName("table-main leaguestats")(0)

                If Not objTable Is Nothing Then
                    c = 1
                    For Each objTableRow In objTable.Rows
                        text = objTableRow.Cells(0).innerText
                        Select Case text
                        Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                            results(r, c) = objTableRow.Cells(1).innerText
                            results(r, c + 1) = objTableRow.Cells(2).innerText
                            c = c + 2
                        End Select
                    Next objTableRow
                End If
            End If
            Set objTable = Nothing
        Next
    End With
    dataSheet.Range("M4").Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

Public Function GetLinks(ByRef inputArray As Variant) As Variant
    Dim i As Long
    ReDim Preserve inputArray(1 To UBound(inputArray, 1), 1 To UBound(inputArray, 2) + 1)

    For i = LBound(inputArray, 1) To UBound(inputArray, 1)
        inputArray(i, 4) = IIf(inputArray(i, 1) = "CURRENT", inputArray(i, 2), inputArray(i, 3))
    Next
    GetLinks = inputArray
End Function

文件布局:


鉴于大量请求导致阻塞这里是IE版本:

'VBE > Tools > References:
'1: Microsoft HTML Object library  2: Microsoft Internet Controls
Public Sub GetSoccerStats()
    Dim ie As Object, t As Date
    Dim objDoc As New MSHTML.HTMLDocument, text As String
    Dim lastRow As Long, dataSheet As Worksheet, inputArray(), i As Long

    Const MAX_WAIT_SEC As Long = 10

    Set dataSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")
    Set ie = CreateObject("InternetExplorer.Application")
    With dataSheet
        lastRow = .Cells(.Rows.Count, "B").End(xlUp).Row
    End With

    inputArray = dataSheet.Range("C4:E" & lastRow).Value
    inputArray = GetLinks(inputArray)

    Dim results(), r As Long, c As Long
    ReDim results(1 To UBound(inputArray, 1), 1 To 8)

    With ie
        .Visible = True
        For i = LBound(inputArray, 1) To UBound(inputArray, 1)
            r = r + 1
            .navigate2 inputArray(i, 4)

            While .Busy Or .readyState < 4: DoEvents: Wend

            Dim objTable As MSHTML.HTMLTable, objTableRow As MSHTML.HTMLTableRow
            t = timer
            Do
                DoEvents
                On Error Resume Next
                Set objTable = .document.getElementsByClassName("table-main leaguestats")(0)
                On Error GoTo 0
                If Timer - t > MAX_WAIT_SEC Then Exit Do
            Loop While objTable Is Nothing

            If Not objTable Is Nothing Then
                c = 1
                For Each objTableRow In objTable.Rows
                    text = objTableRow.Cells(0).innerText
                    Select Case text
                    Case "Matches played", "Matches remaining", "Home goals", "Away goals"
                        results(r, c) = objTableRow.Cells(1).innerText
                        results(r, c + 1) = objTableRow.Cells(2).innerText
                        c = c + 2
                    End Select
                Next objTableRow
            End If
            Set objTable = Nothing
        Next
        .Quit
    End With
    dataSheet.Range("F4").Resize(UBound(results, 1), UBound(results, 2)) = results
End Sub

【讨论】:

  • 嗨,这很好用,所以我更新了我的联赛列表(现在有 70 多个)当我运行宏时,我收到以下错误:运行时错误'-2147024891 (80070005) : 访问被拒绝。在调试中,错误位于 xmlreq 的 .send
  • 该网站可能会因请求过多而过快阻止您。您可能需要在循环期间引入等待。您想使用同一个网站并为我上传所有联赛的文件吗?
  • 我认为这可能是时间问题。这是文件dropbox.com/s/77sol24sty75w5z/Avg%20Goals.xlsm?dl=0的链接
  • 同意。如果它适用于单个/少数请求但不适用于很多请求,则很可能您受到限制/阻止
  • 是基于IP的拦截。即使更改了 IP,它也会产生另一层保护。我的猜测是这种抓取是不允许的。
【解决方案2】:

也许这样的事情可能会起作用:

Option Explicit

Private Sub GetSoccerStats()

    'Set a reference (VBE > Tools > References) to the following libraries:
    '   1) Microsoft XML, v6.0
    '   2) Microsoft HTML Object Library

    Dim sourceSheet As Worksheet
    Set sourceSheet = ThisWorkbook.Worksheets("AVG GOAL DATA")

    Dim firstRowToFetchDataFor As Long
    firstRowToFetchDataFor = sourceSheet.Cells(sourceSheet.Rows.Count, "C").End(xlUp).Row + 1 ' Assumes a row needs pulling if the value in column C is blank.

    Dim lastRowToFetchDataFor As Long
    lastRowToFetchDataFor = sourceSheet.Cells(sourceSheet.Rows.Count, "B").End(xlUp).Row

    Dim xmlReq As MSXML2.XMLHTTP60
    Set xmlReq = New MSXML2.XMLHTTP60

    Dim htmlDoc As MSHTML.HTMLDocument
    Set htmlDoc = New MSHTML.HTMLDocument

    Dim rowIndex As Long
    For rowIndex = firstRowToFetchDataFor To lastRowToFetchDataFor

        Dim URL As String
        Select Case LCase$(sourceSheet.Cells(rowIndex, "J"))
            Case "current"
                URL = sourceSheet.Cells(rowIndex, "K")
            Case "last"
                URL = sourceSheet.Cells(rowIndex, "L")
            Case Else
                MsgBox "Expected 'current' or 'last', instead got '" & sourceSheet.Cells(rowIndex, "J") & "' in cell '" & sourceSheet.Cells(rowIndex, "J").Address(False, False) & "'.", vbCritical
                Application.Goto sourceSheet.Cells(rowIndex, "J")
                Exit Sub
        End Select

        With xmlReq
            .Open "GET", URL, False
            .send
            If .Status <> 200 Then
                MsgBox "Request returned HTTP " & .Status & ":" & vbNewLine & vbNewLine & .statusText, vbCritical
                Exit Sub
            End If
            htmlDoc.body.innerHTML = .responseText
        End With

        Dim htmlTableExtracted As MSHTML.HTMLTable
        On Error Resume Next
        Set htmlTableExtracted = htmlDoc.getElementsByClassName("table-main leaguestats")(0)
        On Error GoTo 0

        If Not (htmlTableExtracted Is Nothing) Then
            Dim tableRow As MSHTML.HTMLTableRow
            For Each tableRow In htmlTableExtracted.Rows
                Select Case LCase$(tableRow.Cells(0).innerText)
                    Case "matches played"
                        sourceSheet.Cells(rowIndex, "G") = tableRow.Cells(1).innerText
                    Case "matches remaining"
                        sourceSheet.Cells(rowIndex, "H") = tableRow.Cells(1).innerText
                    Case "home goals"
                        sourceSheet.Cells(rowIndex, "C") = tableRow.Cells(2).innerText
                    Case "away goals"
                        sourceSheet.Cells(rowIndex, "E") = tableRow.Cells(2).innerText
                End Select
            Next tableRow

            Set htmlTableExtracted = Nothing ' Prevent this iteration's result having effects on succeeding iterations
        End If
    Next rowIndex
End Sub

我可能错了,但E 栏不应该包含“客场进球”吗?我假设“A SCR AVG”中的“A”代表“Away”(因为“H SCR AVG”中的“H”似乎代表“Home”)。所以我将“客场进球”写到E 列,尽管屏幕截图表明它们应该写到B 列(或者我可能没有正确阅读)。

【讨论】:

  • 由于某种原因,当我将它粘贴到 VBA 模块时,我看不到要运行的宏
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2021-10-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多