【问题标题】:VBA Excel pulling new webpage data after clicking on "submit"单击“提交”后,VBA Excel 拉取新的网页数据
【发布时间】:2014-07-11 15:18:50
【问题描述】:

我正在尝试从通过 API 编号提供油井数据的网站中提取一些信息(API 是美国每口井的唯一编号)

网址:http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1

API 示例:1708300502

问题是,当我到达第二页时, IE.document.getElementsByTagName("body")(0).innerText 仍然从初始页面返回数据。如何获取更新的页面数据?

最终的目标是到达第2页,通过IE.document.getElementsByTagName("a")(0)点击“30570”。点击然后阅读最后的第3页。我只是不知道如何阅读更新的页面:(

Option Explicit

Sub sonris_WellData()
   Dim IE As InternetExplorer
   Set IE = CreateObject("InternetExplorer.Application")
   IE.Visible = True

   Dim i As Integer

   'Open SONRIS website
   Application.StatusBar = "Opening Website"
   IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
   Do While IE.readyState <> 4: DoEvents: Loop
   Application.Wait Now() + TimeValue("00:00:01")
   Application.StatusBar = False

   IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
   IE.document.forms(0).submit

   ' Wait until the next page opens
   Application.StatusBar = "Opening Website"
   Do While IE.readyState <> 4: DoEvents: Loop
   Application.Wait Now() + TimeValue("00:00:01")
   Application.StatusBar = False

   ' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
   MsgBox IE.document.getElementsByTagName("body")(0).innerText

   IE.Quit
End Sub

【问题讨论】:

    标签: html excel vba web-scraping web-crawler


    【解决方案1】:

    这似乎有效。而不是 DoEvents 使用 WinAPI 睡眠功能。我还在表单提交后添加了对Sleep 函数的调用。

    更多时候,我们看到由一些 javascript/etc. 动态提供服务的网站,在这些情况下,浏览器可能看起来是 READYSTATE_COMPLETE 或不是 Busy,但页面尚未呈现“新”结果。

    Option Explicit
    Public Declare Sub Sleep Lib "kernel32" (ByVal dwMilliseconds As Long)
    Sub sonris_WellData()
       Dim IE As Object 'InternetExplorer
       Set IE = CreateObject("InternetExplorer.Application")
       IE.Visible = True
    
       Dim i As Integer
    
       'Open SONRIS website
       Application.StatusBar = "Opening Website"
       IE.navigate "http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
       Do While IE.readyState <> 4
           Sleep 1000
       Loop
    
       Application.StatusBar = False
    
       IE.document.forms(0).p_apinum.Value = "1708300502" 'plug-in API
       IE.document.forms(0).submit
    
       Sleep 1000
    
       ' Wait until the next page opens
       Application.StatusBar = "Opening Website"
       Do While IE.readyState <> 4
        Sleep 1000
       Loop
    
       Application.StatusBar = False
    
       ' Read the page - this is where the issue occurs, MsgBox keeps returning text from the very 1st page
       MsgBox IE.document.getElementsByTagName("body")(0).innerText
    
       IE.Quit
    End Sub
    

    您可以在.submit 之后尝试稍长一点的Sleep

    另外,我注意到在您提交后,URL 发生了变化,因此您也可以尝试将第二个等待循环更改为:

    Do While IE.LocationURL ="http://sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1"
        Sleep 1000
    Loop
    

    这应该让 Excel.Application 等到 URL 发生变化。

    或者,使用 XMLHTTPRequest 可能会更好(在 SO 和 Internet 上的其他地方有很多这样的例子)。这使您可以像浏览器一样发送请求,而无需实际使用 Web 浏览器。然后,您可以简单地将返回文本解析为 HTML 或 XML。我会为此使用 Microsoft XML,v6.0 库参考。

    【讨论】:

    • 还是没有运气。我可以看到页面加载,所以延迟不是这种情况。我将尝试使用 XMLHTTPRequest
    • 哦,抱歉,请尝试第二个建议 Do While IE.Location = "http://..." 我在那里使用了错误的运算符。
    • 即使在我等待 10-15 秒后,IE.LocationURL 仍会返回第一个链接 (sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi1),即使该页面已在第二个链接超过 5 秒:“sonlite.dnr.state.la.us/sundown/cart_prod/cart_con_wellapi2
    • 尝试第二个建议...循环基于IE.Location
    • 什么也没发生。我重新编写了脚本,以便它每 5 秒在新行上写下 IE.LocationURL。结果是它不断地从旧的 IE 页面中提取数据,而不是新的 :(
    【解决方案2】:

    POST 请求:

    ①输入Well API号

    我检查了做出您提到的选择的网页。我使用fiddler 检查了网络流量,并注意到初始请求在您提交 API 编号时由POST request 处理。


    ② POST 请求:

    POST 正文有以下参数:

    p_apinum 是键,关联的值是原始 Well API 编号。

    使用此信息,我直接制定了 POST 请求,从而避免了您的第一个目标网页。


    ③点击超链接:

    接下来,我注意到你要按下的元素:

    查看关联的 HTML,它有一个关联的相对超链接:

    我使用辅助函数来解析页面 HTML 以获取此相对链接并构造绝对路径:GetNextURL(page.body.innerHTML)


    ④ 提出新请求:

    我重新使用我的 HTTPRequest 函数 GetPage 发送第二个请求,主体为空,并从通过 page.getElementsByTagName("table") 返回的 HTML 文档中获取所有表格。


    ⑤ 将表格写入 Excel 工作表:

    我使用辅助函数 AddHeaders 循环页面上的所有表格以写出表格标题,并使用 WriteTables 将当前表格写入工作表。


    示例页面内容:


    示例代码输出:


    VBA:

    Option Explicit
    Public Sub GetWellInfo()
        Dim ws As Worksheet, page As HTMLDocument, targetTable As HTMLTable, apiNumbers(), currNumber As Long
        Const PARAM1 As String = "p_apinum"
        Const BASESTRING As String = "http://sonlite.dnr.state.la.us/sundown/cart_prod/"
        apiNumbers = Array(1708300502, 1708300503)
        
        Application.ScreenUpdating = False
        Set ws = ThisWorkbook.Worksheets("Sheet1")
        
        With ws
            .Cells.ClearContents
            For currNumber = LBound(apiNumbers) To UBound(apiNumbers)
                Set page = GetPage(BASESTRING & "cart_con_wellapi2", apiNumbers(currNumber), PARAM1)
                Set page = GetPage(BASESTRING & GetNextURL(page.body.innerHTML))
                Dim allTables As Object
                Set allTables = page.getElementsByTagName("table")
             
                For Each targetTable In allTables
                    AddHeaders targetTable, GetLastRow(ws, 1) + 2, ws
                    WriteTables targetTable, GetLastRow(ws, 1), ws
                Next targetTable
            
            Next currNumber
        End With
        Application.ScreenUpdating = True
    End Sub
    
    Public Function GetPage(ByVal url As String, Optional ByVal apiNumber As Long, Optional ByVal paramN As String = vbNullString) As HTMLDocument
        Dim objHTTP As Object, html As New HTMLDocument
      
        Set objHTTP = CreateObject("WinHttp.WinHttpRequest.5.1")
    
        Dim sBody As String
        If Not paramN = vbNullString Then sBody = paramN & "=" & apiNumber
        With objHTTP
            .SetTimeouts 10000, 10000, 10000, 10000
            .Open "POST", url, False
            .setRequestHeader "User-Agent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0)"
            .setRequestHeader "Content-type", "application/x-www-form-urlencoded"
            On Error Resume Next
            .send (sBody)
            If Err.Number = 0 Then
                If .Status = "200" Then
                    html.body.innerHTML = .responseText
                    Set GetPage = html
                Else
                    Debug.Print "HTTP " & .Status & " " & .statusText
                    Exit Function
                End If
            Else
                Debug.Print "Error " & Err.Number & " " & Err.Source & " " & Err.Description
                Exit Function
            End If
            On Error GoTo 0
        End With
     
    End Function
    
    Public Function GetNextURL(ByVal inputString As String)
        GetNextURL = Replace$(Replace$(Split(Split(inputString, "href=")(1), ">")(0), Chr$(34), vbNullString), "about:", vbNullString)
    End Function
    
    Public Sub AddHeaders(ByVal hTable As Object, ByVal startRow As Long, ByVal ws As Worksheet)
        Dim headers As Object, header As Object, columnCounter As Long
        Set headers = hTable.getElementsByTagName("th")
        For Each header In headers
            columnCounter = columnCounter + 1
            ws.Cells(startRow, columnCounter) = header.innerText
        Next header
    End Sub
    
    Public Sub WriteTables(ByVal hTable As HTMLTable, Optional ByVal startRow As Long = 1, Optional ByRef ws As Worksheet)
        If ws Is Nothing Then Set ws = ActiveSheet
        
        Dim tRow As Object, tCell As Object, tr As Object, td As Object, r As Long, c As Long
        r = startRow
        With ActiveSheet
            Set tRow = hTable.getElementsByTagName("tr")
            For Each tr In tRow
                Set tCell = tr.getElementsByTagName("td")
                For Each td In tCell
                    .Cells(r, c).Value = td.innerText
                    c = c + 1
                Next td
                r = r + 1:  c = 1
            Next tr
        End With
    End Sub
    
    Public Function GetLastRow(ByVal ws As Worksheet, Optional ByVal columnNumber As Long = 1) As Long
        With ws
            GetLastRow = .Cells(.Rows.Count, columnNumber).End(xlUp).Row
        End With
    End Function
    

    参考资料:

    VBE > 工具 > 参考 > HTML 对象库。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2021-10-03
      • 1970-01-01
      • 2021-10-18
      • 2016-05-14
      • 1970-01-01
      • 1970-01-01
      • 2016-03-13
      相关资源
      最近更新 更多