【问题标题】:Cant' fetch some information from a webpage using xhr无法使用 xhr 从网页中获取一些信息
【发布时间】:2018-12-31 19:05:09
【问题描述】:

我正在尝试使用xmlhttp 请求从网页中获取部分信息。当我执行我的脚本时,它会抛出一个错误Object Variable Or With---。但是,当我使用IE 尝试相同的操作时,我得到的内容就像魔术一样。

需要注意的最重要的一点是,我希望抓取的内容既不是 javascript 加密的,也不是动态生成的。所以,我应该让他们使用xhr。我哪里错了?

Here goes the website link

使用IE(正在工作):

Sub GetText()
    Const Url As String = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
    Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object

    With IE
        .Visible = False
        .navigate Url
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set HTML = .document
    End With
    
    Set post = HTML.querySelectorAll(".section_content")(2)
    MsgBox post.innerText
End Sub

使用 XHR(不工作):

Sub GetText()
    Const Url As String = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
    Dim Http As New XMLHTTP60, HTML As New HTMLDocument, post As Object

    With Http
        .Open "GET", Url, False
        .send
        HTML.body.innerHTML = .responseText
    End With
    
    Set post = HTML.querySelectorAll(".section_content")(2)
    MsgBox post.innerText
End Sub

我上面定义的selector 完美无缺。

我本可以在此处粘贴相关的html elements,但它们包含在comments 中。但是,我在上面提供了该网站的链接。

为了更清楚:我感兴趣的文本部分与该网页中的下面完全一样。

我的问题:如何使用 XHR 获取上述文本块(如上图所示)?

【问题讨论】:

  • 我得到了与 IE 版本完全相同的错误。尝试 HTML.getElementsByClassName("section_content") MsgBox post.innerText Next 中的每个帖子
  • 您使用 IE 会遇到同样的错误,因为您的 IE 版本不支持 .querySelector() @peakpeak。

标签: vba excel web-scraping xmlhttprequest


【解决方案1】:

使用评论位置:

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
    With html
        .body.innerHTML = sResponse
        html.body.innerHTML = html.querySelector("#all_9711922514").LastChild.Data
        Debug.Print html.querySelector("#div_9711922514").innerText
    End With
End Sub

使用nodeType的方法:

Option Explicit    
Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument, ele As Object
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))

    With html
        .body.innerHTML = sResponse
        For Each ele In html.querySelector("#all_9711922514").Children
            If ele.NodeType = 8 Then
                html.body.innerHTML = ele.Data
                Debug.Print html.querySelector("#div_9711922514").innerText
                Exit For
            End If
        Next
    End With
End Sub

使用正则表达式的方法:

Option Explicit

Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))

    With html
        .body.innerHTML = sResponse
        Dim s As String
        s = .querySelector("div[id=all_1786105919]").outerHTML
        s = regexRemove(s, "<([^>]+)>")
        Debug.Print Replace$(Replace$(s, "&", "°"), "-->", vbNullString)
    End With
End Sub

Public Function regexRemove(ByVal s As String, ByVal pattern As String) As String
    Dim regex As Object
    Set regex = CreateObject("VBScript.RegExp")
    With regex
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
    End With

    If regex.test(s) Then
        regexRemove = regex.Replace(s, vbNullString)
    Else
        regexRemove = s
    End If
End Function

输出:

【讨论】:

  • 您的解决方案应该立即被接受。这是一个非常详细的。
  • 我更喜欢你的简单。我正朝着那个方向前进,但你打败了我!所以我不得不寻找其他方式来提供结果:sighs:
【解决方案2】:

解决方案简单明了。您需要做的就是使用Replace() 函数左右从responseText 中踢出comment signs,然后使用Html.body.innerHTML 过滤它们以使其成为proper html contents。其余照常。

这是获取内容的方式:

Sub GetTextFromComment()
    Const URL As String = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
    Dim Http As New XMLHTTP60, Html As New HTMLDocument, post As Object

    With Http
        .Open "GET", URL, False
        .send
        Html.body.innerHTML = Replace(Replace(.responseText, "<!--", ""), "-->", "")
    End With
    Set post = Html.querySelectorAll(".section_content")(2)
    MsgBox post.innerText
End Sub

【讨论】:

  • 喜欢这个解决方案。
  • 甜蜜又简单
猜你喜欢
  • 2021-11-30
  • 2015-02-10
  • 1970-01-01
  • 1970-01-01
  • 2014-04-30
  • 2011-07-12
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多