无法使用 xhr 从网页中获取一些信息答案

【问题标题】：Cant' fetch some information from a webpage using xhr无法使用 xhr 从网页中获取一些信息
【发布时间】：2018-12-31 19:05:09
【问题描述】：

我正在尝试使用xmlhttp 请求从网页中获取部分信息。当我执行我的脚本时，它会抛出一个错误Object Variable Or With---。但是，当我使用IE 尝试相同的操作时，我得到的内容就像魔术一样。

需要注意的最重要的一点是，我希望抓取的内容既不是 javascript 加密的，也不是动态生成的。所以，我应该让他们使用xhr。我哪里错了？

Here goes the website link

使用IE（正在工作）：

Sub GetText()
    Const Url As String = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
    Dim IE As New InternetExplorer, HTML As HTMLDocument, post As Object

    With IE
        .Visible = False
        .navigate Url
        While .Busy = True Or .readyState < 4: DoEvents: Wend
        Set HTML = .document
    End With
    
    Set post = HTML.querySelectorAll(".section_content")(2)
    MsgBox post.innerText
End Sub

使用 XHR（不工作）：

Sub GetText()
    Const Url As String = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
    Dim Http As New XMLHTTP60, HTML As New HTMLDocument, post As Object

    With Http
        .Open "GET", Url, False
        .send
        HTML.body.innerHTML = .responseText
    End With
    
    Set post = HTML.querySelectorAll(".section_content")(2)
    MsgBox post.innerText
End Sub

我上面定义的selector 完美无缺。

我本可以在此处粘贴相关的html elements，但它们包含在comments 中。但是，我在上面提供了该网站的链接。

为了更清楚：我感兴趣的文本部分与该网页中的下面完全一样。

我的问题：如何使用 XHR 获取上述文本块（如上图所示）？

【问题讨论】：

我得到了与 IE 版本完全相同的错误。尝试 HTML.getElementsByClassName("section_content") MsgBox post.innerText Next 中的每个帖子
您使用 IE 会遇到同样的错误，因为您的 IE 版本不支持 .querySelector() @peakpeak。

标签： vba excel web-scraping xmlhttprequest

【解决方案1】：

使用评论位置：

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))
    With html
        .body.innerHTML = sResponse
        html.body.innerHTML = html.querySelector("#all_9711922514").LastChild.Data
        Debug.Print html.querySelector("#div_9711922514").innerText
    End With
End Sub

使用nodeType的方法：

Option Explicit    
Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument, ele As Object
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))

    With html
        .body.innerHTML = sResponse
        For Each ele In html.querySelector("#all_9711922514").Children
            If ele.NodeType = 8 Then
                html.body.innerHTML = ele.Data
                Debug.Print html.querySelector("#div_9711922514").innerText
                Exit For
            End If
        Next
    End With
End Sub

使用正则表达式的方法：

Option Explicit

Public Sub GetInfo()
    Dim sResponse As String, html As New HTMLDocument

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml", False
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    sResponse = Mid$(sResponse, InStr(1, sResponse, "<!DOCTYPE "))

    With html
        .body.innerHTML = sResponse
        Dim s As String
        s = .querySelector("div[id=all_1786105919]").outerHTML
        s = regexRemove(s, "<([^>]+)>")
        Debug.Print Replace$(Replace$(s, "&", "°"), "-->", vbNullString)
    End With
End Sub

Public Function regexRemove(ByVal s As String, ByVal pattern As String) As String
    Dim regex As Object
    Set regex = CreateObject("VBScript.RegExp")
    With regex
        .Global = True
        .MultiLine = True
        .IgnoreCase = False
        .pattern = pattern
    End With

    If regex.test(s) Then
        regexRemove = regex.Replace(s, vbNullString)
    Else
        regexRemove = s
    End If
End Function

输出：

【讨论】：

您的解决方案应该立即被接受。这是一个非常详细的。
我更喜欢你的简单。我正朝着那个方向前进，但你打败了我！所以我不得不寻找其他方式来提供结果：sighs:

【解决方案2】：

解决方案简单明了。您需要做的就是使用Replace() 函数左右从responseText 中踢出comment signs，然后使用Html.body.innerHTML 过滤它们以使其成为proper html contents。其余照常。

这是获取内容的方式：

Sub GetTextFromComment()
    Const URL As String = "https://www.baseball-reference.com/boxes/ANA/ANA201806180.shtml"
    Dim Http As New XMLHTTP60, Html As New HTMLDocument, post As Object

    With Http
        .Open "GET", URL, False
        .send
        Html.body.innerHTML = Replace(Replace(.responseText, "<!--", ""), "-->", "")
    End With
    Set post = Html.querySelectorAll(".section_content")(2)
    MsgBox post.innerText
End Sub

【讨论】：

喜欢这个解决方案。
甜蜜又简单