【问题标题】:XHR request response text has unexpected character setXHR 请求响应文本包含意外字符集
【发布时间】:2018-07-22 16:54:49
【问题描述】:

我正在查看@OmegaStripes 对这个问题How to get a particular InnerText from a specific class? 的答案,这里使用Split 函数和指定的分隔符字符串从.responseBody 中提取href

然后我尝试复制它以提取以下href

"https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2018/02/New-AmbSYS-to-2018-Jan.csv" 

来自NHS England's Ambulance Quality Indicators

HTML sn-p:

<main class="main group" role="main">
        <div class="page-content" id="main-content">
            <header>
                <h1>Ambulance Quality Indicators</h1>
            </header>
            <article class="rich-text">
               <p></p>
              <p></p>
              <p></p>
               <p></p>
              <p></p>
              <p><strong>CSV Data</strong><br>
These files have the same data as other published spreadsheets, but without any formatting:<br>
                <a href="https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2018/02/New-AmbSYS-to-2018-Jan.csv" class="csv-link" onclick="ga('send', 'event', 'Downloads', 'CSV', 'https://www.england.nhs.uk/statistics/wp-content/uploads/sites/2/2018/02/New-AmbSYS-to-2018-Jan.csv');">New Systems Indicators August 2017 to January 2018 (CSV, 23KB)</a><br>
            </article>
    </div>
</main>

问题:

我收到如下回复文本:

响应文本示例:

通过快速研究,查看参考资料,我猜测这可能是编码问题?

我尝试设置.SetRequestHeader

 .setRequestHeader "Content-Type", _
     "application/x-www-form-urlencoded; charset=UTF-8"

这对输出没有影响。

说实话,我不知道如何解决这个问题。

请对我如何获得预期的响应文本有任何建议?即我可以解析感兴趣的href

上下文:

这是一项更大的工作的一部分,其中:

1) 我想抓取该 CSV 链接(其名称每个月都会更改),没有浏览器弹出

2) 下载目标文件内容

3) 使用 ADODB.Stream 将二进制文件写出。

@OmegaStripes 在回答我的问题Return focus to ThisWorkbook.Activesheet after XMLHTTP60 file download 时概述了此过程。我目前正在尝试理解并实施该建议。

代码:

Option Explicit

Public Const url As String = "https://www.england.nhs.uk/statistics/statistical-work-areas/ambulance-quality-indicators/"
Public aBody As String

Sub Testing()

    ' Download via XHR
    With CreateObject("MSXML2.XMLHTTP")

        .Open "GET", url, False
        .setRequestHeader "Content-Type", "application/x-www-form-urlencoded; charset=utf-8"
        .send
        ' Get binary response content
        aBody = .responseBody

    End With

    ActiveSheet.Range("A1") = aBody

End Sub

参考资料:

1)XMLHTTP and Special Characters (eg, accents)

2)setRequestHeader Method (IXMLHTTPRequest)

3)VBA HTML Scraping - '.innertext' from complex table

4) Msxml2.ServerXMLHTTP and UTF-8 charset issues

【问题讨论】:

  • 响应头没有指定编码,这可能是MSXML2.XMLHTTP 没有正确解码正文的原因。使用ADODB.Streamstream.CharSet = "UTF-8"。例如:stackoverflow.com/questions/26624736/convert-binary-to-string/…
  • @FlorentB。谢谢。我会看看的。我不知道是不是我的无知,但这似乎是为了以后的阶段。这会以某种方式与响应文本相关联吗?我首先必须从 XHR 中识别文件 url。
  • 请注意,.responseBody 返回编码为 UTF-8 的字节数组。您将其转换为StringUTF-16 编码),这就是为什么您会得到所有这些外来字符。如果 CSV 文件只包含 ASCII 字符,则使用.responseText,如果没有,则使用ADODB.Stream 转换.responseBody

标签: html vba web-scraping encoding xmlhttprequest


【解决方案1】:

因此,感谢 @FlorentB 提供此解决方案,并向 @OmegaStripes 提出建议。

正如建议的那样,问题确实是.responseBody 返回了一个编码为 UTF-8 的字节数组。正如所指出的,我将其转换为字符串(UTF-16 编码),因此所有这些外来字符。

我使用@Tomalak 的函数BytesToString,稍作改动,来处理到字符串的转换。

代码:

Option Explicit

Public Const url As String = "https://www.england.nhs.uk/statistics/statistical-work-areas/ambulance-quality-indicators/"
Public aBody As String 'this is causing the conversion
Const adTypeBinary As Byte = 1
Const adTypeText As Byte = 2
Const adModeReadWrite As Byte = 3
Public Const strPath As String = "C:\Users\User\Desktop\testXMLHTTPOutput"

Public Sub Testing() 
    ' Download via XHR
    With CreateObject("MSXML2.XMLHTTP")

        .Open "GET", url, False
        .send
        ' Get binary response content
        aBody = BytesToString(.responseBody, "UTF-8")

    End With

    Dim fso As Object  'late binding
    Set fso = CreateObject("Scripting.FileSystemObject")
    Dim oFile As Object
    Set oFile = fso.CreateTextFile(strPath)
    oFile.WriteLine aBody
    oFile.Close
    Set fso = Nothing
    Set oFile = Nothing

End Sub
'ADODB.Stream with stream.CharSet = "UTF-8"
'http://msdn.microsoft.com/en-us/library/windows/desktop/ms675032%28v=vs.85%29.aspx


Public Function BytesToString(ByVal bytes As Variant, ByVal charset As String) As String

    With CreateObject("ADODB.Stream")
        .Mode = adModeReadWrite
        .Type = adTypeBinary
        .Open
        .Write bytes
        .Position = 0
        .Type = adTypeText
        .charset = charset
        BytesToString = .ReadText
    End With
End Function

这里有用的其他链接:

Save text file UTF-8 encoded with VBA

【讨论】:

    猜你喜欢
    • 2016-08-07
    • 1970-01-01
    • 2019-05-24
    • 2016-03-24
    • 2011-01-29
    • 2018-09-18
    • 2023-03-23
    • 1970-01-01
    • 2021-03-01
    相关资源
    最近更新 更多