【问题标题】:Web scraping using VBA使用 VBA 进行网页抓取
【发布时间】:2019-03-15 04:42:29
【问题描述】:

我想从这个URL中提取数据。

我想从 10 张名片中的每一张中提取职务、手机联系电话和地址。

这是我尝试但没有成功的一些代码。

Public Sub GetValueFromBrowser()
    On Error Resume Next
    Dim Sn As Integer
    Dim ie As Object
    Dim url As String
    Dim Doc As HTMLDocument
    Dim element As IHTMLElement
    Dim elements As IHTMLElementCollection

    For Sn = 1 To 1

        url = Sheets("Infos").Range("C" & Sn).Value

        Set ie = CreateObject("InternetExplorer.Application")

        With ie
            .Visible = 0
            .navigate url
            While .Busy Or .readyState <> 4
                DoEvents
            Wend
        End With    

        Set Doc = ie.document
        Set elements = Doc.getElementsByClassName(" col-sm-5 col-xs-8 store-details sp-detail paddingR0")

        Dim count As Long
        Dim erow As Long
        count = 0
        For Each element In elements
            If element.className = "lng_cont_name" Then
                erow = Sheet1.Cells(Rows.count, 1).End(xlUp).Offset(1, 0).Row
                Cells(erow, 1) = Doc.getElementsByClassName("Store-Name")(count).innerText
                Cells(erow, 2) = Doc.getElementsByClassName("cont_fl_addr")(count).innerText
                count = count + 1
            End If
        Next element

        If Val(Left(Sn, 2)) = 99 Then
            ActiveWorkbook.Save
        End If

    Next Sn
End Sub

【问题讨论】:

  • 到目前为止你写过代码吗?
  • 感谢先生重播
  • 谢谢先生,这是我尝试但现在可以使用的一些代码
  • 是的,我写了一些 VBA 宏
  • 请使用edit 提供的sn-p 工具包含HTML。不作为图像。如果可能,请包括您的代码和 URL。

标签: html vba web-scraping screen-scraping


【解决方案1】:

电话号码并不容易,因为我认为它们是故意难以抓取的。我找到了一种从 CSS 伪 ::before 元素内容中破译值的方法。地址和标题是简单的 CSS 选择。


我已经在 python here 中编写了一个更干净的脚本。


那么,代码的各个部分是如何工作的呢?

标题:

Set titles = .querySelectorAll(".jcn [title]")

我将标题定位为具有title 属性和父jcn 类属性的元素。 "." 表示类选择器,"[]" 表示属性选择器,而两者之间的" " 是后代组合器。

documentquerySelectorAll 方法返回页面上所有匹配元素(即 10 个标题)的 nodeList


地址:

Set addresses = .querySelectorAll(".desk-add.jaddt")

地址的目标是其类属性desk-add jaddt。由于不允许使用复合类名称,因此必须额外添加一个 "." 来替换名称中的空格。


电话号码(通过破译storesTextToDecipher中的内容):

Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")

这就是魔法发生的地方。这些数字不能直接通过 DOM 获得,因为它们是伪元素内容。

如果您检查相关的 HTML,您会发现一系列伪 ::before elements。 VBA 没有公开应用伪选择器来尝试在页面的 CSS 中获取此信息的机制。

实际上你看到的是一系列 span 元素,每个元素都有一个以 mobilesv 开头的类属性。这些元素位于 col-sm-5 col-xs-8 store-details sp-detail paddingR0 类的单个父元素中(再次注意复合类名称)。

我最初收集所有父元素的nodeList

返回元素示例:

这些父元素中的每一个都包含构成电话号码字符串字符的类名(以mobilesv 开头)元素。有些字符是字符串中的数字,有些则代表+()-。注意:类名中icon- 之后的 2|3 个字母字符串,例如dc, fe.

例如页面上的第一个搜索结果,对于电话号码中的初始号码9

这个伪元素 /telephone 字符的实际 CSS 内容可以在 CSS 样式中观察到:

注意类名和伪元素选择器之前:.icon-ji:before 内容为\9d010

长话短说....您可以提取icon-之后的2或3个字母,即ji,以及\9d0之后的数字字符串,即10,并使用这些两位信息来破译电话号码。此信息在响应中可用:

在左边看到与电话字符串的类名相关联的相同的 2/3 字母字符串,在右边看到内容说明。稍微算一下,右边的数字比该班级的电话号码大 1,如网页图像所示。我只是创建了一个字典,然后通过解析 html 的这一部分将 2/3 字母缩写映射到电话号码。

当循环遍历storesTextToDecipher 时,我使用该字典从类名中匹配的 2/3 字母缩写中破译实际电话号码。


VBA:

Option Explicit
Public Sub GetInfo()
    Dim sResponse As String, html As HTMLDocument
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3", False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With
  
    Dim cipherKey As String, cipherDict As Object
    Set cipherDict = CreateObject("Scripting.Dictionary")
    cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
    cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
    
    Dim arr() As String, tempArr() As String, i As Long, j As Long
    arr = Split(cipherKey, """}.icon-")
    For i = LBound(arr) To UBound(arr)
        tempArr = Split(arr(i), Chr$(32))
        cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
    Next

    html.body.innerHTML = sResponse
    Dim titles As Object, addresses As Object, storesTextToDecipher As Object
    With html
        Set titles = .querySelectorAll(".jcn [title]")
        Set addresses = .querySelectorAll(".desk-add.jaddt")
        Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
    End With
    
    For i = 0 To titles.Length - 1
        Debug.Print "title: " & titles.item(i).innerText
        Debug.Print "address: " & addresses.item(i).innerText
        Debug.Print GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
    Next
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
    Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
    Set html2 = New HTMLDocument
    html2.body.innerHTML = storeInfo.innerHTML
    Set elems = html2.querySelectorAll("b span")
    For j = 0 To elems.Length - 1
        On Error Resume Next
        If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
            telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
        End If
        On Error GoTo 0
    Next
   GetStoreNumber = telNumber
End Function

示例输出:


编辑:所有页面结果

由于您现在想要超过 10 个,因此以下使用预期的页面结果计数(NUMBER_RESULTS_ON_PAGE) 从页面收集信息。它滚动页面直到找到预期的电话号码数量(应该是唯一的),或MAX_WAIT_SEC 被击中。这意味着您可以避免无限循环,并且如果您期望不同的数字,可以设置您的预期结果计数。它确实依赖于每个商店都有列出的电话号码 - 这似乎是一个相当合理的假设。

Option Explicit
Public Sub GetInfo()
    Dim IE As New InternetExplorer, resultCountDict As Object, cipherDict As Object, t As Date
    Const MAX_WAIT_SEC As Long = 300 'wait 5 minutes max before exiting loop to get all results
    Const NUMBER_RESULTS_ON_PAGE As Long = 80
    Const URL = "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3"
    
    Dim titles As Object, addresses As Object, storesTextToDecipher As Object
    
    Application.ScreenUpdating = True
    
    Set resultCountDict = CreateObject("Scripting.Dictionary")
    Set cipherDict = GetCipherDict(URL)
    
    With IE
        .Visible = True
        .Navigate2 URL
       
        While .Busy Or .readyState < 4: DoEvents: Wend
        
        With .document
            t = Timer
            Do
                DoEvents
                Set titles = .querySelectorAll(".jcn [title]")
                Set addresses = .querySelectorAll(".desk-add.jaddt")
                Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
                Dim telNumber As String, i As Long
                       
                For i = 0 To titles.Length - 1
                    telNumber = GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
                    If Not resultCountDict.Exists(telNumber) Then
                        resultCountDict.Add telNumber, Array(titles.item(i).innerText, addresses.item(i).innerText, telNumber)
                    End If
                Next
            
                .parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
                
                While IE.Busy Or IE.readyState < 4: DoEvents: Wend
                If Timer - t > MAX_WAIT_SEC Then Exit Do
            Loop Until resultCountDict.Count = NUMBER_RESULTS_ON_PAGE

        End With
        .Quit
    End With
    
    Dim key As Variant, rowCounter As Long
    rowCounter = 1
    With ThisWorkbook.Worksheets("Sheet1")
        For Each key In resultCountDict.keys
            .Cells(rowCounter, 1).Resize(1, 3) = resultCountDict(key)
            rowCounter = rowCounter + 1
        Next
    End With
    Application.ScreenUpdating = True
End Sub

Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
    Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
    Set html2 = New HTMLDocument
    html2.body.innerHTML = storeInfo.innerHTML
    Set elems = html2.querySelectorAll("b span")
    For j = 0 To elems.Length - 1
        On Error Resume Next
        If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
            telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
        End If
        On Error GoTo 0
    Next
   GetStoreNumber = telNumber
End Function

Public Function GetCipherDict(ByVal URL As String) As Object
    Dim sResponse As String, html As HTMLDocument
    Set html = New HTMLDocument
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", URL, False
        .setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
        .send
        sResponse = StrConv(.responseBody, vbUnicode)
    End With

    Dim cipherKey As String, cipherDict As Object
    Set cipherDict = CreateObject("Scripting.Dictionary")
    cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
    cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))

    Dim arr() As String, tempArr() As String, i As Long, j As Long
    arr = Split(cipherKey, """}.icon-")
    For i = LBound(arr) To UBound(arr)
        tempArr = Split(arr(i), Chr$(32))
        cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
    Next
    Set GetCipherDict = cipherDict
End Function

编辑:

顶部出现多个数字的版本(请注意,如果您发出太多请求或太快,服务器将为您提供随机页面):

Option Explicit

Public Sub GetDetails()
    Dim re As Object, decodeDict As Object, i As Long
    Dim html As MSHTML.htmlDocument, responseText As String, keys(), values()
    
    Set decodeDict = CreateObject("Scripting.Dictionary")
    Set re = CreateObject("vbscript.regexp")
    Set html = New MSHTML.htmlDocument
    
    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.justdial.com/chengalpattu/Oasis-Pharma-Near-Saraswathi-Children-School-Revathypuram-Urapakkam/9999PXX44-XX44-181123145524-X8G7_BZDET", False
        .setRequestHeader "User-Agent", "Mozilla/4.0"
        .send
        responseText = .responseText
        html.body.innerHTML = responseText
    End With
    
    keys = GetMatches(re, responseText, "-(\w+):before")

    If UBound(keys) = 0 Then Exit Sub
    
    values = GetMatches(re, responseText, "9d0(\d+)", True)
   
    For i = LBound(values) To UBound(values)
        decodeDict(keys(i)) = values(i)
    Next
    
    Dim itemsToDecode()
    
    decodeDict(keys(UBound(keys))) = "+"

    itemsToDecode = GetValuesToDecode(html)
    
    PrintNumbers re, html, itemsToDecode, decodeDict
End Sub

Public Function GetMatches(ByVal re As Object, ByVal inputString As String, ByVal sPattern As String, Optional ByVal numeric = False, Optional ByVal spanSearch = False) As Variant
    Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long

    With re
        .Global = True
        .MultiLine = True
        .IgnoreCase = True
        .Pattern = sPattern

        If .Test(inputString) Then
            Set matches = .Execute(inputString)
            ReDim arrMatches(0 To matches.Count - 1)
            For Each iMatch In matches
                If numeric Then
                    arrMatches(i) = iMatch.SubMatches.item(0) - 1
                Else
                    If spanSearch Then
                        arrMatches(i) = iMatch
                    Else
                        arrMatches(i) = iMatch.SubMatches.item(0)
                    End If
                End If
                i = i + 1
            Next iMatch
        Else
            ReDim arrMatches(0)
            arrMatches(0) = vbNullString
        End If
    End With
    GetMatches = arrMatches
End Function

Public Function GetValuesToDecode(ByVal html As MSHTML.htmlDocument) As Variant
    Dim i As Long, elements As Object, results(), Class As String

    Set elements = html.querySelectorAll(".telCntct span[class*='icon']")
    
    ReDim results(elements.Length - 1)
    For i = 0 To elements.Length - 1
        Class = elements.item(i).className
        results(i) = Right$(Class, Len(Class) - InStrRev(Class, "-"))
    Next
    GetValuesToDecode = results
End Function

Public Sub PrintNumbers(ByVal re As Object, ByVal html As htmlDocument, ByVal itemsToDecode As Variant, ByVal decodeDict As Object)
    Dim output As String, i As Long

    For i = LBound(itemsToDecode) To UBound(itemsToDecode)
        output = output & decodeDict(itemsToDecode(i))
    Next
    
    Dim htmlToSearch As String, groups As Variant, startPos As Long, oldStartPos As Long
    
    htmlToSearch = html.querySelector(".telCntct").outerHTML

    groups = GetMatches(re, htmlToSearch, "mobilesv|,", False, True)
 
    startPos = 1
    
    Dim totalNumbers As Long
    
    For i = LBound(groups) To UBound(groups)
        If InStr(groups(i), ",") > 0 Then
            totalNumbers = totalNumbers + 1
            Debug.Print Mid$(output, startPos, IIf(startPos = 1, i, i - startPos))
            startPos = i + 1
        End If
    Next
    If totalNumbers = 1 Then Debug.Print Right$(output, Len(output) - startPos - 1)
End Sub

【讨论】:

  • 请问 - “If-Modified-Since”标题是什么?它是否强制使用最新版本的页面并忽略缓存或其他内容?
  • 应该是这个想法。它解决了一个 OP 的缓存问题,这是我在这个网站上找到的一个建议。但是,据我了解,服务器仍然可以提供缓存的响应。我不是这方面的专家。我一直在与其他知情人士进行讨论,但尚未找到一种方法来确保不提供缓存结果。显然,服务器可以忽略你发送给它的缓存指令,这取决于它的配置。如果我可以创建一个证明缓存的 MCVE,我会发布一个问题,但我自己无法设置测试用例:-(
  • 太棒了,我前一阵子遇到了这个问题,这太令人沮丧了——我还看到“If-Modified-Since”做了一些不相关的 PHP 工作,我认为从未将两者联系在一起,直到在您的 GET 请求中看到这一点 - 期待尝试这个,谢谢!
  • 谢谢你,先生,我想要的完美......再次感谢
  • 亲爱的先生,我尝试使用此代码...它的工作完美,但有些数字没有得到..几乎 70% 的数字没有得到,而它在网页上可用....什么可以是完成 100 % 工作的可能方法,请先生帮帮我
猜你喜欢
  • 2019-04-03
  • 2017-11-04
  • 2019-12-23
  • 2016-09-12
  • 1970-01-01
  • 2020-11-30
  • 2019-07-30
  • 2018-10-23
相关资源
最近更新 更多