电话号码并不容易,因为我认为它们是故意难以抓取的。我找到了一种从 CSS 伪 ::before 元素内容中破译值的方法。地址和标题是简单的 CSS 选择。
我已经在 python here 中编写了一个更干净的脚本。
那么,代码的各个部分是如何工作的呢?
标题:
Set titles = .querySelectorAll(".jcn [title]")
我将标题定位为具有title 属性和父jcn 类属性的元素。 "." 表示类选择器,"[]" 表示属性选择器,而两者之间的" " 是后代组合器。
document 的querySelectorAll 方法返回页面上所有匹配元素(即 10 个标题)的 nodeList。
地址:
Set addresses = .querySelectorAll(".desk-add.jaddt")
地址的目标是其类属性desk-add jaddt。由于不允许使用复合类名称,因此必须额外添加一个 "." 来替换名称中的空格。
电话号码(通过破译storesTextToDecipher中的内容):
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
这就是魔法发生的地方。这些数字不能直接通过 DOM 获得,因为它们是伪元素内容。
如果您检查相关的 HTML,您会发现一系列伪 ::before elements。 VBA 没有公开应用伪选择器来尝试在页面的 CSS 中获取此信息的机制。
实际上你看到的是一系列 span 元素,每个元素都有一个以 mobilesv 开头的类属性。这些元素位于 col-sm-5 col-xs-8 store-details sp-detail paddingR0 类的单个父元素中(再次注意复合类名称)。
我最初收集所有父元素的nodeList。
返回元素示例:
这些父元素中的每一个都包含构成电话号码字符串字符的类名(以mobilesv 开头)元素。有些字符是字符串中的数字,有些则代表+()-。注意:类名中icon- 之后的 2|3 个字母字符串,例如dc, fe.
例如页面上的第一个搜索结果,对于电话号码中的初始号码9:
这个伪元素 /telephone 字符的实际 CSS 内容可以在 CSS 样式中观察到:
注意类名和伪元素选择器之前:.icon-ji:before
内容为\9d010。
长话短说....您可以提取icon-之后的2或3个字母,即ji,以及\9d0之后的数字字符串,即10,并使用这些两位信息来破译电话号码。此信息在响应中可用:
在左边看到与电话字符串的类名相关联的相同的 2/3 字母字符串,在右边看到内容说明。稍微算一下,右边的数字比该班级的电话号码大 1,如网页图像所示。我只是创建了一个字典,然后通过解析 html 的这一部分将 2/3 字母缩写映射到电话号码。
当循环遍历storesTextToDecipher 时,我使用该字典从类名中匹配的 2/3 字母缩写中破译实际电话号码。
VBA:
Option Explicit
Public Sub GetInfo()
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3", False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
html.body.innerHTML = sResponse
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
With html
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
End With
For i = 0 To titles.Length - 1
Debug.Print "title: " & titles.item(i).innerText
Debug.Print "address: " & addresses.item(i).innerText
Debug.Print GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
Next
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
示例输出:
编辑:所有页面结果
由于您现在想要超过 10 个,因此以下使用预期的页面结果计数(NUMBER_RESULTS_ON_PAGE) 从页面收集信息。它滚动页面直到找到预期的电话号码数量(应该是唯一的),或MAX_WAIT_SEC 被击中。这意味着您可以避免无限循环,并且如果您期望不同的数字,可以设置您的预期结果计数。它确实依赖于每个商店都有列出的电话号码 - 这似乎是一个相当合理的假设。
Option Explicit
Public Sub GetInfo()
Dim IE As New InternetExplorer, resultCountDict As Object, cipherDict As Object, t As Date
Const MAX_WAIT_SEC As Long = 300 'wait 5 minutes max before exiting loop to get all results
Const NUMBER_RESULTS_ON_PAGE As Long = 80
Const URL = "https://www.justdial.com/Ahmedabad/Beauty-Parlours/page-3"
Dim titles As Object, addresses As Object, storesTextToDecipher As Object
Application.ScreenUpdating = True
Set resultCountDict = CreateObject("Scripting.Dictionary")
Set cipherDict = GetCipherDict(URL)
With IE
.Visible = True
.Navigate2 URL
While .Busy Or .readyState < 4: DoEvents: Wend
With .document
t = Timer
Do
DoEvents
Set titles = .querySelectorAll(".jcn [title]")
Set addresses = .querySelectorAll(".desk-add.jaddt")
Set storesTextToDecipher = .querySelectorAll(".col-sm-5.col-xs-8.store-details.sp-detail.paddingR0")
Dim telNumber As String, i As Long
For i = 0 To titles.Length - 1
telNumber = GetStoreNumber(storesTextToDecipher.item(i), cipherDict)
If Not resultCountDict.Exists(telNumber) Then
resultCountDict.Add telNumber, Array(titles.item(i).innerText, addresses.item(i).innerText, telNumber)
End If
Next
.parentWindow.execScript "window.scrollBy(0, window.innerHeight);", "javascript"
While IE.Busy Or IE.readyState < 4: DoEvents: Wend
If Timer - t > MAX_WAIT_SEC Then Exit Do
Loop Until resultCountDict.Count = NUMBER_RESULTS_ON_PAGE
End With
.Quit
End With
Dim key As Variant, rowCounter As Long
rowCounter = 1
With ThisWorkbook.Worksheets("Sheet1")
For Each key In resultCountDict.keys
.Cells(rowCounter, 1).Resize(1, 3) = resultCountDict(key)
rowCounter = rowCounter + 1
Next
End With
Application.ScreenUpdating = True
End Sub
Public Function GetStoreNumber(ByVal storeInfo As Object, cipherDict As Object) As String
Dim html2 As HTMLDocument, elems As Object, telNumber As String, j As Long
Set html2 = New HTMLDocument
html2.body.innerHTML = storeInfo.innerHTML
Set elems = html2.querySelectorAll("b span")
For j = 0 To elems.Length - 1
On Error Resume Next
If cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString)) < 10 Then
telNumber = telNumber & cipherDict(Replace$(elems.item(j).className, "mobilesv icon-", vbNullString))
End If
On Error GoTo 0
Next
GetStoreNumber = telNumber
End Function
Public Function GetCipherDict(ByVal URL As String) As Object
Dim sResponse As String, html As HTMLDocument
Set html = New HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", URL, False
.setRequestHeader "If-Modified-Since", "Sat, 1 Jan 2000 00:00:00 GMT"
.send
sResponse = StrConv(.responseBody, vbUnicode)
End With
Dim cipherKey As String, cipherDict As Object
Set cipherDict = CreateObject("Scripting.Dictionary")
cipherKey = Split(Split(sResponse, "smoothing:grayscale}.icon-")(1), ".mobilesv")(0)
cipherKey = Replace$(cipherKey, ":before{content:""\9d", Chr$(32))
Dim arr() As String, tempArr() As String, i As Long, j As Long
arr = Split(cipherKey, """}.icon-")
For i = LBound(arr) To UBound(arr)
tempArr = Split(arr(i), Chr$(32))
cipherDict(tempArr(0)) = Replace$(tempArr(1), """}", vbNullString) - 1 'needs adjustment
Next
Set GetCipherDict = cipherDict
End Function
编辑:
顶部出现多个数字的版本(请注意,如果您发出太多请求或太快,服务器将为您提供随机页面):
Option Explicit
Public Sub GetDetails()
Dim re As Object, decodeDict As Object, i As Long
Dim html As MSHTML.htmlDocument, responseText As String, keys(), values()
Set decodeDict = CreateObject("Scripting.Dictionary")
Set re = CreateObject("vbscript.regexp")
Set html = New MSHTML.htmlDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://www.justdial.com/chengalpattu/Oasis-Pharma-Near-Saraswathi-Children-School-Revathypuram-Urapakkam/9999PXX44-XX44-181123145524-X8G7_BZDET", False
.setRequestHeader "User-Agent", "Mozilla/4.0"
.send
responseText = .responseText
html.body.innerHTML = responseText
End With
keys = GetMatches(re, responseText, "-(\w+):before")
If UBound(keys) = 0 Then Exit Sub
values = GetMatches(re, responseText, "9d0(\d+)", True)
For i = LBound(values) To UBound(values)
decodeDict(keys(i)) = values(i)
Next
Dim itemsToDecode()
decodeDict(keys(UBound(keys))) = "+"
itemsToDecode = GetValuesToDecode(html)
PrintNumbers re, html, itemsToDecode, decodeDict
End Sub
Public Function GetMatches(ByVal re As Object, ByVal inputString As String, ByVal sPattern As String, Optional ByVal numeric = False, Optional ByVal spanSearch = False) As Variant
Dim matches As Object, iMatch As Object, s As String, arrMatches(), i As Long
With re
.Global = True
.MultiLine = True
.IgnoreCase = True
.Pattern = sPattern
If .Test(inputString) Then
Set matches = .Execute(inputString)
ReDim arrMatches(0 To matches.Count - 1)
For Each iMatch In matches
If numeric Then
arrMatches(i) = iMatch.SubMatches.item(0) - 1
Else
If spanSearch Then
arrMatches(i) = iMatch
Else
arrMatches(i) = iMatch.SubMatches.item(0)
End If
End If
i = i + 1
Next iMatch
Else
ReDim arrMatches(0)
arrMatches(0) = vbNullString
End If
End With
GetMatches = arrMatches
End Function
Public Function GetValuesToDecode(ByVal html As MSHTML.htmlDocument) As Variant
Dim i As Long, elements As Object, results(), Class As String
Set elements = html.querySelectorAll(".telCntct span[class*='icon']")
ReDim results(elements.Length - 1)
For i = 0 To elements.Length - 1
Class = elements.item(i).className
results(i) = Right$(Class, Len(Class) - InStrRev(Class, "-"))
Next
GetValuesToDecode = results
End Function
Public Sub PrintNumbers(ByVal re As Object, ByVal html As htmlDocument, ByVal itemsToDecode As Variant, ByVal decodeDict As Object)
Dim output As String, i As Long
For i = LBound(itemsToDecode) To UBound(itemsToDecode)
output = output & decodeDict(itemsToDecode(i))
Next
Dim htmlToSearch As String, groups As Variant, startPos As Long, oldStartPos As Long
htmlToSearch = html.querySelector(".telCntct").outerHTML
groups = GetMatches(re, htmlToSearch, "mobilesv|,", False, True)
startPos = 1
Dim totalNumbers As Long
For i = LBound(groups) To UBound(groups)
If InStr(groups(i), ",") > 0 Then
totalNumbers = totalNumbers + 1
Debug.Print Mid$(output, startPos, IIf(startPos = 1, i, i - startPos))
startPos = i + 1
End If
Next
If totalNumbers = 1 Then Debug.Print Right$(output, Len(output) - startPos - 1)
End Sub