【发布时间】:2020-03-06 16:08:51
【问题描述】:
我有一段代码在这里已经存在了一段时间,其中包含不同类型的问题。这越来越接近它的最终版本。但是现在我遇到了一个问题,即代码中有错误,部分代码运行不正确。
我们的想法是浏览链接并获取 PDF 文件。链接存储在sLinks 中,请参阅注释“检查链接是否存储在 sLinks 中”。代码继续前进,文件存储在C:\temp\ 中,但是在文件夹中有 12 个 PDF 之后,我收到错误,调试器指向 xHttp.Open "GET", sLink。
我浏览了 PDF 文件,看起来所有文件都已下载……因为有几页上的内容相同,而且至少两页上有一份政策 PDF。这就是为什么有 17 个链接和 12 个文件的原因。无论如何它为什么会抛出错误?
可能是什么问题?
这是我的代码:
Sub DownloadFiles()
Dim xHttp As Object: Set xHttp = CreateObject("Microsoft.XMLHTTP")
Dim hDoc As MSHTML.HTMLDocument
Dim Anchors As Object
Dim Anchor As Variant
Dim sPath As String
Dim wholeURL As String
Dim internet As InternetExplorer
Dim internetdata As HTMLDocument
Dim internetlink As Object
Dim internetinnerlink As Object
Dim arrLinks As Variant
Dim sLink As String
Dim iLinkCount As Integer
Dim iCounter As Integer
Dim sLinks As String
Set internet = CreateObject("InternetExplorer.Application")
internet.Visible = False
internet.navigate ("https://www.webpage.com/products/waste-water/")
Do While internet.Busy
DoEvents
Loop
Do Until internet.readyState = READYSTATE_COMPLETE
DoEvents
Loop
Set internetdata = internet.document
Set internetlink = internetdata.getElementsByTagName("a")
i = 1
For Each internetinnerlink In internetlink
If Left$(internetinnerlink, 36) = "https://www.webpage.com/product/" Then
sLinks = sLinks & internetinnerlink.href & vbCrLf
i = i + 1
Else
End If
ThisWorkbook.Worksheets("Sheet1").range("B1").Value = sLinks ' Check that links are stored in sLinks
Next internetinnerlink
wholeURL = "https://www.webpage.com/"
sPath = "C:\temp\"
arrLinks = Split(sLinks, vbCrLf)
iLinkCount = UBound(arrLinks) + 1
For iCounter = 1 To iLinkCount
sLink = arrLinks(iCounter - 1)
'Get the directory listing
xHttp.Open "GET", sLink ' DEBUGGER IS POINTING HERE
xHttp.send
'Wait for the page to load
Do Until xHttp.readyState = 4
DoEvents
Loop
'Put the page in an HTML document
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = xHttp.responseText
'Loop through the hyperlinks on the directory listing
Set Anchors = hDoc.getElementsByTagName("a")
For Each Anchor In Anchors
'test the pathname to see if it matches your pattern
If Anchor.pathname Like "*.pdf" Then
xHttp.Open "GET", wholeURL & Anchor.pathname, False
xHttp.send
With CreateObject("Adodb.Stream")
.Type = 1
.Open
.write xHttp.responseBody
.SaveToFile sPath & getName(wholeURL & Anchor.pathname), 2 '//overwrite
End With
End If
Next
Next
End Sub
从链接中建立文件名的功能:
Function getName(pf As String) As String
getName = Split(pf, "/")(UBound(Split(pf, "/")))
End Function
编辑:
我已经解决了第一个问题。 arrLinks = Split(p_sLinks, vbCrLf) 应更改为 arrLinks = Split(sLinks, vbCrLf)。现在我面临另一个问题。
编辑到www.webpage.com的链接
【问题讨论】:
标签: excel vba web-scraping web-crawler