simple_html_dom 抓取整个网站 [关闭]

【问题标题】：simple_html_dom to crawl entire website [closed]simple_html_dom 抓取整个网站 [关闭]
【发布时间】：2014-06-07 13:06:05
【问题描述】：

我想抓取整个网站。我正在使用 Simple_html_dom 进行解析，但问题是一次只需要一个网页链接。我只想提供开始（主页）链接，它应该自动抓取和解析该网站的所有网页。有什么建议吗？

【问题讨论】：

标签： parsing simple-html-dom web-crawler

【解决方案1】：

在解析该单个页面的 DOM 时，将所有链接（在同一域内）存储在一个数组中。然后，在解析结束时，检查数组是否为空。如果不是，请获取第一个链接并执行相同操作。

类似（使用类似 Python 的语法编写的代码示例，但您可以轻松地将其适应 PHP - 我的已经生锈了）。

referenced_links = ['your_initial_page.html']

while referenced_links:  # if the array isn't empty...
    crawl_dom(referenced_links[0])
    referenced_links.pop(0)  # remove the first item in that array

def crawl_dom(url):
    # download the url, parse the DOM and append all hyperlinks to the array referenced_links

【讨论】：