使用 python 保存 Google Python 教程的离线副本。答案

【问题标题】：Saving offline copy of Google Python tutorial using python.使用 python 保存 Google Python 教程的离线副本。
【发布时间】：2026-01-12 04:45:01
【问题描述】：

我正在尝试编写 python 代码来保存“Google python 教程”的离线副本，这样即使我没有连接到互联网也可以访问该文件。为此，我正在导入以下库 - urllib、re、BeautifulSoup、OS 这个想法是识别导航路径下的所有url（类-gc-toc），然后循环遍历每个url并将html文件保存在本地。下面是相同的代码。

我的问题是，

下载的 html 文件尝试从以下位置访问 css 和 js 文件在线的。如何通过程序下载这些文件？

目前整个程序看起来很麻烦。你能建议吗改善它的方法？例如，我喜欢避免使用 Re 并使用 BeautifulSoup 来提取 'gc-toc' 类下的链接。

import urllib
import re
from BeautifulSoup import *
import os

#The URL from which the tags are to be scraped from
url = 'https://developers.google.com/edu/python/'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)

#The scraped tags contain relative path. Need to append with the baseurl for downloading
base_url = 'https://developers.google.com'
#save path
save_path = 'D:\My Local Directory'
urllist = list()

# Retreive all anchor tags
tags = soup.findAll('nav',{'class':'gc-toc'})

for tag in re.findall('a href="(.+?)" title="',str(tags)):
    urllist.append(tag)

print 'The number of links extracted is', len(urllist)
print '----------Printing Urls---------------'
for url in urllist:
    full_url = urllib.basejoin(base_url, url)

    if url.find('youtube') > 0: continue

    #Open the webpage and read html
    print 'Opening webpage file: ', full_url
    response = urllib.urlopen(full_url)
    response_html = response.read()

    #save the html file offlne
    print 'saving html file as ', url.split('/')[-1] +'.htm'

    output_file = open(os.path.join(save_path, url.split('/')[-1] +'.htm'),'w')
    output_file.write(response_html)
    output_file.close()

【问题讨论】：

wget --no-parent --mirror -p --html-extension --convert-links -e robots=off -P . https://developers.google.com/edu/python/ 我知道这对你的 python 代码没有帮助，但这很有效。在 python 中做这件事实际上并不是一件简单的工作，因为你必须获取所有的 url——包括图像和 javascript——并且你必须重写 html 以指向下载的文件。
谢谢。你如何为 Windows 做到这一点？
gnuwin32.sourceforge.net/packages/wget.htm
@jgritty 看起来像你的男人；）祝你好运

标签： python python-2.7 beautifulsoup

【解决方案1】：

你可能真的不想使用 python。如果你想要的只是页面中的 html，你可以使用 wget。 wget http://my.url 会得到一个页面的 html，如果这就是你想要的。或者，使用出色的requests api，您可以执行类似的操作。

import requests
open('page', 'w').write(requests.get(url).text)

【讨论】：