Python-维基百科自动下载器答案

【问题标题】：Python-Wikipedia Automated DownloaderPython-维基百科自动下载器
【发布时间】：2011-03-11 21:20:46
【问题描述】：

[使用 Python 3.1] 有没有人知道如何让 Python 3 应用程序允许用户编写包含多个单词的文本文件，并用逗号分隔。该程序应读取该文件，并下载所请求项目的维基百科页面。例如如果他们输入 hello,python-3,chicken，它会转到维基百科并下载 http://www.wikipedia.com/wiki/hello、http://www.wikip... 有人认为他们可以这样做吗？

当我说“下载”时，我的意思是下载文本，与图像无关。

【问题讨论】：

这对我来说听起来像是家庭作业。如果您希望得到一些帮助，请付出一些努力并向我们展示一些代码。
我知道如何制作它，是的。给我看你的，我就给你看。

标签： python download python-3.x wikipedia pywikibot

【解决方案1】：

查找urllib.request。

【讨论】：

【解决方案2】：

您准确地描述了如何制作这样的程序。那么问题是什么？

您阅读文件，用逗号分隔，然后下载 URL。完成！

【讨论】：

我知道如何做额外的事情，阅读文本文件......但我不知道如何下载页面？

【解决方案3】：

检查以下代码，它会下载 html，没有图像，但您可以从正在解析的 xml 文件中访问它们以获取 url。

from time import sleep
import urllib
import urllib2
from xml.dom import minidom, Node

def main():
    print "Hello World"

    keywords = []

    key_file = open("example.txt", 'r')
    if key_file:
        temp_lines = key_file.readlines()

        for keyword_line in temp_lines:
            keywords.append(keyword_line.rstrip("\n"))

        key_file.close()

    print "Total keywords: %d" % len(keywords)
    for keyword in keywords:
        url = "http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=" + keyword
        xmldoc = minidom.parse(urllib.urlopen(url))
        root_node = xmldoc.childNodes[0]

        section_node = None
        for node in root_node.childNodes:
            if node.nodeType == Node.ELEMENT_NODE and \
            node.nodeName == "Section":
                section_node = node
                break

        if section_node is not None:
            items = []
            for node in section_node.childNodes:
                if node.nodeType == Node.ELEMENT_NODE and \
                node.nodeName == "Item":
                    items.append(node)

            if len(items) == 0:
                print "NO results found"
            else:
                print "\nResults found for " + keyword + ":\n"
                for item in items:
                    for node in item.childNodes:
                        if node.nodeType == Node.ELEMENT_NODE and \
                        node.nodeName == "Text":
                            if len(node.childNodes) == 1:
                                print node.childNodes[0].data.encode('utf-8')

                file_name = None
                for node in items[0].childNodes:
                    if node.nodeType == Node.ELEMENT_NODE and \
                    node.nodeName == "Text":
                        if len(node.childNodes) == 1:
                            file_name = "Html\%s.html" % node.childNodes[0].data.encode('utf-8')
                            break

                if file_name is not None:
                    file = open(file_name, 'w')
                    if file:
                        for node in items[0].childNodes:
                            if node.nodeType == Node.ELEMENT_NODE and \
                            node.nodeName == "Url":
                                if len(node.childNodes) == 1:
                                    user_agent = 'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 6.0)'
                                    header = { 'User-Agent' : user_agent }
                                    request = urllib2.Request(url=node.childNodes[0].data, headers=header)
                                    file.write(urllib2.urlopen(request).read())
                                    file.close()
                                    break


    print "Sleeping"
    sleep(2)

if __name__ == "__main__":
    main()

【讨论】：

你不应该用代码回答家庭作业问题，尤其是当提问者展示少量代码和大量“想法”时