【问题标题】:call one spider from another spider in a web crawler made using scrapy在使用scrapy制作的网络爬虫中从另一只蜘蛛中调用一只蜘蛛
【发布时间】:2013-04-22 11:02:46
【问题描述】:

我想点击 pdf 文件所在网页上的所有链接并将这些 pdf 文件存储在我的系统上。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from bs4 import BeautifulSoup


class spider_a(BaseSpider):
    name = "Colleges"
    allowed_domains = ["http://www.abc.org"]
    start_urls = [
        "http://www.abc.org/appwebsite.html",
        "http://www.abc.org/misappengineering.htm",
    ]

    def parse(self, response):
        soup = BeautifulSoup(response.body)
        for link in soup.find_all('a'):
            download_link = link.get('href')
            if '.pdf' in download_link:
                pdf_url = "http://www.abc.org/" + download_link
                print pdf_url

使用上面的代码,我可以在 pdf 文件所在的预期页面上找到链接

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class FileSpider(BaseSpider):
    name = "fspider"
    allowed_domains = ["www.aicte-india.org"]
    start_urls = [
        "http://www.abc.org/downloads/approved_institut_websites/an.pdf#toolbar=0&zoom=85"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-1]
        open(filename, 'wb').write(response.body)

使用此代码,我可以保存start_urls 中列出的页面正文。

有没有办法加入这两个蜘蛛,以便我可以通过运行我的爬虫来保存这些 pdf?

【问题讨论】:

    标签: python beautifulsoup scrapy web-crawler


    【解决方案1】:

    为什么需要两只蜘蛛?

    from urlparse import urljoin
    from scrapy.http import Request
    from scrapy.selector import HtmlXPathSelector
    
    class spider_a(BaseSpider):
        ...
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            for href in hxs.select('//a/@href[contains(.,".pdf")]'):
                yield Request(urljoin(response.url, href),
                        callback=self.save_file)
    
        def save_file(self, response):
            filename = response.url.split("/")[-1]
            with open(filename, 'wb') as f:
                f.write(response.body)
    

    【讨论】:

    • 嗨@steven,感谢您的帮助,但我收到以下错误:exceptions.AttributeError: 'HtmlXPathSelector' object has no attribute 'find'
    • 那是因为你需要使用select,而不是find...如果你使用的是Scrapy,你就不需要Beautiful Soup。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-09-11
    • 1970-01-01
    • 2013-11-30
    • 1970-01-01
    • 2012-04-06
    相关资源
    最近更新 更多