【问题标题】:Scraping within a url using scrapy使用 scrapy 在 url 内抓取
【发布时间】:2013-05-26 23:36:37
【问题描述】:

我正在尝试使用 scrapy 抓取 craigslist 并已成功获取 url,但现在我想从 url 的页面中提取数据。以下是代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from craigslist.items import CraigslistItem

class craigslist_spider(BaseSpider):
    name = "craigslist_unique"
    allowed_domains = ["craiglist.org"]
    start_urls = [
        "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
        "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
    "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
    ]


def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select("//span[@class='pl']")
   items = []
   for site in sites:
       item = CraigslistItem()
       item['title'] = site.select('a/text()').extract()
       item['link'] = site.select('a/@href').extract()
   #item['desc'] = site.select('text()').extract()
       items.append(item)
   hxs = HtmlXPathSelector(response)
   #print title, link        
   return items

我是 scrapy 的新手,无法弄清楚如何实际点击 url (href) 并在该 url 的页面中获取数据并为所有 url 执行此操作。

【问题讨论】:

  • 由于您正在抓取,请使用CrawlSpider。阅读文档以获取一些示例。

标签: python web-scraping scrapy


【解决方案1】:

start_urls的响应在parse方法中一一收到

如果您只想从 start_urls 响应中获取信息,您的代码几乎可以。但是您的 parse 方法应该在您的 craigslist_spider 类中,而不是在该类之外。

def parse(self, response):
   hxs = HtmlXPathSelector(response)
   sites = hxs.select("//span[@class='pl']")
   items = []
   for site in sites:
       item = CraigslistItem()
       item['title'] = site.select('a/text()').extract()
       item['link'] = site.select('a/@href').extract()
       items.append(item)
   #print title, link
   return items

如果您想从 start_urls 中获取一半信息,而从 start_urls 响应中出现的 anchor 中获取一半信息怎么办?

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select("//span[@class='pl']")
    for site in sites:
        item = CraigslistItem()
        item['title'] = site.select('a/text()').extract()
        item['link'] = site.select('a/@href').extract()
        if item['link']:
            if 'http://' not in item['link']:
                item['link'] = urljoin(response.url, item['link'])
            yield Request(item['link'],
                          meta={'item': item},
                          callback=self.anchor_page)


def anchor_page(self, response):
    hxs = HtmlXPathSelector(response)
    old_item = response.request.meta['item'] # Receiving parse Method item that was in Request meta
    # parse some more values
    #place them in old_item
    #e.g
    old_item['bla_bla']=hxs.select("bla bla").extract()
    yield old_item

您只需要在解析方法中yield Request 并使用metaRequest 发送您的old item

然后在 anchor_page 中提取 old_item 并在其中添加新值并简单地生成它。

【讨论】:

    【解决方案2】:

    您的 xpath 有问题 - 它们应该是相对的。代码如下:

    from scrapy.item import Item, Field
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    
    
    class CraigslistItem(Item):
        title = Field()
        link = Field()
    
    
    class CraigslistSpider(BaseSpider):
        name = "craigslist_unique"
        allowed_domains = ["craiglist.org"]
        start_urls = [
            "http://sfbay.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time",
            "http://newyork.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addThree=internship",
            "http://seattle.craigslist.org/search/sof?zoomToPosting=&query=&srchType=A&addFour=part-time"
        ]
    
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select("//span[@class='pl']")
            items = []
            for site in sites:
                item = CraigslistItem()
                item['title'] = site.select('.//a/text()').extract()[0]
                item['link'] = site.select('.//a/@href').extract()[0]
                items.append(item)
            return items
    

    如果通过以下方式运行它:

    scrapy runspider spider.py -o output.json
    

    你会在 output.json 中看到:

    {"link": "/sby/sof/3824966457.html", "title": "HR Admin/Tech Recruiter"}
    {"link": "/eby/sof/3824932209.html", "title": "Entry Level Web Developer"}
    {"link": "/sfc/sof/3824500262.html", "title": "Sr. Ruby on Rails Contractor @ Funded Startup"}
    ...
    

    希望对您有所帮助。

    【讨论】:

      猜你喜欢
      • 2022-12-07
      • 1970-01-01
      • 2023-03-30
      • 2016-07-16
      • 1970-01-01
      • 2016-11-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多