【问题标题】:How to scrape all of the data from the website?如何从网站上抓取所有数据?
【发布时间】:2017-05-13 02:16:18
【问题描述】:

我的代码只给了我 44 个链接数据而不是 102 个。有人能告诉我为什么要这样提取吗?非常感谢您的帮助。我怎样才能正确提取它???

import scrapy
class ProjectItem(scrapy.Item):
    title = scrapy.Field()
    owned = scrapy.Field()
    Revenue2014 = scrapy.Field()
    Revenue2015 = scrapy.Field()
    Website = scrapy.Field()
    Rank = scrapy.Field()
    Employees = scrapy.Field()
    headquarters = scrapy.Field() 
    FoundedYear = scrapy.Field()

类 ProjectSpider(scrapy.Spider):

name = "cin100"
allowed_domains = ['cincinnati.com']
start_urls = ['http://www.cincinnati.com/story/money/2016/11/26/see-which-companies-16-deloitte-100/94441104/']

def parse(self, response):

    # get selector for all 100 companies
    sel_companies = response.xpath('//p[contains(.,"click or tap here.")]/following-sibling::p/a')

    # create request for every single company detail page from href
    for sel_companie in sel_companies:
        href = sel_companie.xpath('./@href').extract_first()
        url = response.urljoin(href)
        request = scrapy.Request(url, callback=self.parse_company_detail)
        yield request

def parse_company_detail(self, response):           

    # On detail page create item
    item = ProjectItem()
    # get detail information with specific XPath statements
    # e.g. title is the first paragraph
    item['title'] = response.xpath('//div[@role="main"]/p[1]//text()').extract_first().rsplit('-')[1]
    # e.g. family owned has a label we can select
    item['owned'] = response.xpath('//div[@role="main"]/p[contains(.,"Family owned")]/text()').extract_first()  
item['Revenue2014'] ='$'+response.xpath('//div[@role="main"]/p[contains(.,"2014")]/text()').extract_first().rsplit('$')[1]
item['Revenue2015'] ='$'+response.xpath('//div[@role="main"]/p[contains(.,"$")]/text()').extract_first().rsplit('$')[1]
    item['Website'] = response.xpath('//div[@role="main"]/p/a[contains(.,"www.")]/@href').extract_first()
item['Rank'] = response.xpath('//div[@role="main"]/p[contains(.,"rank")]/text()').extract_first()
item['Employees'] = response.xpath('//div[@role="main"]/p[contains(.,"Employ")]/text()').extract_first()
item['headquarters'] = response.xpath('//div[@role="main"]/p[10]//text()').extract()
item['FoundedYear'] = response.xpath('//div[@role="main"]/p[contains(.,"founded")]/text()').extract()
    # Finally: yield the item
    yield item

【问题讨论】:

    标签: web-scraping beautifulsoup scrapy


    【解决方案1】:

    仔细观察scrapy的输出,你会发现在几十个请求之后开始,它们会被重定向,如下所示:

    DEBUG: Redirecting (302) to <GET http://www.cincinnati.com/get-access/?return=http%3A%2F%2Fwww.cincinnati.com%2Fstory%2Fmoney%2F2016%2F11%2F27%2Ffrischs-restaurants%2F94430718%2F> from <GET http://www.cincinnati.com/story/money/2016/11/27/frischs-restaurants/94430718/>
    

    请求的页面显示:我们希望您享受免费访问。

    所以看起来他们只向匿名用户提供有限的访问权限。您可能需要注册他们的服务才能完全访问数据。

    【讨论】:

      【解决方案2】:

      您的 xpath 存在一些潜在问题:

      1. 让 xpath 查找页面上的文本通常是个坏主意。文本可以从一分钟更改为下一分钟。布局和 html 结构寿命更长。

      2. 使用 'following-siblings' 也是最后的 xpath 功能,它很容易受到网站上的细微更改的影响。

      我会做什么:

      # iterate all paragraphs within the article:
      for para in  response.xpath("//*[@itemprop='articleBody']/p"):
          url = para.xpath("./a/@href").extract()
          # ... etc
      

      len( response.xpath("//*[@itemprop='articleBody']/p")) 顺便给了我预期的 102。

      您可能需要过滤网址以删除非公司网址,例如标有“点击或点击此处”的网址

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多