【问题标题】:Can't Able to scrape the Data using Scrapy shell - Python无法使用 Scrapy shell 抓取数据 - Python
【发布时间】:2023-03-04 17:04:01
【问题描述】:

我使用 Scrapy shell 作为 URL http://www.yelp.com/search?find_desc=&find_loc=60089

我需要在该链接中获取数据和 URL.. 例如,我需要在该链接中抓取以下数据

  1. 木柴烤肉地中海烧烤
  2. Lo​​u Malnati 的比萨店
  3. 白屋寿司
  4. 美甲和水疗工作室等...

我用过

hxs.select('//span[@class="indexed-biz-name"]/a/text()').extract()
提取数据的命令

我尝试了很多方法我得到了一些其他数据,它与那个页面无关..

请尽快把代码发给我/......

【问题讨论】:

    标签: python shell web-scraping scrapy


    【解决方案1】:

    你的表达有效:

    paul@wheezy:~$ scrapy shell "http://www.yelp.com/search?find_desc=&find_loc=60089"
    2014-01-29 22:48:22+0100 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
    2014-01-29 22:48:22+0100 [scrapy] INFO: Optional features available: ssl, http11, boto, django
    2014-01-29 22:48:22+0100 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
    2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2014-01-29 22:48:22+0100 [scrapy] INFO: Enabled item pipelines: 
    2014-01-29 22:48:22+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2014-01-29 22:48:22+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2014-01-29 22:48:22+0100 [default] INFO: Spider opened
    2014-01-29 22:48:24+0100 [default] DEBUG: Crawled (200) <GET http://www.yelp.com/search?find_desc=&find_loc=60089> (referer: None)
    [s] Available Scrapy objects:
    [s]   item       {}
    [s]   request    <GET http://www.yelp.com/search?find_desc=&find_loc=60089>
    [s]   response   <200 http://www.yelp.com/search?find_desc=&find_loc=60089>
    [s]   sel        <Selector xpath=None data=u'<html xmlns:fb="http://www.facebook.com/'>
    [s]   settings   <CrawlerSettings module=None>
    [s]   spider     <Spider 'default' at 0x3ba6b50>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
    
    In [1]: sel.xpath('//span[@class="indexed-biz-name"]/a/text()').extract()
    Out[1]: 
    [u'Firewood Kabob Mediterranean Grill',
     u"Lou Malnati's Pizzeria",
     u'Hakuya Sushi',
     u'Nails & Spa Studio',
     u'Wooil Korean Restaurant',
     u"Grande Jake's Fresh Mexican Grill",
     u'Hanabi Japanese Restaurant',
     u'India House',
     u'Deerfields Bakery',
     u'Wiener Take All']
    
    In [2]: 
    

    【讨论】:

    • 谢谢,但为什么它没有在我的机器中处理。我的IP是否被屏蔽
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-06-10
    • 2013-05-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-09-04
    • 1970-01-01
    相关资源
    最近更新 更多