【发布时间】:2013-03-24 12:52:10
【问题描述】:
我是 python 新手,我正在尝试从黄页中抓取数据。我能够刮掉它,但我得到了一个混乱的结果。
这是我得到的结果:
2013-03-24 20:26:47+0800 [scrapy] INFO: Scrapy 0.14.4 started (bot: eyp)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, MemoryUsage, SpiderState
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware,DefaultHeadersMiddleware, RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Enabled item pipelines:
2013-03-24 20:26:47+0800 [eyp] INFO: Spider opened
2013-03-24 20:26:47+0800 [eyp] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-03-24 20:26:47+0800 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
我怎样才能得到一个干净的结果?我只想获取姓名、地址、电话号码和链接。
顺便说一句,我用来执行此操作的代码是;
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from eyp.items import EypItem
class EypSpider(BaseSpider):
def parse(self, response):
hxs = HtmlXPathSelector(response)
titles = hxs.select('//ol[@class="result"]/li')
items = []
for title in titles:
item = EypItem()
item['title'] = title.select(".//p/text()").extract()
item['link'] = title.select(".//a/@href").extract()
items.append(item)
return items
【问题讨论】:
-
在
item['title']中,您似乎在选择<li>中的每个<p>元素。您是否应该更精确地选择您想要的内容?如果你想刮name,phone number,address,link,你的项目真的应该只有title和link吗?您不应该更准确地选择要抓取的链接吗?不是每个链接,就像您对<p>所做的那样?在寻求帮助之前,您应该学习基本手册,您不觉得吗? -
我在这里给了你 3 个我看到的问题。
-
阅读Item Loaders。
标签: python-2.7 web-scraping scrapy scrape scraper