【问题标题】:Scrapy returning a null output when extracting an element from a table using xpath使用 xpath 从表中提取元素时,Scrapy 返回空输出
【发布时间】:2015-05-13 15:08:34
【问题描述】:

我一直在尝试搜索这个包含科罗拉多州油井详细信息的网站 https://cogcc.state.co.us/cogis/FacilityDetail.asp?facid=12307555&type=WELL

Scrapy 抓取网站,并在我抓取它时返回 URL,但是当我需要使用它的 XPath(油井县)提取表格内的元素时,我得到的只是一个空输出,即 [] .

我尝试在页面中访问的任何元素都会发生这种情况。

这是我的蜘蛛:

import scrapy
import json
class coloradoSpider(scrapy.Spider):
    name = "colorado"
    allowed_domains = ["cogcc.state.co.us"]
    start_urls = ["https://cogcc.state.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=07555&APIWB=00&Year=All"]
    def parse(self, response):
        url = response.url
        response.selector.remove_namespaces()
        variable = (response.xpath("/html/body/blockquote/font/font/table/tbody/tr[3]/th[3]").extract())
        print url, variable

这是输出:

2015-05-13 20:14:54+0530 [scrapy] INFO: Scrapy 0.24.6 started (bot: tutorial)
2015-05-13 20:14:54+0530 [scrapy] INFO: Optional features available: ssl, http11
2015-05-13 20:14:54+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE'
: 'tutorial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutor
ial'}
2015-05-13 20:14:54+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetCons
ole, CloseSpider, WebService, CoreStats, SpiderState
2015-05-13 20:14:55+0530 [scrapy] INFO: Enabled downloader middlewares: HttpAuth
Middleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, Def
aultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, Redirec
tMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-05-13 20:14:55+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMid
dleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddlew
are
2015-05-13 20:14:56+0530 [scrapy] INFO: Enabled item pipelines:
2015-05-13 20:14:56+0530 [colorado] INFO: Spider opened
2015-05-13 20:14:56+0530 [colorado] INFO: Crawled 0 pages (at 0 pages/min), scra
ped 0 items (at 0 items/min)
2015-05-13 20:14:56+0530 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6
023
2015-05-13 20:14:56+0530 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2015-05-13 20:15:02+0530 [colorado] DEBUG: Crawled (200) <GET https://cogcc.stat
e.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=07555&APIWB=00&Year
=All> (referer: None)
https://cogcc.state.co.us/cogis/ProductionWellMonthly.asp?APICounty=123&APISeq=0
7555&APIWB=00&Year=All []
2015-05-13 20:15:02+0530 [colorado] INFO: Closing spider (finished)
2015-05-13 20:15:02+0530 [colorado] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 292,
         'downloader/request_count': 1,
         'downloader/request_method_count/GET': 1,
         'downloader/response_bytes': 366770,
         'downloader/response_count': 1,
         'downloader/response_status_count/200': 1,
         'finish_reason': 'finished',
         'finish_time': datetime.datetime(2015, 5, 13, 14, 45, 2, 349000),
         'log_count/DEBUG': 3,
         'log_count/INFO': 7,
         'response_received_count': 1,
         'scheduler/dequeued': 1,
         'scheduler/dequeued/memory': 1,
         'scheduler/enqueued': 1,
         'scheduler/enqueued/memory': 1,
         'start_time': datetime.datetime(2015, 5, 13, 14, 44, 56, 77000)}
2015-05-13 20:15:02+0530 [colorado] INFO: Spider closed (finished)

如果我返回 XPath 上的几个节点,我会得到一个输出,其中 Scrapy 以 HTML 形式返回表格。

谢谢!

【问题讨论】:

  • 你到底想在网站上做什么,例如J SAND

标签: python xpath web-scraping web-crawler scrapy


【解决方案1】:

似乎这是一个 xpath 问题,在开发过程中,在此站点中,他们可能省略了tbody,但通过浏览器查看时会自动插入浏览器。您可以从here 获得更多信息。

所以你需要给定页面中的县值(WELD #123)然后可能的xpath 将是,

In [20]: response.xpath('/html/body/font/table/tr[6]/td[2]//text()').extract()
Out[20]: [u'WELD                               #123']

【讨论】:

    【解决方案2】:

    看起来是xpath的问题,可以试试这个

    //blockquote/font/font/table//tr/td[3]//text(),

    我认为您不需要 tbody 标签。

    【讨论】:

      猜你喜欢
      • 2019-04-07
      • 2021-12-20
      • 1970-01-01
      • 2021-11-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多