【发布时间】:2020-04-06 17:05:45
【问题描述】:
我正在尝试使用scrapy在python上使用简单的蜘蛛代码提取web news的每个标题的文本。我在下面留下部分html代码
<div _ngcontent-c17="" class="col-md-8"><h2 _ngcontent-c17="" class="cormorant">Notícias</h2>
<ul _ngcontent-c17="" class="list-unstyled lista-noticias"><!----><!---->
<li _ngcontent-c17="" class="noticia hvr-shadow py-3 d-block"><!---->
<div _ngcontent-c17="" class="container-noticia"><div _ngcontent-c17="" class="data pr-3"><span _ngcontent-c17="" class="dia cormorant">02</span><span _ngcontent-c17="" class="mes">Abril</span><span _ngcontent-c17="" class="hora cormorant">14:25</span></div><div _ngcontent-c17="" class="texto pl-3"><div _ngcontent-c17="" class="assunto"></div><!----><a _ngcontent-c17="" bcblink="" class="d-block" href="/detalhenoticia/434/noticia">
<h4 _ngcontent-c17="" class="cormorant">CMN autoriza o BC a conceder empréstimos mediante emissão de Letra Financeira Garantida e a firmar acordo de swap com o Federal Reserve</h4>
因此,我想提取 h4 中的文本。为此,我在 python 中使用了以下代码:
from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.loader import ItemLoader
class Pregunta(Item):
titulo = Field()
id = Field()
class BcbSpider(Spider):
name = "bcb_noticias"
start_urls = ['https://www.bcb.gov.br/noticias']
def parse(self,response):
sel = Selector(response)
preguntas = sel.xpath('//ul[@class="list-unstyled lista-noticias"]/li')
for i, elem in enumerate(preguntas):
item = ItemLoader(Pregunta(),elem)
item.add_xpath('titulo','.//h4[@class="cormorant"]/text()')
item.add_value('id',i)
yield item.load_item()
当我在 Powershell 中运行我的代码时没有错误。但是它不会刮掉任何东西
我在下面留下部分信息
2020-04-06 11:21:25 [scrapy.core.engine] INFO: Spider opened
2020-04-06 11:21:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-06 11:21:25 [scrapy.extensions.telnet] INFO: Telnet console listening on (IP number)
2020-04-06 11:21:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bcb.gov.br/noticias> (referer: None)
该代码适用于其他网页。我不知道我是否正确编写了 xpath(我尝试过以多种形式编写它),或者还有其他问题
【问题讨论】: