使用python和scrapy从网络中提取文本答案

【问题标题】：Extract text from web with python and scrapy使用python和scrapy从网络中提取文本
【发布时间】：2020-04-06 17:05:45
【问题描述】：

我正在尝试使用scrapy在python上使用简单的蜘蛛代码提取web news的每个标题的文本。我在下面留下部分html代码

<div _ngcontent-c17="" class="col-md-8"><h2 _ngcontent-c17="" class="cormorant">Notícias</h2>
<ul _ngcontent-c17="" class="list-unstyled lista-noticias"><!----><!---->
<li _ngcontent-c17="" class="noticia hvr-shadow py-3 d-block"><!---->
<div _ngcontent-c17="" class="container-noticia"><div _ngcontent-c17="" class="data pr-3"><span _ngcontent-c17="" class="dia cormorant">02</span><span _ngcontent-c17="" class="mes">Abril</span><span _ngcontent-c17="" class="hora cormorant">14:25</span></div><div _ngcontent-c17="" class="texto pl-3"><div _ngcontent-c17="" class="assunto"></div><!----><a _ngcontent-c17="" bcblink="" class="d-block" href="/detalhenoticia/434/noticia">
<h4 _ngcontent-c17="" class="cormorant">CMN autoriza o BC a conceder empréstimos mediante emissão de Letra Financeira Garantida e a firmar acordo de swap com o Federal Reserve</h4>

因此，我想提取 h4 中的文本。为此，我在 python 中使用了以下代码：

from scrapy.item import Field
from scrapy.item import Item
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy.loader import ItemLoader


class Pregunta(Item): 
    titulo = Field()
    id = Field() 

class BcbSpider(Spider): 
    name = "bcb_noticias" 
    start_urls = ['https://www.bcb.gov.br/noticias']

    def parse(self,response): 
            sel = Selector(response) 
            preguntas = sel.xpath('//ul[@class="list-unstyled lista-noticias"]/li') 
            
            for i, elem in enumerate(preguntas):
                item = ItemLoader(Pregunta(),elem)
                item.add_xpath('titulo','.//h4[@class="cormorant"]/text()')
                item.add_value('id',i)
                yield item.load_item()

当我在 Powershell 中运行我的代码时没有错误。但是它不会刮掉任何东西

我在下面留下部分信息

2020-04-06 11:21:25 [scrapy.core.engine] INFO: Spider opened
2020-04-06 11:21:25 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-04-06 11:21:25 [scrapy.extensions.telnet] INFO: Telnet console listening on (IP number)
2020-04-06 11:21:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.bcb.gov.br/noticias> (referer: None)

该代码适用于其他网页。我不知道我是否正确编写了 xpath（我尝试过以多种形式编写它），或者还有其他问题

【问题讨论】：

标签： python xpath scrapy

【解决方案1】：

网站使用动态呈现。你需要 Selenium 或类似的工具来刮这个。或者，您可以直接从以下网址下载包含您要查找的内容的 JSON：

https://www.bcb.gov.br/api/servico/sitebcb/noticias?listsite=conteudo/home-ptbr&listname=Notícias

并使用您想要的工具对其进行解析。

【讨论】：