【问题标题】:Scrapy: How to export Json from scriptScrapy:如何从脚本中导出 Json
【发布时间】:2020-02-23 14:34:57
【问题描述】:

我用scrapy创建了一个网络爬虫,但我的电话号码有问题,因为它在脚本中。 脚本是:

<script data-n-head="true" type="application/ld+json">{"@context":"http://schema.org","@type":"LocalBusiness","name":"Clínica Dental Reina Victoria 23","description":".TU CLÍNICA DENTAL DE REFERENCIA EN MADRID","logo":"https://estaticos.qdq.com/CMS/directory/logos/c/l/clinica-dental-reina-victoria.png","image":"https://estaticos.qdq.com/coverphotos/098/535/ed1c5ffcf38241f8b83a1808af51a615.jpg","url":"https://www.clinicadental-reinavictoria.es/","hasMap":"https://www.google.com/maps/search/?api=1&query=40.4469174,-3.7087934","telephone":"+34915340309","address":{"@type":"PostalAddress","streetAddress":"Av. Reina Victoria 23","addressLocality":"MADRID","addressRegion":"Madrid","postalCode":"28003"}}</script>

此脚本在不同的页面中更改,但仅更改电话号码

我用 Xpath 提取脚本

data = response.xpath('/html/head/script[3]').extract()
        decoded = json.loads(data.telephone("utf-8"))
        ml_item['datos'] = decoded['telephone']

我认为我需要自定义管道来提取电话号码

在 pipelines.py 我添加了 jsonWriter 行

ITEM_PIPELINES = {'mercado.pipelines.MercadoPipeline': 500,
                    'mercado.pipelines.MercadoImagenesPipeline': 600,
                    'mercado.pipelines.JsonWriterPipeline': 800, }

但我需要在 pipelines.py 中添加一些代码来定义 JsonWriterPipeline。 控制台返回此错误:

raise NameError("Module '%s' doesn't define any object named '%s'" % (module, name))
NameError: Module 'mercado.pipelines' doesn't define any object named 'JsonWriterPipeline'

我将所有数字与姓名、Web 等其他信息一起保存在 CSV 文件中...

【问题讨论】:

  • 如果您已经获得了 javascript 文本内容,那么为什么不使用正则表达式来查找电话号码字符串呢?如果脚本更改我猜电话号码总是在“电话”后面:“字符串
  • 我不知道怎么做 :( 我是 python 初学者,怎么做?

标签: python json scrapy


【解决方案1】:

如果你已经爬过脚本标签里面的内容就很简单了

import re

script = '{"@context":"http://schema.org","@type":"LocalBusiness","name":"Clínica Dental Reina Victoria 23","description":".TU CLÍNICA DENTAL DE REFERENCIA EN MADRID","logo":"https://estaticos.qdq.com/CMS/directory/logos/c/l/clinica-dental-reina-victoria.png","image":"https://estaticos.qdq.com/coverphotos/098/535/ed1c5ffcf38241f8b83a1808af51a615.jpg","url":"https://www.clinicadental-reinavictoria.es/","hasMap":"https://www.google.com/maps/search/?api=1&query=40.4469174,-3.7087934","telephone":"+34915340309","address":{"@type":"PostalAddress","streetAddress":"Av. Reina Victoria 23","addressLocality":"MADRID","addressRegion":"Madrid","postalCode":"28003"}}'

phone_number = re.search(r'"telephone":"(.*?)","address"', script).group(1)

print(phone_number)

【讨论】:

    【解决方案2】:

    最简单快捷的选择是,我也更喜欢这个。

    import json
    
    json.loads(response.css('script:contains("LocalBusiness") ::text').re_first('(.*)'))
    

    【讨论】:

      猜你喜欢
      • 2016-11-23
      • 2021-01-28
      • 2015-06-06
      • 2012-11-06
      • 2011-12-11
      • 2015-02-27
      • 2017-05-16
      • 2018-09-13
      • 2020-11-20
      相关资源
      最近更新 更多