【问题标题】:Scrape multiple links from a json file从 json 文件中抓取多个链接
【发布时间】:2022-01-08 11:51:15
【问题描述】:

我正在尝试抓取我之前抓取并保存在 json 文件中的多个链接。

到目前为止,这有效,但我不想只是从我的 json 文件中抓取一个 url。

import scrapy
import json

class RatingSpider(scrapy.Spider):
    name = "rating"

    def start_requests(self):
        urls = [
            'https://www.darkpattern.games/game/3478/0/ragnarok-m-eternal-love-rom-.html'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)
            
    def parse(self, response):
        for rating in response.css('div.score_box'):
            yield {
                'reported': rating.css('div.score_heading *::text').extract()
                
            }

json 文件如下所示

[
  {
    "title": [
      "\n\t\t\t\t\t\t",
      "Ragnarok M: Eternal Love(ROM)",
      "\n\t\t\t\t\t\t",
      "\t\t\t\t\t\t",
      "The classic adventure returns",
      "\n\t\t\t\t\t"
    ],
    "link": [
      "/game/3478/0/ragnarok-m-eternal-love-rom-.html"
    ],
    "rating": [
      "\n\t\t\t\t\t\t",
      "\n\t\t\t\t\t\t",
      "-4.68",
      "\n\t\t\t\t\t"
    ]
  }
]

关于如何做到这一点的任何建议?

【问题讨论】:

    标签: python json scrapy web-crawler


    【解决方案1】:

    我没有在您的示例中看到您从 json 文件中读取的位置。你需要做这样的事情:

    with open("your json file", "r") as f:
        jsonlist = json.load(f)
    
    for i in range(len(jsonlist)):
        url = jsonlist[i]["link"][0]
    do something with url - run request or store in list, etc. Also, Your sample json contains a relative url so I assume the rest of the file is the same and the base url is https://www.darkpattern.games so you would need to concatenate the base url - https://www.darkpattern.games - and the relative urls prior to running the requests.
    

    【讨论】:

    • 谢谢,我会努力的
    • 我如何将绝对链接与来自 jsons 的相对链接?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-12-27
    • 1970-01-01
    • 1970-01-01
    • 2020-12-27
    • 2020-06-16
    • 2019-03-12
    相关资源
    最近更新 更多