【发布时间】:2020-10-14 21:20:58
【问题描述】:
我在 .txt 文件中有一个 instagram 列表。
这是我必须抓取的 URL:https://brandfollowers.io/kol/all-post?uid=$INSTAGRAM$&page_num=$PAGENUMBER$(注意我把 $INSTAGRAM$ 和 $PAGENUMBER$ 放在我需要更改变量的地方)
例如在这个网址https://brandfollowers.io/kol/all-post?uid=philipppleinofficial&page_num=1
我对此很陌生,但实际上我设法为列表中的每个 Instagram 获取第 1 页中的所有项目。但是,我无法遍历每个 Instagram 的所有页面。
你能给我一些建议吗?我对这个话题很陌生。
这是我知道的:
# -*- coding: utf-8 -*-
import scrapy
import json
class ContenidoSpider(scrapy.Spider):
name = 'BACKUP_contenido'
allowed_domains = ['brandfollowers.io']
start_urls = ['http://brandfollowers.io/']
base_url = 'http://brandfollowers.io/kol/all-post?uid='
def parse(self, response):
FILE = open('list.txt', 'r').readlines()
instagrams = []
for lines in FILE:
new_line = lines.replace('https://www.instagram.com/', '')
instagrams.append(new_line)
for instagram in instagrams:
posts_url = self.base_url + instagram
yield scrapy.Request(posts_url, callback=self.parse_json)
def parse_json(self, response):
current_page = 0
pagesize = 6
json_response = json.loads(response.text)
path = json_response["data"]["models"]
while current_page < pagesize:
brand = path[current_page]["author"]["platform_unique_id"]
date = path[current_page]["platform_create_time"]
comments = path[current_page]["comment_count"]
likes = path[current_page]["like_count"]
engagement_rate = path[current_page]["share_count"]
description = path[current_page]["description"]
url_post = path[current_page]["post_url"]
picture_link = path[current_page]["picture_link"]
yield {
'BRAND': brand,
'DATE': date,
'COMMENTS': comments,
'LIKES': likes,
'ENGAGEMENT RATE': engagement_rate,
'DESCRIPTION': description,
'URL': url_post,
'PICTURE LINK': picture_link,
}
current_page += 1
【问题讨论】:
标签: python json scrapy web-crawler