【发布时间】:2022-01-22 09:40:33
【问题描述】:
我有一组用于网页网址的品牌编号。我将网页 url 转换为 f 字符串,并在它应该应用的地方应用品牌号。每个页面都有一个唯一的 ID 来加载下一页。我正在尝试在匹配 ID 所属的品牌号的同时提取下一页。
这里有一些示例代码:
import requests
import pandas as pd
from bs4 import BeautifulSoup
brands = [989,1344,474,1237,886,1,328,2188]
testid = {}
for b in brands:
url = f'https://webapi.depop.com/api/v2/search/products/?brands={b}&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance'
payload={}
headers = {}
response = requests.request("GET", url, headers=headers, data=payload)
test= pd.read_json(StringIO(response.text), lines=True)
for m in test['meta'].items():
if m[1]['hasMore'] == True:
testid[str(b)]= [m[1]['cursor']]
else:
continue
for br in testid.keys():
while True:
html = f'https://webapi.depop.com/api/v2/search/products/?brands={br}&cursor={testid[str(br)][-1]}&itemsPerPage=24&country=gb¤cy=GBP&sort=relevance'
r = requests.request("GET",html, headers=headers, data=payload)
read_id = pd.read_json(StringIO(r.text), lines=True)
for m in read_id['meta'].items():
try:
testid[str(br)].append(m[1]['cursor'])
except:
continue
这是它产生的输出:
{'989': ['MnwyNHwxNjQwMDMwODcw']}
但是,它会替换品牌编号中最初的值,只留下最后一个收集的值。它应该留下一个列表并产生如下内容:
{'989': ['MnwyNHwxNjQwMDI4Mzk1', ...],
'1344': ['MnwyNHwxNjQwMDI4Mzk2', ...],
'474': ['MnwyNHwxNjQwMDI4Mzk3', ...],
'1237': ['MnwyNHwxNjQwMDI4Mzk3', ...],
'886': ['MnwyNHwxNjQwMDI4Mzk4', ...],
'1': ['MnwyNHwxNjQwMDI4Mzk4', ...],
'328': ['MnwyNHwxNjQwMDI4Mzk5', ...],
其中三个点 ... 表示从具有该品牌编号的页面收集的附加 ID 值。我怎样才能得到这样的输出?
【问题讨论】:
-
你可能想把
testid = {}改成testid = collections.defaultdict(list)然后你可以testid[str(b)].append([m[1]['cursor']]) -
@JonSG 对我的输出没有影响
标签: python loops web-scraping beautifulsoup while-loop