【发布时间】:2021-06-29 15:52:54
【问题描述】:
我正在为该站点制作一个 python 解析器:https://www.kinopoisk.ru/lists/series-top250/
import requests
from bs4 import BeautifulSoup
import csv
CSV = 'genres.csv'
URL = 'https://www.kinopoisk.ru/lists/series-top250/?page=1&tab=all'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0', 'accept': '*/*'}
def get_html(url, params = None):
r = requests.get(url, headers=HEADERS, params=params)
return r
def get_content(html):
soup = BeautifulSoup(html, 'html.parser')
items = soup.find_all('div', class_='selection-film-item-meta selection-film-item-meta_theme_desktop')
genres = []
for item in items:
additional = item.find_all('span', {'class':'selection-film-item-meta__meta-additional-item'})
genres.append(
{
'genre': additional[1].get_text(strip = True)
}
)
return genres
def save_genres(items, path):
with open(path, 'w', newline='') as file:
writer = csv.writer(file, delimiter=',')
writer.writerow(['genre'])
for item in items:
writer.writerow([item['genre']])
def parser():
html = get_html(URL)
if html.status_code == 200:
genres = []
for page in range(1, 6):
html = get_html(URL, params = {'page': page})
genres.extend(get_content(html.text))
save_genres(genres, CSV)
pass
else:
print('Non_available')
parser()
网站部分有 5 页评级: https://www.kinopoisk.ru/lists/series-top250/?page=1&tab=all ... https://www.kinopoisk.ru/lists/series-top250/?page=5&tab=all
我做了一个 for_loop 来解析所有页数变化的页面
for page in range(1, 6):
html = get_html(URL, params = {'page': page})
genres.extend(get_content(html.text))
但解析只发生在 1 页上。请告诉我,我做错了什么?
当我将结果保存为 CSV 时,每行可以包含超过 1 个单词(流派名称),我不知道如何确保 1 行上只有 1 个值用于聚合分析
谢谢!
【问题讨论】:
标签: python csv parsing beautifulsoup