【问题标题】:Parsing in python for all pages of the site section用python解析站点部分的所有页面
【发布时间】:2021-06-29 15:52:54
【问题描述】:

我正在为该站点制作一个 python 解析器:https://www.kinopoisk.ru/lists/series-top250/

import requests
from bs4 import BeautifulSoup
import csv

CSV = 'genres.csv'
URL = 'https://www.kinopoisk.ru/lists/series-top250/?page=1&tab=all'
HEADERS = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0', 'accept': '*/*'}


def get_html(url, params = None):
    r = requests.get(url, headers=HEADERS, params=params)
    return r


def get_content(html):
    soup = BeautifulSoup(html, 'html.parser')
    items = soup.find_all('div', class_='selection-film-item-meta selection-film-item-meta_theme_desktop')

    genres = []
    for item in items:
        additional = item.find_all('span', {'class':'selection-film-item-meta__meta-additional-item'})
        genres.append(
           {
                'genre': additional[1].get_text(strip = True)
           }
        )
    return genres


def save_genres(items, path):
    with open(path, 'w', newline='') as file:
        writer = csv.writer(file, delimiter=',')
        writer.writerow(['genre'])
        for item in items:
            writer.writerow([item['genre']])


def parser():
    html = get_html(URL)
    if html.status_code == 200:
        genres = []
        for page in range(1, 6):
            html = get_html(URL, params = {'page': page})
            genres.extend(get_content(html.text))
            save_genres(genres, CSV)
        pass
    else:
        print('Non_available')


parser()

网站部分有 5 页评级: https://www.kinopoisk.ru/lists/series-top250/?page=1&tab=all ... https://www.kinopoisk.ru/lists/series-top250/?page=5&tab=all

我做了一个 for_loop 来解析所有页数变化的页面

for page in range(1, 6):
            html = get_html(URL, params = {'page': page})
            genres.extend(get_content(html.text))

但解析只发生在 1 页上。请告诉我,我做错了什么?

当我将结果保存为 CSV 时,每行可以包含超过 1 个单词(流派名称),我不知道如何确保 1 行上只有 1 个值用于聚合分析

谢谢!

【问题讨论】:

    标签: python csv parsing beautifulsoup


    【解决方案1】:

    从URL中删除参数(包括?之后的部分):

    import requests
    from bs4 import BeautifulSoup
    import csv
    
    CSV = "genres.csv"
    URL = "https://www.kinopoisk.ru/lists/series-top250/"
    HEADERS = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0",
        "accept": "*/*",
    }
    PARAMS = {"page": 1, "tab": "all"}
    
    
    def get_html(url, params=None):
        r = requests.get(url, headers=HEADERS, params=params)
        return r
    
    
    def get_content(html):
        soup = BeautifulSoup(html, "html.parser")
        items = soup.find_all(
            "div",
            class_="selection-film-item-meta selection-film-item-meta_theme_desktop",
        )
    
        genres = []
        for item in items:
            additional = item.find_all(
                "span", {"class": "selection-film-item-meta__meta-additional-item"}
            )
            genres.append({"genre": additional[1].get_text(strip=True)})
        return genres
    
    
    def save_genres(items, path):
        with open(path, "w", newline="") as file:
            writer = csv.writer(file, delimiter=",")
            writer.writerow(["genre"])
            for item in items:
                writer.writerow([item["genre"]])
    
    
    def parser():
        genres = []
        for page in range(1, 6):
            print("Parsing page {}...".format(page))
            PARAMS["page"] = page
            html = get_html(URL, PARAMS)
            if html.status_code == 200:
                genres.extend(get_content(html.text))
            else:
                print("Non_available")
        save_genres(genres, CSV)
    
    
    parser()
    

    创建genres.csv:

    【讨论】:

    • 非常感谢!您还能帮忙将所有单词分成 1 列吗?
    • @Daderk86 你可以试试genres.extend(chain.from_iterable([g.split(",") for g in get_content(html.text)]))(先做from itertools import chain)。但由于我得到验证码页面,我无法对其进行测试。
    猜你喜欢
    • 2020-07-31
    • 1970-01-01
    • 2012-07-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多