【问题标题】:Web Scraping Dynamic Pages - Adjusting the code网页抓取动态页面 - 调整代码
【发布时间】:2020-03-09 05:12:25
【问题描述】:

αԋɱҽԃ αмєяιcαη 帮助我构建了这段代码,用于从动态加载评论的页面中抓取评论。然后,我尝试对其进行调整,使其不仅可以抓取评论正文,还可以抓取评论者的姓名、日期和评级,并让代码将提取的数据保存到 Excel 文件中。但我没有这样做。有人可以帮我正确调整代码吗?

这是来自αԋɱҽԃ αмєяιcαη的代码

import requests
from bs4 import BeautifulSoup
import math


def PageNum():
    r = requests.get(
        "https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
    soup = BeautifulSoup(r.text, 'html.parser')
    num = int(
        soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
    if num % 3 == 0:
        return (num / 3) + 1
    else:
        return math.ceil(num / 3) + 2


def Main():
    num = PageNum()
    headers = {
        'X-Requested-With': 'XMLHttpRequest'
    }
    with requests.Session() as req:
        for item in range(1, num):
            print(f"Extracting Page# {item}")
            r = req.get(
                f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
            soup = BeautifulSoup(r.text, 'html.parser')
            for com in soup.findAll("div", class_=r'\"comment-body\"'):
                print(com.text[5:com.text.find(r"\n", 3)])


Main()

这是我调整的代码,但出现了我无法解决的错误

import requests
from bs4 import BeautifulSoup
import math
import pandas as pd

df = pd.DataFrame()

def PageNum():
    r = requests.get(
        "https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
    soup = BeautifulSoup(r.text, 'html.parser')
    num = int(
        soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
    if num % 3 == 0:
        return (num / 3) + 1
    else:
        return math.ceil(num / 3) + 2


def Main():
    num = PageNum()
    headers = {
        'X-Requested-With': 'XMLHttpRequest'
    }
    with requests.Session() as req:
        for item in range(1, num):
            names = []
            headers = []
            bodies = []
            ratings = []
            published = []
            updated = []
            reported = []
            dateElements = []
            print(f"Extracting Page# {item}")
            r = req.get(
                f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
            soup = BeautifulSoup(r.text, 'html.parser')
            for com in soup.findAll("div", class_=r'\"user-review\"'):
                names.append(article.find('div', attrs={'class': 'name'}).text.strip())
                try:
                    bodies.append(article.find('div', attrs={'class': 'comment-body'}).text.strip())
                except:
                    bodies.append('NA')

                try:
                    ratings.append(article.find('meta', attrs={'itemprop': 'ratingValue'})['content'])
                except:
                    ratings.append('NA')
                dateElements.append(article.find('div', attrs={'class': 'comment-date'}).text.strip())
                print(com.text[5:com.text.find(r"\n", 3)])

            temp_df = pd.DataFrame(
                {'User Name': names, 'Body': bodies, 'Rating': ratings, 'Published Date': dateElements})
            df = df.append(temp_df, sort=False).reset_index(drop=True)

Main()

df.to_csv('Allure10.csv', index=False, encoding='utf-8')
print ('excel done')

【问题讨论】:

  • 你在哪里定义文章?

标签: python pandas web-scraping beautifulsoup element


【解决方案1】:
import requests
from bs4 import BeautifulSoup
import math
import csv


def PageNum():
    r = requests.get(
        "https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
    soup = BeautifulSoup(r.text, 'html.parser')
    num = int(
        soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
    if num % 3 == 0:
        return (num / 3) + 1
    else:
        return math.ceil(num / 3) + 2


def Main():
    num = PageNum()
    headers = {
        'X-Requested-With': 'XMLHttpRequest'
    }
    with requests.Session() as req:
        names = []
        dates = []
        comments = []
        rating = []
        for item in range(1, num):
            print(f"Extracting Page# {item}")
            r = req.get(
                f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
            soup = BeautifulSoup(r.text, 'html.parser')
            for com in soup.findAll("div", class_=r'\"comment-body\"'):
                comments.append(com.text[5:com.text.find(r"\n", 3)])
            for name in soup.findAll("div", class_=r'\"name\"'):
                names.append(name.text[:name.text.find(r"<\/div>", 1)])
            for date in soup.findAll("div", class_=r'\"comment-date\"'):
                dates.append(date.text[:date.text.find(r"<\/div>", 1)])
            for rate in soup.findAll("meta", itemprop=r'\"ratingValue\"'):
                rating.append(rate.get("content")[2:-3])
    return zip(names, dates, rating, comments)


def Save():
    data = Main()
    with open("oka.csv", 'w', newline="", encoding="UTF-8") as f:
        writer = csv.writer(f)
        writer.writerow(["Name", "Dates", "Rating", "Comments"])
        writer.writerows(data)


Save()

输出:check-online

【讨论】:

  • 如何解决这个问题? Traceback (most recent call last): File "C:/Users/Sara Jitkresorn/PycharmProjects/untitled/venv/SubsAddict.py", line 53, in &lt;module&gt; Save() File "C:/Users/Sara Jitkresorn/PycharmProjects/untitled/venv/SubsAddict.py", line 50, in Save writer.writerows(data) File "C:\Users\Sara Jitkresorn\AppData\Local\Programs\Python\Python37\lib\encodings\cp874.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f922' in position 530: character maps to &lt;undefined&gt;
  • @SaraJitkresorn 我相信您使用的是WindowsVSCode?
  • 我使用的是 Windows。对不起,我不知道 VSCode 是什么。
  • @SaraJitkresorn 你如何运行你的代码?你用的是哪个interpreter
  • 我正在使用 Python 3.7 作为解释器通过 Pycharm 运行代码
猜你喜欢
  • 1970-01-01
  • 2018-01-18
  • 2020-07-30
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-12-02
  • 2021-03-10
相关资源
最近更新 更多