抓取亚马逊评论，不能排除付费评论答案

【问题标题】：Scraping Amazon reviews, cannot exclude paid reviews抓取亚马逊评论，不能排除付费评论
【发布时间】：2020-01-26 12:14:11
【问题描述】：

我正在尝试收集每个评论者给产品打的星数。我注意到一些评论者是“Vine Voices”或付费评论者。他们很少给4星，大多是5星。因此，我想排除它们。

如果评论标有“a-color-success a-text-bold”标签，我会标记它们“付费”或“未付费”。

我似乎无法将任何“付费”标签附加到 vine 变量中。怎么会？

只有由 Vine Voice 撰写的评论有标签，没有的评论没有“付费”标签。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

rating_list = [] 
date_list = []
vine = []

for num in range(1,12):
    url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)

    r = requests.get(url, headers = headers)

    soup = BeautifulSoup(r.content, 'lxml')

    for ratings in soup.find_all("div", attrs={"data-hook": "review"}):     
        submission_date = ratings.find("span", {'data-hook':'review-date'}).text
        rating = ratings.find('i', attrs={"data-hook": "review-star-rating"}).text
        paid = ratings.find("span", attrs={"class": "a-color-success a-text-bold"})

        if paid in ratings:
             vine.append("Paid")
        else:
            vine.append("Not-paid")

            date_list.append(submission_date)
            rating_list.append(rating)

            data = {'Rating':rating_list, 'Date':date_list, "Paid":vine}
        time.sleep(2)

df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)

这是我目前所得到的。评论 2 和 3 是 Vine Voice，但它们被标记为未付费，但应该付费。

0    5.0 out of 5 stars 2019-09-18  Not-paid
1    4.0 out of 5 stars 2019-09-13  Not-paid
2    5.0 out of 5 stars 2019-09-12  Not-paid
3    5.0 out of 5 stars 2019-09-11  Not-paid
4    5.0 out of 5 stars 2019-09-10  Not-paid
...

【问题讨论】：

标签： python html web-scraping beautifulsoup

【解决方案1】：

您将元素与元素进行比较，这就是为什么它总是进入 else 条件。我已经进行了更改并将文本与文本进行了比较，它工作正常。检查以下代码。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

rating_list = []
date_list = []
vine = []

for num in range(1,12):
    url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)

    r = requests.get(url, headers = headers)

    soup = BeautifulSoup(r.content, 'lxml')

    for ratings in soup.find_all("div", attrs={"data-hook": "review"}):
        submission_date = ratings.find("span", {'data-hook':'review-date'}).text
        rating = ratings.find('i', attrs={"data-hook": "review-star-rating"}).text
        paid = ratings.find("span", attrs={"class": "a-color-success a-text-bold"})
        if paid:

         if paid.text in ratings.text:
             vine.append("Paid")
             date_list.append(submission_date)
             rating_list.append(rating)

             data = {'Rating': rating_list, 'Date': date_list, "Paid": vine}
        else:
            vine.append("Not-paid")

            date_list.append(submission_date)
            rating_list.append(rating)

            data = {'Rating':rating_list, 'Date':date_list, "Paid":vine}
        time.sleep(2)

df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)

输出：

          Date      Paid              Rating
0   2019-09-18  Not-paid  5.0 out of 5 stars
1   2019-09-13  Not-paid  4.0 out of 5 stars
2   2019-09-12      Paid  5.0 out of 5 stars
3   2019-09-11      Paid  5.0 out of 5 stars
4   2019-09-10  Not-paid  5.0 out of 5 stars
5   2019-09-10  Not-paid  2.0 out of 5 stars
6   2019-09-10      Paid  5.0 out of 5 stars
7   2019-09-09      Paid  5.0 out of 5 stars
8   2019-09-09  Not-paid  2.0 out of 5 stars
9   2019-09-08      Paid  5.0 out of 5 stars
10  2019-09-05      Paid  5.0 out of 5 stars
11  2019-09-01  Not-paid  2.0 out of 5 stars
12  2019-08-31      Paid  5.0 out of 5 stars
13  2019-08-25      Paid  5.0 out of 5 stars
14  2019-08-24  Not-paid  4.0 out of 5 stars
15  2019-08-22  Not-paid  5.0 out of 5 stars
16  2019-08-21      Paid  5.0 out of 5 stars
17  2019-08-20  Not-paid  5.0 out of 5 stars
18  2019-08-20      Paid  5.0 out of 5 stars
19  2019-08-18      Paid  5.0 out of 5 stars
20  2019-08-17  Not-paid  5.0 out of 5 stars
21  2019-08-17  Not-paid  5.0 out of 5 stars
22  2019-08-14  Not-paid  4.0 out of 5 stars
23  2019-08-12      Paid  5.0 out of 5 stars
24  2019-08-05      Paid  5.0 out of 5 stars
25  2019-08-05      Paid  4.0 out of 5 stars
26  2019-08-04      Paid  5.0 out of 5 stars
27  2019-08-04      Paid  4.0 out of 5 stars
29  2019-08-03      Paid  5.0 out of 5 stars
28  2019-08-03      Paid  4.0 out of 5 stars
..         ...       ...                 ...
80  2019-07-08      Paid  5.0 out of 5 stars
81  2019-07-08      Paid  5.0 out of 5 stars
82  2019-07-08      Paid  5.0 out of 5 stars
85  2019-07-07      Paid  5.0 out of 5 stars
83  2019-07-07      Paid  5.0 out of 5 stars
84  2019-07-07      Paid  5.0 out of 5 stars
87  2019-07-06      Paid  5.0 out of 5 stars
86  2019-07-06      Paid  4.0 out of 5 stars
88  2019-07-05  Not-paid  4.0 out of 5 stars
89  2019-07-05      Paid  5.0 out of 5 stars
90  2019-07-05      Paid  5.0 out of 5 stars
91  2019-07-05      Paid  5.0 out of 5 stars
92  2019-07-04      Paid  5.0 out of 5 stars
93  2019-07-04      Paid  4.0 out of 5 stars
94  2019-07-04      Paid  5.0 out of 5 stars
95  2019-07-04      Paid  5.0 out of 5 stars
96  2019-07-04      Paid  5.0 out of 5 stars
98  2019-07-03  Not-paid  3.0 out of 5 stars
97  2019-07-03      Paid  5.0 out of 5 stars
99  2019-07-01      Paid  5.0 out of 5 stars
100 2019-07-01      Paid  3.0 out of 5 stars
101 2019-07-01      Paid  5.0 out of 5 stars
102 2019-06-30      Paid  5.0 out of 5 stars
103 2019-06-29      Paid  5.0 out of 5 stars
104 2019-06-29      Paid  5.0 out of 5 stars
105 2019-06-28  Not-paid  1.0 out of 5 stars
106 2019-06-27      Paid  4.0 out of 5 stars
107 2019-06-27      Paid  5.0 out of 5 stars
108 2019-06-26      Paid  5.0 out of 5 stars
109 2019-06-26      Paid  5.0 out of 5 stars

[110 rows x 3 columns]

【讨论】：

不需要工作。但我也注意到“如果支付”缩进发生了什么事情？
如果该元素存在于父标签下，则检查内部 if 项目文本 val 或转到 else 这就是如果付费：所有关于。但是我运行了代码，它给了我两个 Not-paid和付费选项。
很奇怪。我不知道为什么我的在窃听。与我自己的代码相比，我也只得到了 20 条评论，所有评论都是非付费的，这给了我 110 条评论。
您添加了输出，它看起来是正确的。我得到了我的 20 行。不知道为什么我们得到如此不同的结果？
我做到了。谢谢您的帮助。我认为问题出在键盘语言上。所以当我手动输入你的代码时，它起作用了！

【解决方案2】：

我认为更好的方法（使用 bs4 4.7.1+）是使用 :has 和 :not 预先进行排除。然后，您不需要排除字段/标志。在下面，我打印出审稿人姓名作为视觉检查（您会看到付费审稿人的姓名没有出现）。我还调整了您的循环以正常工作并使用Session 提高效率。我也使用更短更健壮的选择器。

css 选择器比find 快，所以我可能会将find 行更改为：

submission_date = review.select_one('[data-hook=review-date]').text
rating = review.select_one('[data-hook=review-star-rating]').text

派

import requests
from bs4 import BeautifulSoup
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0'}
rating_list = [] 
date_list = []

with requests.Session() as s:   
    for num in range(1,12):
        url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)
        r = s.get(url, headers = headers)
        soup = BeautifulSoup(r.content, 'lxml')

        for review in soup.select('.review:not(:has(.a-color-success))'):   
            submission_date = review.select_one('[data-hook=review-date]').text
            rating = review.select_one('[data-hook=review-star-rating]').text
            date_list.append(submission_date)
            rating_list.append(rating)
            print(review.select_one('.a-profile-name').text) #check 
    data = {'Rating':rating_list, 'Date':date_list}

df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)

【讨论】：

我很佩服这个伪选择器:has()。
我在 vba 中创建了一个令人着迷的脚本，使用从免费代理网站自动获取的代理轮换。该脚本能够绕过任何形式的禁令，除了首先出现的验证码。只要内容是静态的而不是谷歌，它就能够抓取几乎所有的网站。总有一天我会把它寄给你的。
@SIM 我期待看到它。很高兴看到您也使用 :has。
看起来很有趣。明天将对此进行测试，并将报告！