【问题标题】:Scraping Amazon reviews, cannot exclude paid reviews抓取亚马逊评论,不能排除付费评论
【发布时间】:2020-01-26 12:14:11
【问题描述】:

我正在尝试收集每个评论者给产品打的星数。我注意到一些评论者是“Vine Voices”或付费评论者。他们很少给4星,大多是5星。因此,我想排除它们。

如果评论标有“a-color-success a-text-bold”标签,我会标记它们“付费”或“未付费”。

我似乎无法将任何“付费”标签附加到 vine 变量中。怎么会?

只有由 Vine Voice 撰写的评论有标签,没有的评论没有“付费”标签。

import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}

rating_list = [] 
date_list = []
vine = []

for num in range(1,12):
    url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)

    r = requests.get(url, headers = headers)

    soup = BeautifulSoup(r.content, 'lxml')

    for ratings in soup.find_all("div", attrs={"data-hook": "review"}):     
        submission_date = ratings.find("span", {'data-hook':'review-date'}).text
        rating = ratings.find('i', attrs={"data-hook": "review-star-rating"}).text
        paid = ratings.find("span", attrs={"class": "a-color-success a-text-bold"})

        if paid in ratings:
             vine.append("Paid")
        else:
            vine.append("Not-paid")

            date_list.append(submission_date)
            rating_list.append(rating)

            data = {'Rating':rating_list, 'Date':date_list, "Paid":vine}
        time.sleep(2)

df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)

这是我目前所得到的。 评论 2 和 3 是 Vine Voice,但它们被标记为未付费,但应该付费。

0    5.0 out of 5 stars 2019-09-18  Not-paid
1    4.0 out of 5 stars 2019-09-13  Not-paid
2    5.0 out of 5 stars 2019-09-12  Not-paid
3    5.0 out of 5 stars 2019-09-11  Not-paid
4    5.0 out of 5 stars 2019-09-10  Not-paid
...

【问题讨论】:

    标签: python html web-scraping beautifulsoup


    【解决方案1】:

    您将元素与元素进行比较,这就是为什么它总是进入 else 条件。 我已经进行了更改并将文本与文本进行了比较,它工作正常。检查以下代码。

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    import time
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
    
    rating_list = []
    date_list = []
    vine = []
    
    for num in range(1,12):
        url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)
    
        r = requests.get(url, headers = headers)
    
        soup = BeautifulSoup(r.content, 'lxml')
    
        for ratings in soup.find_all("div", attrs={"data-hook": "review"}):
            submission_date = ratings.find("span", {'data-hook':'review-date'}).text
            rating = ratings.find('i', attrs={"data-hook": "review-star-rating"}).text
            paid = ratings.find("span", attrs={"class": "a-color-success a-text-bold"})
            if paid:
    
             if paid.text in ratings.text:
                 vine.append("Paid")
                 date_list.append(submission_date)
                 rating_list.append(rating)
    
                 data = {'Rating': rating_list, 'Date': date_list, "Paid": vine}
            else:
                vine.append("Not-paid")
    
                date_list.append(submission_date)
                rating_list.append(rating)
    
                data = {'Rating':rating_list, 'Date':date_list, "Paid":vine}
            time.sleep(2)
    
    df = pd.DataFrame(data)
    df["Date"] = pd.to_datetime(df["Date"])
    df = df.sort_values(by="Date", ascending=False)
    print(df)
    

    输出:

              Date      Paid              Rating
    0   2019-09-18  Not-paid  5.0 out of 5 stars
    1   2019-09-13  Not-paid  4.0 out of 5 stars
    2   2019-09-12      Paid  5.0 out of 5 stars
    3   2019-09-11      Paid  5.0 out of 5 stars
    4   2019-09-10  Not-paid  5.0 out of 5 stars
    5   2019-09-10  Not-paid  2.0 out of 5 stars
    6   2019-09-10      Paid  5.0 out of 5 stars
    7   2019-09-09      Paid  5.0 out of 5 stars
    8   2019-09-09  Not-paid  2.0 out of 5 stars
    9   2019-09-08      Paid  5.0 out of 5 stars
    10  2019-09-05      Paid  5.0 out of 5 stars
    11  2019-09-01  Not-paid  2.0 out of 5 stars
    12  2019-08-31      Paid  5.0 out of 5 stars
    13  2019-08-25      Paid  5.0 out of 5 stars
    14  2019-08-24  Not-paid  4.0 out of 5 stars
    15  2019-08-22  Not-paid  5.0 out of 5 stars
    16  2019-08-21      Paid  5.0 out of 5 stars
    17  2019-08-20  Not-paid  5.0 out of 5 stars
    18  2019-08-20      Paid  5.0 out of 5 stars
    19  2019-08-18      Paid  5.0 out of 5 stars
    20  2019-08-17  Not-paid  5.0 out of 5 stars
    21  2019-08-17  Not-paid  5.0 out of 5 stars
    22  2019-08-14  Not-paid  4.0 out of 5 stars
    23  2019-08-12      Paid  5.0 out of 5 stars
    24  2019-08-05      Paid  5.0 out of 5 stars
    25  2019-08-05      Paid  4.0 out of 5 stars
    26  2019-08-04      Paid  5.0 out of 5 stars
    27  2019-08-04      Paid  4.0 out of 5 stars
    29  2019-08-03      Paid  5.0 out of 5 stars
    28  2019-08-03      Paid  4.0 out of 5 stars
    ..         ...       ...                 ...
    80  2019-07-08      Paid  5.0 out of 5 stars
    81  2019-07-08      Paid  5.0 out of 5 stars
    82  2019-07-08      Paid  5.0 out of 5 stars
    85  2019-07-07      Paid  5.0 out of 5 stars
    83  2019-07-07      Paid  5.0 out of 5 stars
    84  2019-07-07      Paid  5.0 out of 5 stars
    87  2019-07-06      Paid  5.0 out of 5 stars
    86  2019-07-06      Paid  4.0 out of 5 stars
    88  2019-07-05  Not-paid  4.0 out of 5 stars
    89  2019-07-05      Paid  5.0 out of 5 stars
    90  2019-07-05      Paid  5.0 out of 5 stars
    91  2019-07-05      Paid  5.0 out of 5 stars
    92  2019-07-04      Paid  5.0 out of 5 stars
    93  2019-07-04      Paid  4.0 out of 5 stars
    94  2019-07-04      Paid  5.0 out of 5 stars
    95  2019-07-04      Paid  5.0 out of 5 stars
    96  2019-07-04      Paid  5.0 out of 5 stars
    98  2019-07-03  Not-paid  3.0 out of 5 stars
    97  2019-07-03      Paid  5.0 out of 5 stars
    99  2019-07-01      Paid  5.0 out of 5 stars
    100 2019-07-01      Paid  3.0 out of 5 stars
    101 2019-07-01      Paid  5.0 out of 5 stars
    102 2019-06-30      Paid  5.0 out of 5 stars
    103 2019-06-29      Paid  5.0 out of 5 stars
    104 2019-06-29      Paid  5.0 out of 5 stars
    105 2019-06-28  Not-paid  1.0 out of 5 stars
    106 2019-06-27      Paid  4.0 out of 5 stars
    107 2019-06-27      Paid  5.0 out of 5 stars
    108 2019-06-26      Paid  5.0 out of 5 stars
    109 2019-06-26      Paid  5.0 out of 5 stars
    
    [110 rows x 3 columns]
    

    【讨论】:

    • 不需要工作。但我也注意到“如果支付”缩进发生了什么事情?
    • 如果该元素存在于父标签下,则检查内部 if 项目文本 val 或转到 else 这就是如果付费:所有关于。但是我运行了代码,它给了我两个 Not-paid和付费选项。
    • 很奇怪。我不知道为什么我的在窃听。与我自己的代码相比,我也只得到了 20 条评论,所有评论都是非付费的,这给了我 110 条评论。
    • 您添加了输出,它看起来是正确的。我得到了我的 20 行。不知道为什么我们得到如此不同的结果?
    • 我做到了。谢谢您的帮助。我认为问题出在键盘语言上。所以当我手动输入你的代码时,它起作用了!
    【解决方案2】:

    我认为更好的方法(使用 bs4 4.7.1+)是使用 :has 和 :not 预先进行排除。然后,您不需要排除字段/标志。在下面,我打印出审稿人姓名作为视觉检查(您会看到付费审稿人的姓名没有出现)。我还调整了您的循环以正常工作并使用Session 提高效率。我也使用更短更健壮的选择器。

    css 选择器比find 快,所以我可能会将find 行更改为:

    submission_date = review.select_one('[data-hook=review-date]').text
    rating = review.select_one('[data-hook=review-star-rating]').text
    

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    
    headers = {'User-Agent': 'Mozilla/5.0'}
    rating_list = [] 
    date_list = []
    
    with requests.Session() as s:   
        for num in range(1,12):
            url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)
            r = s.get(url, headers = headers)
            soup = BeautifulSoup(r.content, 'lxml')
    
            for review in soup.select('.review:not(:has(.a-color-success))'):   
                submission_date = review.select_one('[data-hook=review-date]').text
                rating = review.select_one('[data-hook=review-star-rating]').text
                date_list.append(submission_date)
                rating_list.append(rating)
                print(review.select_one('.a-profile-name').text) #check 
        data = {'Rating':rating_list, 'Date':date_list}
    
    df = pd.DataFrame(data)
    df["Date"] = pd.to_datetime(df["Date"])
    df = df.sort_values(by="Date", ascending=False)
    print(df)
    

    【讨论】:

    • 我很佩服这个伪选择器:has()
    • 我在 vba 中创建了一个令人着迷的脚本,使用从免费代理网站自动获取的代理轮换。该脚本能够绕过任何形式的禁令,除了首先出现的验证码。只要内容是静态的而不是谷歌,它就能够抓取几乎所有的网站。总有一天我会把它寄给你的。
    • @SIM 我期待看到它。很高兴看到您也使用 :has。
    • 看起来很有趣。明天将对此进行测试,并将报告!
    猜你喜欢
    • 2017-07-28
    • 1970-01-01
    • 1970-01-01
    • 2020-07-07
    • 1970-01-01
    • 2012-04-27
    • 2018-07-21
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多