【发布时间】:2020-01-26 12:14:11
【问题描述】:
我正在尝试收集每个评论者给产品打的星数。我注意到一些评论者是“Vine Voices”或付费评论者。他们很少给4星,大多是5星。因此,我想排除它们。
如果评论标有“a-color-success a-text-bold”标签,我会标记它们“付费”或“未付费”。
我似乎无法将任何“付费”标签附加到 vine 变量中。怎么会?
只有由 Vine Voice 撰写的评论有标签,没有的评论没有“付费”标签。
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
rating_list = []
date_list = []
vine = []
for num in range(1,12):
url = "https://www.amazon.com/Jabra-Wireless-Noise-Canceling-Headphones-Built/product-reviews/B07RS8B5HV/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber={}&sortBy=recent".format(num)
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for ratings in soup.find_all("div", attrs={"data-hook": "review"}):
submission_date = ratings.find("span", {'data-hook':'review-date'}).text
rating = ratings.find('i', attrs={"data-hook": "review-star-rating"}).text
paid = ratings.find("span", attrs={"class": "a-color-success a-text-bold"})
if paid in ratings:
vine.append("Paid")
else:
vine.append("Not-paid")
date_list.append(submission_date)
rating_list.append(rating)
data = {'Rating':rating_list, 'Date':date_list, "Paid":vine}
time.sleep(2)
df = pd.DataFrame(data)
df["Date"] = pd.to_datetime(df["Date"])
df = df.sort_values(by="Date", ascending=False)
print(df)
这是我目前所得到的。 评论 2 和 3 是 Vine Voice,但它们被标记为未付费,但应该付费。
0 5.0 out of 5 stars 2019-09-18 Not-paid
1 4.0 out of 5 stars 2019-09-13 Not-paid
2 5.0 out of 5 stars 2019-09-12 Not-paid
3 5.0 out of 5 stars 2019-09-11 Not-paid
4 5.0 out of 5 stars 2019-09-10 Not-paid
...
【问题讨论】:
标签: python html web-scraping beautifulsoup