【发布时间】:2020-09-11 07:08:02
【问题描述】:
我正在尝试从 yelp 页面上抓取一些数据。但是,当我得到结果时丢失了一些值,并且每次执行代码时丢失的数据都会改变(例如:第一次执行时缺少 2 个数据,第二次执行时缺少 1 个数据)。你们知道为什么会这样吗?谢谢!!
import time
review_listings= []
cols2 = ['restaurant name','username','ratings','review.text']
copy = 0
for url in data_rev['url']: # Each url has 20 so start
start = time.time()
for p in pages:
url_review = url+ "&start={}".format(str(p))
page = r.get(url_review)
soup = BeautifulSoup(page.content,'html.parser')
res_name = soup.find("h1",{"class":"lemon--h1__373c0__2ZHSL heading--h1__373c0___56D3 undefined heading--inline__373c0__1jeAh"}).text
tables=soup.findAll('li',{'class':'lemon--li__373c0__1r9wz margin-b3__373c0__q1DuY padding-b3__373c0__342DA border--bottom__373c0__3qNtD border-color--default__373c0__3-ifU'})
if(len(tables) == 0):
print(url_review)
break
else:
for table in tables:
#name,ratings,username:
username = table.find("span",{"class":"lemon--span__373c0__3997G text__373c0__2Kxyz fs-block text-color--blue-dark__373c0__1jX7S text-align--left__373c0__2XGa- text-weight--bold__373c0__1elNz"}).a.text
ratings = table.find("span",{"class":"lemon--span__373c0__3997G display--inline__373c0__3JqBP border-color--default__373c0__3-ifU"}).div.get("aria-label")
text = table.find("span",{"class":"lemon--span__373c0__3997G raw__373c0__3rKqk"}).text
review_listings.append([res_name,username,ratings,text])
rev_df = pd.DataFrame.from_records(review_listings,columns=cols2)
size_df = len(rev_df)
print("review sizes are =>",size_df - copy)
print(res_name)
copy = size_df
end = time.time()
print(end-start)
【问题讨论】:
-
你能分享一些例子
url吗? -
@user14245642 使用
Selenium而不是BeautifulSoup -
@ZarakiKenpachi 我也想过这个问题,但是我必须抓取成千上万的数据,所以如果我使用 Selenium 会花费很长时间。
-
你使用的类很可能是动态的以避免抓取
标签: python web-scraping missing-data review