【问题标题】:Why my web scraping code is not extracting data like it should?为什么我的网络抓取代码没有像应有的那样提取数据?
【发布时间】:2019-11-01 02:32:33
【问题描述】:

我正在尝试从在线购物网站获取数据。我的代码运行没有任何错误,但数据没有像应有的那样被提取到 csv 文件中。我的代码哪里出错了?

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Chrome("/usr/bin/chromedriver")

products=[] #List to store name of the product
prices=[] #List to store price of the product
ratings=[] #List to store rating of the product
driver.get("https://www.flipkart.com/lenovo-core-i3-6th-gen-4-gb-1-tb-hdd-windows-10-home-ip-320e-laptop/p/itmf3s32ghxrkrhf?pid=COMEWM7FTAQ9EHRF&srno=b_1_2&otracker=browse&lid=LSTCOMEWM7FTAQ9EHRFBL70ZV&fm=organic&iid=90098c10-e53b-49dc-9359-ff04338c0c4e.COMEWM7FTAQ9EHRF.SEARCH&ssid=2d6xzladk00000001572540087124")

content = driver.page_source
soup = BeautifulSoup(content)
for a in soup.findAll('a',href=True, attrs={'class':'_29OxBi'}):
    name = a.find('div', attrs={'class':'_35KyD6'})
    price = a.find('div', attrs={'class':'_1vC4OE _3qQ9m1'})
    rating= a.find('div', attrs={'class':'hGSR34'})
    products.append(name.text)
    prices.append(price.text)
    ratings.append(rating.text)

df = pd.DataFrame({'Product Name':products,'Price':prices,'Rating':ratings})
df.to_csv('products.csv', index=False, encoding='utf-8')

我希望代码返回网站上可用产品的名称、价格和评级等数据。

【问题讨论】:

  • 你得到了什么输出?

标签: python pandas selenium web-scraping


【解决方案1】:

flipkart :当浏览器在网页中执行 javascript 时,它会从脚本标签动态加载。您可以正则表达式输出此信息并使用 json 解析器解析以检索所需信息,只需使用 requests;没有硒的开销。

import requests, re, json

p = re.compile(r'window\.__INITIAL_STATE__ = (.*);')
r = requests.get('https://www.flipkart.com/lenovo-core-i3-6th-gen-4-gb-1-tb-hdd-windows-10-home-ip-320e-laptop/p/itmf3s32ghxrkrhf?pid=COMEWM7FTAQ9EHRF&srno=b_1_2&otracker=browse&lid=LSTCOMEWM7FTAQ9EHRFBL70ZV&fm=organic&iid=90098c10-e53b-49dc-9359-ff04338c0c4e.COMEWM7FTAQ9EHRF.SEARCH&ssid=2d6xzladk00000001572540087124')
data = json.loads(p.findall(r.text)[0])['pageDataV4']['page']['data']['10002'][1]['widget']['data']

##data sections:
# data.keys()

##pricing info:
# data['pricing']['value'].keys()
# data['pricing']['value']['mrp'].keys()

##rating info:
# data['ratingsAndReviews']['value']['rating']

price = data['pricing']['value']['mrp']['currency'] + str(data['pricing']['value']['mrp']['value'])
title = ' '.join(reversed([v for k,v in data['titleComponent']['value'].items() if k in ['title', 'subtitle']]))
average_rating = data['ratingsAndReviews']['value']['rating']['average']

【讨论】:

    猜你喜欢
    • 2022-12-20
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-05-18
    • 2019-09-09
    • 1970-01-01
    • 2020-09-23
    相关资源
    最近更新 更多