【问题标题】:beautiful soup find_all() not returning all elements美丽的汤 find_all() 不返回所有元素
【发布时间】:2021-06-04 06:39:25
【问题描述】:

我正在尝试使用 bs4 抓取 this 网站。在特定的汽车广告图块上使用检查,我想出了我需要刮什么才能获得标题和汽车页面的链接。

我正在使用 bs4 库的 find_all() 函数,但问题是它没有抓取所有汽车的所需信息。它只返回大约 21 辆的信息,而在网站上可以清楚地看到大约有 2410 辆汽车。

相关代码:

from bs4 import BeautifulSoup as bs
from urllib.request import Request, urlopen 
import re
import requests

url = 'https://www.cardekho.com/used-cars+in+bangalore'
req = Request(url , headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()

page_soup = bs(webpage,"html.parser")

tags = page_soup.find_all("div","title")

print(len(tags))

如何获取页面上所有汽车的信息。

P.S - 只想指出一件事,所有的汽车都不会同时显示。向下滚动时会加载更多汽车信息。会不会是因为这个?不确定。

【问题讨论】:

  • “向下滚动时会加载更多汽车信息。” -- 看起来 javascript 用于加载其他内容。你需要像 Selenium 这样的东西来执行 javascript。
  • 当您检查该站点时,您会看到,在滚动时,它会从一个 api 端点以 json 格式加载新车。您可以遍历 api url 中的页面以获取额外的汽车。
  • 嗨! @RJAdriaansen。感谢您对这篇文章发表评论。在报废方面,我是初学者,所以我不确定我是否理解您的意思。请给我一个例如?我会很高兴的。
  • @JustinEzequiel
  • @vishalsingh 我已经发布了一个答案来帮助您入门

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

好的,我已经编写了一个示例代码来向您展示它是如何完成的。尽管该站点有一个方便的 api 可供我们利用,但第一页无法通过 api 获得,而是嵌入在 html 代码中的 script 标记中。这需要额外的处理来提取。之后,只需从 api 获取 json 数据,将其解析为 python 字典并将汽车条目附加到列表中。在滚动站点时在 Chrome 或 Firefox 中检查 network activity 时,可以找到该 api 的链接。

from bs4 import BeautifulSoup
import re
import json
from subprocess import check_output
import requests
import time
from tqdm import tqdm #tqdm is just to implement a progress bar, https://pypi.org/project/tqdm/

cars = [] #create empty list to which we will append the car dicts from the json data

url = 'https://www.cardekho.com/used-cars+in+bangalore'
r = requests.get(url , headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(r.content.decode('utf-8'),"html.parser")
s = soup.find('script', {"type":"application/ld+json"}).next_sibling #find the section with the json data. It looks for a script tage with application/ld+json type, and takes the next tag, which is the one with the data we need, see page source code

js = 'window = {};\n'+s.text.strip()+';\nprocess.stdout.write(JSON.stringify(window.__INITIAL_STATE__));' #strip the text from unnecessary strings and load the json as python dict, taken from: https://stackoverflow.com/questions/54991571/extract-json-from-html-script-tag-with-beautifulsoup-in-python/54992015#54992015
with open('temp.js','w') as f: # save the sting to a javascript file
    f.write(js)

data_site = json.loads(check_output(['node','temp.js'])) #execute the file with node, which will return the json data that will be loaded with json.loads.
for i in data_site['items']: #iterate over the dict and append all cars to the empty list 'cars'
  cars.append(i)

for page in tqdm(range(20, data_site['total_count'], 20)): #'pagefrom' in the api call is 20, 40, 60, etc. so create a range and loop it
  r = requests.get(f"https://www.cardekho.com/api/v1/usedcar/search?&cityId=105&connectoid=&lang_code=en&regionId=0&searchstring=used-cars%2Bin%2Bbangalore&pagefrom={page}&sortby=updated_date&sortorder=asc&mink=0&maxk=200000&dealer_id=&regCityNames=&regStateNames=", headers={'User-Agent': 'Mozilla/5.0'})
  data = r.json()

  for i in data['data']['cars']: #iterate over the dict and append all cars to the empty list 'cars'
    cars.append(i)

  time.sleep(5) #wait a few seconds to avoid overloading the site

这将导致cars 成为字典列表。汽车名称可以在vid 键中找到,而网址则在vlink 键中。 您可以将其加载到 pandas 数据框中以探索数据:

import pandas as pd
df = pd.DataFrame(cars)

df.head() 将输出(为了便于阅读,我省略了图像列):

loc myear bt ft km it pi pn pu dvn ic ucid sid ip oem model vid city vlink p_numeric webp_image position pageNo centralVariantId isExpiredModel modelId isGenuine is_ftc seller_location utype views tmGaadiStore cls
0 Koramangala 2014 SUV Diesel 30,000 0 https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2206305_1614944913.jpg 9.9 Lakh Mahindra XUV500 W6 2WD 13 3019084 9509A09F1673FE2566DF59EC54AAC05B 1 Mahindra Mahindra XUV500 Mahindra XUV500 2011-2015 W6 2WD Bangalore /used-car-details/used-Mahindra-XUV500-2011-2015-W6-2WD-cars-Bangalore_9509A09F1673FE2566DF59EC54AAC05B.htm 990000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2206305_1614944913.webp 1 1 3822 True 570 0 0 {'address': 'BDA Complex, 100 Feet Rd, 3rd Block, Koramangala 3 Block, Koramangala, Bengaluru, Karnataka 560034, Bangalore', 'lat': 12.931, 'lng': 77.6228} Dealer 235 False
1 Marathahalli Colony 2017 SUV Petrol 30,000 0 https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2203506_1614754307.jpeg 7.85 Lakh Ford Ecosport 1.5 Petrol Trend BSIV 14 3015331 2C0E4C4E543D4792C1C3186B361F718B 1 Ford Ford Ecosport Ford Ecosport 2015-2021 1.5 Petrol Trend BSIV Bangalore /used-car-details/used-Ford-Ecosport-2015-2021-1.5-Petrol-Trend-BSIV-cars-Bangalore_2C0E4C4E543D4792C1C3186B361F718B.htm 785000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2203506_1614754307.webp 2 1 6086 True 175 0 0 {'address': '2, Varthur Rd, Ayyappa Layout, Chandra Layout, Marathahalli, Bengaluru, Karnataka 560037, Marathahalli Colony, Bangalore', 'lat': 12.956727624875453, 'lng': 77.70174980163576} Dealer 495 False
2 Yelahanka 2020 SUV Diesel 13,969 0 https://images10.gaadicdn.com/usedcar_image/320x240/usedcar_11_276591614316705_1614316747.jpg 41 Lakh Toyota Fortuner 2.8 4WD AT 12 3007934 BBC13FB62DF6840097AA62DDEA05BB04 1 Toyota Toyota Fortuner Toyota Fortuner 2016-2021 2.8 4WD AT Bangalore /used-car-details/used-Toyota-Fortuner-2016-2021-2.8-4WD-AT-cars-Bangalore_BBC13FB62DF6840097AA62DDEA05BB04.htm 4100000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/usedcar_11_276591614316705_1614316747.webp 3 1 7618 True 364 0 0 {'address': 'Sonnappanahalli Kempegowda Intl Airport Road Jala Uttarahalli Hobli, Yelahanka, Bangalore, Karnataka 560064', 'lat': 13.1518821, 'lng': 77.6220694} Dealer 516 False
3 Byatarayanapura 2017 Sedans Diesel 18,000 0 https://images10.gaadicdn.com/usedcar_image/320x240/used_car_2202297_1615013237.jpg 35 Lakh Mercedes-Benz E-Class E250 CDI Avantgarde 15 3013606 4553943A967049D873712AFFA5F65A56 1 Mercedes-Benz Mercedes-Benz E-Class Mercedes-Benz E-Class 2009-2012 E250 CDI Avantgarde Bangalore /used-car-details/used-Mercedes-Benz-E-Class-2009-2012-E250-CDI-Avantgarde-cars-Bangalore_4553943A967049D873712AFFA5F65A56.htm 3500000 https://images10.gaadicdn.com/usedcar_image/320x240webp/2021/used_car_2202297_1615013237.webp 4 1 4611 True 674 0 0 {'address': 'NO 19, Near Traffic Signal, Byatanarayanapura, International Airport Road, Byatarayanapura, Bangalore, Karnataka 560085', 'lat': 13.0669588, 'lng': 77.5928756} Dealer 414 False
4 nan 2015 Sedans Diesel 80,000 0 https://stimg.cardekho.com/pwa/img/noimage.svg 12.5 Lakh Skoda Octavia Elegance 2.0 TDI AT 1 3002709 156E5F2317C0A3A3BF8C03FFC35D404C 1 Skoda Skoda Octavia Skoda Octavia 2013-2017 Elegance 2.0 TDI AT Bangalore /used-car-details/used-Skoda-Octavia-2013-2017-Elegance-2.0-TDI-AT-cars-Bangalore_156E5F2317C0A3A3BF8C03FFC35D404C.htm 1250000 5 1 3092 True 947 0 0 {'lat': 0, 'lng': 0} Individual 332 False

或者,如果您希望将seller_location 中的字典分解为列,您可以使用df = pd.json_normalize(cars) 加载它。

您可以将所有数据保存到csv 文件:df.to_csv('output.csv')

【讨论】:

  • 先生付出了多大的努力!非常感谢你的帮助。我只是对一些代码有疑问。我认为我没有理解以下代码。
  • s = soup.find('script', {"type":"application/ld+json"}).next_sibling | data_site = json.loads(check_output(['node','temp.js']))
  • r = requests.get(f"cardekho.com/api/v1/usedcar/search? .... , headers={'User-Agent': 'Mozilla/5.0'}) .
  • 我添加了一些额外的解释。 check_output 是一种提取嵌入在页面see here 中的json 的hack。从那时起,当滚动站点时,会从api url 加载数据。
猜你喜欢
  • 2019-11-20
  • 2020-12-31
  • 2016-09-26
  • 2021-05-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多