【问题标题】:How to extract data in columns from page using soup如何使用汤从页面中提取列中的数据
【发布时间】:2026-01-13 02:40:01
【问题描述】:

尝试捕获项目符号中存在的数据

链接https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/

这里需要使用xpath提取数据

要提取的数据

    4 Door Sedan

    4 Cylinder, 1.8 Litre

    Constantly Variable Transmission, Front Wheel Drive

    Petrol - Unleaded ULP

    6.4 L/100km 

试过这个:

import requests
import lxml.html as lh
import pandas as pd
import html
from lxml import html
from bs4 import BeautifulSoup
import requests


cars = [] 

urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/']

for url in urls: 
    car_data={} 
    headers = {'User-Agent':'Mozilla/5.0'}
    page = (requests.get(url, headers=headers))
    tree = html.fromstring(page.content)
    if tree.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[4]/div/div'):
        car_data["namings"] = tree.xpath('/html/body/div[1]/div[2]/div/div[1]/div[1]/div[4]/div/div')[0]


【问题讨论】:

    标签: python beautifulsoup request python-requests


    【解决方案1】:

    您已经导入 BeautifulSoup,为什么不使用 css 类选择器?

    import requests
    from bs4 import BeautifulSoup as bs
    
    r = requests.get('https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/', headers = {'User-Agent':'Mozilla/5.0'})
    soup = bs(r.content, 'lxml')
    info = [i.text.strip() for i in soup.select('.dgi-')]
    

    你也可以打印成

    for i in soup.select('.dgi-'):
        print(i.text.strip())
    

    【讨论】:

    • 如果我需要将输出分成 5 个部分?像门 = 4 门轿车,车身 = 4 缸,1.8 升
    • 输出部分。这是一个列表。
    • 我添加了一个编辑,这样你就可以看到如何在没有列表理解的情况下打印
    • 如何将列表的每个元素分配给变量。它试图这样做info = [i.text.strip() for i in soup.select('.dgi-')] car_data['0']=info.split(" ")[0] car_data['1']=info.split(" ")[1] car_data['2']=info.split(" ")[2] car_data['3']=info.split(" ")[3]
    【解决方案2】:
    • find_all()-返回元素的集合。
    • strip()- Python 的内置函数用于从字符串中删除所有前导和尾随空格。

    例如

    import requests
    from bs4 import BeautifulSoup
    
    cars = []
    urls = ['https://www.redbook.com.au/cars/details/2019-honda-civic-50-years-edition-auto-my19/SPOT-ITM-524208/']
    
    for url in urls:
        car_data=[]
        headers = {'User-Agent':'Mozilla/5.0'}
        page = (requests.get(url, headers=headers))
        soup = BeautifulSoup(page.content,'lxml')
        car_obj = soup.find("div",{'class':'r-center-pane'}).find("div",\
                        {'class':'micro-spec'}).find("div",{'class':'columns'}).find_all("dd")
    
        for x in car_obj:
            text = x.text.strip()
            if text != "":
                car_data.append(text)
        cars.append(car_data)
    
    print(cars)
    

    O/P:

    [['4 Door Sedan', '4 Cylinder, 1.8 Litre', 'Constantly Variable Transmission,
    Front Wheel Drive', 'Petrol - Unleaded ULP', '6.4 L/100km']]
    

    【讨论】: