【问题标题】:Beautifulsoup extract <li> and <ul> tags and write results to CSVBeautifulsoup 提取 <li> 和 <ul> 标签并将结果写入 CSV
【发布时间】:2018-01-25 03:53:27
【问题描述】:

我正在尝试从以下内容中提取所有行 ('li') 内的文本:

<ul id="tco_detail_data">
        <li>
            <ul class="list-title">
                <li class="first"> </li>
                <li>Year 1</li>
                <li>Year 2</li>
                <li>Year 3</li>
                <li>Year 4</li>
                <li>Year 5</li>
                <li class="last">5 Yr Total</li>
            </ul>
        </li>
        <hr class="loose-dotted" />
        <li class="first">
            <ul class="first">
                <li class="first">Depreciation</li>
                <li>$5,390</li>
                <li>$1,658</li>
                <li>$1,459</li>
                <li>$1,293</li>
                <li>$1,161</li>
                <li class="last">$10,961</li>
            </ul>
        </li>
        <hr class="loose-dotted" />
        <li>
            <ul>
                <li class="first">Taxes &amp; Fees</li>
                <li>$1,424</li>
                <li>$61</li>
                <li>$61</li>
                <li>$61</li>
                <li>$61</li>
                <li class="last">$1,668</li>
            </ul>
        </li>
        <hr class="loose-dotted" />
        <li>
            <ul>
                <li class="first">Financing</li>
                <li>$1,022</li>
                <li>$817</li>
                <li>$603</li>
                <li>$375</li>
                <li>$135</li>
                <li class="last">$2,952</li>
            </ul>

为了达到这一点,我使用了以下内容:

import requests
from bs4
import BeautifulSoup
import csv
page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/')
soup = BeautifulSoup(page.content, 'html.parser')
data = soup.find_all("ul", {"id": "tco_detail_data"})

现在,为了提取 class="first" 下的所有行,我使用了:

details = soup.find_all("li", {"class":"first"})

但是,它只获取第一个父 li 标签和它下面的子 li 标签。如何重复该过程以选择每个 li 类“第一”部分并将结果写入 CSV? 我将不胜感激任何指导。

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup


    【解决方案1】:

    这是与上一个答案类似的方法,它将以嵌套列表形式为您提供网页中的表格(即[[table row], [table row], ...':

    data = soup.find_all("ul", {"id": "tco_detail_data"})
    
    # get all list elements
    lis = data[0].find_all('li')
    
    # add a helper lambda, just for readability
    find_ul = lambda x: x.find_all('ul')
    uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]
    
    # use a nested list comprehension to iterate over the <ul> tags
    # and extract text from each <li> into sublists
    text = [[li.text.encode('utf-8') for li in ul[0].find_all('li')] for ul in uls]
    
    # [
    #   ['\xc2\xa0', 'Year 1', 'Year 2', 'Year 3', 'Year 4', 'Year 5', '5 Yr Total'],
    #   ['Depreciation', '$4,853', '$1,658', '$1,459', '$1,293', '$1,161', '$10,424'],
    #   ['Taxes & Fees', '$2,057', '$21', '$66', '$21', '$66', '$2,231'],
    #   ['Financing', '$1,026', '$821', '$605', '$376', '$136', '$2,964'],
    #   ['Fuel', '$1,606', '$1,654', '$1,704', '$1,755', '$1,808', '$8,527'],
    #   ['Insurance', '$764', '$791', '$818', '$847', '$877', '$4,097'],
    #   ['Maintenance', '$230', '$601', '$385', '$1,653', '$1,504', '$4,373'],
    #   ['Repairs', '$0', '$0', '$109', '$257', '$374', '$740'],
    #   ['Tax Credit', '$0', '', '', '', '', '$0'],
    #   ['True Cost to Own \xc2\xae', '$10,536', '$5,546', '$5,146', '$6,202', '$5,926', '$33,356']
    # ]
    
    # write "text" list to csv
    with open('ford_escape_2017.csv', 'w') as f:
        writer = csv.writer(f)
        writer.writerows(text)
    

    【讨论】:

    • 尽管我没有提供示例输出,但这正是我正在寻找的。阿里提供的方法和您提供的方法效果很好。感谢您的帮助!
    • 如何将其写入 csv 以使其看起来像表格而不是列表?
    • @CarlosH 最简单的方法仍然是将结果存储在一个列表中,然后将该列表写入 CSV;请查看我的更新答案。
    【解决方案2】:

    我不确定我的输出是否是您的想法,因为您没有提供示例输出。

    代码:

    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/').text
    soup = BeautifulSoup(page, 'html.parser')
    uls = soup.find_all('ul', id='tco_detail_data')
    for ul in uls:
        newsoup = BeautifulSoup(str(ul), 'html.parser')
        lis = newsoup.find_all('li')
        for li in lis:
            print(li.text)
    

    输出:

    Year 1
    Year 2
    Year 3
    Year 4
    Year 5
    5 Yr Total
    
    
     
    Year 1
    Year 2
    Year 3
    Year 4
    Year 5
    5 Yr Total
    
    
    Depreciation
    $5,219
    $1,658
    $1,459
    $1,293
    $1,161
    $10,790
    
    
    Depreciation
    $5,219
    $1,658
    $1,459
    $1,293
    $1,161
    $10,790
    
    
    Taxes & Fees
    $2,257
    $195
    $184
    $175
    $166
    $2,977
    
    
    Taxes & Fees
    $2,257
    $195
    $184
    $175
    $166
    $2,977
    
    
    Financing
    $1,051
    $842
    $620
    $386
    $139
    $3,038
    
    
    Financing
    $1,051
    $842
    $620
    $386
    $139
    $3,038
    
    
    Fuel
    $1,906
    $1,963
    $2,022
    $2,083
    $2,146
    $10,120
    
    
    Fuel
    $1,906
    $1,963
    $2,022
    $2,083
    $2,146
    $10,120
    
    
    Insurance
    $1,160
    $1,201
    $1,243
    $1,286
    $1,331
    $6,221
    
    
    Insurance
    $1,160
    $1,201
    $1,243
    $1,286
    $1,331
    $6,221
    
    
    Maintenance
    $274
    $716
    $447
    $1,849
    $1,637
    $4,923
    
    
    Maintenance
    $274
    $716
    $447
    $1,849
    $1,637
    $4,923
    
    
    Repairs
    $0
    $0
    $134
    $318
    $465
    $917
    
    
    Repairs
    $0
    $0
    $134
    $318
    $465
    $917
    
    
    Tax Credit
    $0
    
    
    
    
    $0
    
    
    Tax Credit
    $0
    
    
    
    
    $0
    
    
    True Cost to Own ®
    $11,867
    $6,575
    $6,109
    $7,390
    $7,045
    $38,986
    
    
    True Cost to Own ®
    $11,867
    $6,575
    $6,109
    $7,390
    $7,045
    $38,986
    

    为了能够将结果保存到 csv 文件,我使用了 cmaher 的答案,因为它有助于创建 csv 文件。我的代码只是为您提供li 标签之间所有文本的数据。 请注意,我使用管道而不是逗号作为 csv 文件内容的分隔符,因为数据包含逗号。

    代码:

    import requests
    from bs4 import BeautifulSoup
    
    page = requests.get('https://www.edmunds.com/ford/escape/2017/cost-to-own/').text
    soup = BeautifulSoup(page, 'html.parser')
    data = soup.find_all("ul", {"id": "tco_detail_data"})
    lis = data[0].find_all('li')
    find_ul = lambda x: x.find_all('ul')
    uls = [find_ul(elem) for elem in lis if find_ul(elem) != []]
    text = [[li for li in ul[0].find_all('li')] for ul in uls]
    with open('csvfile.csv', 'w') as file:
        for lis in text:
            temp = ''
            for li in lis:
                temp += li.text + '|'
            temp += '\n'
            file.write(temp)
    

    输出:

     |Year 1|Year 2|Year 3|Year 4|Year 5|5 Yr Total|
    Depreciation|$5,219|$1,658|$1,459|$1,293|$1,161|$10,790|
    Taxes & Fees|$2,257|$195|$184|$175|$166|$2,977|
    Financing|$1,051|$842|$620|$386|$139|$3,038|
    Fuel|$1,906|$1,963|$2,022|$2,083|$2,146|$10,120|
    Insurance|$1,160|$1,201|$1,243|$1,286|$1,331|$6,221|
    Maintenance|$274|$716|$447|$1,849|$1,637|$4,923|
    Repairs|$0|$0|$134|$318|$465|$917|
    Tax Credit|$0|||||$0|
    True Cost to Own ®|$11,867|$6,575|$6,109|$7,390|$7,045|$38,986|
    

    【讨论】:

    • 我想我忘了提到这一点,我太专注于你刚刚所做的事情。我想如果我将其保存为 CSV,它将看起来像一个列表,我怎么能将输出作为一个字符串,以将数据列在下面的列中,并将年份作为标题。
    • 我刚刚添加了一个解决方案,用于如何以 CSV 的形式检索结果。 @CarlosH 请编辑您的问题,以包括需要以 CSV 格式获取结果。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2013-04-28
    • 2022-07-01
    • 2015-09-22
    • 1970-01-01
    • 2019-01-24
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多