【问题标题】:Extract data from html using python,使用python从html中提取数据,
【发布时间】:2020-01-01 13:03:05
【问题描述】:

我有以下示例 HTML 块。从每个块中,我想使用 PYTHON 和 Beautifulsoup 提取“alt”和“Author”。我已经使用漂亮的汤解析了 html。任何人都可以帮助编写脚本

    <div class="row m-0">
        <div class="col-12 d-flex flex-column justify-content-center text-center wow fadeIn" data-wow-delay="0.2s">
            <h5 class="text-white alt-font font-weight-400 letter-spacing-1 margin-10px-bottom">INSPIRATIONAL QUOTES</h5>
            <span class="text-white-2 opacity8 alt-font mb-0 padding-20px-bottom">Find the perfect quote... and Pass It On®</span>

            <form class="search-box2 margin-30px-bottom" action="/inspirational-quotes" method="get">
                <div class="input-group add-on width-75 mx-auto sm-width-100">
                    <input name="q" type="text" value='' placeholder="Search our collection of inspiring quotes..." class="form-control" />
                    <div class="input-group-append">
                        <button type="submit" class="btn btn-default"><i class="ti-search text-small m-0"></i></button>
                    </div>
                </div>
            </form>
        </div>
    </div>

    <div class='row' id='all_quotes'>
        <div class="col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top">

    <a href="/inspirational-quotes/7848-i-say-to-myself-that-i-shall-try-to-make-my"><img alt="I say to myself that I shall try to make my life like an open fireplace, so that people may be warmed and cheered by it and so go out themselves to warm and cheer. #&lt;Author:0x00007fde720f6b28&gt;" class="margin-10px-bottom shadow" src="https://assets.passiton.com/quotes/quote_artwork/7848/medium/20191231_tuesday_quote.jpg?1577388768" width="310" height="310" /></a>
    <h5 class='value_on_red'><a href="/inspirational-quotes/7848-i-say-to-myself-that-i-shall-try-to-make-my">CHEER</a></h5>

    <a href="/inspirational-quotes/7849-the-unselfish-effort-to-bring-cheer-to-others"><img alt="The unselfish effort to bring cheer to others will be the beginning of a happier life for ourselves. #&lt;Author:0x00007fde721154d8&gt;" class="margin-10px-bottom shadow" src="https://assets.passiton.com/quotes/quote_artwork/7849/medium/20191230_monday_quote.jpg?1577388731" width="310" height="310" /></a>
    <h5 class='value_on_red'><a href="/inspirational-quotes/7849-the-unselfish-effort-to-bring-cheer-to-others">CHEER</a></h5>

    <a href="/inspirational-quotes/8027-there-is-no-mistaking-love-it-is-the-common"><img alt="There is no mistaking love. It is the common fiber of life, the flame that heats our soul, energizes our spirit and supplies passion to our lives. #&lt;Author:0x00007fde7213df28&gt;" class="margin-10px-bottom shadow" src="https://assets.passiton.com/quotes/quote_artwork/8027/medium/20191226_thursday_quote.jpg?1576706550" width="310" height="310" /></a>
    <h5 class='value_on_red'><a href="/inspirational-quotes/8027-there-is-no-mistaking-love-it-is-the-common">LOVE</a></h5>

【问题讨论】:

  • 请包含有问题的确切 HTML 块而不是链接
  • Devesh - 我已经复制了源代码,如果有帮助请告诉我
  • 请问你的 python 编码尝试在哪里?

标签: html python-3.x web-scraping beautifulsoup


【解决方案1】:

应该这样做:python 代码在您的 html 文件中搜索 img 块。该脚本还适用于您的 html 文本中的多个 img 块。如果通过将字符串分成两部分找到Author 块(我使用# 符号作为分隔符)。我希望这会有所帮助。

from bs4 import BeautifulSoup

url = "http://values.com/inspirational-quotes" 
r = requests.get(url).text
soup = BeautifulSoup(r,'html.parser') 
table = soup.findAll('img') 

for image in table: 
    alt_table = image.attrs['alt'].split('#') 
    # Check with if-clause to prevent IndexError if no Author is found
    if len(alt_table) > 1:
        alt = alt_table[0] 
        author = alt_table[1]
        print('Alt: \'{}\'\nAuthor: \'{}\'\n'.format(alt,author))
    else:
        alt = alt_table[0]
        print("Only found alt. Alt: \'{}\'\n".format(alt))

【讨论】:

  • BeautifulSoup 是要走的路。
  • 感谢 sp4c。它正在工作,但部分工作。我正在获取 alt = alt_author_list[0] 的值,但是当我运行 author = alt_author_list[1] 时,它给出了错误“索引超出范围”。
  • 好的,对我来说很奇怪。 altalt_author_list 你会得到什么?
  • URL = "http://www.values.com/inspirational-quotes"r = requests.get(URL)soup = BeautifulSoup(r.content,'html5lib')table = soup.findAll('img')for image in table:alt_table = image.attrs['alt'].split('#&amp;lt')alt = alt_table[0]author = alt_table[1]print('Alt: \'{}\'\nAuthor: \'{}\'\n'.format(alt,author))
  • 好的,实际上当我运行你的代码时我得到IndexError: list index out of range ?。
猜你喜欢
  • 2013-06-12
  • 1970-01-01
  • 2011-10-16
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2023-01-01
  • 2021-10-22
  • 2020-10-17
相关资源
最近更新 更多