【问题标题】:get text between span with BeautifulSoup使用 BeautifulSoup 获取跨度之间的文本
【发布时间】:2019-03-09 06:01:53
【问题描述】:

我正在尝试使用 Python 中的 BeautifulSoup 抓取各种网站。假设我有以下html 摘录:

<div class="member_biography">
<h3>Biography</h3>
<span class="sub_heading">District:</span> AnyState - At Large<br/>
<span class="sub_heading">Political Highlights:</span> AnyTown City Council, 19XX-XX<br/>
<span class="sub_heading">Born:</span> June X, 19XX; AnyTown, Calif.<br/>
<span class="sub_heading">Residence:</span> Some Town<br/>
<span class="sub_heading">Religion:</span> Episcopalian<br/>
<span class="sub_heading">Family:</span> Wife, Some Name; two children<br/>
<span class="sub_heading">Education:</span> Some State College, A.A. 19XX; Some Other State College, B.A. 19XX<br/>
<span class="sub_heading">Elected:</span> 19XX<br/>
</div>

我需要以下格式的结果:

District:              AnyState - At Large
Political Highlights:  AnyTown City Council, 19XX-XX
Born:                  June X, 19XX; AnyTown, Calif.
Residence:             Some Town
Religion:              Episcopalian
Family:                Wife, Some Name; two children
Education:             Some State College, A.A. 19XX; Some Other State College, B.A. 19XX
Elected:               19XX

但是,到目前为止,我只能实现以下目标:

District:
Political Highlights:
Born:
Residence:
Religion:
Family:
Education:
Elected:

使用以下代码:

import urllib.request
import sys
from bs4 import BeautifulSoup

def main(url):
    fp = urllib.request.urlopen(url)
    site_bytearray = fp.read()
    fp.close()

    #bs_data = BeautifulSoup(site_str,features="html.parser")
    bs_data = BeautifulSoup(site_bytearray,'lxml')
    tmplist = bs_data.find_all('span',{'class':'sub_heading'})
    for item in tmplist:
        print(item.text)
    sys.exit(0)

if __name__ == "__main__":
    main(sys.argv[1])

简而言之,如何从&lt;span class="sub_heading"&gt;District:&lt;/span&gt; AnyState - At Large&lt;br/&gt; 中提取DistrictAnyState - At Large 并将结果累积到一个列表中以供进一步处理?

【问题讨论】:

    标签: python beautifulsoup lxml


    【解决方案1】:

    将您的打印命令替换为:

    Python 3.6+:

    print(f'{item.text:<25} {item.next_sibling}') 
    

    Python 3 - 3.5:

    print('{:<25} {}'.format(item.text, item.next_sibling))
    

    输出:

    District:                  AnyState - At Large
    Political Highlights:      AnyTown City Council, 19XX-XX
    Born:                      June X, 19XX; AnyTown, Calif.
    Residence:                 Some Town
    Religion:                  Episcopalian
    Family:                    Wife, Some Name; two children
    Education:                 Some State College, A.A. 19XX; Some Other State College, B.A. 19XX
    Elected:                   19XX
    

    【讨论】:

    • 你能给我指出一些可以解释这一点的东西吗?例如:python3 示例中的 {:
    • Nvm,我想通了。谢谢
    • :
    【解决方案2】:

    您是否尝试过使用getText() 似乎总是对我有用。

    【讨论】:

      猜你喜欢
      • 2021-10-05
      • 2016-02-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2011-10-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多