【问题标题】:Use beautifulsoup to parse string from within <span class="foobar">text_I_want</span>使用 beautifulsoup 从 <span class="foobar">text_I_want</span> 中解析字符串
【发布时间】:2021-08-09 22:46:54
【问题描述】:

我正在尝试以以下格式解析一行:

<span class="foobar">text_I_want</span>

我怎样才能只访问“text_I_want”?

也许在用 bs 解析时我应该采取更早的步骤。最初,我有以下内容:

<div class="commit_item">
<span class="commit_id"><a href="/commit/944bd962177fd1444b2e6282ec808402bb9e3fa6/">944bd962177f</a></span>
<span class="commit_summary">
<span class="commit_subject">mm/memory-failure: make sure wait for page writeback in memory_failure</span>
<span class="commit_date">2021-08-02</span>
<span class="commit_author">Rafael Aquini</span>
</span>
<span class="commit_link">
<a class="tree_link" href="/commit/e8675d291ac007e1c636870db880f837a9ea112a/"><img alt="" class="tree_icon" src="/static/gitrepo/tux.svg"/> <span class="tree_name">linux</span></a>
</span>
</div>

为了解析这个,我做了以下操作:

for commit in soup.find_all('div', {"class": "commit_item"}):
    print(commit)
    url = commit.find('span', {"class": "commit_id"})
    subject = commit.find('span', {"class": "commit_subject"}) 
    author = commit.find('span', {"class": "commit_date"})
    date = commit.find('span', {"class": "commit_author"})
    commit_link = commit.find('span', {"class": "commit_link"})

但是,现在我正在努力获取

【问题讨论】:

    标签: python html parsing beautifulsoup python-requests


    【解决方案1】:
    from bs4 import BeautifulSoup
    from pprint import pp
    
    html = '''<div class="commit_item">
    <span class="commit_id"><a href="/commit/944bd962177fd1444b2e6282ec808402bb9e3fa6/">944bd962177f</a></span>
    <span class="commit_summary">
    <span class="commit_subject">mm/memory-failure: make sure wait for page writeback in memory_failure</span>
    <span class="commit_date">2021-08-02</span>
    <span class="commit_author">Rafael Aquini</span>
    </span>
    <span class="commit_link">
    <a class="tree_link" href="/commit/e8675d291ac007e1c636870db880f837a9ea112a/"><img alt="" class="tree_icon" src="/static/gitrepo/tux.svg"/> <span class="tree_name">linux</span></a>
    </span>
    </div>
    '''
    
    soup = BeautifulSoup(html, 'lxml')
    
    goal = soup.select_one('.commit_item')
    data = list(goal.stripped_strings)
    goal = {
        'Url': goal.a['href'],
        'Subject': data[1],
        'Author': data[-2],
        'Date': data[-3]
    }
    
    pp(goal)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-11-28
      • 2018-12-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多