【问题标题】:Parsing BeautifulSoup html tag解析 BeautifulSoup html 标签
【发布时间】:2014-12-20 15:20:07
【问题描述】:

我需要使用 BeautifulSoup 解析一个 HTML 文件。 HTML 看起来像这样:

    <div class="entry_container">

       <div class="entry lang_en-gb" id="turn-over_1">   
          <span class="inline">
             <h1 class="hwd">turn over</h1>
          </span>
          <div class="hom" id="turn-over_1.1">
             <span class="gramGrp"><span class="pos">intransitive verb</span></span>
             <div class="sense"><span class="bold">1 </span><span class="gramGrp"><span class="colloc"><span>[</span>person<span>]</span></span></span><span class="lbl"><span> (</span>in bed<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">se retourner</span></span><span class="cit" id="turn-over_1.2"><span>;   </span></span></div>

             <div class="sense"><span> <br/></span><span class="bold">2 </span><span class="gramGrp"><span class="colloc"><span>[</span>car<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">se retourner</span></span><span>, </span><span class="cit lang_fr"><span class="quote">faire un tonneau</span></span><span class="cit" id="turn-over_1.3"><span>;   </span></span></div>

             <div class="sense"><span> <br/></span><span class="bold">3 </span><span class="lbl"><span>(= </span>switch TV channels<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">changer de chaîne</span></span><span class="cit" id="turn-over_1.4"><span>;   </span></span></div>

          </div>

          <div class="hom" id="turn-over_1.5">
             <span> <br/>▶ </span><span class="gramGrp"><span class="pos">transitive verb</span></span>
             <div class="sense">
                <span class="bold">1 </span>
                <div class="sense"><span class="bold">   a </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>object<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">retourner</span></span><span class="cit" id="turn-over_1.6"><span>;   </span></span></div>

                <div class="sense"><span class="bold">   b </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>page<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">tourner</span></span></div>

                <div class="sense"><span class="bold">   c </span><span class="gramGrp"><span class="colloc"><span>[</span><span>+ </span>tape<span>]</span></span></span><span> </span><span class="cit lang_fr"><span class="quote">changer de face</span></span><span class="cit" id="turn-over_1.7"><span>;   </span></span></div>

             </div>

             <div class="sense"><span> <br/></span><span class="bold">2 </span><span class="lbl"><span>(= </span>hand over<span>)</span></span><span> </span><span class="cit lang_fr"><span class="quote">remettre</span></span><span class="cit" id="turn-over_1.8"><span>;   </span></span><span class="cit" id="turn-over_1.9"><span>;   </span></span></div>

          </div>      
       </div>

    </div>

我需要检索每个div class="hom" 的位置(span class="pos")和意义(每个&lt;div class="sense"&gt;

解析的结果可能是这样的:

目前,我已经尝试了以下代码:

for gramGrp in entryContentHTML.find_all('div',attrs={"class":u"hom"}):
  for pos in gramGrp.find('span',attrs={"class":u"gramGrp"}).find('span',attrs={"class":u"pos"}):
    print pos

但是输出是:

intransitive verb
intransitive verb
transitive verb

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    你必须整理输出,但这会得到你需要的:

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html)
    
    res= (["\n".join(s.strip() for s in x.text.splitlines()).replace(";","") for x in     soup.find_all("div", {"class":"hom"})])
    print("\n".join(res))
    
    
    intransitive verb
    1 [person] (in bed) se retourner
    2 [car] se retourner, faire un tonneau
    3 (= switch TV channels) changer de chaîne
    
    ▶ transitive verb
    
    1
    a [+ object] retourner
    b [+ page] tourner
    c [+ tape] changer de face
    
    2 (= hand over) remettre
    

    【讨论】:

      猜你喜欢
      • 2017-10-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-09-24
      • 2013-03-20
      • 2012-05-22
      • 1970-01-01
      • 2022-01-20
      相关资源
      最近更新 更多