【问题标题】:how to get outer <li> tag with inside <li> or other tag text using BeautifulSoup in python如何在python中使用BeautifulSoup获取带有内部<li>的外部<li>标签或其他标签文本
【发布时间】:2020-01-31 19:26:07
【问题描述】:

我只想输出外部的 li 标签文本。

  from bs4 import BeautifulSoup

  html = BeautifulSoup("""

      <ul>

            <li><a href="#">B2B Marketing</a>
                   <ul>
                        <li><a href="offerings/b2bmarketing/outboundai.php"> Campagin </a></li>
                        <li><b>Inbound AI </b>Enrich inbound leads</a></li>
                   </ul>
           </li>

           <li>Marketing Data Analysis
                   <ul>
                        <li><a href="offerings/marketingdataanalysis/event360ai.php"><b>Event 360 AI </b></a></li>
                   </ul>
          </li>

          <li class="drop-down"><a href="#">Enrichment API</a>
          </li>


      </ul>

      """)

  print([i.text.strip() for i in html.findAll('li')])

输出是 html 内容的整个文本。

['B2B Marketing\n\n Campagin \nInbound AI Enrich inbound leads', 'Campagin', 'Inbound AI Enrich inbound leads', 'Marketing Data Analysis\n          \nEvent 360 AI', 'Event 360 AI', 'Enrichment API\n\nAPI  Technographics, Firmographics, Intent data', 'API  Technographics, Firmographics, Intent data']

但是

输出应该是:-

  [
   'B2B Marketing : Campagin, Enrich inbound leads',
   'Marketing Data Analysis : Event 360 AI',
   'Enrichment API'
  ]

请帮我解决这个问题

【问题讨论】:

  • 但是你只对外部li元素的文本感兴趣;您请求的输出也是嵌套列表中 li 元素内容的函数。

标签: python web-scraping beautifulsoup python-requests


【解决方案1】:

这是怎么回事?

from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<ul>
            <li><a href="#">B2B Marketing</a>
                   <ul>
                        <li><a href="offerings/b2bmarketing/outboundai.php"> Campagin </a></li>
                        <li><b>Inbound AI </b>Enrich inbound leads</a></li>
                   </ul>
           </li>
           <li>Marketing Data Analysis
                   <ul>
                        <li><a href="offerings/marketingdataanalysis/event360ai.php"><b>Event 360 AI </b></a></li>
                   </ul>
          </li>
          <li class="drop-down"><a href="#">Enrichment API</a>
          </li>
      </ul>
'''
doc = SimplifiedDoc(html)
lis = doc.ul.lis
out = []
for li in lis:
  if li.b and li.b.nextText():
    li.removeElement('b')
  name = li.firstText() if li.firstText() else li.a.text
  tmp = ''
  for l in li.lis:
    tmp += l.text+','
  if tmp:
    out.append(name+':'+tmp[0:-1])
  else:
    out.append(name)
print (out)

结果:

['B2B Marketing:Campagin,Enrich inbound leads', 'Marketing Data Analysis:Event 360 AI', 'Enrichment API']

【讨论】:

  • In out list 'Marketing Data Analysis: Event 360 AI' Event 360 AI is not there
  • 你的需求真的很特别,但我见过你:)
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2017-12-02
  • 2014-04-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-10-18
相关资源
最近更新 更多