【问题标题】:How to get a html text inside tag using BeautifulSoup如何使用 BeautifulSoup 在标签内获取 html 文本
【发布时间】:2021-12-23 00:31:44
【问题描述】:

如何使用beautifulsoup 从示例 HTML 中提取数据?

<Tag1>
    <message code="able to extract text from here"/>
    <text value="able to extract text that is here"/>
    <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
</Tag1>

我尝试了.findall.get_text,但是我无法从htmlText 元素中提取文本值。

预期输出:

some thing ORget exact data from here

【问题讨论】:

    标签: python html python-3.x beautifulsoup


    【解决方案1】:

    您可以使用 BeautifulSoup 两次,首先提取 htmlText 元素,然后解析内容。例如:

    from bs4 import BeautifulSoup
    import lxml
    
    html = """
    <Tag1>
        <message code="able to extract text from here"/>
        <text value="able to extract text that is here"/>
        <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
    </Tag1>
    """
    soup = BeautifulSoup(html, "lxml")
    
    for tag1 in soup.find_all("tag1"):
        cdata_html = tag1.htmltext.text
        cdata_soup = BeautifulSoup(cdata_html, "lxml")
        
        print(cdata_soup.p.text)
    

    将显示的内容:

    some thing ORget exact data from here
    

    注意:lxml 也需要使用pip install lxml 安装。 BeautifulSoup 会自动导入这个。

    【讨论】:

      【解决方案2】:

      以下是您需要执行的步骤:

      # firstly, select all "htmlText" elements
      soup.select("htmlText")
      
      
      # secondly, iterate over all of them
      for result in soup.select("htmlText"):
          # further code
      
      
      # thirdly, use another BeautifulSoup() object to parse the data
      # otherwise you can't access <p>, <lite> elements data
      # since they are unreachable to first BeautifulSoup() object
      for result in soup.select("htmlText"):
          final = BeautifulSoup(result.text, "lxml")
      
      
      # fourthly, grab all <p> elements AND their .text -> "p.text"
      for result in soup.select("htmlText"):
          final = BeautifulSoup(result.text, "lxml").p.text
      

      代码和example in the online IDE使用最易读的):

      from bs4 import BeautifulSoup
      import lxml
      
      html = """
      <Tag1>
          <message code="able to extract text from here"/>
          <text value="able to extract text that is here"/>
          <htmlText>&lt;![CDATA[&lt;p&gt;some thing &lt;lite&gt;OR&lt;/lite&gt;get exact data from here&lt;/p&gt;]]&gt;</htmlText>
      </Tag1>
      """
      
      soup = BeautifulSoup(html, "lxml")
      
      
      # BeautifulSoup inside BeautifulSoup
      unreadable_soup = BeautifulSoup(BeautifulSoup(html, "lxml").select_one('htmlText').text, "lxml").p.text
      print(unreadable_soup)
      
      
      example_1 = BeautifulSoup(soup.select_one('htmlText').text, "lxml").p.text
      print(text_1)
      
      
      # wihtout hardcoded list slices
      for result in soup.select("htmlText"):
          example_2 = BeautifulSoup(result.text, "lxml").p.text
          print(example_2)
      
      
      # or one liner
      example_3 = ''.join([BeautifulSoup(result.text, "lxml").p.text for result in soup.select("htmlText")])
      print(example_3)
      
      
      # output
      '''
      some thing ORget exact data from here
      some thing ORget exact data from here
      some thing ORget exact data from here
      some thing ORget exact data from here
      '''
      

      【讨论】:

        猜你喜欢
        • 2015-03-12
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-12-24
        • 2021-07-22
        • 2014-05-22
        相关资源
        最近更新 更多