【问题标题】:Extracting text from span class tag with beautifulsoup用beautifulsoup从span类标签中提取文本
【发布时间】:2016-10-18 20:20:42
【问题描述】:

我正在尝试从网站的跨度类之间提取一些文本元素。

这是 HTML 代码的 sn-p:

<h1>2 Some address</h1>
                </div>
                <div id="smi-summary-items">
                    <div id="smi-price-string">&euro;230,000</div>
                    <span class="header_text"> Detached House</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">3 Beds</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">2 Baths</span>
                    <!-- Text_Link_Full_Ad_Unit -->
                    <div id='dfp-text_link_full_ad_unit' class='sale'>
                        <script type='text/javascript'>
                            googletag.cmd.push(function()
                                {
                                    googletag.display('dfp-text_link_full_ad_unit');
                                }
                            );
                        </script>
                    </div>

我想提取“3 Beds”和“2 Baths”的文本。

我尝试了一些解决方案,但主要是得到错误或空结果。

谁能提出解决方案?

【问题讨论】:

    标签: html web-scraping beautifulsoup html-parsing


    【解决方案1】:

    据我了解,您可以简单地按类过滤所需的元素:

    [item.get_text() for item in soup.select("span.header_text")]
    

    完整的工作示例代码:

    from bs4 import BeautifulSoup
    
    data = """
    <div id="smi-summary-items">
        <div id="smi-price-string">&euro;230,000</div>
        <span class="header_text"> Detached House</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">3 Beds</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">2 Baths</span>
        <!-- Text_Link_Full_Ad_Unit -->
        <div id='dfp-text_link_full_ad_unit' class='sale'>
            <script type='text/javascript'>
                googletag.cmd.push(function()
                    {
                        googletag.display('dfp-text_link_full_ad_unit');
                    }
                );
            </script>
        </div>"""
    soup = BeautifulSoup(data, "html.parser")
    print([item.get_text(strip=True) for item in soup.select("span.header_text")])
    

    产生:

    ['Detached House', '3 Beds', '2 Baths']
    

    【讨论】:

      【解决方案2】:

      以下代码用于从网站中提取跨度类之间的一些文本元素

      >>> from bs4 import BeautifulSoup
      >>> import re
      >>> content = """<h1>2 Some address</h1>
      ...                 </div>
      ...                 <div id="smi-summary-items">
      ...                     <div id="smi-price-string">&euro;230,000</div>
      ...                     <span class="header_text"> Detached House</span>
      <span class="bar">&nbsp;|&nbsp;</span><span class="header_text">3 
      Beds</span><span class="bar">&nbsp;|&nbsp;</span><span class="header_text">2 
      Baths</span>
      ...                     <!-- Text_Link_Full_Ad_Unit -->
      ...                     <div id='dfp-text_link_full_ad_unit' class='sale'>
      ...                         <script type='text/javascript'>
      ...                             googletag.cmd.push(function()
      ...                                 {
      ...                                     googletag.display('dfp-
      text_link_full_ad_unit');
      ...                                 }
      ...                             );
      ...                         </script>
      ...                     </div>"""
      
      >>> soup = BeautifulSoup(content, "html.parser")
      >>> req = soup.find_all("span", {"class":"header_text"})
      >>> print(req)
      [<span class="header_text"> Detached House</span>, <span 
      class="header_text">3 Beds</span>, <span class="header_text">2 Baths</span>]
      >>> x23 = []
      >>> for i in req:
      ...     x23.append(i.get_text())
      ...
      >>> print(x23)
      [' Detached House', '3 Beds', '2 Baths']
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-07-26
        • 1970-01-01
        • 2016-04-19
        • 1970-01-01
        • 2023-04-02
        • 2021-06-05
        相关资源
        最近更新 更多