【问题标题】:Remove all style, scripts, and html tags from an html page从 html 页面中删除所有样式、脚本和 html 标记
【发布时间】:2015-08-14 10:26:08
【问题描述】:

这是我目前所拥有的:

from bs4 import BeautifulSoup

def cleanme(html):
    soup = BeautifulSoup(html) # create a new bs4 object from the html data loaded
    for script in soup(["script"]): 
        script.extract()
    text = soup.get_text()
    return text
testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"

cleaned = cleanme(testhtml)
print (cleaned)

正在删除脚本

【问题讨论】:

  • 你的预期输出是什么?

标签: python html beautifulsoup


【解决方案1】:

看起来你几乎拥有它。您还需要删除 html 标记和 css 样式代码。这是我的解决方案(我更新了函数):

def cleanMe(html):
    soup = BeautifulSoup(html, "html.parser") # create a new bs4 object from the html data loaded
    for script in soup(["script", "style"]): # remove all javascript and stylesheet code
        script.extract()
    # get text
    text = soup.get_text()
    # break into lines and remove leading and trailing space on each
    lines = (line.strip() for line in text.splitlines())
    # break multi-headlines into a line each
    chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
    # drop blank lines
    text = '\n'.join(chunk for chunk in chunks if chunk)
    return text

【讨论】:

  • @Anu 这对我有用:relist = re.split("window.fbAsyncInit+", texttotest) print(relist[0]) 你可以看到正则表达式拆分工作正常,对于 texttotest 变量我完全使用了你的示例文本。
【解决方案2】:

您可以使用decompose 从文档中完全删除标签,并使用stripped_strings 生成器来检索标签内容。

def clean_me(html):
    soup = BeautifulSoup(html)
    for s in soup(['script', 'style']):
        s.decompose()
    return ' '.join(soup.stripped_strings)

>>> clean_me(testhtml) 
'THIS IS AN EXAMPLE I need this text captured And this'

【讨论】:

    【解决方案3】:

    以干净的方式删除指定的标签和 cmets。感谢Kim Hyesung this code

    from bs4 import BeautifulSoup
    from bs4 import Comment
    
    def cleanMe(html):
        soup = BeautifulSoup(html, "html5lib")    
        [x.extract() for x in soup.find_all('script')]
        [x.extract() for x in soup.find_all('style')]
        [x.extract() for x in soup.find_all('meta')]
        [x.extract() for x in soup.find_all('noscript')]
        [x.extract() for x in soup.find_all(text=lambda text:isinstance(text, Comment))]
        return soup
    

    【讨论】:

      【解决方案4】:

      改用

      # Requirements: pip install lxml
      
      import lxml.html.clean
      
      
      def cleanme(content):
          cleaner = lxml.html.clean.Cleaner(
              allow_tags=[''],
              remove_unknown_tags=False,
              style=True,
          )
          html = lxml.html.document_fromstring(content)
          html_clean = cleaner.clean_html(html)
          return html_clean.text_content().strip()
      
      testhtml = "<!DOCTYPE HTML>\n<head>\n<title>THIS IS AN EXAMPLE </title><style>.call {font-family:Arial;}</style><script>getit</script><body>I need this text captured<h1>And this</h1></body>"
      cleaned = cleanme(testhtml)
      print (cleaned)
      

      【讨论】:

        【解决方案5】:

        如果您想要一个快速而肮脏的解决方案,您可以使用:

        re.sub(r'<[^>]*?>', '', value)
        

        在 php.ini 中创建一个 strip_tags 的等价物。 这就是你想要的吗?

        【讨论】:

          【解决方案6】:

          除了styvane 答案之外的另一个实现。如果要提取大量文本,请查看selectolax,它比lxml快得多

          代码和example in the online IDE

          def clean_me(html):
              soup = BeautifulSoup(html, 'lxml')
          
              body = soup.body
              if body is None:
                  return None
          
              # removing everything besides text
              for tag in body.select('script'):
                  tag.decompose()
              for tag in body.select('style'):
                  tag.decompose()
          
              plain_text = body.get_text(separator='\n').strip()
              print(plain_text)
          
          clean_me()
          

          【讨论】:

            猜你喜欢
            • 2011-07-10
            • 2011-07-27
            • 2012-01-07
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2019-03-04
            • 2015-11-21
            • 2015-08-29
            相关资源
            最近更新 更多