【问题标题】:python extract data from html tags [duplicate]python从html标签中提取数据[重复]
【发布时间】:2018-05-07 00:14:24
【问题描述】:

我想在Python中提取html标签内的(段落)

 <p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">

 Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

 </span></p>

我的代码是

 from HTMLParser import HTMLParser
 from bs4 import BeautifulSoup

x = """<p style="text-align: justify;"><span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""

p1 = HTMLParser()
p1.unescape(x)
bdy_soup = BeautifulSoup(p1.unescape(x)).get_text(separator=";")
print(bdy_soup)

此代码没有返回任何内容,请帮助我这样做,任何帮助将不胜感激

【问题讨论】:

  • 你是从html页面还是文本文件中读取?
  • @prakash-palnati --- 从 Sql 表中读取
  • @s.s 你可以使用BeautifulSoup 来提取你的精确数据。先做import html >>> html.unescape(x).
  • @manoj jadhav 你能解释一下代码吗
  • @s.s 查看我的帖子。

标签: python html python-3.x


【解决方案1】:
  1. 使用html.unescape将html char转换为ascii
  2. 使用bs4.BeautifulSoup(html_content).text提取内容

>>> x = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;"> Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. </span></p>"""

>>> import html
>>> xx = html.unescape(x)
'<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">\n\n Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.\n\n </span></p>'

>>> import bs4
>>> bs4.BeautifulSoup(xx, "html").text
' Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. '

【讨论】:

  • 这不行你能帮我解释一下代码
  • 我修改了它。 @s.s
  • 我修改了我的问题,请参阅
  • 感谢您的帮助,我发布了我的答案
【解决方案2】:

你可以这样做。请先安装HTMLParserbeautifulsoup4

from HTMLParser import HTMLParser
p = "&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span 
 style=&quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"
from bs4 import BeautifulSoup
p1 = HTMLParser()
p1.unescape(p)
bdy_soup = BeautifulSoup(p1.unescape(p)).get_text(separator="\n")
print bdy_soup

【讨论】:

  • p = { "<p ......... gt;</p>" 中的文本} 显示错误
  • @s.s 确切的输入是什么?你可以输入完整的sn-p吗?
  • 正是这个=== <p style="text-align: justify;><span style="font-size: small; font-family: lato, arial, helvetica, sans-serif;">不管你拥有什么样的小企业,使用传统的销售和营销策略可能会很昂贵。</span></p>
  • 您的代码正在运行但没有返回任何输出...我需要打印什么...?
  • 请添加print bdy_soup。你在bdy_soup得到什么
【解决方案3】:

您可以使用正则表达式来提取两个 HTML 标签之间的数据

r'<title[^>]*>([^<]+)</title>'

【讨论】:

    【解决方案4】:
    The code worked by installing lxml parser.. thankyou everyone for your help
    
     import html
     import bs4
     import html.parser
     import lxml
     from bs4 import BeautifulSoup
    
     x = """&lt;p style=&quot;text-align: justify;&quot;&gt;&lt;span style=&  quot;font-size: small; font-family: lato, arial, helvetica, sans-serif;&quot;&gt; Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive. &lt;/span&gt;&lt;/p&gt;"""
    
     p1 = html.unescape(x) 
     bdy_soup = bs4.BeautifulSoup(p1, "lxml").get_text(separator="/n")
     print(bdy_soup)
    

    【讨论】:

      猜你喜欢
      • 2021-10-22
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-05-13
      • 1970-01-01
      • 2020-12-07
      • 2019-04-04
      • 2017-05-11
      相关资源
      最近更新 更多