【问题标题】:Scraping badly coded html抓取编码错误的 html
【发布时间】:2020-03-26 03:15:43
【问题描述】:

我抓取了一个网站,其中包含数百页组织不良的 HTML。我使用 BeautifulSoup 来捕获每个页面上 div 的所有内容。该列表的摘录是:

mylist = [['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/30/2019<br/>09:00:00 AM<br/>12/31/2019<br/>09:00:00 AM<br/>92112<br/>Initiate<br/>Capacity Constraint<br/>12/29/2019<br/>03:02:38 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/30/2019<br/></div>'],
['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/29/2019<br/>09:00:00 AM<br/>12/30/2019<br/>09:00:00 AM<br/>92086<br/>Initiate<br/>Capacity Constraint<br/>12/28/2019<br/>02:55:39 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/29/2019<br/></div>'],
['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/28/2019<br/>09:00:00 AM<br/>12/29/2019<br/>09:00:00 AM<br/>92074<br/>Initiate<br/>Capacity Constraint<br/>12/27/2019<br/>03:14:16 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/28/2019<br/></div>']]

如何捕捉&lt;br/&gt; 标签之间的内容,包括它们之间没有任何内容时的空白?

我应该补充一点,输出应该成为一个列表列表,其中每个项目由&lt;br/&gt; 标记分开,是列表中的一个项目。例如:

[['"006951446", "Algonquin Gas Transmission, LLC", "Critical notice", "12/30/2019", "09:00:00 AM", "12/31/2019", "09:00:00 AM", "92112", "Initiate", "Capacity Constraint", "12/29/2019", "03:02:38 PM", "No response required", "AGT Pipeline Conditions for 12/30/2019"'],
 ['"006951446", "Algonquin Gas Transmission, LLC", "Critical notice", "12/29/2019", "09:00:00 AM", "12/30/2019", "09:00:00 AM", "92086", "Initiate", "Capacity Constraint", "12/28/2019", "02:55:39 PM", "No response required", "AGT Pipeline Conditions for 12/29/2019"'],
 ['"006951446", "Algonquin Gas Transmission, LLC", "Critical notice", "12/28/2019", "09:00:00 AM", "12/29/2019", "09:00:00 AM", "92074", "Initiate", "Capacity Constraint", "12/27/2019", "03:14:16 PM", "No response required", "AGT Pipeline Conditions for 12/28/2019"']]

【问题讨论】:

  • 我不确定我是否理解你能举一个小例子来演示你想要什么而不是 mylist 中的所有文本吗?

标签: python web-scraping beautifulsoup


【解决方案1】:

通常,当您在 BeatifulSoup 对象上使用 select 时,您会得到 Tags 的列表。
您可以再次在Tags 上使用select/getText
例如:

SEP='(--*--SEP--*--)'
mylist=soup.select('div')
between_br=[[j for j in i.getText(SEP).split(SEP) if not j.isspace()] for i in mylist]

【讨论】:

  • 嗨,我收到一个错误,上面写着:SelectorSyntaxError: Invalid character '/' position 2 line 1: br/
  • 已修复,这是您想要的输出吗?paste.ubuntu.com/p/F5x4nmbN2z
  • 粘贴到评论中太大了,格式会很糟糕。对不起:(
【解决方案2】:
from bs4 import BeautifulSoup
mylist = [['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/30/2019<br/>09:00:00 AM<br/>12/31/2019<br/>09:00:00 AM<br/>92112<br/>Initiate<br/>Capacity Constraint<br/>12/29/2019<br/>03:02:38 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/30/2019<br/></div>'],
          ['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/29/2019<br/>09:00:00 AM<br/>12/30/2019<br/>09:00:00 AM<br/>92086<br/>Initiate<br/>Capacity Constraint<br/>12/28/2019<br/>02:55:39 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/29/2019<br/></div>'],
          ['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/28/2019<br/>09:00:00 AM<br/>12/29/2019<br/>09:00:00 AM<br/>92074<br/>Initiate<br/>Capacity Constraint<br/>12/27/2019<br/>03:14:16 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/28/2019<br/></div>']]

for item in mylist:
    soup = BeautifulSoup(*item, 'html.parser')
    print(*[a.get_text(strip=True, separator="|").split("|") for a in soup])

输出:

['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/30/2019', '09:00:00 AM', '12/31/2019', '09:00:00 AM', '92112', 'Initiate', 'Capacity Constraint', '12/29/2019', '03:02:38 PM', 'No response required', 'AGT Pipeline Conditions for 
12/30/2019']
['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/29/2019', '09:00:00 AM', '12/30/2019', '09:00:00 AM', '92086', 'Initiate', 'Capacity Constraint', '12/28/2019', '02:55:39 PM', 'No response required', 'AGT Pipeline Conditions for 
12/29/2019']
['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/28/2019', '09:00:00 AM', '12/29/2019', '09:00:00 AM', '92074', 'Initiate', 'Capacity Constraint', '12/27/2019', '03:14:16 PM', 'No response required', 'AGT Pipeline Conditions for 
12/28/2019']

【讨论】:

  • 这真的很接近。但我需要列表中的每个项目都是列表中的单独项目。喜欢:``` [['“006951446”、“Algonquin Gas Transmission, LLC”、“重要通知”、“12/30/2019”、“09:00:00 AM”、“12/31/2019”、 “09:00:00 AM”、“92112”、“启动”、“容量限制”、“12/29/2019”、“03:02:38 PM”、“无需响应”、“AGT 管道条件12/30/2019"'], ['"006951446", "Algonquin Gas Transmission, LLC", "重要通知", "12/29/2019", "09:00:00 AM", "12/30/ 2019”、“09:00:00 AM”、“92086”、“启动”、“容量限制”']] ``
  • 谢谢!我注意到在“无需回复”部分之前,有一些 &lt;br\&gt; 里面什么都没有。在整个数据中,有时它们包含一些数字。如何在有数据时将它们捕获为空项和数据。
  • @SMJune 您已收到基于您的示例的答案。祝你好运
【解决方案3】:

如果没有看到您的其余代码,可能很难给出准确的答案,但 是一个很好的包。您应该能够继续使用bs4 包通过BeautifulSoup 方法的组合来梳理HTML。(例如find/find_all/select 等)

See this answer for help on br tags

【讨论】:

    【解决方案4】:

    使用库 SimplifiedDoc 的解决方案。

    from simplified_scrapy import SimplifiedDoc,req,utils
    mylist = [['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/30/2019<br/>09:00:00 AM<br/>12/31/2019<br/>09:00:00 AM<br/>92112<br/>Initiate<br/>Capacity Constraint<br/>12/29/2019<br/>03:02:38 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/30/2019<br/></div>'],
              ['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/29/2019<br/>09:00:00 AM<br/>12/30/2019<br/>09:00:00 AM<br/>92086<br/>Initiate<br/>Capacity Constraint<br/>12/28/2019<br/>02:55:39 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/29/2019<br/></div>'],
              ['<div id="headingData">006951446<br/>Algonquin Gas Transmission, LLC<br/>Critical notice<br/>12/28/2019<br/>09:00:00 AM<br/>12/29/2019<br/>09:00:00 AM<br/>92074<br/>Initiate<br/>Capacity Constraint<br/>12/27/2019<br/>03:14:16 PM<br/> <br/><br/>No response required<br/> <br/> <br/>AGT Pipeline Conditions for 12/28/2019<br/></div>']]
    values = []
    # First way
    for item in mylist:
      doc = SimplifiedDoc(item[0])
      tmp = doc.selects('br').nextText()
      tmp.insert(0,doc.div.firstText())
      values.append(tmp)
    values = []
    # Second way
    for item in mylist:
      doc = SimplifiedDoc(item[0])
      brs = doc.selects('br')
      tmp = [br.previousText() for br in brs]
      values.append(tmp)
    print(values)
    

    结果:

    [['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/30/2019', '09:00:00 AM', '12/31/2019', '09:00:00 AM', '92112', 'Initiate', 'Capacity Constraint', '12/29/2019', '03:02:38 PM', '', '', 'No response required', '', '', 'AGT Pipeline Conditions for 12/30/2019'], ['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/29/2019', '09:00:00 AM', '12/30/2019', '09:00:00 AM', '92086', 'Initiate', 'Capacity Constraint', '12/28/2019', '02:55:39 PM', '', '', 'No response required', '', '', 'AGT Pipeline Conditions for 12/29/2019'], ['006951446', 'Algonquin Gas Transmission, LLC', 'Critical notice', '12/28/2019', '09:00:00 AM', '12/29/2019', '09:00:00 AM', '92074', 'Initiate', 'Capacity Constraint', '12/27/2019', '03:14:16 PM', '', '', 'No response required', '', '', 'AGT Pipeline Conditions for 12/28/2019']]
    

    【讨论】:

      猜你喜欢
      • 2013-12-31
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多