【问题标题】:splitting text based on whitespace基于空格分割文本
【发布时间】:2018-04-24 05:01:49
【问题描述】:
            No time. Not today.
                (slides in last bullets)
            Ten, eleven, twelve... or bust.
                (chambers a shell into each
                 gun, looks up)
            Right here!

The cab SCREECHES to a stop on the shoulder of the highest
FREEWAY in a massive INTERCHANGE of freeways. Dopinder halts
the meter and hands Deadpool his CARD.

我的目标是解析上述文本,使对话与描述分开。我的文件中有多个这样的实例。输出应该是两个单独的字符串 x 和 y,其中: x = “没时间。不是今天……就在这里!”和 y = “出租车尖叫……他的卡片”。

如何使用正则表达式匹配来实现这一点?或者有没有更好的方法来解决这个问题?我正在使用 python。

【问题讨论】:

  • 试试beautifulsoup?

标签: python html regex parsing split


【解决方案1】:

使用 BeautifulSoup 解析网页内容。 根据所需的标签更容易提取内容。使用正则表达式解析 HTML 不是一个好主意。

演示:

from bs4 import BeautifulSoup
s = """<b>                          DEADPOOL (CONT'D) </b>                Little help?

    The cabbie grabs Deadpool's hand and pulls him through to the
    front. Deadpool's head rests upside down on the bench seat
    as he maneuvers his legs through. The cabbie turns the
    helping hand into a HANDSHAKE, then turns down the Juice.

<b>                            CABBIE </b>"""

soup = BeautifulSoup(s, "html.parser")
print(soup.text)

输出:

【讨论】:

  • 您好,感谢您的回复。我已经更新了我的问题,以便更清楚。
【解决方案2】:

您似乎弄错了字符串“little Help?”为“一点帮助?”。而x, y 你要提取的是同一块中由newlines(\n\n) 分隔的字符串。

你可以试试这个,

ss="""<b>                          DEADPOOL (CONT'D) </b>                Little help?

The cabbie grabs Deadpool's hand and pulls him through to the
front. Deadpool's head rests upside down on the bench seat
as he maneuvers his legs through. The cabbie turns the
helping hand into a HANDSHAKE, then turns down the Juice.

<b>                            CABBIE </b>"""
import re
regx=re.compile(r'(?s)(?<=\>)[^<>]*(?=\<)')
lst=[m.strip() for m in regx.findall(ss)]
xy=[m.strip() for m in re.split(r'\n{2}',lst[1])]
for i in xy: print(i+"\n")     # x=xy[0], y=xy[1]

输出是,

Little help?
The cabbie grabs Deadpool's hand and pulls him through to the
front. Deadpool's head rests upside down on the bench seat
as he maneuvers his legs through. The cabbie turns the
helping hand into a HANDSHAKE, then turns down the Juice.

已编辑以输入您的第二个附加问题。

ss="""copy&paste_Your_Input_string_Here"""
xy=[m.strip() for m in re.split(r'\n{2}',ss)]
for i in xy: print(i +"\n")     # x=xy[0], y=xy[1]

【讨论】:

  • 嘿,谢谢你,但我的问题与你理解的有点不同。我已经更新了这个问题。再次感谢!
  • 我为您的第二个问题编辑了我的答案。请再试一次,谢谢:-)
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2011-12-28
  • 1970-01-01
  • 2017-12-15
  • 2019-02-07
  • 2019-08-14
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多