【问题标题】:Text Segmentation Based On Direct Sentence [duplicate]基于直接句的文本分割[重复]
【发布时间】:2019-02-07 14:38:52
【问题描述】:

假设我有一个这样的 docx 文件:

当我还是个小男孩的时候,我父亲带我到城里去看一个 军乐队。 他说:“儿子长大了,你会成为破碎者的救世主吗?”。 父亲坐在我身边,双手抱住我的肩膀。 我说“我愿意”。 我父亲回答说:“那是我的孩子!”

我想根据直接句子分割 docx。像这样:

sent1 : 他说:“儿子长大后会成为救世主吗? 坏了?”

sent2:我说“我愿意”。

sent3 : 我父亲回答说:“那是我的孩子!”

我尝试使用正则表达式。结果是这样的

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?

".

My father sat beside me, hugging my shoulders with both of his arms.

I said "I Would.

".

My father replied "That is my boy!

正则表达式代码:

import re
SENTENCE_REGEX = re.compile('[^!?\.]+[!?\.]')
text = open ('text.docx','r')

def parse_sentences(text):
   return [x.lstrip() for x in SENTENCE_REGEX.findall(text)]

def print_sentences(sentences):
    print ("\n\n".join(sentences))

if __name__ == "__main__":
    print_sentences(parse_sentences(text))

【问题讨论】:

  • I tried using regex. 用什么代码?

标签: python regex text-segmentation


【解决方案1】:
import re

txt = '''When I was a young boy my father took me into the city to see a marching band. He said, "Son when you grow up would you be the savior of the broken?" My father sat beside me, hugging my shoulders with both of his arms. I said "I Would." My father replied "That is my boy!"'''

pttrn = re.compile(r'(\.|\?|\!)(\'|\")?\s')

new = re.sub(pttrn, r'\1\2\n\n', txt)

print(new)

输出:

When I was a young boy my father took me into the city to see a marching band.

He said, "Son when you grow up would you be the savior of the broken?".

My father sat beside me, hugging my shoulders with both of his arms.


I said "I Would."

My father replied "That is my boy!"

PS: 据我所知,?"..".!". 这样的结尾在英文中是不允许的。

【讨论】:

  • 如何根据直接句进行分组?
  • @SyafiqurRahman lst = output.slip("\n\n")
  • 滑倒?你的意思是分开吗?因为当我尝试时,没有属性 Slip。
  • @SyafiqurRahman 是的,我的意思是拆分。抱歉打错了。您可以根据任何分隔符将字符串拆分为字符串列表。
猜你喜欢
  • 2011-12-28
  • 1970-01-01
  • 1970-01-01
  • 2020-08-07
  • 2021-03-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2020-01-29
相关资源
最近更新 更多