【问题标题】:How to sanitise a block of text Python 3 no external modules?如何清理文本块 Python 3 没有外部模块?
【发布时间】:2019-03-21 10:32:00
【问题描述】:

最近被设置为hackerrank,我无法在不破坏Python 3中的文本的情况下从标签中正确清理文本块。

提供了两个示例输入(如下),挑战是清除它们以使其成为安全的普通文本块。完成挑战的时间已经结束,但我很困惑我怎么把这么简单的事情弄错了。任何关于我应该如何去做的帮助将不胜感激。

测试输入一

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
var y=window.prompt("Hello")
window.alert(y)
</script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

测试输入二

In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work.  The full details of your in-text references, <script language="JavaScript">
document.write("Page. Last update:" + document.lastModified); </script>When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. 
The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

测试建议的输出 1

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

测试建议的输出 2

  In-text references or citations are used to acknowledge the work or ideas of others. They are placed next to the text that you have paraphrased or quoted, enabling the reader to differentiate between your writing and other people’s work. The full details of your in-text references, When quoting directly from the source include the page number if available and place quotation marks around the quote, e.g. The World Health Organisation defines driver distraction ‘as when some kind of triggering event external to the driver results in the driver shifting attention away from the driving task’.

提前致谢!

编辑(使用@YakovDan 的清理): 代码:

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


    return out_str

inp=input()
print(sanitize(inp))

输入:

It is a long established fact that a reader will be distracted by the readable content of a page when looking at its layout. The point of using Lorem Ipsum is that it has a more-or-less normal distribution of letters, as opposed to using 'Content here, content here', making it look like readable English. Many desktop publishing packages and web page editors now use Lorem Ipsum as their default model text, and a search for 'lorem ipsum' will uncover many web sites still in their infancy. <script>
 var y=window.prompt("Hello")
 window.alert(y)
 </script>Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage.

输出:

读者在查看页面布局时会被页面的可读内容分散注意力,这是一个早已确立的事实。使用 Lorem Ipsum 的关键在于它具有或多或少的正态分布字母,而不是使用“这里的内容,这里的内容”,使它看起来像可读的英语。许多桌面发布包和网页编辑器现在使用 Lorem Ipsum 作为其默认模型文本,搜索“lorem ipsum”将发现许多仍处于起步阶段的网站。

输出应该是什么:

读者在查看页面布局时会被页面的可读内容分散注意力,这是一个早已确立的事实。使用 Lorem Ipsum 的关键在于它具有或多或少的正态分布字母,而不是使用“这里的内容,这里的内容”,使它看起来像可读的英语。许多桌面出版程序包和网页编辑器现在使用 Lorem Ipsum 作为他们的默认模型文本,搜索“lorem ipsum”将发现许多仍处于起步阶段的网站。与普遍的看法相反,Lorem Ipsum 不仅仅是随机文本。它起源于公元前 45 年的一部古典拉丁文学作品,距今已有 2000 多年的历史。弗吉尼亚州汉普登-悉尼学院的拉丁语教授理查德·麦克林托克从 Lorem Ipsum 的一篇文章中查找了一个比较晦涩的拉丁词 consectetur。

【问题讨论】:

  • 请说明要做什么。你能提供一个示例输出吗?你能解释一下你已经尝试过什么吗?如果我理解正确,你有一些文本混合了 标签,你需要清除标签吗?
  • 对我来说效果很好。能提供一个测试用例吗?
  • @YakovDan 再次感谢您的回复!我已经用代码、输入、输出和我认为输出应该是什么来编辑主帖子。问题是,在清除 标记后,它似乎删除了它后面的其余文本,这完全没问题,没有恶意。
  • 我无法复制该问题。相同的代码在我这边运行良好。你能添加你用来调用函数的代码吗?
  • @YakovDan 感谢您回复我。你可以在这里看到我是如何运行它的,如果你粘贴来自主帖子的输入,你应该会收到我得到的输出 - repl.it/repls/FormalStiffPipelining

标签: python python-3.x sanitization input-sanitization


【解决方案1】:

一般来说,正则表达式是解析 HTML 标签 (see here) 的错误工具,但它适用于这项工作,因为标签很简单 - 如果你有非正则(标签没有结束标签等)输入,它将失败。

话虽如此,对于这两个示例,您可以使用this regex

<.*?>.*?<\s*?\/.*?>

在 Python 中实现:

import re
s = one of your long strings
r = re.sub('<.*?>.*?<\s*?\/.*?>', '', s, flags=re.DOTALL)
print(r)

它给出了预期的结果(太啰嗦了,无法复制!)。

【讨论】:

  • 谢谢!确实有效,尽管挑战是在没有正则表达式的情况下做到这一点(应该说),但这会完美地解决问题。
  • @JamesOdo 很抱歉,但一旦必须考虑嵌套标签和其他复杂情况,问题就会变得太长,我无法从头开始编写。您实际上是在要求一个难以实现的完整 [X]HTML 解析器!如果嵌套标签不是必需的,那么您可以实现一个状态为“标签内”或“不在标签内”的 FSM。然后当你迭代字符时,你有两个决定:我要修改我的状态吗?我是否将此字符添加到输出中。就是这样。希望您可以自己管理实施 - 然后工作将是您的 :)
  • @JamesOdo 请注意,您的状态可能必须有两个部分,而不仅仅是“我在标签内”,因为您需要考虑何时您实际上在标签内(例如 "p""&lt;script&gt;" 标签中)。这可以在一个单独的变量中完成。
  • 再次感谢您的回复。正如我认为没有正则表达式那样简单,这似乎很奇怪,因为测试相当短并且不允许我使用该模块。不过感谢您的建议,现在我有更多时间我会自己尝试一下。
【解决方案2】:

这是一种不用正则表达式的方法。

def sanitize(inp_str):

    ignore_flag =False
    close_tag_count = 0


    out_str =""
    for c in inp_str:
        if not ignore_flag:
           if c == '<':
               close_tag_count=2
               ignore_flag=True
           else:
               out_str+=c
        else:
            if c == '>':
                close_tag_count-=1

            if close_tag_count == 0:
                ignore_flag=False


     return out_str

应该这样做(取决于关于标签的假设)

【讨论】:

  • 可能说得有点过早了,虽然这会清除所有
猜你喜欢
  • 1970-01-01
  • 2017-06-15
  • 2013-01-06
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2016-02-16
  • 2017-09-21
  • 1970-01-01
相关资源
最近更新 更多