【问题标题】:Ruby's Truncate unsanitizes MS Word codeRuby 的 Truncate 不清理 MS Word 代码
【发布时间】:2011-03-12 22:29:15
【问题描述】:

好奇是否有人注意到这一点,但我有一个所见即所得,用户偶尔会从单词粘贴到其中。有一个词消毒剂,但不是每个人都是天才。

如果我在其他地方解析该文本,结果是正确的。但是如果我截断它,就会出现 msword 代码。

有谁知道为什么 truncate unsanitizes this ||有谁知道如何同时清理和截断?

更新:

这是我截断后显示的 msword 示例:

≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪O:Office Document Settings>  ≪Br /> ≪O:Allow Png/>  ≪Br /> ≪/O:Office Document Settings>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Word Document>  ≪Br /> ≪W:Zoom>0≪/W:Zoom>  ≪Br /> ≪W:Track Moves>False≪/W:Track Moves>  ≪Br /> ≪W:Track Formatting/>  ≪Br /> ≪W:Punctuation Kerning/>  ≪Br /> ≪W:Drawing Grid Horizontal Spacing>18 Pt≪/W:Drawing Grid Horizontal Spacing>  ≪Br /> ≪W:Drawing Grid Vertical Spacing>18 Pt≪/W:Drawing Grid Vertical Spacing>  ≪Br /> ≪W:Display Horizontal Drawing Grid Every>0≪/W:Display Horizontal Drawing Grid Every>  ≪Br /> ≪W:Display Vertical Drawing Grid Every>0≪/W:Display Vertical Drawing Grid Every>  ≪Br /> ≪W:Validate Against Schemas/>  ≪Br /> ≪W:Save If Xml Invalid>False≪/W:Save If Xml Invalid>  ≪Br /> ≪W:Ignore Mixed Content>False≪/W:Ignore Mixed Content>  ≪Br /> ≪W:Always Show Placeholder Text>False≪/W:Always Show Placeholder Text>  ≪Br /> ≪W:Compatibility>  ≪Br /> ≪W:Break Wrapped Tables/>  ≪Br /> ≪W:Dont Grow Autofit/>  ≪Br /> ≪W:Dont Autofit Constrained Tables/>  ≪Br /> ≪W:Dont Vert Align In Txbx/>  ≪Br /> ≪/W:Compatibility>  ≪Br /> ≪/W:Word Document>  ≪Br />≪/Xml>≪![Endif] >≪! [If Gte Mso 9]>≪Xml>  ≪Br /> ≪W:Latent Styles Def Locked State="False" Latent Style Count="276">  ≪Br /> ≪/W:Latent Styles>  ≪Br />≪/Xml>≪![Endif] >  ≪! {Cke Protected}%3 C!%2 D%2 D%7 Bcke Protected%7 D%253 C!%252 D%252 D%257 Bcke Protected%257 D%25253 C!%25252 D%25252 D%25257 Bcke Protected%25257 D%2525253 C!%2525252 D%2525252 D%2525257 Bcke Protected%2525257 D%252525253 C!%252525252 D%252525252 D%252525257 Bcke Protected%252525257 D%25252525253 C!%25252525252 D%25252525252 D%25252525257 Bcke Protected%25252525257 D%2525252525253 C!%2525252525252 D%2525252525252 D%2525252525250 A%25252525252520%2525252525252 F*%25252525252520 Font%25252525252520 Definitions%25252525252520*%2525252525252 F%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Times%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%252525252525200%252525252525205%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%2525252525253 B%2525252525250 A%25252525252509mso Font Charset%2525252525253 A0%2525252525253 B%2525252525250 A%25252525252509mso Generic Font Family%2525252525253 Aauto%2525252525253 B%2525252525250 A%25252525252509mso Font Pitch%2525252525253 Avariable%2525252525253 B%2525252525250 A%25252525252509mso Font Signature%2525252525253 A3%252525252525200%252525252525200%252525252525200%252525252525201%252525252525200%2525252525253 B%2525252525257 D%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Verdana%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%2525252525252011%252525252525206%252525252525204%25

整个内容大约有 600 个字符长。这是前200个左右:

“Excellent” – The New York Times            

“4 Stars”  - The Star-Ledger                                                                       

“Best Romantic Restaurant” – Suburban Essex

“Best View” – OpenTable



In December 1986, the Knowles opened Highlawn after months of restoration to the former open-air “casino” which had, along with the now-prosperous park, been neglected for several years.

这是我在 Stackoverflow 的帮助下制作的自定义消毒剂:

def sanitized_text(text)
  sanitized = text.gsub(/≪[^>]*>/, '')
end

这个消毒剂的问题是它在我截断到 125 个字符后返回空白。我将其扩展为 600 个字符,我得到了一个单独的行,它是另一个 msword 条件语句。

更新: 这是产生 msword 内容的代码。

 = truncate(organization.about_us, 125)

请注意,当我刚刚放这个时:

 = organization.about_us

结果很好,但当然不会被截断。

我还应该添加这是 Ruby 1.8.7 / rails 2.3.5

【问题讨论】:

  • 向您发布消毒剂代码和失败的测试用例将大大有助于回答这个问题。
  • 你能发布一些代码吗?我想知道您是否会假设某些内容已就地编辑或其他内容,但事实并非如此。只是瞎猜。
  • 你能张贴最小的样品,说明什么是有缺陷的,什么是有缺陷的?
  • 请您显示失败的确切代码,例如,如果这是在视图中,请粘贴相关部分。是“” 失败而“ 没问题吗?
  • 我在下面附加了我的答案,因为我无法使用您提供的数据重现此问题。真的很神秘。

标签: ruby-on-rails ruby ms-word sanitization


【解决方案1】:

截断 HTML 总是很麻烦,因为您最终可能会拆分标签和实体。如果没有正确的 UTF-8 处理,您还会冒着将两个字节字符切成两半的风险。

另外需要注意的是过于贪婪的正则表达式:

def sanitized_text(text)
  sanitized = text.gsub(/≪[^>]*?>/, '')
end

*?将捕获匹配的最小值,其中 * 将捕获最大的匹配。

例如:

<A><B>

如果你最终得到错误的表达式,这可以分为“”。

编辑:我试图重现这个但没有运气。

在此示例中,使用您粘贴的文本并对其进行清理,一切似乎都正常。

# app/controllers/example_controller.rb
class ExampleController < ApplicationController
  def index
    @text = '&Lt;! [If Gte Mso 9]>&Lt;Xml>  &Lt;Br /> &Lt;O:Office Document Settings>  &Lt;Br /> &Lt;O:Allow Png/>  &Lt;Br /> &Lt;/O:Office Document Settings>  &Lt;Br />&Lt;/Xml>&Lt;![Endif] >&Lt;! [If Gte Mso 9]>&Lt;Xml>  &Lt;Br /> &Lt;W:Word Document>  &Lt;Br /> &Lt;W:Zoom>0&Lt;/W:Zoom>  &Lt;Br /> &Lt;W:Track Moves>False&Lt;/W:Track Moves>  &Lt;Br /> &Lt;W:Track Formatting/>  &Lt;Br /> &Lt;W:Punctuation Kerning/>  &Lt;Br /> &Lt;W:Drawing Grid Horizontal Spacing>18 Pt&Lt;/W:Drawing Grid Horizontal Spacing>  &Lt;Br /> &Lt;W:Drawing Grid Vertical Spacing>18 Pt&Lt;/W:Drawing Grid Vertical Spacing>  &Lt;Br /> &Lt;W:Display Horizontal Drawing Grid Every>0&Lt;/W:Display Horizontal Drawing Grid Every>  &Lt;Br /> &Lt;W:Display Vertical Drawing Grid Every>0&Lt;/W:Display Vertical Drawing Grid Every>  &Lt;Br /> &Lt;W:Validate Against Schemas/>  &Lt;Br /> &Lt;W:Save If Xml Invalid>False&Lt;/W:Save If Xml Invalid>  &Lt;Br /> &Lt;W:Ignore Mixed Content>False&Lt;/W:Ignore Mixed Content>  &Lt;Br /> &Lt;W:Always Show Placeholder Text>False&Lt;/W:Always Show Placeholder Text>  &Lt;Br /> &Lt;W:Compatibility>  &Lt;Br /> &Lt;W:Break Wrapped Tables/>  &Lt;Br /> &Lt;W:Dont Grow Autofit/>  &Lt;Br /> &Lt;W:Dont Autofit Constrained Tables/>  &Lt;Br /> &Lt;W:Dont Vert Align In Txbx/>  &Lt;Br /> &Lt;/W:Compatibility>  &Lt;Br /> &Lt;/W:Word Document>  &Lt;Br />&Lt;/Xml>&Lt;![Endif] >&Lt;! [If Gte Mso 9]>&Lt;Xml>  &Lt;Br /> &Lt;W:Latent Styles Def Locked State="False" Latent Style Count="276">  &Lt;Br /> &Lt;/W:Latent Styles>  &Lt;Br />&Lt;/Xml>&Lt;![Endif] >  &Lt;! {Cke Protected}%3 C!%2 D%2 D%7 Bcke Protected%7 D%253 C!%252 D%252 D%257 Bcke Protected%257 D%25253 C!%25252 D%25252 D%25257 Bcke Protected%25257 D%2525253 C!%2525252 D%2525252 D%2525257 Bcke Protected%2525257 D%252525253 C!%252525252 D%252525252 D%252525257 Bcke Protected%252525257 D%25252525253 C!%25252525252 D%25252525252 D%25252525257 Bcke Protected%25252525257 D%2525252525253 C!%2525252525252 D%2525252525252 D%2525252525250 A%25252525252520%2525252525252 F*%25252525252520 Font%25252525252520 Definitions%25252525252520*%2525252525252 F%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Times%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%252525252525200%252525252525205%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%252525252525200%2525252525253 B%2525252525250 A%25252525252509mso Font Charset%2525252525253 A0%2525252525253 B%2525252525250 A%25252525252509mso Generic Font Family%2525252525253 Aauto%2525252525253 B%2525252525250 A%25252525252509mso Font Pitch%2525252525253 Avariable%2525252525253 B%2525252525250 A%25252525252509mso Font Signature%2525252525253 A3%252525252525200%252525252525200%252525252525200%252525252525201%252525252525200%2525252525253 B%2525252525257 D%2525252525250 A%25252525252540font Face%2525252525250 A%25252525252509%2525252525257 Bfont Family%2525252525253 A Verdana%2525252525253 B%2525252525250 A%25252525252509panose 1%2525252525253 A2%2525252525252011%252525252525206%252525252525204%2'
  end
end

# app/helpers/example_helper.rb
module ExampleHelper
  def sanitized_text(text)
    text.gsub(/&Lt;[^>]*>/, '')
  end
end

视图本身就是你所拥有的:

<!-- app/views/example/index.html.erb -->
<body>
  <strong>Original</strong>
  <div>
    <%= sanitized_text(@text) %>
  </div>
  <strong>Truncated</strong>
  <div>
    <%= truncate(sanitized_text(@text), :length => 125) %>
  </div>
  <strong>Truncated With Deprecated Option</strong>
  <div>
    <%= truncate(sanitized_text(@text), 125) %>
  </div>
</body>

这是在 OS X 上使用 Ruby 1.8.7p174、Rails 2.3.5 使用 WEBrick 进行测试的。

【讨论】:

  • 虽然你说的是真的,但原始表达式永远不应该太贪婪,因为它匹配 0 个或多个非'>'后跟一个 '>',它总是在第一个 '> 处终止'。
  • 啊,由于特定的设置,您的计数是正确的。人们通常会在 .* 上拍打,好像它已经过时了,然后想知道为什么他们的东西不起作用。
  • 感谢您的努力。我真的。当我按原样运行该方法时,它会消除一切。可能是因为 FD 的观点。此时我唯一能想到的就是在粘贴任何内容时通过 CKEditor 自动清理所有文本。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2010-10-11
  • 1970-01-01
  • 2011-04-24
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多