【问题标题】:Invalid byte sequence using HTML sanitizer使用 HTML sanitizer 的字节序列无效
【发布时间】:2025-12-06 01:00:01
【问题描述】:

我使用 Rails HTML::FullSanitizeron rails 控制台遇到了这个错误:

h = HTML::FullSanitizer.new
html = "Something with invalid characters \x80 and tags ī."
h.sanitze html

ArgumentError: invalid byte sequence in UTF-8
from /Users/benaluan/.rbenv/versions/1.9.3-p385/lib/ruby/gems/1.9.1/gems/actionpack-3.2.12/lib/action_controller/vendor/html-scanner/html/sanitizer.rb:37:in `sanitize'

我尝试的是在清理之前对 html 进行编码:

html = html.encode('UTF-8', 'binary', invalid: :replace, undef: :replace, replace: '')

它有效,但是它删除了 ī 字符。有没有人遇到过同样的问题?

【问题讨论】:

    标签: ruby-on-rails ruby encoding utf-8 html-sanitizing


    【解决方案1】:

    阅读这篇准确描述您的问题的文章:http://www.spacevatican.org/2012/7/7/stripping-invalid-utf-8/

    本文的解决方案代码:

    html = html.force_encoding('UTF-8').
          encode('UTF-16', :invalid => :replace, :replace => '').
          encode('UTF-8')
    

    【讨论】:

    • 酷!谢谢你。接受。