相同的字符串，不同的编码，但在 Ruby 中不相等

【问题标题】：Same strings, different encoding, but not equal in Ruby相同的字符串，不同的编码，但在 Ruby 中不相等
【发布时间】：2023-08-03 16:11:01
【问题描述】：

谁能解释这段代码发生了什么？

s1 = "\x20".force_encoding 'UTF-8'
s2 = "\x20".force_encoding 'ASCII-8BIT'
puts "s1 == s2: #{s1 == s2}"

s3 = "\xAB".force_encoding 'UTF-8'
s4 = "\xAB".force_encoding 'ASCII-8BIT'
puts "s3 == s4: #{s3 == s4}"

在 Ruby 2.0.0p353 中打印：

s1 == s2: true
s3 == s4: false

我不明白为什么 s1 和 s2 相等时 s3 和 s4 不相等。 0xAB 是 '½' 的 ASCII 码，据我所知，它可以用 ASCII-8BIT 和 UTF8 表示。

【问题讨论】：

\0xAB 也是 not ½ 作为 UTF-8 字符代码。我发现了这个："\xAB".force_encoding('CP850').encode('UTF-8') - 给了 ½ 。 . . en.wikipedia.org/wiki/Code_page_850 - 可能其他一些基于 MSDOS 的扩展也有这个映射。
我不知道你从哪里得到关于 1/2 的 ASCII 码的信息。它实际上是Left-pointing double angle quotation mark, left pointing guillemet。你的意思是\xBD？
感谢@NeilSlater，这很有道理！
0xAB 不是 ASCII，[0xAB] 不是有效的 UTF-8 字符串。

标签： ruby utf-8 character-encoding ascii-8bit

【解决方案1】：

utf-8 中的\xAB 与ascii-8bit codepage 中的\xAB 不同，因为编码 utf-8 以多字节集编码，并且从\x80 到\xff 的字符用于编码超过\x80 的符号。

但是由于ASCII-8BIT不是特定的编码，可以看作是基于ascii的编码类，在ruby中别名为binary编码。从\x80 到\xff 的代码也不能进行任何编码转换。所以它就像是基于 ASCII 的代码页的抽象。

因此，如果您尝试从 ASCII-8BIT 转换为 utf-8，您将得到转换异常：

Encoding::UndefinedConversionError: "\xC9" from ASCII-8BIT to UTF-8

但是，您可以使用显式设置的 iso-8859-1 或 cp1252 代码页和字符 \xBD 以 8 位编码正确处理 ½ 符号，如下所示：

"\xBD".force_encoding('ISO-8859-1').encode('UTF-8')
# => "½"
"\xBD".force_encoding('CP1252').encode('UTF-8')
# => "½"

【讨论】：