将“php unicode”转换为字符答案

【问题标题】：Convert "php unicode" to character将“php unicode”转换为字符
【发布时间】：2011-05-14 19:06:41
【问题描述】：

如何通过 Java 将所谓的“php unicode”(link to php unicode) 转换为普通字符？示例 \xEF\xBC\xA1 -> A. jdk 中是否有任何嵌入式方法或者我应该使用正则表达式进行此转换？

【问题讨论】：

您的输入是字符串格式 (\xNN) 还是二进制格式？

标签： java php

【解决方案1】：

您首先需要将字符串中的字节放入字节数组而不更改它们，然后将字节数组解码为 UTF-8 字符串。

将字符串放入字节数组的最简单方法是使用 ISO-8859-1 对其进行编码，它将 unicode 值小于 256 的每个字符映射到具有相同值（或等效的负数）的字节

String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1"); // maps to bytes with the same ordinal value
String javaString = new String(bytes, "UTF-8");
System.out.println(javaString);

编辑
以上将 UTF-8 转换为 Unicode 字符。如果您想将其转换为合理的 ASCII 等价物，则没有标准的方法：但是see this question

编辑
我假设您有一个字符串包含与 UTF-8 序列具有相同序数值的字符，但您指出您的字符串实际上包含转义序列，如：

String phpUnicode = "\\xEF\\xBC\\xA1";

JDK 没有任何内置方法可以像这样转换字符串，因此您需要使用自己的正则表达式。由于我们最终希望将 utf-8 字节序列转换为字符串，因此我们需要设置一个字节数组，使用可能：

Pattern oneChar = Pattern.compile("\\\\x([0-9A-F]{2})|(.)", Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher matcher = oneChar.matcher(phpUnicode);
ByteArrayOutputStream bytes = new ByteArrayOutputStream();

while (matcher.find()) {
    int ch;
    if (matcher.group(1) == null) {
        ch = matcher.group(2).charAt(0);
    }
    else {
        ch = Integer.parseInt(matcher.group(1), 16);
    }
    bytes.write((int) ch);
}
String javaString = new String(bytes.toByteArray(), "UTF-8");
System.out.println(javaString);

这将通过转换 \xAB 序列生成一个 UTF-8 流。然后将此 UTF-8 流转换为 Java 字符串。重要的是要注意，任何不属于转义序列的字符都将转换为相当于 unicode 字符的低 8 位的字节。这适用于 ascii，但可能会导致非 ascii 字符的转码问题。

@麦克道尔：
顺序：

String phpUnicode = "\u00EF\u00BC\u00A1"
byte[] bytes = phpUnicode.getBytes("ISO-8859-1");

创建一个字节数组，其中包含的字节数与原始字符串的字符数相同，并且对于 unicode 值低于 256 的每个字符，相同的数值存储在字节数组中。

字符全宽拉丁大写字母 A (U+FF41) 不存在于原始字符串中，因此它不在 ISO-8859-1 中的事实是无关紧要的。

我知道将字符转换为字节时可能会出现转码错误，这就是为什么我说 ISO-8859-1 只会“将 unicode 值小于 256 的每个字符映射到具有相同值的字节”

【讨论】：

很好，但我需要将 \xNN\xNN 字符串转换为 unicode 字符串，我编写了一个捕获 NN 字符的正则表达式，但是如何从 NN 创建一个 unicode 字符串？ F.e.我有 NN 我需要“\u0NN”（字符串添加在这里不起作用）
Java 字符串是 UTF-16；试图在其中表示 UTF-8 ("\u00EF\u00BC\u00A1") 只会导致转码错误。在任何情况下，字符 FULLWIDTH LATIN CAPITAL LETTER A 在 ISO-8859-1 中都不存在。

【解决方案2】：

有问题的字符是 U+FF21（全宽拉丁大写字母 A）。 PHP 形式 (\xEF\xBC\xA1) 是一个 UTF-8 编码的八位字节序列。

为了将此序列解码为 Java 字符串（始终为 UTF-16），您可以使用以下代码：

// \xEF\xBC\xA1
byte[] utf8 = { (byte) 0xEF, (byte) 0xBC, (byte) 0xA1 };
String utf16 = new String(utf8, Charset.forName("UTF-8"));

// print the char as hex   
for(char ch : utf16.toCharArray()) {
    System.out.format("%02x%n", (int) ch);
}

如果您想从字符串文字中解码数据，您可以使用以下形式的代码：

public static void main(String[] args) {
  String utf16 = transformString("This is \\xEF\\xBC\\xA1 string");
  for (char ch : utf16.toCharArray()) {
    System.out.format("%s %02x%n", ch, (int) ch);
  }
}

private static final Pattern SEQ 
                           = Pattern.compile("(\\\\x\\p{Alnum}\\p{Alnum})+");

private static String transformString(String encoded) {
  StringBuilder decoded = new StringBuilder();
  Matcher matcher = SEQ.matcher(encoded);
  int last = 0;
  while (matcher.find()) {
    decoded.append(encoded.substring(last, matcher.start()));
    byte[] utf8 = toByteArray(encoded.substring(matcher.start(), matcher.end()));
    decoded.append(new String(utf8, Charset.forName("UTF-8")));
    last = matcher.end();
  }
  return decoded.append(encoded.substring(last, encoded.length())).toString();
}

private static byte[] toByteArray(String hexSequence) {
  byte[] utf8 = new byte[hexSequence.length() / 4];
  for (int i = 0; i < utf8.length; i++) {
    int offset = i * 4;
    String hex = hexSequence.substring(offset + 2, offset + 4);
    utf8[i] = (byte) Integer.parseInt(hex, 16);
  }
  return utf8;
}

【讨论】：