【发布时间】:2020-09-19 08:45:11
【问题描述】:
说我有 ???? (恶魔)表情符号。
在 4 字节的 UTF-8 中,表示如下:\u00f0\u009f\u0098\u0088
但是,在 Java 中,它只会像这样正确打印:\ud83d\ude08
如何从第一个转换为第二个?
更新 2
MNEMO 的回答要简单得多,并且回答了我的问题,所以最好还是采用他的解决方案。
更新
感谢 Basil Bourque 的报道。非常有趣。
我在这里找到了一个很好的参考:https://github.com/pRizz/Unicode-Converter/blob/master/conversionfunctions.js(尤其是 convertUTF82Char() 函数)。
对于将来在此徘徊的任何人,Java 中的外观如下所示:
public static String fromCharCode(int n) {
char c = (char)n;
return Character.toString(c);
}
public static String decToChar(int n) {
// converts a single string representing a decimal number to a character
// note that no checking is performed to ensure that this is just a hex number, eg. no spaces etc
// dec: string, the dec codepoint to be converted
String result = "";
if (n <= 0xFFFF) {
result += fromCharCode(n);
} else if (n <= 0x10FFFF) {
n -= 0x10000;
result += fromCharCode(0xD800 | (n >> 10)) + fromCharCode(0xDC00 | (n & 0x3FF));
} else {
result += "dec2char error: Code point out of range: " + decToHex(n);
}
return result;
}
public static String decToHex(int n) {
return Integer.toHexString(n).toUpperCase();
}
public static String convertUTF8_toChar(String str) {
// converts to characters a sequence of space-separated hex numbers representing bytes in utf8
// str: string, the sequence to be converted
var outputString = "";
var counter = 0;
var n = 0;
// remove leading and trailing spaces
str = str.replaceAll("/^\\s+/", "");
str = str.replaceAll("/\\s+$/", "");
if (str.length() == 0) {
return "";
}
str = str.replaceAll("/\\s+/g", " ");
var listArray = str.split(" ");
for (var i = 0; i < listArray.length; i++) {
int b = parseInt(listArray[i], 16); // alert('b:'+dec2hex(b));
switch (counter) {
case 0:
if (0 <= b && b <= 0x7F) { // 0xxxxxxx
outputString += decToChar(b);
} else if (0xC0 <= b && b <= 0xDF) { // 110xxxxx
counter = 1;
n = b & 0x1F;
} else if (0xE0 <= b && b <= 0xEF) { // 1110xxxx
counter = 2;
n = b & 0xF;
} else if (0xF0 <= b && b <= 0xF7) { // 11110xxx
counter = 3;
n = b & 0x7;
} else {
outputString += "convertUTF82Char: error1 " + decToHex(b) + "! ";
}
break;
case 1:
if (b < 0x80 || b > 0xBF) {
outputString += "convertUTF82Char: error2 " + decToHex(b) + "! ";
}
counter--;
outputString += decToChar((n << 6) | (b - 0x80));
n = 0;
break;
case 2:
case 3:
if (b < 0x80 || b > 0xBF) {
outputString += "convertUTF82Char: error3 " + decToHex(b) + "! ";
}
n = (n << 6) | (b - 0x80);
counter--;
break;
}
}
return outputString.replaceAll("/ $/", "");
}
几乎是一对一的副本,但它实现了我的目标。
【问题讨论】:
-
如果要解决问题,建议多了解字符编码和Unicode系统。 4 字节 UTF-8 是一个字节序列,而不是 Unicode 代码点本身。
标签: java utf-8 byte emoji utf-16