从其编号创建 Unicode 字符答案

【问题标题】：Creating Unicode character from its number从其编号创建 Unicode 字符
【发布时间】：2011-07-31 22:49:55
【问题描述】：

我想在 Java 中显示一个 Unicode 字符。如果我这样做，它工作得很好：

String symbol = "\u2202";

符号等于“∂”。这就是我想要的。

问题是我知道 Unicode 编号并且需要从中创建 Unicode 符号。我尝试了（对我而言）显而易见的事情：

int c = 2202;
String symbol =  "\\u" + c;

但是，在这种情况下，符号等于“\u2202”。这不是我想要的。

如果我知道符号的 Unicode 编号，我该如何构造符号（但仅在运行时——我不能像第一个示例那样对其进行硬编码）？

【问题讨论】：

删除第一个反斜杠，这样它就不会转义反斜杠，而是转义 Unicode 序列。使用 "\\" 告诉 Java 你想打印出 "\"，而不是使用它作为 Unicode 字符转义序列的过去。如果您删除第一个，那么它将转义 Unicode 序列而不是第二个反斜杠。至少，据我所知，它会。
您可以通过以下方式简单地将int 转换为char：char ch = (char)c;。你可以像这样创建一个字符串：String symbol = "" + (char)c;。将字符添加到现有字符串时，这种类型的转换应该是最简单的方法。示例：String text = "You typed the following character: " + (char)c;

标签： java string unicode character

【解决方案1】：

如果您想将 UTF-16 编码的代码单元作为 char，您可以解析整数并按照其他人的建议进行转换。

如果您想支持所有代码点，请使用Character.toChars(int)。这将处理代码点无法放入单个 char 值的情况。

文档说：

将指定的字符（Unicode 代码点）转换为存储在 char 数组中的 UTF-16 表示。如果指定的代码点是 BMP（基本多语言平面或平面 0）值，则生成的 char 数组具有与 codePoint 相同的值。如果指定的代码点是补充代码点，则生成的 char 数组具有相应的代理对。

【讨论】：

虽然这是一个更通用的解决方案，并且在许多情况下您应该在接受的答案上使用它，但接受的答案更接近保罗要求的具体问题。
首先，谢谢！在 Scala 中，我仍然无法解析大于 char 的字符。 scala> "?‍?".map(_.toInt).flatMap((i: Int) => Character.toChars(i)).map(_.toHexString) 给出res11: scala.collection.immutable.IndexedSeq[String] = Vector(f468, 200d, f3a8) 这个表情符号，“男歌手”，使用三个代码点U+1f468、U+200d 和U+1f3a8 来处理。最重要的数字丢失。我可以用按位或 (stackoverflow.com/a/2220476/1007926) 添加它，但不知道如何确定哪些已解析的字符已被截断。谢谢！
@JochemKuijpers 我不同意“接受的答案更接近特定问题”。 OP 明确询问 “如果我知道它的 Unicode 编号...，我该如何构造符号？”，如果 “Unicode number" 在 BMP 之外。例如，对于有效代码点 0x1040C，接受的答案失败，因为它在 SMP 中。这是一个糟糕的答案，应该更正或删除。
@skomisa OPs 场景仅限于十六进制 Unicode 转义序列的表示。如果你有一个字符应该被编码为代理对，那么这会反映在这些转义序列中，所以它最终仍然有效。正如我所说，这是一个更通用的解决方案，您应该使用它。

【解决方案2】：

只需将您的int 转换为char。您可以使用Character.toString() 将其转换为String：

String s = Character.toString((char)c);

编辑：

请记住，Java 源代码中的转义序列（\u 位）是 HEX 格式，因此如果您尝试重现转义序列，则需要 int c = 0x2202 之类的内容。

【讨论】：

这只是给了我一个方形盒子，࢚。它没有给我“∂”。
危险，威尔罗宾逊！不要忘记 Unicode 代码点不一定适合字符。因此，您需要提前绝对确定c 的值小于0x10000，否则这种方法将严重失败。
@NickHartley 抱歉，不要关注 --- 您是否将 0x10000 误读为 10000？
这就是为什么我说'下面'！我需要强调的是，尽管 Java 字符最多只能达到 0xffff，但 Unicode 代码点最多只能达到 0xfffff。设计 Java 后，Unicode 标准发生了变化。这些天来，Java 字符在技术上保存 UTF-16 字，而不是 Unicode 代码点，当您的应用程序遇到奇异的脚本时，忘记这一点会导致可怕的破坏。
@DavidGiven 感谢Java chars go up to 0xFFFF。我不知道。

【解决方案3】：

这里的其他答案要么仅支持 U+FFFF 的 unicode（仅处理一个 char 实例的答案），要么不告诉如何获取实际符号（答案停在 Character.toChars() 或之后使用不正确的方法），所以也在这里添加我的答案。

为了也支持补充代码点，这是需要做的：

// this character:
// http://www.isthisthingon.org/unicode/index.php?page=1F&subpage=4&glyph=1F495
// using code points here, not U+n notation
// for equivalence with U+n, below would be 0xnnnn
int codePoint = 128149;
// converting to char[] pair
char[] charPair = Character.toChars(codePoint);
// and to String, containing the character we want
String symbol = new String(charPair);

// we now have str with the desired character as the first item
// confirm that we indeed have character with code point 128149
System.out.println("First code point: " + symbol.codePointAt(0));

我还快速测试了哪些转换方法有效，哪些无效

int codePoint = 128149;
char[] charPair = Character.toChars(codePoint);

System.out.println(new String(charPair, 0, 2).codePointAt(0)); // 128149, worked
System.out.println(charPair.toString().codePointAt(0));        // 91, didn't work
System.out.println(new String(charPair).codePointAt(0));       // 128149, worked
System.out.println(String.valueOf(codePoint).codePointAt(0));  // 49, didn't work
System.out.println(new String(new int[] {codePoint}, 0, 1).codePointAt(0));
                                                               // 128149, worked

--

注意：正如 @Axel 在 cmets 中提到的，对于 java 11，Character.toString(int codePoint) 可以说是最适合这项工作的。

【讨论】：

为什么它不能作为单线工作？ new String(Character.toChars(121849)); 在 Eclipse 控制台中中断，但三行版本有效。
@Noumenon 无法重现该问题，对我来说同样有效
感谢您走得更远。对于str4 分配，code 不应该是codePoint 吗？
@skomisa 是的。固定。
从 Java 11 开始，我们现在拥有 Character.toString(int codePoint)。

【解决方案4】：

这个对我来说很好用。

  String cc2 = "2202";
  String text2 = String.valueOf(Character.toChars(Integer.parseInt(cc2, 16)));

现在 text2 将有 ∂。

【讨论】：

【解决方案5】：

请记住，char 是一个整数类型，因此可以给定一个整数值以及一个 char 常量。

char c = 0x2202;//aka 8706 in decimal. \u codepoints are in hex.
String s = String.valueOf(c);

【讨论】：

这只是给了我一个方盒子，࢚。它没有给我“∂”。
那是因为 2202 不是您要查找的 int。您正在寻找 0x2202。我的错。在任何情况下，如果您有您要查找的代码点的int，您可以将其转换为char，并使用它（如果您愿意，可以构造一个String）。

【解决方案6】：

String st="2202";
int cp=Integer.parseInt(st,16);// it convert st into hex number.
char c[]=Character.toChars(cp);
System.out.println(c);// its display the character corresponding to '\u2202'.

【讨论】：

虽然这篇文章可能会回答这个问题，但需要说明您在做什么；提高答案的质量和可读性
谢谢，它真的帮助了我！工作正常，并且比这里的其他解决方案更容易（真的，Java 人喜欢把事情复杂化）。

【解决方案7】：

虽然这是一个老问题，但在今天发布的 Java 11 中有一个非常简单的方法：你可以使用a new overload of Character.toString()：

public static String toString(int codePoint)

Returns a String object representing the specified character (Unicode code point). The result is a string of length 1 or 2, consisting solely of the specified codePoint.

Parameters:
codePoint - the codePoint to be converted

Returns:
the string representation of the specified codePoint

Throws:
IllegalArgumentException - if the specified codePoint is not a valid Unicode code point.

Since:
11

由于该方法支持任意Unicode码位，所以返回的String长度不一定是1。

问题中给出的示例所需的代码很简单：

    int codePoint = '\u2202';
    String s = Character.toString(codePoint); // <<< Requires JDK 11 !!!
    System.out.println(s); // Prints ∂

这种方法有几个优点：

它适用于任何 Unicode 代码点，而不仅仅是那些可以使用 char 处理的代码点。
简洁明了，很容易理解代码在做什么。
它将值作为字符串返回，而不是char[]，这通常是您想要的。如果您希望将代码点返回为char[]，则The answer posted by McDowell 是合适的。

【讨论】：

对此进行了一些额外的澄清，因为这个答案确实让我立即明白了如何创建 codePoint 变量。这里的语法应该是：int codePoint = 0x2202; 然后：String s = Character.toString(codePoint); // <<< Requires JDK 11 !!! 或者单行：System.out.println(Character.toString(0x2202)); // Prints ∂ 希望这有助于其他人使用 JDK 11 的这个特性。

【解决方案8】：

这就是你的做法：

int cc = 0x2202;
char ccc = (char) Integer.parseInt(String.valueOf(cc), 16);
final String text = String.valueOf(ccc);

This solution 由 Arne Vajhøj 撰写。

【讨论】：

你说这行得通吗？如果是这样，这是因为您将两千、两百零二重新解释为 0x2202，当然，这根本不是一回事。
哦，不，等一下！ Unicode 值（Java 源代码中的 \u 转义序列）是十六进制的！所以这是对的。你刚才说int c = 2202误导大家，这是错误的！比这更好的解决方案是简单地说 int c = 0x2202 这将节省您通过字符串等操作。
+1 @dty：中间的char ccc... 行绝对没有电话。只需使用int cc = 0x2202;，然后使用final String text=String.valueOf(cc);

【解决方案9】：

下面的代码将为日语中的单词“be”写入 4 个 unicode 字符（用小数表示）。是的，日语中的动词“be”有4个字符！字符的值是十进制的，它已被读入 String[] 数组——例如使用 split 。如果您有八进制或十六进制，parseInt 也采用基数。

// pseudo code
// 1. init the String[] containing the 4 unicodes in decima :: intsInStrs 
// 2. allocate the proper number of character pairs :: c2s
// 3. Using Integer.parseInt (... with radix or not) get the right int value
// 4. place it in the correct location of in the array of character pairs
// 5. convert c2s[] to String
// 6. print 

String[] intsInStrs = {"12354", "12426", "12414", "12377"}; // 1.
char [] c2s = new char [intsInStrs.length * 2];  // 2.  two chars per unicode

int ii = 0;
for (String intString : intsInStrs) {
    // 3. NB ii*2 because the 16 bit value of Unicode is written in 2 chars
    Character.toChars(Integer.parseInt(intsInStrs[ii]), c2s, ii * 2 ); // 3 + 4
    ++ii; // advance to the next char
}

String symbols = new String(c2s);  // 5.
System.out.println("\nLooooonger code point: " + symbols); // 6.
// I tested it in Eclipse and Java 7 and it works.  Enjoy

【讨论】：

【解决方案10】：

这是一个打印出\u00c0到\u00ff之间的unicode字符的块：

char[] ca = {'\u00c0'};
for (int i = 0; i < 4; i++) {
    for (int j = 0; j < 16; j++) {
        String sc = new String(ca);
        System.out.print(sc + " ");
        ca[0]++;
    }
    System.out.println();
}

【讨论】：

【解决方案11】：

不幸的是，删除第一条评论（newbiedoodle）中提到的一个反弹不会带来好的结果。大多数（如果不是全部）IDE 都会出现语法错误。原因在于，Java Escaped Unicode 格式需要语法“\uXXXX”，其中 XXXX 是 4 个十六进制数字，这是强制性的。尝试从碎片中折叠此字符串失败。当然，“\u”与“\\u”不同。第一种语法表示转义的“u”，第二种表示转义的反冲（即反冲）后跟“u”。奇怪的是，在 Apache 页面上提供的实用程序正是这样做的。但实际上是Escape mimic utility。 Apache 有一些自己的实用程序（我没有测试它们），它们可以为您完成这项工作。可能，它仍然不是你想要的。 Apache Escape Unicode utilities 但是这个实用程序1 有很好的解决方案。与上述组合（MeraNaamJoker）。我的解决方案是创建这个 Escaped 模仿字符串，然后将其转换回 unicode（以避免真正的 Escaped Unicode 限制）。我用它来复制文本，所以在 uencode 方法中使用 '\\u' 会更好，除了 '\\\\u'。试试看。

  /**
   * Converts character to the mimic unicode format i.e. '\\u0020'.
   * 
   * This format is the Java source code format.
   * 
   *   CharUtils.unicodeEscaped(' ') = "\\u0020"
   *   CharUtils.unicodeEscaped('A') = "\\u0041"
   * 
   * @param ch  the character to convert
   * @return is in the mimic of escaped unicode string, 
   */
  public static String unicodeEscaped(char ch) {
    String returnStr;
    //String uniTemplate = "\u0000";
    final static String charEsc = "\\u";

    if (ch < 0x10) {
      returnStr = "000" + Integer.toHexString(ch);
    }
    else if (ch < 0x100) {
      returnStr = "00" + Integer.toHexString(ch);
    }
    else if (ch < 0x1000) {
      returnStr = "0" + Integer.toHexString(ch);
    }
    else
      returnStr = "" + Integer.toHexString(ch);

    return charEsc + returnStr;
  }

  /**
   * Converts the string from UTF8 to mimic unicode format i.e. '\\u0020'.
   * notice: i cannot use real unicode format, because this is immediately translated
   * to the character in time of compiling and editor (i.e. netbeans) checking it
   * instead reaal unicode format i.e. '\u0020' i using mimic unicode format '\\u0020'
   * as a string, but it doesn't gives the same results, of course
   * 
   * This format is the Java source code format.
   * 
   *   CharUtils.unicodeEscaped(' ') = "\\u0020"
   *   CharUtils.unicodeEscaped('A') = "\\u0041"
   * 
   * @param String - nationalString in the UTF8 string to convert
   * @return is the string in JAVA unicode mimic escaped
   */
  public String encodeStr(String nationalString) throws UnsupportedEncodingException {
    String convertedString = "";

    for (int i = 0; i < nationalString.length(); i++) {
      Character chs = nationalString.charAt(i);
      convertedString += unicodeEscaped(chs);
    }
    return convertedString;
  }

  /**
   * Converts the string from mimic unicode format i.e. '\\u0020' back to UTF8.
   * 
   * This format is the Java source code format.
   * 
   *   CharUtils.unicodeEscaped(' ') = "\\u0020"
   *   CharUtils.unicodeEscaped('A') = "\\u0041"
   * 
   * @param String - nationalString in the JAVA unicode mimic escaped
   * @return is the string in UTF8 string
   */
  public String uencodeStr(String escapedString) throws UnsupportedEncodingException {
    String convertedString = "";

    String[] arrStr = escapedString.split("\\\\u");
    String str, istr;
    for (int i = 1; i < arrStr.length; i++) {
      str = arrStr[i];
      if (!str.isEmpty()) {
        Integer iI = Integer.parseInt(str, 16);
        char[] chaCha = Character.toChars(iI);
        convertedString += String.valueOf(chaCha);
      }
    }
    return convertedString;
  }

【讨论】：

【解决方案12】：

char c=(char)0x2202; 字符串 s=""+c;

【讨论】：

【解决方案13】：

(ANSWER IS IN DOT NET 4.5 and in java,一定有类似的方法存在)

我来自印度的西孟加拉邦。据我了解，您的问题是... 你想产生类似于'অ'（这是孟加拉语的字母）其中有 Unicode HEX：0X0985.

现在，如果您知道与您的语言相关的这个值，那么您将如何生成该语言特定的 Unicode 符号？

在 Dot Net 中就是这么简单：

int c = 0X0985;
string x = Char.ConvertFromUtf32(c);

现在 x 是你的答案。但这是 HEX by HEX convert 和句子到句子转换是研究人员的工作：P

【讨论】：

问题确实是针对 java 的。我看不出 .NET 答案与这里有什么关系。