如何从字符串中提取表情符号和字母字符答案

【问题标题】：How to extract emoji and alphabet characters from the string如何从字符串中提取表情符号和字母字符
【发布时间】：2018-12-13 12:10:42
【问题描述】：

我想从字符串中提取表情符号和字母字符到一个集合中，只需字符串具有任何类型的表情符号字符，例如活动、家庭、旗帜、动物符号，并且还具有字母字符。当我从EditText 得到字符串时，它类似于“AB????C????D?????????????????????????E????️‍ ????‍????”。我尝试了，但不幸的是，收集数组与我的期望不一样，任何人都可以建议我，我需要为预期的收集数组做什么？

使用 Eclipse 我尝试了这段代码，如果我错了，请纠正我

public class CodePoints {

    public static void main(String []args){
        List<String> list = new ArrayList<>();
        for(int codePoint : codePoints("AB????C????D????‍????‍????‍????E????️‍????‍????")) {
            list.add(String.valueOf(Character.toChars(codePoint)));
        }

        System.out.println(Arrays.toString(list.toArray()));
    }

    public static Iterable<Integer> codePoints(final String string) {
     return new Iterable<Integer>() {
       public Iterator<Integer> iterator() {
         return new Iterator<Integer>() {
           int nextIndex = 0;
           public boolean hasNext() {
             return nextIndex < string.length();
           }
           public Integer next() {
             int result = string.codePointAt(nextIndex);
             nextIndex += Character.charCount(result);
             return result;
           }
           public void remove() {
             throw new UnsupportedOperationException();
           }
         };
       }
     };
   }
}

输出：
[A, B, ????, C, ????, D, ????, ‍, ????, ‍, ????, ‍, ????, E, ??? ?, ️, ‍, ????, ‍, ????]

预期：
[A、B、????、C、????、D、??????‍????‍??????‍????、E、??????️‍???? ‍, ????]

【问题讨论】：

看来你想要拆分而不是拆分和过滤（对我来说提取意味着过滤）。查看 break 迭代器以确保您不会在“组合字符”之间进行拆分。

标签： java android utf-8 character emoji

【解决方案1】：

问题是您的字符串包含不可见的字符。
他们是：
Unicode 字符“零宽度连接器”(U+200D)
Unicode 字符 'VARIATION SELECTOR-16' (U+FE0F)
其他类似的还有：
Unicode 字符“软连字符”(U+00AD)
...

java字符为utf16编码，见：https://en.wikipedia.org/wiki/UTF-16
https://docs.oracle.com/javase/7/docs/api/java/lang/String.html

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

这是一种在字符串中迭代单个 unicode 字符的方法。

public static List<String> getUnicodeCharacters(String str) {
    List<String> result = new ArrayList<>();
    char charArray[] = str.toCharArray();
    for (int i = 0; i < charArray.length; ) {
        if (Character.isHighSurrogate(charArray[i])
                && (i + 1) < charArray.length
                && Character.isLowSurrogate(charArray[i + 1])) {
            result.add(new String(new char[]{charArray[i], charArray[i + 1]}));
            i += 2;
        } else {
            result.add(new String(new char[]{charArray[i]}));
            i++;
        }
    }
    return result;
}

@Test
void getUnicodeCharacters() {
    String str = "AB?C?D?‍?‍?‍?E?️‍?‍?";
    System.out.println(str.codePointCount(0, str.length()));
    for (String unicodeCharacter : UTF_16.getUnicodeCharacters(str)) {
        if ("\u200D".equals(unicodeCharacter)
                || "\uFE0F".equals(unicodeCharacter))
            continue;
        System.out.println(unicodeCharacter);
    }
}

【讨论】：