字符串之间的拉丁字符答案

【问题标题】：Latin Character Inbetween String字符串之间的拉丁字符
【发布时间】：2015-08-25 01:49:03
【问题描述】：

我有一个程序可以读取包含拉丁词的文件，例如“\xed”。这些拉丁词可以出现在任何行之间的任何位置，因此我有程序解析这些字符。有没有图书馆可以做到这一点？

【问题讨论】：

中间是什么意思？
我不知道\xed 是一个拉丁单词。这是什么意思？
@Andreas，刚刚发现它应该被解析为 \u00ed 这是一个“拉丁小写字母 I”
@LJNielsenDk，例如我有 K\xedng 应该是 Kíng
你是说它包含 string "K\xedng"，还是包含 bytes 4B ED 6E 67 ("Kíng")？

【解决方案1】：

我经常做的简单方法是“UTF8”格式的 InputStreamReader。例如：

         try {
            File fileDir = new File("c:/temp/sample.txt");

            BufferedReader in = new BufferedReader(
                    new InputStreamReader(
                            new FileInputStream(fileDir), "UTF8"));

            String str;

            while ((str = in.readLine()) != null) {
                System.out.println(str);
            }

            in.close();
        } 
        catch (UnsupportedEncodingException e) 
        {
            System.out.println(e.getMessage());
        } 
        catch (IOException e) 
        {
            System.out.println(e.getMessage());
        }
        catch (Exception e)
        {
            System.out.println(e.getMessage());
        }

【讨论】：

【解决方案2】：

如果您的意思是文本以字节为单位，并且您有一个十六进制值 ED 的字节，那么该字节的解释取决于您的代码页。

Java 在内部以 UTF-16 存储所有 String。这意味着在读取和写入文件时几乎总是应用代码页转换（UTF-16 不是常见的文件编码）。

默认情况下，Java 将使用 platform default 字符集。如果这不是正确的，您必须指定Charset 才能使用。

作为问题的一个例子，字节ED是：

ISO-8859-1：í（unicode 00ED）美国 Windows
Windows-1251：н（unicode 043D）俄语
代码页 437：φ（unicode 03C6）美国 Windows 命令行（Win 7）

要控制代码页，请像这样读取文件：

File file = new File("C:\\path\\to\\file.txt");
try (BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), "ISO-8859-1"))) {
    String line;
    while ((line = in.readLine()) != null) {
        // process line here
    }
}

或者使用更新的Path API：

Path path = Paths.get("C:\\path\\to\\file.txt");
try (BufferedReader in = Files.newBufferedReader(path, Charset.forName("ISO-8859-1"))) {
    String line;
    while ((line = in.readLine()) != null) {
        // process line here
    }
}

【讨论】：

我猜如果整个文件都是 ISO8859 格式，那么你的方法就可以完美运行！但是，我的文件是 iso8859 和 utf8 的混合文件
@chj 你肯定是在开玩笑。 UTF-8 使用字节 80-FF，ISO-8859-1 也是如此。您应该如何知道该范围内的字节是一个还是另一个？