InputStreamReader 缓冲问题答案

【问题标题】：InputStreamReader buffering issueInputStreamReader 缓冲问题
【发布时间】：2011-02-07 13:24:54
【问题描述】：

不幸的是，我正在从具有两种字符编码类型的文件中读取数据。

有一个标题和一个正文。标头始终为 ASCII 格式，并定义了正文编码的字符集。

标头不是固定长度，必须通过解析器运行以确定其内容/长度。

文件也可能很大，所以我需要避免将整个内容都带入内存。

所以我从单个 InputStream 开始。我最初用带有 ASCII 的 InputStreamReader 包装它，然后解码标题并提取正文的字符集。都很好。

然后我使用正确的字符集创建一个新的 InputStreamReader，将它放在同一个 InputStream 上并开始尝试读取正文。

不幸的是，javadoc 证实了这一点，即 InputStreamReader 可能会出于效率目的选择预读。所以标题的阅读会咀嚼部分/全部正文。

有人对解决这个问题有什么建议吗？会手动创建 CharsetDecoder 并一次输入一个字节，但这是个好主意（可能包含在自定义 Reader 实现中？）

提前致谢。

编辑：我的最终解决方案是编写一个没有缓冲的 InputStreamReader，以确保我可以在不咀嚼部分正文的情况下解析标头。虽然这不是非常有效，但我用 BufferedInputStream 包装了原始 InputStream，所以它不会成为问题。

// An InputStreamReader that only consumes as many bytes as is necessary
// It does not do any read-ahead.
public class InputStreamReaderUnbuffered extends Reader
{
    private final CharsetDecoder charsetDecoder;
    private final InputStream inputStream;
    private final ByteBuffer byteBuffer = ByteBuffer.allocate( 1 );

    public InputStreamReaderUnbuffered( InputStream inputStream, Charset charset )
    {
        this.inputStream = inputStream;
        charsetDecoder = charset.newDecoder();
    }

    @Override
    public int read() throws IOException
    {
        boolean middleOfReading = false;

        while ( true )
        {
            int b = inputStream.read();

            if ( b == -1 )
            {
                if ( middleOfReading )
                    throw new IOException( "Unexpected end of stream, byte truncated" );

                return -1;
            }

            byteBuffer.clear();
            byteBuffer.put( (byte)b );
            byteBuffer.flip();

            CharBuffer charBuffer = charsetDecoder.decode( byteBuffer );

            // although this is theoretically possible this would violate the unbuffered nature
            // of this class so we throw an exception
            if ( charBuffer.length() > 1 )
                throw new IOException( "Decoded multiple characters from one byte!" );

            if ( charBuffer.length() == 1 )
                return charBuffer.get();

            middleOfReading = true;
        }
    }

    public int read( char[] cbuf, int off, int len ) throws IOException
    {
        for ( int i = 0; i < len; i++ )
        {
            int ch = read();

            if ( ch == -1 )
                return i == 0 ? -1 : i;

            cbuf[ i ] = (char)ch;
        }

        return len;
    }

    public void close() throws IOException
    {
        inputStream.close();
    }
}

【问题讨论】：

也许我错了，但从那一刻起，我认为该文件只能同时具有一种编码类型。
@Roman：你可以对文件做任何你想做的事情；它们只是字节序列。所以你可以写出一堆被解释为 ASCII 的字节，然后写出更多被解释为 UTF-16 的字节，甚至更多的字节被解释为 UTF-32。我并不是说这是一个好主意，尽管 OP 的用例肯定是合理的（毕竟，你必须有一些方法来指示文件使用什么编码）。
@Mike Q - InputStreamReaderUnbuffered 的好主意。我建议一个单独的答案 - 它值得关注:)
关于 InputStreamReaderUnbuffered 的解决方案：如果字节缓冲区大小为 1，您如何消耗作为单个字符一部分的 2 个字节？

标签： java buffer character-encoding decode inputstreamreader

【解决方案1】：

为什么不使用 2 InputStreams？一个用于读取标题，另一个用于正文。

第二个InputStream 应该是skip 标头字节。

【讨论】：

谢谢，我想我必须这样做。
你怎么知道要跳过什么？您需要阅读标题才能知道它在哪里结束。一旦您开始使用 InputStreaReader 读取标头，它就可以咀嚼正文中的字节。

【解决方案2】：

这是伪代码。

使用InputStream，但不要换行 Reader 周围。
读取包含标头的字节和将它们存储到 ByteArrayOutputStream。
从以下位置创建ByteArrayInputStream ByteArrayOutputStream 并解码标头，这次换行 ByteArrayInputStream 使用 ASCII 字符集进入 Reader。
计算非ascii的长度输入，并读取该字节数进入另一个ByteArrayOutputStream。
创建另一个ByteArrayInputStream 从第二 ByteArrayOutputStream 并包装它使用Reader 使用来自标题。

【讨论】：

感谢您的建议。不幸的是，无论是二进制还是字符，标头的长度都不是固定的，因此我确实需要通过字符集解码器对其进行解析，以找出它的结构以及它的长度。我还需要避免将整个内容读入内部缓冲区。

【解决方案3】：

我建议使用新的InputStreamReader 从头开始重新阅读流。也许假设支持InputStream.mark。

【讨论】：

【解决方案4】：

我的第一个想法是关闭流并重新打开它，使用InputStream#skip 跳过标题，然后将流提供给新的InputStreamReader。

如果您真的不想重新打开文件，您可以使用file descriptors 来获取多个文件流，尽管您可能必须使用channels 在文件中拥有多个位置（因为你不能假设你可以用reset重置位置，所以它可能不受支持。

【讨论】：

如果您使用相同的FileDescriptor 创建多个FileInputStreams，那么它们的行为就好像它们是同一个流。
@Tom：是的，我假设他会串联使用它们，而不是并联使用，并且他会在使用一个和另一个之间重新设置位置。但是你不能假设你可以重置位置......（我认为他们的行为不会像 相同的流，我认为这会更糟；他们只是分享实际文件位置。理论上，如果您尝试并行使用它们，单个实例中的数据缓存可能会变得非常非常混乱。）

【解决方案5】：

更简单：

正如您所说，您的标题始终为 ASCII。所以直接从 InputStream 中读取 header，当你完成后，用正确的编码创建 Reader 并从中读取

private Reader reader;
private InputStream stream;

public void read() {
    int c = 0;
    while ((c = stream.read()) != -1) {
        // Read encoding
        if ( headerFullyRead ) {
            reader = new InputStreamReader( stream, encoding );
            break;
        }
    }
    while ((c = reader.read()) != -1) {
        // Handle rest of file
    }
}

【讨论】：

谢谢。最终，我采用了另一种解决方案，即编写一个 InputStreamReaderUnbuffered，它与 InputStreamReader 完全相同，但没有内部缓冲区，因此您永远不会读太多。查看我的编辑。

【解决方案6】：

如果您包装 InputStream 并将所有读取一次限制为 1 个字节，则似乎禁用了 InputStreamReader 内部的缓冲。

这样我们就不必重写 InputStreamReader 逻辑了。

public class OneByteReadInputStream extends InputStream
{
    private final InputStream inputStream;

    public OneByteReadInputStream(InputStream inputStream)
    {
        this.inputStream = inputStream;
    }

    @Override
    public int read() throws IOException
    {
        return inputStream.read();
    }

    @Override
    public int read(byte[] b, int off, int len) throws IOException
    {
        return super.read(b, off, 1);
    }
}

构造：

new InputStreamReader(new OneByteReadInputStream(inputStream));

【讨论】：