使用 StreamReader 读取文件时读取固定数量的字节答案

【问题标题】：Read a fixed number of bytes when reading file using StreamReader使用 StreamReader 读取文件时读取固定数量的字节
【发布时间】：2018-09-25 14:37:50
【问题描述】：

我正在使用 Mozilla 字符集检测器的 this 端口来确定文件的编码，然后使用它来构造 StreamReader。到目前为止，一切顺利。

但是，我正在阅读的文件格式很奇怪，有时需要跳过一些字节。也就是说，一个文本文件，采用一种或其他编码方式，将嵌入一些原始字节。

我想将流作为文本读取，直到我点击一些指示字节流跟随的文本，然后我想读取字节流，然后继续作为文本读取。最好的方法是什么（简单性和性能的平衡）？

我不能依赖于寻找 StreamReader 底层的 FileStream（然后丢弃后者中的缓冲数据），因为我不知道在读取字符时使用了多少字节。我可能会放弃使用 StreamReader 并切换到使用并行字节和字符数组的定制类，使用解码器从前者填充后者，并在每次读取字符时通过使用编码来跟踪字节数组中的位置来计算用于字符的字节数。呵呵。

为了进一步澄清，该文件具有以下格式：

[编码字符][嵌入字节指示符 + len][len 字节][编码字符]...

其中很多是零一个或多个嵌入字节块，嵌入字符块可以是任意长度。

所以，例如：

ABC:123:DEF:456:$0099[0x00,0x01,0x02,... x 99]GHI:789:JKL:...

没有行分隔符。我可能有任意数量的字段（ABC、123、...），由某个字符（在本例中为冒号）分隔。这些字段可能位于各种代码页中，包括 UTF-8（不保证为单字节）。当我点击 $ 时，我知道接下来的 4 个字节包含一个长度（称为 n），接下来的 n 个字节将被原始读取，字节 n + 1 将是另一个文本字段 (GHI)。

【问题讨论】：

我很难弄清楚问题是什么。你能说得更清楚一点，或者加个问号吗？
StreamReader 并不是一个特别复杂的动物。你可以很容易地实现你自己的版本。不需要任何并行缓冲区，只需做大多数读者所做的事情并保留一个 byte[] 缓冲区，并根据需要向/从它传输。一点也不讨厌。
@glenebob - 问题在于“酌情转移给它”。因为我主要想解析文本，所以我需要使用编码或解码器实例将我的字节缓冲区转换为字符缓冲区来检测编码。如果我取 x 个字节并将它们转换为 chars，如果我需要在从 char 缓冲区读取的过程中开始读取固定数量的字节，我就会遇到问题。
@RonIdaho：为什么不直接将字节读取为字节，使用适当的编码（例如 UTF-8）将其转换为字符串，如果必须跳过字节，就跳过字节？
如果字符到达给定字符串的末尾，您怎么知道？ “嵌入式字节指示符 + len”的本质是什么？听起来好像读者可能需要逐字节地通过缓冲区来构造字符串值，直到它到达一个终止符字节。这正是 StreamReader.ReadLine() 所做的。这非常简单。

标签： c# .net

【解决方案1】：

概念证明。此类适用于 UTF-16 字符串数据和每个 OP 的 ':' 分隔符。它期望二进制长度为 4 字节的小端二进制整数。应该很容易适应您的（奇怪的）文件格式的更具体的细节。例如，任何 Decoder 类都应该放入 ReadString() 并“正常工作”。

要使用它，请使用 Stream 类构造它。对于每个单独的数据元素，调用 ReportNextData()，它会告诉你接下来是什么类型的数据，然后调用适当的 Read*() 方法。对于二进制数据，调用 ReadBinaryLength()，然后调用 ReadBinaryData()。

注意 ReadBinaryData() 遵循流契约；它不能保证返回您要求的尽可能多的字节，因此您可能需要多次调用它。但是，如果请求的字节数过多，则会抛出 EndOfStreamException。

我用这个数据（十六进制格式）对其进行了测试： 410042004300240A0000000102030405060708090024050000000504030201580059005A003A310032003300

这是： ABC$[10][1234567890]$[5][54321]XYZ:123

像这样扫描数据：

OddFileReader.NextData nextData;

while ((nextData = reader.ReportNextData()) != OddFileReader.NextData.Eof)
{
    // Call appropriate Read*() here.
}

public class OddFileReader : IDisposable
{
    public enum NextData
    {
        Unknown,
        Eof,
        String,
        BinaryLength,
        BinaryData
    }

    private Stream source;
    private byte[] byteBuffer;
    private int bufferOffset;
    private int bufferEnd;
    private NextData nextData;
    private int binaryOffset;
    private int binaryEnd;
    private char[] characterBuffer;

    public OddFileReader(Stream source)
    {
        this.source = source;
    }

    public NextData ReportNextData()
    {
        if (nextData != NextData.Unknown)
        {
            return nextData;
        }

        if (!PopulateBufferIfNeeded(1))
        {
            return (nextData = NextData.Eof);
        }

        if (byteBuffer[bufferOffset] == '$')
        {
            return (nextData = NextData.BinaryLength);
        }
        else
        {
            return (nextData = NextData.String);
        }
    }

    public string ReadString()
    {
        ReportNextData();

        if (nextData == NextData.Eof)
        {
            throw new EndOfStreamException();
        }
        else if (nextData != NextData.String)
        {
            throw new InvalidOperationException("Attempt to read non-string data as string");
        }

        if (characterBuffer == null)
        {
            characterBuffer = new char[1];
        }

        StringBuilder stringBuilder = new StringBuilder();
        Decoder decoder = Encoding.Unicode.GetDecoder();

        while (nextData == NextData.String)
        {
            byte b = byteBuffer[bufferOffset];

            if (b == '$')
            {
                nextData = NextData.BinaryLength;

                break;
            }
            else if (b == ':')
            {
                nextData = NextData.Unknown;
                bufferOffset++;

                break;
            }
            else
            {
                if (decoder.GetChars(byteBuffer, bufferOffset++, 1, characterBuffer, 0) == 1)
                {
                    stringBuilder.Append(characterBuffer[0]);
                }

                if (bufferOffset == bufferEnd && !PopulateBufferIfNeeded(1))
                {
                    nextData = NextData.Eof;

                    break;
                }
            }
        }

        return stringBuilder.ToString();
    }

    public int ReadBinaryLength()
    {
        ReportNextData();

        if (nextData == NextData.Eof)
        {
            throw new EndOfStreamException();
        }
        else if (nextData != NextData.BinaryLength)
        {
            throw new InvalidOperationException("Attempt to read non-binary-length data as binary length");
        }

        bufferOffset++;

        if (!PopulateBufferIfNeeded(sizeof(Int32)))
        {
            nextData = NextData.Eof;

            throw new EndOfStreamException();
        }

        binaryEnd = BitConverter.ToInt32(byteBuffer, bufferOffset);
        binaryOffset = 0;
        bufferOffset += sizeof(Int32);
        nextData = NextData.BinaryData;

        return binaryEnd;
    }

    public int ReadBinaryData(byte[] buffer, int offset, int count)
    {
        ReportNextData();

        if (nextData == NextData.Eof)
        {
            throw new EndOfStreamException();
        }
        else if (nextData != NextData.BinaryData)
        {
            throw new InvalidOperationException("Attempt to read non-binary data as binary data");
        }

        if (count > binaryEnd - binaryOffset)
        {
            throw new EndOfStreamException();
        }

        int bytesRead;

        if (bufferOffset < bufferEnd)
        {
            bytesRead = Math.Min(count, bufferEnd - bufferOffset);

            Array.Copy(byteBuffer, bufferOffset, buffer, offset, bytesRead);
            bufferOffset += bytesRead;
        }
        else if (count < byteBuffer.Length)
        {
            if (!PopulateBufferIfNeeded(1))
            {
                throw new EndOfStreamException();
            }

            bytesRead = Math.Min(count, bufferEnd - bufferOffset);

            Array.Copy(byteBuffer, bufferOffset, buffer, offset, bytesRead);
            bufferOffset += bytesRead;
        }
        else
        {
            bytesRead = source.Read(buffer, offset, count);
        }

        binaryOffset += bytesRead;

        if (binaryOffset == binaryEnd)
        {
            nextData = NextData.Unknown;
        }

        return bytesRead;
    }

    private bool PopulateBufferIfNeeded(int minimumBytes)
    {
        if (byteBuffer == null)
        {
            byteBuffer = new byte[8192];
        }

        if (bufferEnd - bufferOffset < minimumBytes)
        {
            int shiftCount = bufferEnd - bufferOffset;

            if (shiftCount > 0)
            {
                Array.Copy(byteBuffer, bufferOffset, byteBuffer, 0, shiftCount);
            }

            bufferOffset = 0;
            bufferEnd = shiftCount;

            while (bufferEnd - bufferOffset < minimumBytes)
            {
                int bytesRead = source.Read(byteBuffer, bufferEnd, byteBuffer.Length - bufferEnd);

                if (bytesRead == 0)
                {
                    return false;
                }

                bufferEnd += bytesRead;
            }
        }

        return true;
    }

    public void Dispose()
    {
        Stream source = this.source;

        this.source = null;

        if (source != null)
        {
            source.Dispose();
        }
    }
}

【讨论】：

这比我可怕的并行缓冲区解决方案更好。我想我也可以稍微简化一下。谢谢。我想为班级名称添加 +1。