【问题标题】:How can I make reverse scanning of a binary file faster?如何更快地对二进制文件进行反向扫描?
【发布时间】:2012-03-23 10:10:13
【问题描述】:

我有一个描述分组数据结构的二进制文件规范。每个数据包都有一个两字节的同步模式,因此可以使用BinaryReaderFileStream 组合扫描数据包的开头:

while(!reader.EndOfFile)
{
    // Check for sync pattern.
    if (reader.ReadUInt16() != 0xEB25)
    {
        // Move to next byte.
        reader.BaseStream.Seek(-1, SeekOrigin.Current);
        continue;
    }

    // If we got here, a sync pattern was found.
}

这个过程在正向方向上运行得非常好,但在反向方向上类似的代码扫描至少要慢两个数量级:

while(!reader.BeginningOfFile)
{
    // Check for sync pattern.
    if (reader.ReadUInt16() != 0xEB25)
    {
        // Move to previous byte.
        reader.BaseStream.Seek(-3, SeekOrigin.Current);
        continue;
    }

    // If we got here, a sync pattern was found.
}

我尝试了一些解决方法,例如向后移动任意数量(当前为 1 兆字节)并向前扫描,但很明显我真正需要的是 BinaryReaderFileStream 修改为在正向和反向读取时具有足够的性能特征。

我已经有一个FastFileStream,它通过继承普通的FileStream 并缓存PositionLength 属性(它还提供BeginningOfFileEndOfFile 属性)来提高前向读取性能。这就是驱动上面代码中的reader 变量的原因。

我是否可以做一些类似的事情来提高反向阅读性能,也许通过将MemoryStream 合并为缓冲区?

【问题讨论】:

  • This process works perfectly fine in the forward direction。这也很糟糕。读取两个字节,如果不是EB25,则返回一个字节。
  • @L.B:实际代码的优化远不止于此...在正向方向上,我首先检查 0x25(尊重 Little Endian),然后检查 0xEB。为了清楚起见,我在此处发布的代码进行了简化。相信我,这不是我检查字节的方式;正在发生反向减速,因为文件系统并非旨在像这样向后工作。
  • 那我试试Memory Mapped files
  • 好的,你可以选择FastForwardReverseFileStream。请注意:使用Memory Mapped file 并不意味着您将所有文件内容加载到内存中。
  • 为什么不使用 BufferedStream 而不是 FileStream?向后或向前移动没有区别,因为 Buffered Stream 将读取块中的文件,并且以反向或向前顺序扫描块不会有太大区别,即使使用普通 FileStream,您也可以读取一个字节缓冲区块并将其读入倒序,在递减索引的简单 for 循环中,filestream 无论如何都使用 4kb 缓冲区,但它针对正向读取进行了优化。

标签: c# .net-3.5 buffer filestream reverse


【解决方案1】:

编辑:好的,我有一些代码。好吧,相当多的代码。它允许您向前和向后扫描数据包头。

我不保证它没有错误,您肯定想调整缓冲区大小以查看它的性能...但是鉴于您发送给我的相同文件,它至少在扫描时显示相同的数据包头位置向前和向后:)

在代码之前,我仍然建议,如果您可能可以,扫描一次文件并保存数据包信息的索引以供以后使用可能是更好的方法。

无论如何,这是代码(除了示例程序之外,没有测试):

PacketHeader.cs:

using System;

namespace Chapter10Reader
{
    public sealed class PacketHeader
    {
        private readonly long filePosition;
        private readonly ushort channelId;
        private readonly uint packetLength;
        private readonly uint dataLength;
        private readonly byte dataTypeVersion;
        private readonly byte sequenceNumber;
        private readonly byte packetFlags;
        private readonly byte dataType;
        private readonly ulong relativeTimeCounter;

        public long FilePosition { get { return filePosition; } }
        public ushort ChannelId { get { return channelId; } }
        public uint PacketLength { get { return packetLength; } }
        public uint DataLength { get { return dataLength; } }
        public byte DataTypeVersion { get { return dataTypeVersion; } }
        public byte SequenceNumber { get { return sequenceNumber; } }
        public byte PacketFlags { get { return packetFlags; } }
        public byte DataType { get { return dataType; } }
        public ulong RelativeTimeCounter { get { return relativeTimeCounter; } }

        public PacketHeader(ushort channelId, uint packetLength, uint dataLength, byte dataTypeVersion,
            byte sequenceNumber, byte packetFlags, byte dataType, ulong relativeTimeCounter, long filePosition)
        {
            this.channelId = channelId;
            this.packetLength = packetLength;
            this.dataLength = dataLength;
            this.dataTypeVersion = dataTypeVersion;
            this.sequenceNumber = sequenceNumber;
            this.packetFlags = packetFlags;
            this.dataType = dataType;
            this.relativeTimeCounter = relativeTimeCounter;
            this.filePosition = filePosition;
        }

        internal static PacketHeader Parse(byte[] data, int index, long filePosition)
        {
            if (index + 24 > data.Length)
            {
                throw new ArgumentException("Packet header must be 24 bytes long; not enough data");
            }
            ushort syncPattern = BitConverter.ToUInt16(data, index + 0);
            if (syncPattern != 0xeb25)
            {
                throw new ArgumentException("Packet header must start with the sync pattern");
            }
            ushort channelId = BitConverter.ToUInt16(data, index + 2);
            uint packetLength = BitConverter.ToUInt32(data, index + 4);
            uint dataLength = BitConverter.ToUInt32(data, index + 8);
            byte dataTypeVersion = data[index + 12];
            byte sequenceNumber = data[index + 13];
            byte packetFlags = data[index + 14];
            byte dataType = data[index + 15];
            // TODO: Validate this...
            ulong relativeTimeCounter =
                (ulong)BitConverter.ToUInt32(data, index + 16) +
                ((ulong)BitConverter.ToUInt16(data, index + 20)) << 32;
            // Assume we've already validated the checksum...
            return new PacketHeader(channelId, packetLength, dataLength, dataTypeVersion, sequenceNumber,
                packetFlags, dataType, relativeTimeCounter, filePosition);
        }

        /// <summary>
        /// Checks a packet header's checksum to see whether this *looks* like a packet header.
        /// </summary>
        internal static bool CheckPacketHeaderChecksum(byte[] data, int index)
        {
            if (index + 24 > data.Length)
            {
                throw new ArgumentException("Packet header must is 24 bytes long; not enough data");
            }
            ushort computed = 0;
            for (int i = 0; i < 11; i++)
            {
                computed += BitConverter.ToUInt16(data, index + i * 2);
            }
            return computed == BitConverter.ToUInt16(data, index + 22);
        }
    }
}

PacketScanner.cs:

using System;
using System.Diagnostics;
using System.IO;

namespace Chapter10Reader
{
    public sealed class PacketScanner : IDisposable
    {
        // 128K buffer... tweak this.
        private const int BufferSize = 1024 * 128;

        /// <summary>
        /// Where in the file does the buffer start?
        /// </summary>
        private long bufferStart;

        /// <summary>
        /// Where in the file does the buffer end (exclusive)?
        /// </summary>
        private long bufferEnd;

        /// <summary>
        /// Where are we in the file, logically?
        /// </summary>
        private long logicalPosition;

        // Probably cached by FileStream, but we use it a lot, so let's
        // not risk it...
        private readonly long fileLength;

        private readonly FileStream stream;
        private readonly byte[] buffer = new byte[BufferSize];        

        private PacketScanner(FileStream stream)
        {
            this.stream = stream;
            this.fileLength = stream.Length;
        }

        public void MoveToEnd()
        {
            logicalPosition = fileLength;
            bufferStart = -1; // Invalidate buffer
            bufferEnd = -1;
        }

        public void MoveToBeforeStart()
        {
            logicalPosition = -1;
            bufferStart = -1;
            bufferEnd = -1;
        }

        private byte this[long position]
        {
            get 
            {
                if (position < bufferStart || position >= bufferEnd)
                {
                    FillBuffer(position);
                }
                return buffer[position - bufferStart];
            }
        }

        /// <summary>
        /// Fill the buffer to include the given position.
        /// If the position is earlier than the buffer, assume we're reading backwards
        /// and make position one before the end of the buffer.
        /// If the position is later than the buffer, assume we're reading forwards
        /// and make position the start of the buffer.
        /// If the buffer is invalid, make position the start of the buffer.
        /// </summary>
        private void FillBuffer(long position)
        {
            long newStart;
            if (position > bufferStart)
            {
                newStart = position;
            }
            else
            {
                // Keep position *and position + 1* to avoid swapping back and forth too much
                newStart = Math.Max(0, position - buffer.Length + 2);
            }
            // Make position the start of the buffer.
            int bytesRead;
            int index = 0;
            stream.Position = newStart;
            while ((bytesRead = stream.Read(buffer, index, buffer.Length - index)) > 0)
            {
                index += bytesRead;
            }
            bufferStart = newStart;
            bufferEnd = bufferStart + index;
        }

        /// <summary>
        /// Make sure the buffer contains the given positions.
        /// 
        /// </summary>
        private void FillBuffer(long start, long end)
        {
            if (end - start > buffer.Length)
            {
                throw new ArgumentException("Buffer not big enough!");
            }
            if (end > fileLength)
            {
                throw new ArgumentException("Beyond end of file");
            }
            // Nothing to do.
            if (start >= bufferStart && end < bufferEnd)
            {
                return;
            }
            // TODO: Optimize this more to use whatever bits we've actually got.
            // (We're optimized for "we've got the start, get the end" but not the other way round.)
            if (start >= bufferStart)
            {
                // We've got the start, but not the end. Just shift things enough and read the end...
                int shiftAmount = (int) (end - bufferEnd);
                Buffer.BlockCopy(buffer, shiftAmount, buffer, 0, (int) (bufferEnd - bufferStart - shiftAmount));
                stream.Position = bufferEnd;
                int bytesRead;
                int index = (int)(bufferEnd - bufferStart - shiftAmount);
                while ((bytesRead = stream.Read(buffer, index, buffer.Length - index)) > 0)
                {
                    index += bytesRead;
                }
                bufferStart += shiftAmount;
                bufferEnd = bufferStart + index;
                return;
            }

            // Just fill the buffer starting from start...
            bufferStart = -1;
            bufferEnd = -1;
            FillBuffer(start);
        }

        /// <summary>
        /// Returns the header of the next packet, or null 
        /// if we've reached the end of the file.
        /// </summary>
        public PacketHeader NextHeader()
        {
            for (long tryPosition = logicalPosition + 1; tryPosition < fileLength - 23; tryPosition++)
            {
                if (this[tryPosition] == 0x25 && this[tryPosition + 1] == 0xEB)
                {
                    FillBuffer(tryPosition, tryPosition + 24);
                    int bufferPosition = (int) (tryPosition - bufferStart);
                    if (PacketHeader.CheckPacketHeaderChecksum(buffer, bufferPosition))
                    {
                        logicalPosition = tryPosition;
                        return PacketHeader.Parse(buffer, bufferPosition, tryPosition);
                    }
                }
            }
            logicalPosition = fileLength;
            return null;
        }

        /// <summary>
        /// Returns the header of the previous packet, or null 
        /// if we've reached the start of the file.
        /// </summary>
        public PacketHeader PreviousHeader()
        {
            for (long tryPosition = logicalPosition - 1; tryPosition >= 0; tryPosition--)
            {
                if (this[tryPosition + 1] == 0xEB && this[tryPosition] == 0x25)
                {
                    FillBuffer(tryPosition, tryPosition + 24);
                    int bufferPosition = (int)(tryPosition - bufferStart);
                    if (PacketHeader.CheckPacketHeaderChecksum(buffer, bufferPosition))
                    {
                        logicalPosition = tryPosition;
                        return PacketHeader.Parse(buffer, bufferPosition, tryPosition);
                    }
                }
            }
            logicalPosition = -1;
            return null;
        }

        public static PacketScanner OpenFile(string filename)
        {
            return new PacketScanner(File.OpenRead(filename));
        }

        public void Dispose()
        {
            stream.Dispose();
        }
    }
}

Program.cs(用于测试):

using System;
using System.Collections.Generic;
using System.Linq;

namespace Chapter10Reader
{
    class Program
    {
        static void Main(string[] args)
        {
            string filename = "test.ch10";

            Console.WriteLine("Forwards:");
            List<long> positionsForward = new List<long>();
            using (PacketScanner scanner = PacketScanner.OpenFile(filename))
            {
                scanner.MoveToBeforeStart();
                PacketHeader header;
                while ((header = scanner.NextHeader()) != null)
                {
                    Console.WriteLine("Found header at {0}", header.FilePosition);
                    positionsForward.Add(header.FilePosition);
                }
            }
            Console.WriteLine();
            Console.WriteLine("Backwards:");
            List<long> positionsBackward = new List<long>();
            using (PacketScanner scanner = PacketScanner.OpenFile(filename))
            {
                scanner.MoveToEnd();
                PacketHeader header;
                while ((header = scanner.PreviousHeader()) != null)
                {
                    positionsBackward.Add(header.FilePosition);
                }
            }
            positionsBackward.Reverse();
            foreach (var position in positionsBackward)
            {
                Console.WriteLine("Found header at {0}", position);
            }

            Console.WriteLine("Same? {0}", positionsForward.SequenceEqual(positionsBackward));
        }
    }
}

【讨论】:

  • 有一个数据包头,其中有一个字段,用于指定数据包的大小。我可能会读取数据包,或者如果那不是我想要的数据包,我可能会使用大小跳到下一个数据包。如果文件没有损坏,这对于前向扫描来说很好,在这种情况下,我必须恢复扫描同步模式。数据包的大小可达 1/2 兆字节。如果您有兴趣,可以在 this document 中获得数据包的详细说明,从第 10.6 节开始。
  • @RobertHarvey:谢谢,今晚回家看看,尝试编写一些代码。你有我可以玩的示例数据文件吗?
  • 由于 UI 中的 Next/Previous 功能,优化是必要的。向后跳过一个数据包并不是真正的问题。延迟是可以察觉的,但并不严重。向后跳过许多数据包并不是那么难以察觉;数据包有一个分配给它们的通道号,我可能希望前一个数据包在同一个通道中,所以我可能需要返回一百个或更多数据包才能找到它。
  • 样本数据在这里:irig106.org/sample_data。我会给你登录名和密码。
  • @RobertHarvey:令我震惊的是,字节 0xEB25 可能出现在任意数据的任何位置。如果您正在向后扫描,有什么可以阻止您得到误报,并将一些随机数据位视为数据包标头?标头校验和在一定程度上有助于防止这种情况发生,但它并非万无一失...您是否真的需要向后扫描而不是向前扫描一次并记住所有标头,然后只是寻找到你想要的数据包?
【解决方案2】:

L.B 在评论中提到使用内存映射文件,您可能会对性能印象深刻。

请尝试以下方法:

var memoryMapName = Path.GetFileName(fileToRead);

using (var mapStream = new FileStream(fileToRead, FileMode.Open))
{
    using (var myMap = MemoryMappedFile.CreateFromFile(
                            mapStream, 
                            memoryMapName, mapStream.Length,
                            MemoryMappedFileAccess.Read, null, 
                            HandleInheritability.None, false))
    {                    
        long leftToRead = mapStream.Length;
        long mapSize = Math.Min(1024 * 1024, mapStream.Length);
        long bytesRead = 0;
        long mapOffset = Math.Max(mapStream.Length - mapSize, 0);

        while (leftToRead > 1)
        {
            using (var FileMap = myMap.CreateViewAccessor(mapOffset, 
                                 mapSize, MemoryMappedFileAccess.Read))
            {
                long readAt = mapSize - 2;
                while (readAt > -1)
                {
                    var int16Read = FileMap.ReadUInt16(readAt);
                    //0xEB25  <--check int16Read here                            
                    bytesRead += 1;
                    readAt -= 1;
                }
            }

            leftToRead = mapStream.Length- bytesRead;
            mapOffset = Math.Max(mapOffset - mapSize, 0);
            mapSize = Math.Min(mapSize, leftToRead);
        }
    }
}

【讨论】:

  • 我继续并赞成你的答案,因为我认为这是一个很好的答案,但请注意问题上有一个 .NET 3.5 标签,我相信 MemoryMappedFile 仅可用在 .NET 4.0 中。 :)
  • 公平地说,您也可以在 3.5 中始终 P/Invoke 映射文件 API,因此如果这种方式有帮助,您仍然可以通过一些小的函数调用更改来使用它。
  • @RobertHarvey 你可以在 3.5 中做到这一点,但你必须 wrap the Win32 functions
  • @RobertHarvey Ahh,是的...我确实看到了 3.5 标签,但希望这不是一个明确的要求。我对 MemoryMapped 文件进行了一些测试,甚至向前读取也比使用 FileStream 快得多。很抱歉,这对你不起作用。
猜你喜欢
  • 1970-01-01
  • 2019-03-10
  • 1970-01-01
  • 2013-03-17
  • 2010-12-06
  • 1970-01-01
  • 2020-06-03
  • 2013-02-14
  • 2021-11-12
相关资源
最近更新 更多