【问题标题】:Improving an algorithm for substring search when reading ZIP files改进读取 ZIP 文件时的子字符串搜索算法
【发布时间】:2025-11-27 17:05:01
【问题描述】:

所以我有一个 ZIP 阅读器库,我首先通过找出 EOCD 记录的位置来阅读 ZIP 文件(“从尾部”的标准方式)。我必须寻找一个大致是这样的模式:

4byte_magic_number, fixed_n_bytes, 2_bytes_of_comment_size, comment

comment 的字节大小在2_bytes_of_comment_size 中提供。仅扫描幻数是不够的,因为我急切地阅读了文件尾部的大部分内容 - 基本上是 ZIP EOCD 记录的最大大小,然后在其中查找此模式。

到目前为止,我想出了这个

def locate_eocd_signature(in_str)
  # We have to scan from the _very_ tail. We read the very minimum size
  # the EOCD record can have (up to and including the comment size), using
  # a sliding window. Once our end offset matches the comment size we found our
  # EOCD marker.
  eocd_signature_int = 0x06054b50
  unpack_pattern = 'VvvvvVVv'
  minimum_record_size = 22
  end_location = minimum_record_size * -1
  loop do
    # If the window is nil, we have rolled off the start of the string, nothing to do here.
    # We use negative values because if we used positive slice indices
    # we would have to detect the rollover ourselves
    break unless window = in_str[end_location, minimum_record_size]

    window_location = in_str.bytesize + end_location
    unpacked = window.unpack(unpack_pattern)

    # If we found the signature, pick up the comment size, and check if the size of the window
    # plus that comment size is where we are in the string. If we are - bingo.
    if unpacked[0] == 0x06054b50 && comment_size = unpacked[-1] 
      assumed_eocd_location = in_str.bytesize - comment_size - minimum_record_size
      # if the comment size is where we should be at - we found our EOCD
      return assumed_eocd_location if assumed_eocd_location == window_location
    end

    end_location -= 1 # Shift the window back, by one byte, and try again.
  end
end

但它只是对我尖叫。有没有更好的方法来做这样的事情?是否有一个 pack 说明符表示“直到字符串末尾的所有二进制字节”我不知道?然后我可以将它附加到包说明符的末尾......这里有点不知所措。

【问题讨论】:

  • 也许你可以使用正则表达式,但如果你想避免丑陋,那可能是错误的引导方式。清理它的一种方法是将常量移动到实际常量中并将其封装在类或模块中。还要使用你的常量,而不是在你的代码中撒上相同的神奇数字。
  • 我在module 中使用了足够多的常量,它来自:-),但要点是。在这种情况下,正则表达式实际上似乎是一个传递应用程序......
  • 请注意不要使用正则表达式来表示死亡。 It can happen to the best of us.

标签: ruby algorithm zip substring


【解决方案1】:

最后我选择了以下优化。首先,我创建了一种方法来查找字符串中给定子字符串的所有索引 - 没有内置的 stdlib。

def all_indices_of_substr_in_str(of_substring, in_string)
  last_i = 0
  found_at_indices = []
  while last_i = in_string.index(of_substring, last_i)
    found_at_indices << last_i
    last_i += of_substring.bytesize
  end
  found_at_indices
end

然后,我们使用它来“锁定”缓冲区中找到签名的偏移量。

def locate_eocd_signature(in_str)
  eocd_signature = 0x06054b50
  eocd_signature_str = [eocd_signature].pack('V')
  unpack_pattern = 'VvvvvVVv'
  minimum_record_size = 22
  str_size = in_str.bytesize
  indices = all_indices_of_substr_in_str(eocd_signature_str, in_str)
  indices.each do |check_at|
    maybe_record = in_str[check_at..str_size]
    # If the record is smaller than the minimum - we will never recover anything
    break if maybe_record.bytesize < minimum_record_size
    # Now we check if the record ends with the combination
    # of the comment size and an arbitrary byte string of that size.
    # If it does - we found our match
    *_unused, comment_size = maybe_record.unpack(unpack_pattern)
    if (maybe_record.bytesize - minimum_record_size) == comment_size
      return check_at # Found the EOCD marker location
    end
  end
  # If we haven't caught anything, return nil deliberately instead of returning the last statement
  nil
end

【讨论】: