具有相同长度的行的最大优点是您无需查找换行符即可知道每行的开始位置。如果文件大小为 ~40GB,包含 ~180 万行,则行长约为 20KB/行。如果要对 10K 行进行采样,则行之间的空间约为 40MB。这几乎肯定比磁盘上的块大小大三个数量级。因此,寻找下一个读取位置比读取文件中的每个字节要高效得多。
Seeking 将适用于行长不等的文件(例如,UTF-8 编码中的非 ascii 字符),但需要对方法进行少量修改。如果你有不等的行,你可以寻找一个估计的位置,然后扫描到下一行的开头。这仍然非常有效,因为您每需要阅读约 20KB 就会跳过约 40MB。由于您将选择字节位置而不是行位置,因此您的采样均匀性会受到轻微影响,并且您无法确定正在读取的行号。
您可以使用生成行号的 Python 代码直接实施您的解决方案。以下是如何处理所有具有相同字节数的行的示例(通常是 ascii 编码):
import random
from os.path import getsize
# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
file_size = getsize(file_name)
with open(file_name) as file:
# Read the first line to get the length
file.readline()
line_size = file.tell()
# You don't have to seek(0) here: if line #0 is selected,
# the seek will happen regardless later.
# Assuming you are 100% sure all lines are equal, this might
# discard the last line if it doesn't have a trailing newline.
# If that bothers you, use `math.round(file_size / line_size)`
line_count = file_size // line_size
# This is just a trivial example of how to generate the line numbers.
# If it doesn't work for you, just use the method you already have.
# By the way, this will just error out (ValueError) if you try to
# select more lines than there are in the file, which is ideal
selection_indices = random.sample(range(line_count), selection_count)
selection_indices.sort()
# Now skip to each line before reading it:
prev_index = 0
for line_index in selection_indices:
# Conveniently, the default seek offset is the start of the file,
# not from current position
if line_index != prev_index + 1:
file.seek(line_index * line_size)
print('Line #{}: {}'.format(line_index, file.readline()), end='')
# Small optimization to avoid seeking consecutive lines.
# Might be unnecessary since seek probably already does
# something like that for you
prev_index = line_index
如果您愿意在行号分布中牺牲(非常)少量的均匀性,您可以轻松地将类似的技术应用于行长不等的文件。您只需生成随机字节偏移量,然后跳到偏移量之后的下一个完整行。在下面的实现中,假设您知道行的长度不超过 40KB。如果您的 CSV 具有以 UTF-8 编码的非 ascii unicode 字符,您将不得不这样做,因为即使这些行都包含相同数量的字符,它们也会包含不同数量的字节。在这种情况下,您必须以二进制模式打开文件,否则当您跳到一个随机字节时,如果该字节恰好是中间字符,您可能会遇到解码错误:
import random
from os.path import getsize
# Input file path
file_name = 'file.csv'
# How many lines you want to select
selection_count = 10000
# An upper bound on the line size in bytes, not chars
# This serves two purposes:
# 1. It determines the margin to use from the end of the file
# 2. It determines the closest two offsets are allowed to be and
# still be 100% guaranteed to be in different lines
max_line_bytes = 40000
file_size = getsize(file_name)
# make_offset is a function that returns `selection_count` monotonically
# increasing unique samples, at least `max_line_bytes` apart from each
# other, in the range [0, file_size - margin). Implementation not provided.
selection_offsets = make_offsets(selection_count, file_size, max_line_bytes)
with open(file_name, 'rb') as file:
for offset in selection_offsets:
# Skip to each offset
file.seek(offset)
# Readout to the next full line
file.readline()
# Print the next line. You don't know the number.
# You also have to decode it yourself.
print(file.readline().decode('utf-8'), end='')
这里的所有代码都是 Python 3。