如何使用 Python 从特定位置读取文件到特定位置？答案

【问题标题】：How can I read from a file with Python from a specific location to a specific location?如何使用 Python 从特定位置读取文件到特定位置？
【发布时间】：2020-04-14 03:35:41
【问题描述】：

目前，我正在做：

    source_noise = np.fromfile('data/noise/' + source + '_16k.dat', sep='\n')
    source_noise_start = np.random.randint(
        0, len(source_noise) - len(audio_array))
    source_noise = source_noise[source_noise_start:
                                source_noise_start + len(audio_array)]

我的文件看起来像：

  -5.3302745e+02
  -5.3985005e+02
  -5.8963920e+02
  -6.5875741e+02
  -5.7371864e+02
  -2.0796765e+02
   2.8152341e+02
   6.5398089e+02
   8.6053581e+02

.. 不断重复。

这要求我读取整个文件，而我只想读取文件的一部分。有什么方法可以让我用 Python 做到这一点，比我现在做的更快？

【问题讨论】：

您可以使用常规的 open 函数，然后在循环中使用 file.readline() 以仅遍历前 n 行。但是，这确实需要您自己解析数据，因为它只会将其作为文本返回。
我很乐意自己解析它，因为它只是每一行的数字。但我不一定要做第一行。
性能真的有问题吗？
是的。我在数千个文件中这样做了数千次，所以我需要它尽可能快。
请说明文件的数量、每个文件中的样本数量以及您希望从每个文件中读取的样本数量。

标签： python numpy file

【解决方案1】：

您可以使用 seek 方法在文件内移动并读取特定位置。

文件数据 -> “hello world”

start_read = 6

with open("filename", 'rb') as file:
    file.seek(start_read)
    output = file.read(5)
    print(output)

# will display world

【讨论】：

这样更快吗？
我可以移动到特定的行吗？

【解决方案2】：

您的文件包含行，因此seek() 本身几乎没有用，因为它以字节偏移文件。这意味着如果您想要正确的结果，您需要非常仔细地阅读文件，否则您最终会没有- 符号或缺少十进制数字，或者文本将在数字中间的某处被剪切。

更不用说一些怪癖，例如在科学计数法eN 与纯浮点数之间切换，如果您转储到文件错误的内容也可能发生这种情况。

现在关于阅读，Python 允许你使用readlines(hint=-1)

可以指定hint来控制读取的行数：如果到目前为止所有行的总大小（以字节/字符为单位）超过hint，则不会再读取行。

因此：

test.txt

控制台

>>> with open("test.txt") as f:
...     print(f.readlines(5))
...     print(f.readlines(9))
... 
['123\n', '456\n']
['789\n', '012\n', '345\n']

我没有测量它，但如果你不想处理你的线条/不想被击中脚，这可能是 Python 中最快的，seek() 可能会更慢最后是由于您这边解析的次优解决方案。

我对“...从特定位置到特定位置？”有点困惑。如果不打算进行解析，则解决方案也可能只是一些 bash 脚本或类似的东西，但您必须知道文件中的行数（readlines(hint=-1) func 的替代方法）：

with open(file) as inp:
    with open(file2) as out:
        for idx in range(num_of_lines - 1):
            line = inp.readline(idx)
            if not some_logic(line):
                continue
            out.write(line)

注意：with 的嵌套只是因为跳过了先读取整个文件然后检查+写入其他地方的开销。

尽管如此，您仍然使用 numpy，这只是 Cython 或 C/C++ 库的一小步。这意味着，您可以跳过 Python 开销，直接使用 Cython 或 C 读取文件。

mmap, mmap vs ifstream vs fread.

Here is an article 实际测量：

Python 代码 (readline())，
Cython（只是虚拟编译），
C (cimport from stdio.h to use getline()(can't find C reference :/))
C++（好像在图中错误地标记为C）

这似乎是最有效的代码，它进行了一些清理和删除了行，它应该给你一个想法，以防你想尝试mmap 或其他花哨的阅读。不过我没有测量结果：

依赖

apt install build-essential  # gcc, etc
pip install cython

setup.py

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name="test",
    ext_modules = cythonize("test.pyx")
)

test.pyx

from libc.stdio cimport *

cdef extern from "stdio.h":
    FILE *fopen(const char *, const char *)
    int fclose(FILE *)
    ssize_t getline(char **, size_t *, FILE *)

def read_file(filename):
    filename_byte_string = filename.encode("UTF-8")
    cdef char* fname = filename_byte_string

    cdef FILE* cfile
    cfile = fopen(fname, "rb")
    if cfile == NULL:
        raise FileNotFoundError(2, "No such file or directory: '%s'" % filename)

    cdef char * line = NULL
    cdef size_t l = 0
    cdef ssize_t read
    cdef list result = []

    while True:
        read = getline(&line, &l, cfile)
        if read == -1:
            break
        result.append(line)

    fclose(cfile)
    return result

外壳

pip install --editable .

控制台

from test import read_file
lines = read_file(file)

【讨论】：