高效地在 Torch 张量中读取、解析和存储 .txt 文件内容答案

【问题标题】：Reading, parsing and storing .txt files contents in Torch tensors efficiently高效地在 Torch 张量中读取、解析和存储 .txt 文件内容
【发布时间】：2017-03-01 07:16:11
【问题描述】：

我有大量的 .txt 文件（可能大约 1000 万个），每个文件的行数/列数相同。它们实际上是一些单通道图像，像素值用空格分隔。这是我为完成这项工作而编写的代码，但速度很慢。我想知道是否有人可以建议一种更优化/更有效的方法：

require 'torch'

f = assert(io.open(txtFilePath, 'r'))
local tempTensor = torch.Tensor(1, 64, 64):fill(0)
local i = 1
for line in f:lines() do
    local l = line:split(' ')
    for key, val in ipairs(l) do
        tempTensor[{1, i, key}] = tonumber(val)
    end
    i = i + 1
end
f:close()

【问题讨论】：

标签： optimization machine-learning lua torch

【解决方案1】：

简而言之，如果可能，请更改源文件。

我唯一能建议的是使用二进制数据而不是 txt 作为数据源。你有长期的方法：f:lines()、line:split(' ') 和tonumber(val)。他们都使用字符串作为变量。

据我了解，您有这样的文件：

0 10 20

11 18 22

....

所以，将您的源代码更改为二进制文件，如下所示：

...

其中 <18> 是十六进制形式的字节，即 12 ，<20> 是 16 等。

阅读

fid = io.open(sup_filename, "rb")
while true do
  local bytes = fid:read(1)
  if bytes == nil then break end -- EOF
  local st = bytes[0]
  print(st)
end

fid:close()

https://www.lua.org/pil/21.2.2.html 它会大大加快。

可能使用正则表达式（而不是 :split() 和 lines()）可以帮助你，但我不认为。

【讨论】：

谢谢，但由于某些原因，我无法将文件格式更改为二进制。现在，我应该找到一个更快的解决方案来读取原始文件。
这是不可能的，imo。 io 相当快。 SSD 可能会有所帮助。