将字符串转换为 int 太慢了答案

【问题标题】：Converting string to int is too slow将字符串转换为 int 太慢了
【发布时间】：2012-12-01 03:13:55
【问题描述】：

我有一个程序，每行读取 3 个字符串 50000。然后它会做其他事情。读取文件并转换为整数的部分占用了总运行时间的 80%。

我的代码 sn-p 如下：

import time
file = open ('E:/temp/edges_big.txt').readlines()
start_time = time.time()
for line in file[1:]:
    label1, label2, edge = line.strip().split()
    # label1 = int(label1); label2 = int(label2); edge = float(edge)
    # Rest of the loop deleted
print ('processing file took ', time.time() - start_time, "seconds")

上述过程大约需要 0.84 秒。现在，当我取消注释该行时

label1 = int(label1);label2 = int(label2);edge = float(edge)

运行时间增加到大约 3.42 秒。

输入文件的格式为：str1 str2 str3 每行

int() 和 float() 的函数有那么慢吗？我该如何优化呢？

【问题讨论】：

我没有看到导致运行时差异如此大的两行之间的差异；你能澄清一下吗？
这很奇怪。在我的机器上，两个int() 调用和一个float() 调用总共需要大约1.7us。这个时间 50000 是 85ms。这使您的速度比我的慢 30 倍。这听起来不对。
为了呼应蒂姆所说的，你能清楚地说明你正在比较的是哪两个版本吗？现在您在代码中进行了转换，但将 append() 注释掉了。然后，您建议当您添加转换时，时间会发生变化。要么我完全误解了这一点，要么显然有一些错别字。
如果我是你，我会看看这三个转换中的每一个都需要多少时间。另外，我会想出一个小型的独立可运行测试用例来展示速度缓慢并且我们可以进行试验。
什么 Python 3.x 版本？这是我从快速试用中看到的：2.7、3.2 和 3.3 在没有转换的情况下都运行 0.033。通过转换，我得到：2.7 - 0.125s； 3.1 - 0.162s； 3.2 - 0.155 秒，3.3 - 0.10 秒。对于 3.1 和 3.2，这是 5 和 0.84 x 5 ~ 4s 的减速。

标签： python performance python-3.x

【解决方案1】：

我根本无法重现。

我生成了一个 50000 行的文件，每行包含三个随机数（两个整数，一个浮点数），用空格分隔。

然后我在那个文件上运行了你的脚本。在我三岁的电脑上，原始脚本在 0.05 秒内完成，未注释行的脚本需要 0.15 秒才能完成。当然，字符串到 int/float 的转换需要更长的时间，但肯定不会在几秒钟的范围内。除非您的目标机器是运行嵌入式 Windows CE 的烤面包机。

【讨论】：

Tim，它实际上是一个新机器，核心 i7，64 位。我会尝试使用不同的文件并更新

【解决方案2】：

如果文件在操作系统缓存中，那么在我的机器上解析文件需要几毫秒：

name                                 time ratio comment
read_read                        145 usec  1.00 big.txt
read_readtxt                    2.07 msec 14.29 big.txt
read_readlines                  4.94 msec 34.11 big.txt
read_james_otigo                29.3 msec 201.88 big.txt
read_james_otigo_with_int_float 82.9 msec 571.70 big.txt
read_map_local                  93.1 msec 642.23 big.txt
read_map                        95.6 msec 659.57 big.txt
read_numpy_loadtxt               321 msec 2213.66 big.txt

read_*() 函数在哪里：

def read_read(filename):
    with open(filename, 'rb') as file:
        data = file.read()

def read_readtxt(filename):
    with open(filename, 'rU') as file:
        text = file.read()

def read_readlines(filename):
    with open(filename, 'rU') as file:
        lines = file.readlines()

def read_james_otigo(filename):
    file = open (filename).readlines()
    for line in file[1:]:
        label1, label2, edge = line.strip().split()

def read_james_otigo_with_int_float(filename):
    file = open (filename).readlines()
    for line in file[1:]:
        label1, label2, edge = line.strip().split()
        label1 = int(label1); label2 = int(label2); edge = float(edge)

def read_map(filename):
    with open(filename) as file:
        L = [(int(l1), int(l2), float(edge))
             for line in file
             for l1, l2, edge in [line.split()] if line.strip()]

def read_map_local(filename, _i=int, _f=float):
    with open(filename) as file:
        L = [(_i(l1), _i(l2), _f(edge))
             for line in file
             for l1, l2, edge in [line.split()] if line.strip()]

import numpy as np

def read_numpy_loadtxt(filename):
    a = np.loadtxt('big.txt', dtype=[('label1', 'i'),
                                     ('label2', 'i'),
                                     ('edge', 'f')])

big.txt 是使用以下方法生成的：

#!/usr/bin/env python
import numpy as np

n = 50000
a = np.random.random_integers(low=0, high=1<<10, size=2*n).reshape(-1, 2)
np.savetxt('big.txt', np.c_[a, np.random.rand(n)], fmt='%i %i %s')

它产生 50000 行：

150 952 0.355243621018
582 98 0.227592557278
478 409 0.546382780254
46 879 0.177980983303
...

要重现结果，download the code 并运行：

# write big.txt
python generate-file.py
# run benchmark
python read-array.py

【讨论】：

我很惊讶 numpy loadtxt 版本比其他版本慢得多——知道发生了什么吗？
@EdwardLoper：不知道。你在你的机器上得到了什么结果？
我得到了和你一样的结果—— loadtxt 要慢得多。这让我感到惊讶，因为我认为 loadtxt 的全部意义在于它是用 c 编写的，所以它可以很快。虽然经过进一步调查，我可能被误导了——毕竟它可能是用 python 编写的，在这种情况下，我肯定会认为它会更慢。
这是它的定义：projects.scipy.org/numpy/browser/trunk/numpy/lib/… -- 所以现在对我来说它很慢是有道理的。我想这更多是为了方便，而不是为了快速加载数据。
爱德华，我在下面收到关于编码的错误：SyntaxError: Non-ASCII character '\xb5' in file E:\reporttime.py on line 37, but no encoding declared

【解决方案3】：

我可以得到几乎和你一样的时间。我认为问题出在我正在计时的代码上：

read_james_otigo                  40 msec big.txt
read_james_otigo_with_int_float  116 msec big.txt
read_map                         134 msec big.txt
read_map_local                   131 msec big.txt
read_numpy_loadtxt               400 msec big.txt
read_read                        488 usec big.txt
read_readlines                  9.24 msec big.txt
read_readtxt                    4.36 msec big.txt

name                                 time ratio comment
read_read                        488 usec  1.00 big.txt
read_readtxt                    4.36 msec  8.95 big.txt
read_readlines                  9.24 msec 18.95 big.txt
read_james_otigo                  40 msec 82.13 big.txt
read_james_otigo_with_int_float  116 msec 238.64 big.txt
read_map_local                   131 msec 268.05 big.txt
read_map                         134 msec 274.87 big.txt
read_numpy_loadtxt               400 msec 819.42 big.txt


read_james_otigo                39.4 msec big.txt
read_readtxt                    4.37 msec big.txt
read_readlines                  9.21 msec big.txt
read_map_local                   131 msec big.txt
read_james_otigo_with_int_float  116 msec big.txt
read_map                         134 msec big.txt
read_read                        487 usec big.txt
read_numpy_loadtxt               398 msec big.txt

name                                 time ratio comment
read_read                        487 usec  1.00 big.txt
read_readtxt                    4.37 msec  8.96 big.txt
read_readlines                  9.21 msec 18.90 big.txt
read_james_otigo                39.4 msec 80.81 big.txt
read_james_otigo_with_int_float  116 msec 238.51 big.txt
read_map_local                   131 msec 268.84 big.txt
read_map                         134 msec 275.11 big.txt
read_numpy_loadtxt               398 msec 816.71 big.txt

【讨论】：