将字符串转换为 Pandas 列中整数列表的快速方法？答案

【问题标题】：Fast way to convert strings into lists of ints in a Pandas column?将字符串转换为 Pandas 列中整数列表的快速方法？
【发布时间】：2016-01-31 06:44:27
【问题描述】：

我正在尝试计算大型数据帧中列中所有字符串之间的汉明距离。我在此列中有超过 100,000 行，因此所有成对组合，即 10x10^9 比较。这些字符串是短的 DNA 序列。我想快速将列中的每个字符串转换为整数列表，其中一个唯一的整数表示字符串中的每个字符。例如

"ACGTACA" -> [0, 1, 2, 3, 1, 2, 1]

然后我使用scipy.spatial.distance.pdist 快速有效地计算所有这些之间的汉明距离。在 Pandas 中有没有快速的方法来做到这一点？

我尝试过使用apply，但速度很慢：

mapping = {"A":0, "C":1, "G":2, "T":3}
df.apply(lambda x: np.array([mapping[char] for char in x]))

get_dummies 和其他分类操作不适用，因为它们在每行级别上操作。不在行内。

【问题讨论】：

pretty slow 需要一些虚拟数据和基准测试:)
你能展示一下你的数据框吗？
汉明距离是基于元素的相等或不相等，所以将 ['A', 'C', 'G', 'T'] 翻译成 [0, 1, 2 , 3] 应该是不必要的。

标签： python numpy pandas scipy

【解决方案1】：

由于汉明距离不关心幅度差异，我只需在虚构的数据集上将 df.apply(lambda x: np.array([mapping[char] for char in x])) 替换为 df.apply(lambda x: map(ord, x))，就可以获得大约 40-60% 的加速。

【讨论】：

【解决方案2】：

我没有测试这个的性能，但你也可以试试类似的东西

atest = "ACGTACA"
alist = atest.replace('A', '3.').replace('C', '2.').replace('G', '1.').replace('T', '0.').split('.')
anumlist = [int(x) for x in alist if x.isdigit()]

结果：

[3, 2, 1, 0, 3, 2, 3]

编辑：好的，所以用 atest = "ACTACA"*100000 测试它需要一段时间：/ 也许不是最好的主意...

编辑 5：另一个改进：

import datetime
import numpy as np

class Test(object):
    def __init__(self):
        self.mapping = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}

    def char2num(self, astring):
        return [self.mapping[c] for c in astring]

def main():
        now = datetime.datetime.now()
        atest = "AGTCAGTCATG"*10000000
        t = Test()
        alist = t.char2num(atest)
        testme = np.array(alist)
        print testme, len(testme)
        print datetime.datetime.now() - now    

if __name__ == "__main__":
    main()

110.000.000 个字符大约需要 16 秒，并使您的处理器而不是内存保持忙碌：

[0 2 3 ..., 0 3 2] 110000000
0:00:16.866659

【讨论】：

【解决方案3】：

创建您的测试数据

In [39]: pd.options.display.max_rows=12

In [40]: N = 100000

In [41]: chars = np.array(list('ABCDEF'))

In [42]: s = pd.Series(np.random.choice(chars, size=4 * np.prod(N)).view('S4'))

In [45]: s
Out[45]: 
0        BEBC
1        BEEC
2        FEFA
3        BBDA
4        CCBB
5        CABE
         ... 
99994    EEBC
99995    FFBD
99996    ACFB
99997    FDBE
99998    BDAB
99999    CCFD
dtype: object

这些实际上不必和我们做的一样长。

In [43]: maxlen = s.str.len().max()

In [44]: result = pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)

In [47]: result
Out[47]: 
       0  1  2  3
0      1  4  1  2
1      1  4  4  2
2      5  4  5  0
3      1  1  3  0
4      2  2  1  1
5      2  0  1  4
...   .. .. .. ..
99994  4  4  1  2
99995  5  5  1  3
99996  0  2  5  1
99997  5  3  1  4
99998  1  3  0  1
99999  2  2  5  3

[100000 rows x 4 columns]

因此，您可以根据相同的类别进行分解（例如，代码是有意义的）

而且相当快

In [46]: %timeit pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
10 loops, best of 3: 118 ms per loop

【讨论】：

【解决方案4】：

使用ord 或精确映射 A->0、C->1 等的基于字典的查找似乎没有太大区别：

import pandas as pd
import numpy as np

bases = ['A', 'C', 'T', 'G']

rowlen = 4
nrows = 1000000

dna = pd.Series(np.random.choice(bases, nrows * rowlen).view('S%i' % rowlen))

lookup = dict(zip(bases, range(4)))

%timeit dna.apply(lambda row: map(lookup.get, row))
# 1 loops, best of 3: 785 ms per loop

%timeit dna.apply(lambda row: map(ord, row))
# 1 loops, best of 3: 713 ms per loop

Jeff 的解决方案在性能方面也相差不远：

%timeit pd.concat([dna.str[i].astype('category', categories=bases).cat.codes for i in range(rowlen)], axis=1)
# 1 loops, best of 3: 1.03 s per loop

与将行映射到整数列表相比，这种方法的一个主要优点是可以通过.values 属性将类别视为单个(nrows, rowlen) uint8 数组，然后可以将其直接传递给pdist .

【讨论】：