Python：计算列表的所有可能成对距离（DTW）答案

【问题标题】：Python: Compute all possible pairwise distances of a list (DTW)Python：计算列表的所有可能成对距离（DTW）
【发布时间】：2016-11-16 13:30:34
【问题描述】：

我有一个这样的项目列表：T=[T_0, T_1, ..., T_N] 其中每个 T_i 本身就是一个时间序列。我想找到所有潜在对的成对距离（通过 DTW）。

例如如果T=[T_0, T_1, T_2] 和我有一个DTW 函数f，我想找到f(T_0, T_1), f(T_0, T_2), f(T_1, T_2)。

注意T_i 实际上看起来像( id of i, [ time series values ] )。

我的代码 sn-p 如下所示：

cluster = defaultdict( list )                                                                                                                                                                                                                                                                                                                           
donotcluster = defaultdict( list )                                                                                                                          
for i, lst1 in tqdm(enumerate(T)):                                                                                                           
    for lst2 in tqdm(T):                                                                                                                     
        if lst2 in cluster[lst1[0]] or lst2 in donotcluster[lst1[0]]:                                                                                       
            pass                                                                                                                                            
        else:                                                                                                                                               
            distance, path = fastdtw(lst1[1], lst2[1], dist=euclidean)                                                                                      
            if distance <= distance_threshold:                                                                                                              
                cluster[lst1[0]] += [ lst2 ]                                                                                                                
                cluster[lst2[0]] += [ lst1 ]                                                                                                                
            else:                                                                                                                                           
                donotcluster[lst1[0]] += [ lst2 ]                                                                                                           
                donotcluster[lst2[0]] += [ lst1 ]

现在我有大约 20,000 个时间序列，这需要的时间太长（大约需要 5 天）。我正在使用 python 库fastdtw。有没有更优化的库？或者只是计算所有可能距离的更好/更快的方法？由于距离是对称的，如果我已经计算了f(T_33, T_41)，我就不必计算例如f(T_41,T_33)

【问题讨论】：

标签： python euclidean-distance

【解决方案1】：

我建议保留您迄今为止完成的所有配对的set，请记住set 具有恒定时间查找操作。除此之外，您应该考虑不经常扩展列表的其他方法（您正在这样做的讨厌的+=），因为它可能相当昂贵。不过，我对您的应用程序的实现知之甚少，无法对此发表评论。如果您提供更多信息，我可能会想办法摆脱一些您不需要的+=。一个想法（为了提高效率）是将append 每个列表转换为列表列表，然后在脚本末尾使用类似

的内容将其展平

[i for x in cluster[lst[0]] for i in x]

我修改了你的代码如下：

cluster = defaultdict( list )
donotcluster = defaultdict( list )
seen = set() # added this
for i, lst1 in tqdm(enumerate(T)):
    for lst2 in tqdm(T):
        if hashPair( lst1[1], lst2[1] ) not in seen and lst2 not in cluster[lst1[0]] and lst2 not in donotcluster[lst1[0]]: # changed around your condition
            seen.add( hashPair( lst1[1], lst2[1] ) # added this
            distance, path = fastdtw(lst1[1], lst2[1], dist=euclidean)
            if distance <= distance_threshold:
                cluster[lst1[0]] += [ lst2 ]
                cluster[lst2[0]] += [ lst1 ]
            else:
                donotcluster[lst1[0]] += [ lst2 ]
                donotcluster[lst2[0]] += [ lst1 ]

def hashPair( a, b ): # added this
    return ','.join(max(a,b), min(a,b))

【讨论】：

我只需要一个字典，为每个 id 提供距离小于某个阈值的另一个 id。我有 donotcluster 字典，所以如果之前找到相同的对称距离（并且显示不满足阈值），我不必计算距离。
我的代码使用set而不是dict进行检查，效率更高:)

【解决方案2】：

我无法回答你关于是否有更优化的 dtw 库的问题，但你可以使用itertools 来获得你想要的没有重复的组合：

import itertools

for combination in itertools.combinations(T, 2):
        f(combination[0], combination[1])

以下是组合示例：

('T_1', 'T_2')
('T_1', 'T_3')
('T_1', 'T_4')
('T_1', 'T_5')
('T_2', 'T_3')
('T_2', 'T_4')
('T_2', 'T_5')
('T_3', 'T_4')
('T_3', 'T_5')
('T_4', 'T_5')

【讨论】：