【问题标题】:How do you optimize this code for nn prediction?您如何优化此代码以进行 nn 预测?
【发布时间】:2017-01-30 05:38:43
【问题描述】:

您如何优化此代码? 目前,它正在运行以减慢通过此循环的数据量。此代码运行 1-最近邻。它将根据 p_data_set 预测 training_element 的标签

#               [x] ,           [[x1],[x2],[x3]],    [l1, l2, l3]
def prediction(training_element, p_data_set, p_label_set):
    temp = np.array([], dtype=float)
    for p in p_data_set:
        temp = np.append(temp, distance.euclidean(training_element, p))

    minIndex = np.argmin(temp)
    return p_label_set[minIndex]

【问题讨论】:

  • 所涉及的输入的形状是什么?
  • (100L), (40,000, 100L) , (40,000)

标签: python performance numpy scipy nearest-neighbor


【解决方案1】:

你可以使用distance.cdist直接得到距离temp然后使用.argmin()得到min-index,像这样-

minIndex = distance.cdist(training_element[None],p_data_set).argmin()

这是使用np.einsum 的另一种方法-

subs = p_data_set - training_element
minIndex =  np.einsum('ij,ij->i',subs,subs).argmin()

运行时测试

好吧,我在想cKDTree 会轻松击败cdist,但我猜training_element 作为1D 数组对于cdist 来说并不算太重,而且我看到它会击败cKDTree10x+ 的优势!

这是计时结果-

In [422]: # Setup arrays
     ...: p_data_set = np.random.randint(0,9,(40000,100))
     ...: training_element = np.random.randint(0,9,(100,))
     ...: 

In [423]: def tree_based(p_data_set,training_element): #@ali_m's soln
     ...:     tree = cKDTree(p_data_set)
     ...:     dist, idx = tree.query(training_element, k=1)
     ...:     return idx
     ...: 
     ...: def einsum_based(p_data_set,training_element):    
     ...:     subs = p_data_set - training_element
     ...:     return np.einsum('ij,ij->i',subs,subs).argmin()
     ...: 

In [424]: %timeit tree_based(p_data_set,training_element)
1 loops, best of 3: 210 ms per loop

In [425]: %timeit einsum_based(p_data_set,training_element)
100 loops, best of 3: 17.3 ms per loop

In [426]: %timeit distance.cdist(training_element[None],p_data_set).argmin()
100 loops, best of 3: 14.8 ms per loop

【讨论】:

    【解决方案2】:

    如果使用得当,Python 可以是相当快的编程语言。 这是我的建议(faster_prediction):

    import numpy as np
    import time
    
    def euclidean(a,b):
        return np.linalg.norm(a-b)
    
    def prediction(training_element, p_data_set, p_label_set):
        temp = np.array([], dtype=float)
        for p in p_data_set:
            temp = np.append(temp, euclidean(training_element, p))
    
        minIndex = np.argmin(temp)
        return p_label_set[minIndex]
    
    def faster_prediction(training_element, p_data_set, p_label_set):    
        temp = np.tile(training_element, (p_data_set.shape[0],1))
        temp = np.sqrt(np.sum( (temp - p_data_set)**2 , 1))    
    
        minIndex = np.argmin(temp)
        return p_label_set[minIndex]   
    
    
    training_element = [1,2,3]
    p_data_set = np.random.rand(100000, 3)*10
    p_label_set = np.r_[0:p_data_set.shape[0]]
    
    
    t1 = time.time()
    result_1 = prediction(training_element, p_data_set, p_label_set)
    t2 = time.time()
    
    t3 = time.time()
    result_2 = faster_prediction(training_element, p_data_set, p_label_set)
    t4 = time.time()
    
    
    print "Execution time 1:", t2-t1, "value: ", result_1
    print "Execution time 2:", t4-t3, "value: ", result_2
    print "Speed up: ", (t4-t3) / (t2-t1)
    

    我在相当旧的笔记本电脑上得到以下结果:

    Execution time 1: 21.6033108234 value:  9819
    Execution time 2: 0.0176379680634 value:  9819
    Speed up:  1224.81857013
    

    这让我觉得我一定犯了一些愚蠢的错误:)

    如果数据非常庞大,内存可能会成为问题,我建议使用 Cython 或在 C++ 中实现函数并将其包装在 python 中。

    【讨论】:

    • 如果需要对同一个数据集进行多次搜索,我会推荐 KDTree,如上面 ali_m 所述。
    【解决方案3】:

    使用k-D tree 进行快速最近邻查找,例如scipy.spatial.cKDTree:

    from scipy.spatial import cKDTree
    
    # I assume that p_data_set is (nsamples, ndims)
    tree = cKDTree(p_data_set)
    
    # training_elements is also assumed to be (nsamples, ndims)
    dist, idx = tree.query(training_elements, k=1)
    
    predicted_labels = p_label_set[idx]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-02-09
      • 1970-01-01
      • 2011-04-09
      • 2019-12-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-07-12
      相关资源
      最近更新 更多