如何减少使用 numpy 读取文件的处理时间答案

【问题标题】：How to reduce the processing time of reading a file using numpy如何减少使用 numpy 读取文件的处理时间
【发布时间】：2017-12-01 06:14:39
【问题描述】：

我想读取一个文件并比较一些值，找到重复的索引并删除重复的索引。我正在 while 循环中执行此过程。这需要大约 76 秒的更多处理时间。这是我的代码：

Source = np.empty(shape=[0,7])
Source = CalData (# CalData is the log file data)
CalTab = np.empty(shape=[0,7])
Source = Source[Source[:, 4].argsort()] # Sort by Azimuth
while Source.size >=1:
    temp = np.logical_and(Source[:,4]==Source[0,4],Source[:,5]==Source[0,5])    
    selarrayindex = np.argwhere(temp)   # find indexes
    selarray = Source[temp]
    CalTab = np.append(CalTab, [selarray[selarray[:,6].argsort()][-1]], axis=0) 
    Source = np.delete(Source, selarrayindex, axis=0)   #delete other rows with similar AZ, EL

while 循环处理需要更多时间。不使用 numpy 或 Efficient numpy 的任何其他方法（使用普通 python）请帮忙！！

【问题讨论】：

您是否尝试过研究 pandas 或类似的库？
没有@kshikama。我只想使用 numpy 或普通 python（比如使用文件操作查找列）。
您需要将您的问题简化为minimal reproducible example。你的算法有什么（CalTab），你想要什么（CalTab）？什么格式、形状、大小等。现在我从你的代码中看到的只是形状为(0,7) 的空数组，这没有多大意义。特别重要的是数组的dtype，因为这将推动如何在numpy中进行操作

标签： python file numpy

【解决方案1】：

无论如何，我认为这应该会改善你的时间安排：

def lex_pick(Source):
    idx = np.lexsort((Source[:, 6], Source[:, 5], Source[:, 4])) 
                      # indices to sort by columns 4, then 5, then 6
    # if dtype = float
    mask = np.r_[np.logical_not(np.isclose(Source[idx[:-1], 5], Source[idx[1:], 5])), True]
    # if dtype = int or string
    mask = np.r_[Source[idx[:-1], 5] != Source[idx[1:], 5], True]
                      # `mask` is `True` in rows before where column 5 changes
    return Source[idx[mask], 6]

【讨论】：