【发布时间】:2021-03-13 15:28:36
【问题描述】:
首先:我在不同的目录和三个扩展中有多个文本文件。都在路径年。示例
file20200102
第二:我有一个模式列表。一个 txt 文件中有 2.500 个模式。它们只是一个单词的名称,例如:
asd-ddds223
bower3300
...
在文本文件中,我有一个数据看起来(379 个文件):
XXXXXXXX bower3300 YYYYYYYY
...
文件的大小完全不同。一个可以有 20Kb,另一个可以有 300mb。
我需要找到所有存在模式的行,并将其放入新的输出文件。 它可以是 2.500 x 40.000 个用户(因为 YYYYYY 是用户 ID)。
我写了一个可行的解决方案。主要方法看起来:
def create_SNP_list(target_str):
filesList_g = filesList()
found = []
target = target_str
print("Searching for: {} patterns".format(len(target_str)) )
print("Supported extends: 'txt, csv, xz, gz'. May works with others text files but not supported.")
#Extension block
ext = []
for l in filesList_g:
ext.append(l.rsplit('.', 1)[1])
ext = list(set(ext))
print("Found files extends", ext)
#print("List of Files", filesList_g )
counter_current = 1
counter_stop = len(filesList_g)
for file_path in filesList_g:
for t in target:
#print("Current t is {} and current target is: {}".format(t, target)
#print('{}/{}'.format(counter_current,counter_stop))
counter_current+=1
#print("current pattern is {}".format(pattern))
try:
if '.xz' in file_path:
#print("file_path = ", file_path)
with lzma.open(file_path, mode='rt') as src:
for line in src:
if t in line:
line = line.rstrip("\n")+ ",{}\n".format(file_path)
found.append(line)
elif '.gz' in file_path:
with gzip.open(file_path,'rt') as src:
for line in src:
if t in line:
line = line.rstrip("\n")+ ",{}\n".format(file_path)
found.append(line)
elif '.txt' in file_path or '.csv' in file_path:
with open(file_path,'r',encoding='UTF-8') as src:
for line in src:
if t in line:
line = line.rstrip("\n")+ ",{}\n".format(file_path)
found.append(line)
#print("*"*10 + "DEBUG" + "*"*10)
#print("current Found table is = {}".format(found))
#print("*" * 10 + "END" + "*" * 10)
except:
print("Something is wrong with open file:", file_path, +'. File will be omitted.')
try:
with open('logs.txt', 'a') as f:
f.write("%s/n" % file_path)
except:
print(errorCommunicate)
print("len of found list = ", len(found))
return found
现在:fileList_G 是 379 个文件的路径列表。 target 是 2.500 个模式的列表,因此 t 是一个,当前模式。
脚本运行良好...我的意思是,它可以运行...但是找到一个模式需要大约 6 分钟。所以......它是6 * 2.500。太长了:)
也许你们中的一些人知道如何加快速度?每个提示都会很棒! :)
【问题讨论】:
标签: python string algorithm search