这种替代方法比其他替代方法更冗长,但它可能更有效,具体取决于几个因素(请参阅最后的注释)。除非您正在处理具有大量配置项的大量文件,否则我什至不会考虑在其他一些建议中使用它,但如果性能是一个问题,此算法可能会有所帮助。
从配置字符串到文件集的字典开始(称为 c2f,从文件到配置字符串集 (f2c)。两者都可以在您 glob 文件时构建。
需要明确的是,c2f 是一个字典,其中键是字符串,值是文件集。 f2c 是一个字典,其中键是文件,值是字符串集。
遍历 f2c 的文件键和一个数据项。使用 c2f 查找包含该项目的所有文件。这些是您需要比较的唯一文件。
这是工作代码:
# this structure simulates the files system and contents.
cfg_data = {
"config1.txt": ["1.1.1.1", "2.2.2.2"],
"config2.txt": ["1.1.1.1"],
"config3.txt": ["2.2.2.2", "1.1.1.1"],
"config4.txt": ["2.2.2.2"]
}
# Build the dictionaries (this is O(n) over the lines of configuration data)
f2c = dict()
c2f = dict()
for file, data in cfg_data.iteritems():
data_set = set()
for item in data:
data_set.add(item)
if not item in c2f:
c2f[item] = set()
c2f[item].add(file)
f2c[file] = data_set;
# build the results as a list of pairs of lists:
results = []
# track the processed files
processed = set()
for file, data in f2c.iteritems():
if file in processed:
continue
size = len(data)
equivalence_list = []
# get one item from data, preferably the one used by the smallest list of
# files.
item = None
item_files = 0
for i in data:
if item == None:
item = i
item_files = len(c2f[item])
elif len(c2f[i]) < item_files:
item = i
item_files = len(c2f[i])
# All files with the same data as f must have at least the first item of
# data, just look at those files.
for other_file in c2f[item]:
other_data = f2c[other_file]
if other_data == data:
equivalence_list.append(other_file)
# No need to visit these files again
processed.add(other_file)
results.append((data, equivalence_list))
# Display the results
for data, files in results:
print data, ':', files
添加一个关于计算复杂度的注释:这在技术上是 O((K log N)*(L log M)) 其中 N 是文件的数量,M 是唯一配置项的数量,K (组的数量,L (,这应该是有效的