按匹配集对项目进行分组答案

【问题标题】：Grouping items by match set按匹配集对项目进行分组
【发布时间】：2011-09-21 08:23:51
【问题描述】：

我正在尝试解析大量配置文件并根据内容将结果分组到单独的组中 - 我只是不知道如何处理这个问题。例如，假设我在 3 个文件中有以下数据：

config1.txt ntp 1.1.1.1 ntp 2.2.2.2 config2.txt ntp 1.1.1.1 配置3.txt ntp 2.2.2.2 ntp 1.1.1.1 config4.txt ntp 2.2.2.2 结果将是：唯一数据集 3：设置1（1.1.1.1、2.2.2.2）：config1.txt、config3.txt 设置 2 (1.1.1.1)：config2.txt 设置 3 (2.2.2.2)：config4.txt

我了解如何 glob 文件目录、循环 glob 结果并一次打开每个文件，并使用正则表达式匹配每一行。我不明白的部分是我如何存储这些结果并将每个文件与一组结果进行比较，即使条目是无序的，但匹配条目是明智的。任何帮助将不胜感激。

谢谢！

【问题讨论】：

“我了解如何 glob 文件目录，循环 glob 结果并一次打开每个文件，并使用正则表达式匹配每一行”向我们展示该代码，我们很乐意告诉你如何做剩下的。提示：使用字典。

标签： python

【解决方案1】：

filenames = [ r'config1.txt',
              r'config2.txt',
              r'config3.txt',
              r'config4.txt' ]
results = {}
for filename in filenames:
    with open(filename, 'r') as f:
        contents = ( line.split()[1] for line in f )
        key = frozenset(contents)
        results.setdefault(key, []).append(filename)

【讨论】：

我更喜欢 defaultdict(list) 而不是 dict.setdefault。
我可能也应该这样做，但我有一个习惯，就是尽量少导入，这对我来说很难打破。

【解决方案2】：

from collections import defaultdict

#Load the data.
paths = ["config1.txt", "config2.txt", "config3.txt", "config4.txt"]
files = {}

for path in paths:
    with open(path) as file:
        for line in file.readlines():
            ... #Get data from files
            files[path] = frozenset(data)

#Example data.
files = {
    "config1.txt": frozenset(["1.1.1.1", "2.2.2.2"]),
    "config2.txt": frozenset(["1.1.1.1"]),
    "config3.txt": frozenset(["2.2.2.2", "1.1.1.1"]),
    "config4.txt": frozenset(["2.2.2.2"]),
}

sets = defaultdict(list)

for key, value in files.items():
    sets[value].append(key)

请注意，您需要使用frozensets，因为它们是不可变的，因此可以用作字典键。因为他们不会改变，这很好。

【讨论】：

精益而刻薄，我喜欢。我认为是 O(N*M) 其中 N 是文件数，M 是每个文件的平均配置项数。

【解决方案3】：

这种替代方法比其他替代方法更冗长，但它可能更有效，具体取决于几个因素（请参阅最后的注释）。除非您正在处理具有大量配置项的大量文件，否则我什至不会考虑在其他一些建议中使用它，但如果性能是一个问题，此算法可能会有所帮助。

从配置字符串到文件集的字典开始（称为 c2f，从文件到配置字符串集 (f2c)。两者都可以在您 glob 文件时构建。

需要明确的是，c2f 是一个字典，其中键是字符串，值是文件集。 f2c 是一个字典，其中键是文件，值是字符串集。

遍历 f2c 的文件键和一个数据项。使用 c2f 查找包含该项目的所有文件。这些是您需要比较的唯一文件。

这是工作代码：

# this structure simulates the files system and contents.
cfg_data = {
    "config1.txt": ["1.1.1.1", "2.2.2.2"],
    "config2.txt": ["1.1.1.1"],
    "config3.txt": ["2.2.2.2", "1.1.1.1"],
    "config4.txt": ["2.2.2.2"]
}

# Build the dictionaries (this is O(n) over the lines of configuration data)
f2c = dict()
c2f = dict()

for file, data in cfg_data.iteritems():
    data_set = set()
    for item in data:
        data_set.add(item)
        if not item in c2f:
            c2f[item] = set()

        c2f[item].add(file)
    f2c[file] = data_set;

# build the results as a list of pairs of lists:
results = []

# track the processed files
processed = set()

for file, data in f2c.iteritems():
    if file in processed:
        continue

    size = len(data)
    equivalence_list = []

    # get one item from data, preferably the one used by the smallest list of
    # files.
    item = None
    item_files = 0
    for i in data:
        if item == None:
            item = i
            item_files = len(c2f[item])
        elif len(c2f[i]) < item_files:
            item = i
            item_files = len(c2f[i])

    # All files with the same data as f must have at least the first item of
    # data, just look at those files.
    for other_file in c2f[item]:
        other_data = f2c[other_file]
        if other_data == data:
            equivalence_list.append(other_file)
            # No need to visit these files again
            processed.add(other_file)

    results.append((data, equivalence_list))

# Display the results
for data, files in results:
    print data, ':', files

添加一个关于计算复杂度的注释：这在技术上是 O((K log N)*(L log M)) 其中 N 是文件的数量，M 是唯一配置项的数量，K (组的数量，L (，这应该是有效的

【讨论】：

【解决方案4】：

我会这样处理：

首先，得到一个这样的字典：

{(1.1.1.1) : (file1, file2, file3), (2.2.2.2) : (file1, file3, file4) }

然后遍历生成集合的文件：

{(file1) : ((1.1.1.1), (2.2.2.2)), etc }

比较集合的值。

if val(file1) == val(file3):
    Set1 = {(1.1.1.1), (2.2.2.2) : (file1, file2), etc }

这可能不是最快和最优雅的解决方案，但它应该可以工作。

【讨论】：

【解决方案5】：

您需要一个将文件内容映射到文件名的字典。所以你必须阅读每个文件，对条目进行排序，从中构建一个元组并将其用作键。

如果文件中有重复条目：首先将内容读入set。

【讨论】：