通过键列表过滤/分组字典列表答案

【问题标题】：Filtering/Grouping a list of dictionaries by a list of keys通过键列表过滤/分组字典列表
【发布时间】：2017-04-09 05:45:42
【问题描述】：

我需要定义一个函数 group_dictionaries，它将获取一个字典列表并返回一个字典列表，其中包含与键列表中的每个键相同的值。 “孤独”的词典将被删除。

这是一个例子：

my_list=[
    {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z},
    {'id':'id3', 'key2 :value_u, 'key3': value_v},
    {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t},
    {'id':'id4', 'key1':value_w, 'key2':value_s, 'key3':value_v}
]

group_dictionary(my_list, list_of_keys=['key1', 'key3'])

#result: the only dictionaries that have key1 AND key3 in common are:
[
    {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z, 'group':0},
    {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t, 'group':0}
]

group_dictionary(my_list, list_of_keys=['key3'])

#result the dictionaries that have key3 in common are divided in two groups
#of different values: group 0 has value_z and group1 has value_v

[
    {'id':'id1', 'key1':value_x, 'key2': value_y, 'key3':value_z, 'group':0},
    {'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t, 'group':0},
    {'id':'id3', 'key2 :value_u, 'key3': value_v, 'group':1},
    {'id':'id4', 'key1':value_w, 'key2':value_s, 'key3':value_v, 'group':1}
]

如你所见：

该函数创建一个标记为“组”的键，它是一个整数从 0 开始。这个键被分配给字典的每个“组” （按组我是指其键对应于列表的字典的密钥与每个密钥完全匹配）
该函数删除没有“组”的字典。
我正在处理的现有数据集包含一个唯一 ID 每本词典。这可能对创建函数很有用。
不存在的键会阻止字典成为候选。

我担心运行时；实际列表平均包含 80,000 个字典，每个字典有 35 个键。该算法的复杂度可能为 n² (80,000²)。欢迎对代码进行任何优化。

【问题讨论】：

它是：my_list[3]={'id':'id2', 'key1':value_x, 'key3':value_z, 'key4': value_t} 在第一个输出中它是第二个元素。 'group':0 已添加。
你的算法绝对不需要 O(n^2);只需维护一个数据结构，该结构将允许您从一组键/值对中唯一标识组 ID
@DSM 好的，抱歉。它们都有 key1 和 key3，但值不同。

标签： python list dictionary

【解决方案1】：

我相信这会奏效，它是用 Python3 编写的，我没有对其进行优化，但如果速度不够快，它可能是一个很好的起点。

list_of_dicts = [
{'id':'id1', 'key1':'value_x', 'key2': 'value_y', 'key3':'value_z'},
{'id':'id3', 'key2' :'value_u', 'key3': 'value_v'},
{'id':'id2', 'key1':'value_x', 'key3':'value_z', 'key4': 'value_t'},
{'id':'id4', 'key1':'value_w', 'key2':'value_s', 'key3':'value_v'}
]

# Since we can't have objects as keys, make the values we're looking for into a string, and use that as the key.
def make_value_key(d, list_of_keys):
    res = ""
    for k in list_of_keys:
        res += str(d[k]) 
    return res

def group_dictionary(list_of_dicts, list_of_keys):
    group_vals = {}
    current_max_group = 0
    dicts_to_remove = []
    for i,d in enumerate(list_of_dicts):
        # If dict doesn't have all keys mark for removal.
        if not all(k in d for k in list_of_keys):
            dicts_to_remove.append(i)
        else:
            value_key = make_value_key(d, list_of_keys)
            # If value key exists assign group otherwise make new group.
            if value_key in group_vals:
                d['group'] = group_vals[value_key]
            else:
                group_vals[value_key] = current_max_group
                d['group'] = current_max_group
                current_max_group += 1

    list_of_dicts = [i for j, i in enumerate(list_of_dicts) if j not in dicts_to_remove]
    return list_of_dicts

list_of_keys=['key1','key3']

print(group_dictionary(list_of_dicts, list_of_keys))
print()
list_of_keys=['key3']

print(group_dictionary(list_of_dicts, list_of_keys))

输出：

[{'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'}, 
{'key3': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'}, 
{'key3': 'value_v', 'key1': 'value_w', 'group': 1, 'key2': 'value_s', 'id': 'id4'}]

[{'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'}, 
{'group': 1, 'key3': 'value_v', 'key2': 'value_u', 'id': 'id3'}, 
{'key3': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'}, 
{'key3': 'value_v', 'key1': 'value_w', 'group': 1, 'key2': 'value_s', 'id': 'id4'}]

优化一：

与其迭代所有键来检查它们是否存在，相反，我们可以在创建值键时失败并返回一个空字符串，这会将字典标记为删除：

def make_value_key(d, list_of_keys):
    res = ""
    for k in list_of_keys:
        if not k in d:
            return ""
        res += str(d[k]) 
    return res

def group_dictionary(list_of_dicts, list_of_keys):
    group_vals = {}
    current_max_group = 0
    dicts_to_remove = []
    for i,d in enumerate(list_of_dicts):
        value_key = make_value_key(d, list_of_keys)
        if value_key == "":
            dicts_to_remove.append(i)
            continue
        if value_key in group_vals:
            d['group'] = group_vals[value_key]

        else:
            group_vals[value_key] = current_max_group
            d['group'] = current_max_group
            current_max_group += 1

    list_of_dicts = [i for j, i in enumerate(list_of_dicts) if j not in dicts_to_remove]
    return list_of_dicts

组必须大于 1：

这使用第二个字典来跟踪组大小，然后检查组是否小于 2 以将它们标记为删除。

def make_value_key(d, list_of_keys):
    res = ""
    for k in list_of_keys:
        if not k in d:
            return ""
        res += str(d[k]) 
    return res

def group_dictionary(list_of_dicts, list_of_keys):
    group_vals = {}
    group_count = {}
    current_max_group = 0
    indices_to_remove = []
    for i,d in enumerate(list_of_dicts):
        value_key = make_value_key(d, list_of_keys)
        if value_key == "":
            indices_to_remove.append(i)
            continue
        if value_key in group_vals:
            d['group'] = group_vals[value_key]
            # Second group member seen, remove from count dict. 
            group_count.pop(d['group'], None)
        else:
            group_vals[value_key] = current_max_group
            d['group'] = current_max_group
            # First time seen, add to count dict.
            group_count[current_max_group] = i
            current_max_group += 1

    indices_to_remove.extend(group_count.values())
    return [i for j, i in enumerate(list_of_dicts) if j not in indices_to_remove]

输出：

[{'key2': 'value_y', 'group': 0, 'id': 'id1', 'key1': 'value_x', 'key3': 'value_z'}, 
{'key4': 'value_t', 'group': 0, 'id': 'id2', 'key1': 'value_x', 'key3': 'value_z'}]

[{'key2': 'value_y', 'group': 0, 'id': 'id1', 'key1': 'value_x', 'key3': 'value_z'}, {'group': 1, 'id': 'id3', 'key2': 'value_u', 'key3': 'value_v'}, {'key4': 'value_t', 'group': 0, 'id': 'id2', 'key1': 'value_x', 'key3': 'value_z'}, {'key2': 'value_s', 'group': 1, 'id': 'id4', 'key1': 'value_w', 'key3': 'value_v'}]

优化 2：

您可以从O(n^2)（循环遍历字典列表一次计算，一次删除）到O(n*m log m)（循环遍历字典列表一次，循环遍历排序的删除索引）：

def make_value_key(d, list_of_keys):
    res = ""
    for k in list_of_keys:
        if not k in d:
            return ""
        res += str(d[k]) 
    return res

def group_dictionary(list_of_dicts, list_of_keys):
    group_vals = {}
    group_count = {}
    current_max_group = 0
    indices_to_remove = []
    for i,d in enumerate(list_of_dicts):
        value_key = make_value_key(d, list_of_keys)
        if value_key == "":
            indices_to_remove.append(i)
            continue
        if value_key in group_vals:
            d['group'] = group_vals[value_key]
            # Second group member seen, remove from count dict. 
            group_count.pop(d['group'], None)
        else:
            group_vals[value_key] = current_max_group
            d['group'] = current_max_group
            # First time seen, add to count dict.
            group_count[current_max_group] = i
            current_max_group += 1

    indices_to_remove.extend(group_count.values())
    for index in sorted(indices_to_remove, reverse=True):
        del list_of_dicts[index]

    return list_of_dicts

【讨论】：

谢谢，但这不是预期的行为。 'id4' 字典确实有键 key1 和 key3，但它是唯一的一个。所以它应该消失。第一个输出应该只是： [ {'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'} {'key3 ': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'} ] 第二个输出是正确的，但只是因为我没有放一本带有“孤独”key3的字典，因为那样它会被放在group2中，我不想这样做。
啊，所以一个组需要有多个成员？
更新了我的问题，它保持了第一个优化但失去了第二个。
是的，一个组需要有多个成员。不能有一个“组”值为 1 的唯一字典。
是的！这正是预期的结果。非常感谢。您忘记将 make_value_key(d, list_of_keys) 放在最后一个代码中。

【解决方案2】：

这很简单；首先，您需要一些方法来轻松地序列化 dict 中的相关数据。我将使用这种（非常简单的）方法，但根据数据的复杂性，您可能需要提出更强大的方法：

def serialize(d, keys):
    return ','.join([d[key] for key in keys])

然后，您只需将所有这些序列化值存储在一个列表中。列表中值的索引是您的组的 ID。

def group_dictionary(dicts, keys):
    groups = []
    result = []

    for d in dicts:
        # skip over dictionaries that don't have all keys
        if any(key not in d for key in keys):
            continue

        # get the serialized data
        serialized_data = serialize(d, keys)

        # if we've encountered a new set of data, create a new group!
        if serialized_data not in groups:
            groups.append(serialized_data)

        # augment the dictionary with the group id
        d['group'] = groups.index(serialized_data)

        # and add it to the list of returned dictionaries
        result.append(d)

    return result

【讨论】：

与@Darkstarone 的注释相同。'id4' 字典确实有键 key1 和 key3，但它是唯一的一个。所以它应该消失。第一个输出应该只是： [ {'key3': 'value_z', 'key1': 'value_x', 'group': 0, 'key2': 'value_y', 'id': 'id1'} {'key3 ': 'value_z', 'key1': 'value_x', 'key4': 'value_t', 'group': 0, 'id': 'id2'} ] 第二个输出是正确的，但只是因为我没有放一本带有“孤独”key3的字典，因为那样它会被放在group2中，我不想这样做。