两个字典的合并和求和答案

【问题标题】：Merge and sum of two dictionaries两个字典的合并和求和
【发布时间】：2018-10-23 14:16:36
【问题描述】：

我在下面有一个字典，我想添加到另一个字典，其中的元素不一定不同，并合并它的结果。是否有任何内置功能，或者我需要自己制作？

{
  '6d6e7bf221ae24e07ab90bba4452267b05db7824cd3fd1ea94b2c9a8': 6,
  '7c4a462a6ed4a3070b6d78d97c90ac230330603d24a58cafa79caf42': 7,
  '9c37bdc9f4750dd7ee2b558d6c06400c921f4d74aabd02ed5b4ddb38': 9,
  'd3abb28d5776aef6b728920b5d7ff86fa3a71521a06538d2ad59375a': 15,
  '2ca9e1f9cbcd76a5ce1772f9b59995fd32cbcffa8a3b01b5c9c8afc2': 11
}

字典中的元素个数也是未知的。

如果合并考虑了两个相同的键，则这些键的值应该相加而不是覆盖。

【问题讨论】：

标签： python dictionary

【解决方案1】：

您没有说具体要如何合并，所以请自行选择：

x = {'both1': 1, 'both2': 2, 'only_x': 100}
y = {'both1': 10, 'both2': 20, 'only_y': 200}

print {k: x.get(k, 0) + y.get(k, 0) for k in set(x)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) & set(y)}
print {k: x.get(k, 0) + y.get(k, 0) for k in set(x) | set(y)}

结果：

{'both2': 22, 'only_x': 100, 'both1': 11}
{'both2': 22, 'both1': 11}
{'only_y': 200, 'both2': 22, 'both1': 11, 'only_x': 100}

【讨论】：

如果我们有 n 个字典，我们如何实现这个？
我喜欢这种方法。但是在我的情况下，对于上述相同的字典值，我试图采取不同的方式。即x-y。 diff= { k: x.get(k, 0) - y.get(k, 0) for k in set(x) | set(y) } print(diff) 这给了我：{'only_y': -200, 'both2': -18, 'only_x': 100, 'both1': -9} 我担心上面的only_y 值，因为它变为负值200 而不是保留200。即使您已经回答了实际问题，您能否提出更好的方法来捕获唯一键的负值？
@Panchu: sub = lambda a, b: a if b is None else b if a is None else a -b 然后{k: sub(x.get(k), y.get(k)) for ... etc 怎么样
@georg 我使用两个for in 循环来完成同样的事情，所以您的第三个选项非常适合我的需要，因为它汇总了所有匹配的键并仍然保留不匹配的键。当我们取消循环时，这些类型的表达式在 python 中称为什么？只是一种映射？谢谢。
@tymac: 这些是dict comprehensions

【解决方案2】：

您可以使用collections.Counter() 执行+、-、& 和|（交集和并集）。

我们可以执行以下操作（注意：只有正计数值会保留在字典中）：

from collections import Counter

x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }

z = dict(Counter(x)+Counter(y))

print(z)
[out]:
{'both2': 22, 'only_x': 100, 'both1': 11, 'only_y': 200}

要解决结果可能为零或负数的加法问题，请使用Counter.update() 进行加法，使用Counter.subtract() 进行减法：

x = {'both1':0, 'both2':2, 'only_x': 100 }
y = {'both1':0, 'both2': -20, 'only_y':200 }
xx = Counter(x)
yy = Counter(y)
xx.update(yy)
dict(xx)
[out]:
{'both2': -18, 'only_x': 100, 'both1': 0, 'only_y': 200}

【讨论】：

【解决方案3】：

根据georg、NPE、Scott 和Havok 的回答补充说明。

我试图对 2 个或更多字典的集合执行此操作，并且有兴趣查看每个字典所花费的时间。因为我想在任意数量的字典上这样做，我不得不稍微改变一些答案。如果有人对他们有更好的建议，请随时编辑。

这是我的测试方法。我最近对其进行了更新，以包含更大字典的测试，并再次包含 Havok 和 Scott 的新方法：

首先我使用了以下数据：

import random

x = {'xy1': 1, 'xy2': 2, 'xyz': 3, 'only_x': 100}
y = {'xy1': 10, 'xy2': 20, 'xyz': 30, 'only_y': 200}
z = {'xyz': 300, 'only_z': 300}

small_tests = [x, y, z]

# 200,000 random 8 letter keys
keys = [''.join(random.choice("abcdefghijklmnopqrstuvwxyz") for _ in range(8)) for _ in range(200000)]

a, b, c = {}, {}, {}

# 50/50 chance of a value being assigned to each dictionary, some keys will be missed but meh
for key in keys:
    if random.getrandbits(1):
        a[key] = random.randint(0, 1000)
    if random.getrandbits(1):
        b[key] = random.randint(0, 1000)
    if random.getrandbits(1):
        c[key] = random.randint(0, 1000)

large_tests = [a, b, c]

print("a:", len(a), "b:", len(b), "c:", len(c))
#: a: 100069 b: 100385 c: 99989

现在每个方法：

from collections import defaultdict, Counter
from functools import reduce

def georg_method(tests):
    return {k: sum(t.get(k, 0) for t in tests) for k in set.union(*[set(t) for t in tests])}

def georg_method_nosum(tests):
    # If you know you will have exactly 3 dicts
    return {k: tests[0].get(k, 0) + tests[1].get(k, 0) + tests[2].get(k, 0) for k in set.union(*[set(t) for t in tests])}

def npe_method(tests):
    ret = defaultdict(int)
    for d in tests:
        for k, v in d.items():
            ret[k] += v
    return dict(ret)

# Note: There is a bug with scott's method. See below for details.
# Scott included a similar version using counters that is fixed
# See the scott_update_method below
def scott_method(tests):
    return dict(sum((Counter(t) for t in tests), Counter()))

def scott_method_nosum(tests):
    # If you know you will have exactly 3 dicts
    return dict(Counter(tests[0]) + Counter(tests[1]) + Counter(tests[2]))

def scott_update_method(tests):
    ret = Counter()
    for test in tests:
        ret.update(test)
    return dict(ret)

def scott_update_method_static(tests):
    # If you know you will have exactly 3 dicts
    xx = Counter(tests[0])
    yy = Counter(tests[1])
    zz = Counter(tests[2])
    xx.update(yy)
    xx.update(zz)
    return dict(xx)

def havok_method(tests):
    def reducer(accumulator, element):
        for key, value in element.items():
            accumulator[key] = accumulator.get(key, 0) + value
        return accumulator
    return reduce(reducer, tests, {})

methods = {
    "georg_method": georg_method, "georg_method_nosum": georg_method_nosum,
    "npe_method": npe_method,
    "scott_method": scott_method, "scott_method_nosum": scott_method_nosum,
    "scott_update_method": scott_update_method, "scott_update_method_static": scott_update_method_static,
    "havok_method": havok_method
}

我还编写了一个快速函数来查找列表之间的任何差异。不幸的是，那时我在 Scott 的方法中发现了问题，即如果您的字典总数为 0，则根本不会包含该字典，因为 Counter() 在添加时的行为方式。

测试设置：

MacBook Pro（15 英寸，2016 年末），2.9 GHz Intel Core i7，16 GB 2133 MHz LPDDR3 RAM，运行 macOS Mojave 版本 10.14.5
Python 3.6.5 通过 IPython 6.1.0

最后，结果：

结果：小测试

for name, method in methods.items():
    print("Method:", name)
    %timeit -n10000 method(small_tests)
#: Method: georg_method
#: 7.81 µs ± 321 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: georg_method_nosum
#: 4.6 µs ± 48.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: npe_method
#: 3.2 µs ± 24.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method
#: 24.9 µs ± 326 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_method_nosum
#: 18.9 µs ± 64.8 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method
#: 9.1 µs ± 90.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: scott_update_method_static
#: 14.4 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#: Method: havok_method
#: 3.09 µs ± 47.9 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

结果：大型测试

当然，不能在尽可能多的循环附近运行

for name, method in methods.items():
    print("Method:", name)
    %timeit -n10 method(large_tests)
#: Method: georg_method
#: 347 ms ± 20 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: georg_method_nosum
#: 280 ms ± 4.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: npe_method
#: 119 ms ± 11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method
#: 324 ms ± 16.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_method_nosum
#: 289 ms ± 14.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method
#: 123 ms ± 1.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: scott_update_method_static
#: 136 ms ± 3.19 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#: Method: havok_method
#: 103 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

结论

╔═══════════════════════════╦═══════╦═════════════════════════════╗
║                           ║       ║    Best of Time Per Loop    ║
║         Algorithm         ║  By   ╠══════════════╦══════════════╣
║                           ║       ║  small_tests ║  large_tests ║
╠═══════════════════════════╬═══════╬══════════════╬══════════════╣
║ functools reduce          ║ Havok ║       3.1 µs ║   103,000 µs ║
║ defaultdict sum           ║ NPE   ║       3.2 µs ║   119,000 µs ║
║ Counter().update loop     ║ Scott ║       9.1 µs ║   123,000 µs ║
║ Counter().update static   ║ Scott ║      14.4 µs ║   136,000 µs ║
║ set unions without sum()  ║ georg ║       4.6 µs ║   280,000 µs ║
║ set unions with sum()     ║ georg ║       7.8 µs ║   347,000 µs ║
║ Counter() without sum()   ║ Scott ║      18.9 µs ║   289,000 µs ║
║ Counter() with sum()      ║ Scott ║      24.9 µs ║   324,000 µs ║
╚═══════════════════════════╩═══════╩══════════════╩══════════════╝

重要。 YMMV。

【讨论】：

【解决方案4】：

您可以为此使用defaultdict：

from collections import defaultdict

def dsum(*dicts):
    ret = defaultdict(int)
    for d in dicts:
        for k, v in d.items():
            ret[k] += v
    return dict(ret)

x = {'both1':1, 'both2':2, 'only_x': 100 }
y = {'both1':10, 'both2': 20, 'only_y':200 }

print(dsum(x, y))

这会产生

{'both1': 11, 'both2': 22, 'only_x': 100, 'only_y': 200}

【讨论】：

【解决方案5】：

另一个使用reduce函数的选项。这允许对任意字典集合进行求和：

from functools import reduce

collection = [
    {'a': 1, 'b': 1},
    {'a': 2, 'b': 2},
    {'a': 3, 'b': 3},
    {'a': 4, 'b': 4, 'c': 1},
    {'a': 5, 'b': 5, 'c': 1},
    {'a': 6, 'b': 6, 'c': 1},
    {'a': 7, 'b': 7},
    {'a': 8, 'b': 8},
    {'a': 9, 'b': 9},
]


def reducer(accumulator, element):
    for key, value in element.items():
        accumulator[key] = accumulator.get(key, 0) + value
    return accumulator


total = reduce(reducer, collection, {})


assert total['a'] == sum(d.get('a', 0) for d in collection)
assert total['b'] == sum(d.get('b', 0) for d in collection)
assert total['c'] == sum(d.get('c', 0) for d in collection)

print(total)

执行：

{'a': 45, 'b': 45, 'c': 3}

优点：

简单、清晰、Pythonic。
无模式，只要所有键都是“可调用的”。
O(n) 时间复杂度和 O(1) 内存复杂度。

【讨论】：

【解决方案6】：

d1 = {'apples': 2, 'banana': 1}
d2 = {'apples': 3, 'banana': 2}
merged = reduce(
    lambda d, i: (
        d.update(((i[0], d.get(i[0], 0) + i[1]),)) or d
    ),
    d2.iteritems(),
    d1.copy(),
)

dict.update() 的替换也很简单：

merged = dict(d1, **d2)

【讨论】：

我喜欢这个提示：merged = dict(d1, **d2)

【解决方案7】：

class dict_merge(dict):
def __add__(self, other):
    result = dict_merge({})
    for key in self.keys():
        if key in other.keys():
            result[key] = self[key] + other[key]
        else:
            result[key] = self[key]
    for key in other.keys():
        if key in self.keys():
            pass
        else:
            result[key] = other[key]
    return result


a = dict_merge({"a":2, "b":3, "d":4})
b = dict_merge({"a":1, "b":2})
c = dict_merge({"a":5, "b":6, "c":5})
d = dict_merge({"a":8, "b":6, "e":5})

print((a + b + c +d))


>>> {'a': 16, 'b': 17, 'd': 4, 'c': 5, 'e': 5}

这就是运算符重载。使用__add__，我们定义了如何为我们的dict_merge 使用运算符+，它继承自内置的python dict。您可以继续使用类似的方式在同一类中定义其他运算符，使其更加灵活，例如* 与 __mul__ 进行乘法运算，或/ 与__div__ 进行除法运算，甚至% 与__mod__ 进行模运算，并将self[key] + other[key] 中的+ 替换为相应的运算符，如果您曾经发现自己需要这样的合并。我只是在没有其他运营商的情况下对此进行了测试，但我预计其他运营商不会出现问题。只是通过尝试来学习。

【讨论】：

【解决方案8】：

一个相当简单的方法：

from collections import Counter
from functools import reduce

data = [
  {'x': 10, 'y': 1, 'z': 100},
  {'x': 20, 'y': 2, 'z': 200},
  {'a': 10, 'z': 300}
]

result = dict(reduce(lambda x, y: Counter(x) + Counter(y), data))

【讨论】：

您的答案可以通过额外的支持信息得到改进。请edit 添加更多详细信息，例如引用或文档，以便其他人可以确认您的答案是正确的。你可以找到更多关于如何写好答案的信息in the help center。

【解决方案9】：

如果你想创建一个新的dict 作为| 使用：

>>> dict({'a': 1,'c': 2}, **{'c': 1})
{'a': 1, 'c': 1}

【讨论】：

他希望 c 等于 3。

【解决方案10】：

Scott 使用collections.Counter 的方法很好，但缺点是不能与sum 一起使用；当您只想按组件添加值时，处理负值或零值的需要对我来说有点违反直觉。

所以我认为，为此编写一个自定义类可能是个好主意。这也是约翰穆图马的想法。但是，我想添加我的解决方案：

我创建了一个行为非常类似于dict 的类，基本上将所有成员调用传递给getatrr 方法中的底层_data。唯一不同的是：

它有一个DEFAULT_VALUE（类似于collections.defaultdict），用作不存在键的值。
它实现了一个__add__() 方法，该方法（连同__radd__() 方法）负责按组件添加字典。

from typing import Union, Any


class AddableDict:
    DEFAULT_VALUE = 0

    def __init__(self, data: dict) -> None:
        self._data = data

    def __getattr__(self, attr: str) -> Any:
        return getattr(self._data, attr)

    def __getitem__(self, item) -> Any:
        try:
            return self._data[item]
        except KeyError:
            return self.DEFAULT_VALUE

    def __repr__(self):
        return self._data.__repr__()

    def __add__(self, other) -> "AddableDict":
        return AddableDict({
            key: self[key] + other[key]
            for key in set(self.keys()) | set(other.keys())
        })

    def __radd__(
        self, other: Union[int, "AddableDict"]
    ) -> "AddableDict":
        if other == 0:
            return self

这样我们就可以添加两个对象以及这些对象的sum 可迭代对象：

>>> alpha = AddableDict({"a": 1})
>>> beta = AddableDict({"a": 10, "b": 5})
>>> alpha + beta
{'a': 11, 'b': 5}

>>> sum([beta]*10)
{'a': 100, 'b': 50}

在我看来，此解决方案的优势在于为开发人员提供了一个简单易懂的界面供使用。当然，你也可以继承dict，而不是使用组合。

【讨论】：