迭代合并 std::unordered_map答案

【问题标题】：Merge std::unordered_map iteratively迭代合并 std::unordered_map
【发布时间】：2018-10-22 15:17:24
【问题描述】：

我有一个节点列表，每个节点都分解为更多节点。例如

Node0 = w01 * Node1 + w02 * Node2 + w03 * Node3
Node1 = w12 * Node2 + w14 * Node4

因此，我们有 Node0 = w01*w12 * Node2 + w03 * Node3 + w01*w14 Node4。

我的 C++ 代码为给定的一组权重分解执行上述聚合/分解/合并如下所示。但是，我觉得有很多优化要做。仅举一个例子，我循环遍历topWeights 的键并将它们收集到topNodeNames，这似乎非常低效。

是否有任何 STL 算法可以帮助我加快速度，并可能避免不必要的复制？

#include <string>
#include <unordered_map>

template<class T, class U> using umap = std::unordered_map<T, U>;


umap<std::string, double> getWeights(const std::string& nodeName, const umap<std::string, umap<std::string, double>>& weightTrees)
{
    const auto it = weightTrees.find(nodeName);
    if (it == weightTrees.end())
        return umap<std::string, double>();

    umap<std::string, double> topWeights = it->second;
    std::vector<std::string> topNodeNames;

    for (const auto& kv : topWeights)
        topNodeNames.push_back(kv.first);

    for (const std::string& topNodeName : topNodeNames)
    {
        umap<std::string, double> subWeights = getWeights(topNodeName, weightTrees);
        if (subWeights.size() > 0)
        {
            const double topWeight = topWeights[topNodeName];
            topWeights.erase(topNodeName);
            for (const auto& subWeight : subWeights)
            {
                const auto it = topWeights.find(subWeight.first);
                if (it == topWeights.end())
                    topWeights[subWeight.first] = topWeight * subWeight.second;
                else
                    it->second += topWeight * subWeight.second;
            }
        }
    }

    return topWeights;
}


int main()
{
    umap<std::string, umap<std::string, double>> weightTrees = {{ "Node0", {{ "Node1",0.5 },{ "Node2",0.3 },{ "Node3",0.2 }} },
                                                                { "Node1", {{ "Node2",0.1 },{ "Node4",0.9 }} }};

    umap<std::string, double> w = getWeights("Node0", weightTrees); // gives {Node2: 0.35, Node3: 0.20, Node4: 0.45}
}

【问题讨论】：

在循环依赖的情况下会发生什么（我假设没有）？您的真实用例在不同分支之间是否有许多公共节点？
循环依赖永远不应该发生（在真实的情况下）。但我同意，这很好，所以对此进行某种安全检查。至于问题 2，它确实可以改变。
层数没有限制。一个节点可以分解成另一个，分解成另一个，分解成两个，等等。
节点已经有订单了吗？（即NodeN 从不依赖节点NodeK 和K < N）？编辑：是的，当我问上一个问题时，我误解了规范。
@MaxLanghof 不，你也可以调用节点：ABC、Node23、Server10、TestNode等

标签： c++ algorithm merge c++17 unordered-map

【解决方案1】：

主要问题是您将每个节点递归到每个子节点，这通常是高度冗余的。避免这种情况的一种方法是在节点名称上引入顺序，其中“较高”节点仅依赖于“较低”节点，然后以相反的顺序计算它们（对于每个节点，您已经准确地知道所有子权重）。但是，我认为没有std 算法会为您找到此顺序，因为您无法廉价地暂时确定节点依赖关系（“节点 X 是否依赖于节点 Y？如果不是直接的，我们可能必须搜索整棵树……”）。

因此，您可以走动态编程路线并将已完全计算的节点存储在某处。或者甚至更好——当你遍历它时，你可以把整棵树压平到只有叶子的权重。只要你在整个递归过程中保持扁平化，这在递归形式上实际上是相当优雅的：

using NodeWeights = std::unordered_map<std::string, double>;
using NonLeaves = std::unordered_map<std::string, NodeWeights>;

// Modifies the tree so that the given root has no non-leaf children.
void flattenTree(std::string root, NonLeaves& toFlatten)
{
    auto rootIt = toFlatten.find(root);
    if (rootIt == toFlatten.end())
        return;

    NodeWeights& rootWeights = rootIt->second;

    NodeWeights leafOnlyWeights;

    for (auto kvp : rootWeights)
    {
        const std::string& childRoot = kvp.first;
        double childWeight = kvp.second;

        std::cout << "Checking child " << childRoot << std::endl;

        // If the graph is indeed acyclic, then the root kvp here is untouched
        // by this call (and thus references to it are not invalidated).
        flattenTree(childRoot, toFlatten);

        auto childIt = toFlatten.find(childRoot);

        // The child is a leaf after flattening: Do not modify anything.
        if (childIt == toFlatten.end())
        {
            leafOnlyWeights[childRoot] = childWeight;
            continue;
        }

        // Child is still not a leaf (but all its children are now leaves):
        // Redistribute its weight among our other child weights.
        const NodeWeights& leafWeights = childIt->second;
        for (auto leafKvp : leafWeights)
            leafOnlyWeights[leafKvp.first] += childWeight * leafKvp.second;
    }

    rootWeights = leafOnlyWeights;
}

int main()
{
    umap<std::string, umap<std::string, double>> weightTrees = {{ "Node0", {{ "Node1",0.5 },{ "Node2",0.3 },{ "Node3",0.2 }} },
                                                                { "Node1", {{ "Node2",0.1 },{ "Node4",0.9 }} }};

    auto flattenedTree = weightTrees;
    flattenTree("Node0", flattenedTree);

    umap<std::string, double> w = flattenedTree["Node0"]; // Should give {Node2: 0.35, Node3: 0.20, Node4: 0.45}

    for (auto kvp : w)
      std::cout << kvp.first << ": " << kvp.second << std::endl;
}

Demo

由于每个节点最多被展平一次，因此您不会遇到原始算法所具有的指数运行时间。

【讨论】：

这是一个有趣的方法。如果我错了，请纠正我，但似乎我每次进行分解时都必须通过auto flattenedTree = weightTrees; 复制整个树？如果weightTrees 很大怎么办？或者如果分解是微不足道的（或非常简单），那么我仍然必须先复制整个树？
这只是一个简单的想法实现。如果您不需要保留原件，那么这会更好。如果您需要保留它但想首先避免深拷贝，那么您可以通过一种方式实现这一点，即额外的存储仅在您找到它们时存储仅叶表示，而不是从完整树开始。
让我想想。谢谢！
我也刚刚意识到由于无序地图擦除与添加的元素交互的方式，那里仍然存在错误。让我修复它。编辑：固定。
看看我之前提到的固定版本：您基本上可以存储所有计算出的leafOnlyWeights，而不是修改原始版本（并在原始版本之前检查它们）。但这也可能有不可思议的增长，因为您必须保留所有仅叶表示。我怀疑这里是否可以摆脱内存与速度的权衡。

【解决方案2】：

我建议先进行拓扑排序，然后再进行动态规划算法。 Standard versions 使用 Khan 算法的拓扑排序需要时间 O(V+E)。（如果该链接失效，您可以使用 Google 查找另一个链接。）在您的情况下，V 是节点数，E 是出现在所有表达式中的术语数。

如果排序失败，那么您已经找到了循环依赖。以这种方式发现它比让你的代码崩溃要好。

一旦你有了这样的排序，那么用 DP 从头到尾都是非常简单的。

此外，如果您真的关心性能，您的性能限制之一是每个操作都使用字符串比较完成。处理大量字符串既简单又方便——这就是脚本语言一直这样做的原因。然而它也很慢。过去我发现在输入性能关键代码之前创建一个将字符串转换为索引的查找结构是值得的，然后抛出某种类型的 int 而不是字符串。然后最后使用查找将其转回字符串。

【讨论】：

您能否更详细地解释您的最后一点？ unordered_map 不使用内部散列结构来存储/查找密钥吗？您是否有示例说明“标准”查找结构的设置是什么样的（在 C++ 中）？
@Phil-ZXX 在 C 中，字符串是一个字节数组，每个字符串操作都需要遍历它们。在 C++ 中，字符串是一种复杂的数据结构，字符串操作可能需要跟随一个指向它实际存储位置的指针，然后是循环操作。相比之下，比较整数是 CPU 中的内置操作。没有指针。没有变长循环。这么快。对字符串进行哈希查找需要对字符串进行可变长度循环来计算哈希，然后进行类似的循环以查看是否在哈希中找到相同的循环。这比在整数上做哈希要慢。
作为一个实现，您可以简单地创建一个节点向量、一个查找图来查找字符串的索引，然后在代码中使用这些索引。然而，这对于复杂的代码来说不是很容易维护。在过去，我通过创建从std::string 查找对象实例的小类来解决这个问题，有一个方便的方法来获取字符串并获取指向表示该字符串的唯一对象的指针，然后只使用指针别处。现在类型系统让我不会混淆去哪里。