在未排序的数组对中查找 K UNIQUE 最大的元素答案

【问题标题】：Finding the K UNIQUE largest elements in an unsorted array of pairs在未排序的数组对中查找 K UNIQUE 最大的元素
【发布时间】：2020-01-15 18:52:54
【问题描述】：

所以这是场景。我有一个未排序的数组（非常大），名为 gallery，其中包含成对的模板 (std::vector<uint8_t>) 及其关联的 ID (std::string)。

我有一个函数，其中为我提供了一个模板，并且必须返回我的画廊中最相似的k 模板的 ID（我使用余弦相似度来生成模板之间的相似度分数）。

我考虑过使用this post 中讨论的堆。但是，问题是图库可以包含属于单个 ID 的多个不同模板。在我的函数中，我必须返回 k unique ID。

对于上下文，我正在做一个面部识别应用程序。我的画廊中可以有多个不同的模板属于一个人（该人使用多个不同的图像在画廊中注册，因此多个模板属于他们的 ID）。搜索功能应将k 最相似的人返回到提供的模板（因此不会多次返回相同的 ID）。

希望有一种在 C++ 中执行此操作的有效算法。

编辑：为我提出的堆解决方案代码截断（不能正确处理重复项）

    std::priority_queue<std::pair<double, std::string>, std::vector<std::pair<double, std::string> >, std::greater<> > queue;


    for(const auto& templPair : m_gallery) {
        try{
            double similairty = computeSimilarityScore(templPair.templ, idTemplateDeserial);

            if (queue.size() < candidateListLength) {
                queue.push(std::pair<double, std::string>(similairty, templPair.id));
            } else if (queue.top().first < similairty) {
                queue.pop();
                queue.push(std::pair<double, std::string>(similairty, templPair.id));
            }
        } catch(...) {
            std::cout << "Unable to compute similarity\n";
            continue;
        }
    }
// CandidateListLength number of IDs with the highest scores will be in queue

这是一个示例，希望对您有所帮助。为了简单起见，我假设已经为模板计算了相似度分数。

模板1：相似度得分：0.4，ID：Cyrus

模板 2：相似度得分：0.5，ID：James

模板 3：相似度得分：0.9，ID：Bob

模板 4：相似度得分：0.8，ID：Cyrus

模板 5：相似度得分：0.7，ID：Vanessa

模板 6：相似度得分：0.3，ID：Ariana

获取前 3 个评分模板的 ID 将返回 [Bob, Cyrus, Vanessa]

【问题讨论】：

使用最大堆而不是丢弃顶部 ID，将它们放入 std::set 并继续直到您的集合的 size() 为 k？
所以如果我把 ID 放在一个集合中，它会告诉我 ID 是否已经在最大堆中，这很好。但是，我还需要修改队列中给定 ID 的得分值（假设新的相似度得分大于队列中已有的相似度得分）。
我不太明白。您在问题中说您有一对 value 和一个 ID。您有一个描述两个值之间相似性的函数，可用于对元素进行排序。您说您要检索与最相似的 value 对应的唯一 ID 的 k。您在哪里提到您需要更改一些数据？无论如何，您可以使用 std::map 而不是 std::set，将 IDs 作为 keys 并将指向您的对的指针作为 values，但是假设我正确理解了你。您能否提供一个示例输入和输出？
在这种情况下输出仍然是相同的 [Bob, Cyrus, Vanessa]（按此顺序）
那么我相信我的第一条评论提出了一个正确的解决方案。如果没有人在 ~24 小时内给你一个满意的答案，我会尽力想出我自己的。

标签： c++ arrays algorithm optimization max

【解决方案1】：

实施了 Maras 的答案大纲。它似乎完成了这项工作。

#include <iostream>
#include <vector>
#include <map>
#include <utility>
#include <string>
#include <set>

int main() {
    int K = 3;

    std::vector<std::pair<double, std::string>> data {
        {0.4, "Cyrus"},
        {0.5, "James"},
        {0.9, "Bob"},
        {0.8, "Cyrus"},
        {0.7, "Vanessa"},
        {0.3, "Ariana"},
    };

    std::set<std::pair<double, std::string>> mySet;
    std::map<std::string, double> myMap;

    for (const auto& pair: data) {
        if (myMap.find( pair.second ) == myMap.end()) {
            // The ID is unique
            if (mySet.size() < K) {
                // The size of the set is less than the size of search candidates
                // Add the result to the map and the set
                mySet.insert(pair);
                myMap[pair.second] = pair.first;
            } else {
                // Check to see if the current score is larger than the worst performer in the set
                auto worstPairPtr = mySet.begin();

                if (pair.first > (*worstPairPtr).first) {
                    // The contender performed better than the worst in the set
                    // Remove the worst item from the set, and add the contender
                    // Remove the corresponding item from the map, and add the new contender
                    mySet.erase(worstPairPtr);
                    myMap.erase((*worstPairPtr).second);
                    mySet.insert(pair);
                    myMap[pair.second] = pair.first;
                }
            }

        } else {
            // The ID already exists
            // Compare the contender score to the score of the existing ID.
            // If the contender score is better, replace the existing item score with the new score
            // Remove the old item from the set
            if (pair.first > myMap[pair.second]) {
                mySet.erase({myMap[pair.second], pair.second});
                mySet.insert(pair);
                myMap[pair.second] = pair.first;
            }

        }
    }

    for (auto it = mySet.rbegin(); it != mySet.rend(); ++it) {
        std::cout << (*it).second << std::endl;
    }

}

输出是

Bob
Cyrus
Vanessa

【讨论】：

【解决方案2】：

使用 std::set 结构（平衡 BST）而不是堆。它还将元素按顺序排列，让您找到插入的最大和最小元素。此外，它在使用插入功能时会自动检测重复并忽略它，因此其中的每个元素将始终是唯一的。复杂度是完全一样的（虽然因为更大的常数，所以会慢一点）。

编辑：我可能没有正确理解这个问题。据我所见，您可以拥有多个具有不同值的元素，这些元素应被视为重复项。

我会做什么：

用对（模板值，ID）创建一个集合
制作一个映射，其中键是 ID，值是当前集合中模板的模板值。
如果要添加新模板：
- 如果它的 ID 在地图上 - 您找到了重复项。如果它的值比映射中与ID配对的值差，则什么也不做，否则从集合中删除一对（旧值，ID）并插入（新值，ID），将映射中的值更改为新值.
- 如果它不在地图中，只需将其添加到地图和设置中。
当集合中的项目过多时，只需从集合和地图中删除最差的一项即可。

【讨论】：