【问题标题】:How to get elements by an array field and filter the array field to return only the matching elements?如何通过数组字段获取元素并过滤数组字段以仅返回匹配的元素?
【发布时间】:2020-04-24 13:45:20
【问题描述】:

我有一本包含数千个此类元素的字典:

{
'_id': ObjectId('5e9cd87f8b5ab6d445edab5f'), 
'id': 'XXX-YYY-ZZZ', 
'Published': datetime.datetime(2020, 2, 25, 18, 15), 
'summary': 'Some information', 
'subelements': [
    "Apple", "Car", "Glass"     // Thousands more
]
}

“子元素”字段可能包含数千个字符串。

拥有一组子元素,我想扫描包含至少一个匹配子元素的所有元素。困难的部分是我想过滤子元素以仅包含我正在寻找的那些值。

例如寻找“Apple”应该返回:

{
'_id': ObjectId('5e9cd87f8b5ab6d445edab5f'), 
'id': 'XXX-YYY-ZZZ', 
'Published': datetime.datetime(2020, 2, 25, 18, 15), 
'summary': 'Some information', 
'subelements': [
    "Apple"
]
}
// other matches ...

并寻找“Apple”、“Car”和即“Book”应该返回:

{
'_id': ObjectId('5e9cd87f8b5ab6d445edab5f'), 
'id': 'XXX-YYY-ZZZ', 
'Published': datetime.datetime(2020, 2, 25, 18, 15), 
'summary': 'Some information', 
'subelements': [
    "Apple", "Car"
]
}
// other matches

编辑 - 我的案例中的一些实际元素。我正在使用 CVE 数据库并希望使用一个查询来查找多个 CPE 的 CVE:

{
    "id": "CVE-1999-0001",
    "assigner": "cve@mitre.org",
    "Published": {
        "$date": {
            "$numberLong": "946530000000"
        }
    },
    "Modified": {
        "$date": {
            "$numberLong": "1292475600000"
        }
    },
    "summary": "ip_input.c in BSD-derived TCP/IP implementations allows remote attackers to cause a denial of service (crash or hang) via crafted packets.",
    "access": {
        "authentication": "NONE",
        "complexity": "LOW",
        "vector": "NETWORK"
    },
    "impact": {
        "availability": "PARTIAL",
        "confidentiality": "NONE",
        "integrity": "NONE"
    },
    "cvss": {
        "$numberDouble": "5"
    },
    "cvss-time": {
        "$date": {
            "$numberLong": "1292475600000"
        }
    },
    "cvss-vector": "AV:N/AC:L/Au:N/C:N/I:N/A:P",
    "references": ["http://www.openbsd.org/errata23.html#tcpfix", "http://www.osvdb.org/5707"],
    "vulnerable_configuration": ["cpe:2.3:o:bsdi:bsd_os:3.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.0:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.1.5.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.2:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.0:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.0.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.0.5:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.5:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.6:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.6.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.7:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.7.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.2:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.3:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.4:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.5:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.6:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.8:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:3.0:*:*:*:*:*:*:*", "cpe:2.3:o:openbsd:openbsd:2.3:*:*:*:*:*:*:*", "cpe:2.3:o:openbsd:openbsd:2.4:*:*:*:*:*:*:*"],
    "vulnerable_product": ["cpe:2.3:o:bsdi:bsd_os:3.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.0:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.1.5.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:1.2:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.0:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.0.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.0.5:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.5:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.6:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.6.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.7:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.1.7.1:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.2:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.3:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.4:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.5:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.6:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:2.2.8:*:*:*:*:*:*:*", "cpe:2.3:o:freebsd:freebsd:3.0:*:*:*:*:*:*:*", "cpe:2.3:o:openbsd:openbsd:2.3:*:*:*:*:*:*:*", "cpe:2.3:o:openbsd:openbsd:2.4:*:*:*:*:*:*:*"],
    "cwe": "CWE-20",
    "vulnerable_configuration_cpe_2_2": []
}

【问题讨论】:

  • 您能否添加一个更大的样本,以便更好地了解您的数据和问题。字典“包含数千个这样的元素”是什么意思?
  • 您想将其作为 mongo 查询或 python 处理的一部分吗?
  • @Gabip 作为 mongo 查询的一部分
  • @DaniMesejo 刚刚更新了帖子
  • 您要过滤以下数组中的哪一个:referencesvulnerable_configurationvulnerable_product

标签: python arrays mongodb filter pymongo


【解决方案1】:

您可以将其作为集合进行比较,即用集合比较集合,而不是检查列表中的每个元素

import pandas as pd
import datetime

data = [{
# '_id': str(ObjectId('5e9cd87f8b5ab6d445edab5f')), 
'id': 'XXX-YYY-ZZZ', 
'Published': datetime.datetime(2020, 2, 25, 18, 15), 
'summary': 'Some information', 
'subelements': [
    "Apple", "Car"
]
},
{
# '_id': str(ObjectId('5e9cd87f8b5ab6d445edab5f')), 
'id': 'XXX-YYY-ZZZ', 
'Published': datetime.datetime(2020, 2, 25, 18, 15), 
'summary': 'Some information', 
'subelements': [
    "Apple", "Car","Onion"
]
},
{
# '_id': str(ObjectId('5e9cd87f8b5ab6d445edab5f')), 
'id': 'XXX-YYY-ZZZ', 
'Published': datetime.datetime(2020, 2, 25, 18, 15), 
'summary': 'Some information', 
'subelements': [
    "Apple", "Car","Watch"
]
}
]

df = pd.DataFrame(data)

def toSet (inp):
    return set(inp)
df['set'] = df.apply(lambda x: toSet(x.subelements), axis=1)
print(df[df ['set'] == set(['Car','Apple'])])
            id           Published  ...   subelements           set
0  XXX-YYY-ZZZ 2020-02-25 18:15:00  ...  [Apple, Car]  {Car, Apple}

【讨论】:

    猜你喜欢
    • 2013-01-22
    • 2018-01-06
    • 2019-04-23
    • 1970-01-01
    • 2020-09-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多