【问题标题】:MongoDB collection to pandas DataframeMongoDB 集合到熊猫数据框
【发布时间】:2021-12-01 13:24:18
【问题描述】:

我的MongoDB文档结构如下,部分因素为NaN。

  _id :ObjectId("5feddb959297bb2625db1450")
factors: Array 
   0:Object
     factorId:"C24"
     Index:0
     weight:1
   1:Object
     factorId:"C25"
     Index:1
     weight:1
   2:Object
     factorId:"C26"
     Index:2
     weight:1
name:"Growth Led Momentum"

我想使用 pymongo 和 pandas 将其转换为 pandas 数据框,如下所示。

|name                   | factorId | Index | weight|
----------------------------------------------------
|Growth Led Momentum    | C24      | 0     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C25      | 1     | 0     |
----------------------------------------------------
|Growth Led Momentum    | C26      | 2     | 0     |
----------------------------------------------------

谢谢

【问题讨论】:

    标签: python pandas mongodb dataframe pymongo


    【解决方案1】:

    更新

    我破解了 ol Python 来破解一下 - 下面的代码可以完美运行!

    from pymongo import MongoClient
    import pandas as pd
    
    uri = "mongodb://<your_mongo_uri>:27017"
    database_name = "<your_database_name"
    collection_name = "<your_collection_name>"
    
    mongo_client = MongoClient(uri)
    database = mongo_client[database_name]
    collection = database[collection_name]
    
    # I used this code to insert a doc into a test collection
    # before querying (just incase you wanted to know lol)
    """
    data = {
        "_id": 1,
        "name": "Growth Lead Momentum",
        "factors": [
            {
                "factorId": "C24",
                "index": 0,
                "weight": 1
            },
            {
                "factorId": "D74",
                "index": 7,
                "weight": 9
            }
        ]
    }
    
    insert_result = collection.insert_one(data)
    print(insert_result)
    """
    
    # This is the query that
    # answers your question
    
    results = collection.aggregate([
      {
        "$unwind": "$factors"
      },
      {
        "$project": {
          "_id": 1, # Change to 0 if you wish to ignore "_id" field.
          "name": 1,
          "factorId": "$factors.factorId",
          "index": "$factors.index",
          "weight": "$factors.weight"
        }
      }
    ])
    
    # This is how we turn the results into a DataFrame.
    # We can simply pass `list(results)` into `DataFrame(..)`,
    # due to how our query works.
    
    results_as_dataframe = pd.DataFrame(list(results))
    print(results_as_dataframe)
    

    哪些输出:

       _id                  name factorId  index  weight
    0    1  Growth Lead Momentum      C24      0       1
    1    1  Growth Lead Momentum      D74      7       9
    

    原答案

    您可以使用聚合管道展开factors,然后投影您想要的字段。

    这样的事情应该可以解决问题。

    直播demo here.

    数据库结构

    [
      {
        "_id": 1,
        "name": "Growth Lead Momentum",
        "factors": [
          {
            factorId: "C24",
            index: 0,
            weight: 1
          },
          {
            factorId: "D74",
            index: 7,
            weight: 9
          }
        ]
      }
    ]
    

    查询

    db.collection.aggregate([
      {
        $unwind: "$factors"
      },
      {
        $project: {
          _id: 1,
          name: 1,
          factorId: "$factors.factorId",
          index: "$factors.index",
          weight: "$factors.weight"
        }
      }
    ])
    

    结果

    (.csv 友好)

    [
      {
        "_id": 1,
        "factorId": "C24",
        "index": 0,
        "name": "Growth Lead Momentum",
        "weight": 1
      },
      {
        "_id": 1,
        "factorId": "D74",
        "index": 7,
        "name": "Growth Lead Momentum",
        "weight": 9
      }
    ]
    

    【讨论】:

    • @Kalindu 我最终在我的答案中添加了一个工作 Python 脚本。干杯!
    【解决方案2】:

    马特的精彩回答,如果你想使用熊猫:

    在您从 db 检索文档后使用它:

    df = pd.json_normalize(data)
    df = df['factors'].explode().apply(lambda x: [val for _, val in x.items()]).explode().apply(pd.Series).join(df).drop(columns=['factors'])
    

    输出:

      factorId  Index  weight                 name
    0      C24      0       1  Growth Led Momentum
    0      C25      1       1  Growth Led Momentum
    0      C26      2       1  Growth Led Momentum
    

    【讨论】:

    • 实际上,请参阅我的更新答案 - pandaspymongo 比这更容易!
    猜你喜欢
    • 2016-03-28
    • 2022-01-21
    • 2016-10-03
    • 2013-09-26
    • 1970-01-01
    • 2020-06-03
    • 2019-11-29
    • 2019-02-26
    • 2018-01-23
    相关资源
    最近更新 更多