CSV 到 Mongodb 使用 mongoose 模式答案

【问题标题】：CSV to Mongo using mongoose schemaCSV 到 Mongodb 使用 mongoose 模式
【发布时间】：2016-02-15 06:42:27
【问题描述】：

我正在尝试将 CSV 文件添加到我的 mongodb 集合中（通过 mongoose）同时检查每个级别的架构匹配。

所以对于给定架构personSchema 和嵌套架构carSchema：

repairSchema = {
  date: Date,
  description: String
}
carSchema = {
  make: String,
  model: String
}
personSchema = {
  first_name: String,
  last_name: String,
  car: [carSchema]
}

还有一个我是mapping the CSV data to的对象：

mappingObject = {
  first_name : 0,
  last_name: 1,
  car : {
    make: 2,
    model: 3,
    repair: {
      date: 4,
      description: 5
    }
  }
}

检查我的集合是否匹配，然后检查每个嵌套模式是否匹配或创建整个文档，视情况而定。

所需流程：

我需要检查我的收藏中是否存在匹配first_name 和last_name 的个人文档。

如果存在这样的个人文档，请检查该个人文档是否包含匹配的 car.make 和 car.model。

如果存在这样的汽车文档，请检查该汽车文档是否包含匹配的 car.repair.date 和 car.repair.description。

如果存在这样的修复文档，则什么也不做，与现有记录完全匹配。

如果不存在这样的维修文件，请将此维修推送到相应汽车和人员的维修文件中。

如果这样的汽车文件不存在，将这辆车推送到相应人员的汽车文件中。

如果这样的个人文档不存在，则创建该文档。

踢球者

相同的函数将用于许多模式，可能嵌套了许多层（当前数据库有一个模式，深度为 7 层）。所以它必须相当抽象。 我已经可以将数据作为 javascript 对象获取到我需要的结构中，所以我只需要按照描述从该对象获取到集合。

它还必须是同步的，因为 CSV 中的多条记录可能有同一个人，而异步创建可能意味着同一个人被创建了两次。

当前解决方案

我遍历 each line of the CSV，将数据映射到我的 mappingObject，然后在 javascript 中逐步遍历对象的每个级别，使用 find 检查非对象键值对是否匹配，然后推送/创建或酌情递归。这绝对有效，但是对于如此大的文档来说速度非常慢。

这是我的完整递归函数，它有效：

saveObj 是我已将 CSV 映射到与我的架构匹配的对象。

findPrevObj 最初为假。 path 和 topKey 最初都是 ""。

lr 是行阅读器对象，lr.resume 只是移动到下一行。

var findOrSave = function(saveObj, findPrevObj, path, topKey){
    //the object used to search the collection
    var findObj = {};

    //if this is a nested schema, we need the previous schema search to match as well
    if (findPrevObj){
        for (var key in findPrevObj){
            findObj[key] = findPrevObj[key];
        }
    }

    //go through all the saveObj, compiling the findObj from string fields
    for (var key in saveObj){
        if (saveObj.hasOwnProperty(key) && typeof saveObj[key] === "string"){
            findObj[path+key] = saveObj[key]
        }
    }


    //search the DB for this record
    ThisCollection.find(findObj).exec(function(e, doc){

        //this level at least exists
        if (doc.length){

            //go through all the deeper levels in our saveObj
            for (var key in saveObj){
                var i = 0;
                    if (saveObj.hasOwnProperty(key) && typeof saveObj[key] === "string"){
                        i += 1;
                        findOrSave(saveObj[key], findObj, path+key+".", path+key);
                    }   

                    //if there were no deeper levels (basically, full record exists)        
                    if (!i){
                        lr.resume();
                    }
                }

        //this level doesn't exist, add new record or push to array
            } else {

                if (findPrevObj){

                    var toPush = {};
                    toPush[topKey] = saveObj;

                    ThisCollection.findOneAndUpdate(
                        findPrevObj,
                        {$push: toPush},
                        {safe: true, upsert: true},
                        function(err, doc) {
                            lr.resume();
                        }
                    )   
                } else {
                    // console.log("\r\rTrying to save: \r", saveObj, "\r\r\r");
                    ThisCollection.create(saveObj, function(e, doc){
                        lr.resume();
                    });
                }
            }
    });
}

【问题讨论】：

你能详细说明你到底想在这里做什么吗？我对第一部分的理解是，您有一个 csv，其中包含格式为 first_name,last_name,car_make,car_model 的列，并且您希望遍历创建一个人的每一行。如果是这样的话，我不明白你为什么需要这样做Person.find...这里有某种独特的约束吗？
我认为你需要在“我目前有这个工作”之后重写所有内容，目前还不清楚你的问题是什么......另外，你的数据结构能更具体一点吗？这些嵌套的关卡是什么样子的？
使用这个模块npmjs.com/package/csvtojson将所有数据转换成json，然后根据schema重构你的数据？
@jtmarmon 为了清楚起见，我会更新，但 person.find 是检查是否存在具有匹配名字和姓氏的人。如果它们确实存在，我会检查每辆车是否匹配 - 如果该车已经存在，则没有理由添加此记录。如果汽车不存在，我将其推送到匹配人的汽车阵列。如果没有人匹配，我会保存整个新记录。
@aishwatsingh 我尝试了该模块，它非常适合解析 csv 文件并将数据转换为我想要的结构，但这不是问题。我无法让 mongo/mongoose 检查部分匹配的现有数据（例如，匹配人，然后如果匹配则查找汽车，否则创建新记录。）它每次都会创建一个全新的记录。

标签： node.js mongodb csv mongoose schema

【解决方案1】：

为了清楚起见，我会更新，但 person.find 是检查是否存在具有匹配名字和姓氏的人。如果它们确实存在，我会检查每辆车是否匹配 - 如果该车已经存在，则没有理由添加此记录。如果汽车不存在，我将其推送到匹配人的汽车阵列。如果没有人匹配，我会保存整个新记录。

啊，你想要的是用 upsert 更新：

替换

Person.find({first_name: "adam", last_name: "snider"}).exec(function(e, d){
  //matched? check {first_name: "adam", last_name: "snider", car.make: "honda", car.model: "civic"}

  //no match? create this record (or push to array if this is a nested array) 

});

与

Person.update(
    {first_name: "adam", last_name: "snider"}, 
    {$push: {car: {make: 'whatever', model: 'whatever2'}}}, 
    {upsert: true}
)

如果找到匹配项，它将推入或创建此子文档的car 字段：{car_make: 'whatever', car_model: 'whatever2'}。

如果未找到匹配项，它将创建一个新文档，如下所示：

{first_name: "adam", last_name: "snider", car: {car_make: 'whatever', car_model: 'whatever2'}}

这将您的总数据库往返次数减半。但是，为了提高效率，您可以使用orderedBulkOperation。这将导致到数据库的单次往返。

这就是它的样子（为了简洁起见，这里使用es6...不是必需的）：

const bulk = Person.collection.initializeOrderedBulkOp();
lr.on('line', function(line) {
  const [first_name, last_name, make, model, repair_date, repair_description] = line.split(',');
  // Ensure user exists
  bulk.update({first_name, last_name}, {first_name, last_name}, {upsert: true});

  // Find a user with the existing make and model. This makes sure that if the car IS there, it matches the proper document structure
  bulk.update({first_name, last_name, 'car.make': make, 'car.model': model}, {$set: {'car.$.repair.date': repair_date, 'car.$.repair.description': repair_description}});

  // Now, if the car wasn't there, let's add it to the set. This will not push if we just updated because it should match exactly now.
  bulk.update({first_name, last_name}, {$addToSet: {car: {make, model, repair: {date: repair_date, description: repair_description}}}})
});

【讨论】：

我之前在看这个功能（更新+更新插入）。这如何与一组更深的嵌套模式一起工作？例如，如果car 已经存在但有一个repairs 数组链接到另一个需要推送的模式？如果有下一个级别的示例，我可能会为我更抽象的用例挑选它。
我不完全理解你的问题，但如果 car 字段有一个对象数组，upsert 只会将品牌和模型推入同一个数组，不分青红皂白地数组的内容
该部分的第一段对我来说像是胡言乱语...您能否提供一些更具体的信息，说明这些结构的外观以及它们如何影响数据导入？关于异步性，这是一个orderedBulkOperation，这意味着它按照您执行操作的顺序工作
我现在明白了。您可以在查找中包含多个查询，例如 .find({first_name, last_name, 'car.model': car_model, 'car.make': car_make})
所以我仍然需要为每一层嵌套文档运行一个查询，但至少它会减少一次往返。我在上面发布了我当前的代码。