在 Ruby/Rails 中处理大型数据集导入答案

【问题标题】：Working with large dataset imports in Ruby/Rails在 Ruby/Rails 中处理大型数据集导入
【发布时间】：2017-08-17 09:06:49
【问题描述】：

我目前正在使用 Ruby/Rails 进行一个项目，将发票导入数据库，但试图最大限度地提高流程的效率，这确实太慢了。

对于具有 100.000 行的导入批次，处理和保存数据库中的每条记录大约需要 2.5 3 小时。

//// Ruby 代码

  class DeleteImportStrategy
def pre_process(merchant_prefix, channel_import)
  # channel needed to identify invoices so an import from another channel cannot collude if they had same merchant_prefix
  Jzbackend::Invoice.where(merchant_prefix: merchant_prefix, channel: channel_import.channel).delete_all
  # get rid of all previous import patches which becomes empty after delete_import_strategy
  Jzbackend::Import.where.not(id: channel_import.id).where(channel: channel_import.channel).destroy_all
end

def process_row(row, channel_import)
  debt_claim = Jzbackend::Invoice.new
  debt_claim.import = channel_import
  debt_claim.status = 'pending'
  debt_claim.channel = channel_import.channel
  debt_claim.merchant_prefix = row[0]
  debt_claim.debt_claim_number = row[1]
  debt_claim.amount = Monetize.parse(row[2])
  debt_claim.print_date = row[3]
  debt_claim.first_name = row.try(:[], 4)
  debt_claim.last_name = row.try(:[], 5)
  debt_claim.address = row.try(:[], 6)
  debt_claim.postal_code = row.try(:[], 7)
  debt_claim.city = row.try(:[], 8)
  debt_claim.save
end

结束

////

因此，对于以 CSV 形式出现的每个导入批次，我会删除以前的批次，并通过读取每一行并将其插入到新的 Import as Invoice 记录中来开始导入新批次。但是，100.000 个条目需要 2.5-3 小时似乎有点过大。我如何优化这个过程，因为我确信这种方式肯定没有效率。

编辑：因此，我已经很久没有发布此内容了，但请注意，我最终使用了 activerecord-import 库，该库从那时起运行良好。但是请注意，它的 :on_duplicate_key_update 功能仅在 PostgreSQL v9.5+ 中可用。

【问题讨论】：

标签： ruby-on-rails ruby postgresql activerecord benchmarking

【解决方案1】：

批量导入的第一条规则：批次、批次、批次。

您正在分别保存每一行。这会产生巨大的开销。比如说，插入本身需要 1 毫秒，但到数据库的往返时间是 5 毫秒。使用的总时间 - 6 毫秒。对于 6000 毫秒或 6 秒的 1000 条记录。

现在假设您使用批量插入，在同一语句中为多行发送数据。它看起来像这样：

INSERT INTO users (name, age)
VALUES ('Joe', 20), ('Moe', 22), ('Bob', 33'), ...

假设您在这个请求中发送了 1000 行的数据。请求本身需要 1000 毫秒（但实际上它可能会更快，解析查询、准备执行计划等的开销更少）。总时间为 1000 毫秒 + 5 毫秒。至少减少 6 倍！（在我的实际项目中，我观察到减少了 100 倍至 200 倍）。

【讨论】：

谢谢！那么删除我在上面的 pre_process 方法中的以前的记录呢？我应该以同样的方式处理它吗？
@SahilGadimbayli：是的，也可以批量删除这些。单个DELETE FROM users 可以完成这项工作，但可能会阻止工作太长时间。如果是这种情况，请分批删除。
@SahilGadimbayli：是的，应该大致相同。一定是你做过的其他事情。