在 Scala 中读取带有多行字符串的 CSV 文件答案

【问题标题】：Reading CSV file with multi line strings in Scala在 Scala 中读取带有多行字符串的 CSV 文件
【发布时间】：2019-09-16 15:05:00
【问题描述】：

我有一个 csv 文件，我想逐行读取它。问题是某些单元格值在包含换行符的引号中。

这是一个 CSV 示例：

Product,Description,Price
Product A,This is Product A,20
Product B,"This is much better
than Product A",200

标准的 getLines() 函数无法处理。

Source.fromFile(inputFile).getLines()  // will split at every line break, regardless if quoted or not

getLines 类似于：

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better")
Array("than Product A\"", "20")

但应该是这样的：

Array("Product", "Description", "Price")
Array("Product A", "this is Product A", "20")
Array("Product A", "\"This is much better\nthan Product A\"", "20")

我尝试用它来完全读取文件，并使用类似于这篇文章https://stackoverflow.com/a/31193505的正则表达式进行拆分

file.mkString.split("""\n(?=(?:[^"]*"[^"]*")*[^"]*$)""")

正则表达式工作正常，但我收到堆栈溢出异常，因为文件太大而无法完全处理内存不足。我尝试了使用较小版本的文件，并且成功了。

如帖子中所述， foldLeft() 可以帮助处理更大的文件。但是我不确定它应该如何工作，在遍历字符串的每个 Char 时，一次全部传递...

当前迭代的字符
你正在建设的线路
以及已创建的行列表

也许编写一个自己的 getLines 尾递归版本是可行的，但我不确定是否有更实用的解决方案，而不是逐个字符地处理它。

您是否看到任何其他函数式解决此问题的方法？

坦克和问候，费利克斯

【问题讨论】：

标签： scala csv line-breaks

【解决方案1】：

最简单的答案是找到一个外部库来做这件事！

如果它不是你的解决方案， foldLeft 解决方案是 imo 最好的功能风格！这是一个简单的版本：

  val lines = Source.fromFile(inputFile).getLines()

  lines.foldLeft[(Seq[String], String)](Nil, "") {
    case ((accumulatedLines, accumulatedString), newLine) => {
      val isInAnOpenString = accumulatedString.nonEmpty
      val lineHasOddQuotes =  newLine.count(_ == '"') % 2 == 1
      (isInAnOpenString, lineHasOddQuotes) match {
        case (true, true) => (accumulatedLines :+ (accumulatedString + newLine)) -> ""
        case (true, false) => accumulatedLines -> (accumulatedString + newLine)
        case (false, true) => accumulatedLines -> newLine
        case (false, false) => (accumulatedLines :+ newLine) -> ""
      }
    }
  }._1

请注意，此版本不会处理太多特殊情况，例如一行中有多个包含多行的值，但它应该给你一个好的开始想法。

主要的想法是 foldLeft 几乎所有你需要保存在内存中的东西，然后逐渐改变你的状态。

如您所见，在 foldLeft 内，您可以根据需要拥有尽可能多的逻辑。在这种情况下，我添加了额外的布尔值和嵌套匹配案例以提高可读性。

所以我的建议是：foldLeft，不要惊慌！

【讨论】：

谢谢@C4stor 看起来很棒 - 我真的很喜欢你的建议。我没有完全使用“->”运算符（afaik 仅用于地图）。但是在您的示例中看到 foldLeft 的强大功能，我对其进行了一些修改，以逐个字符读取它，以便在一行内也有多个引用的换行符。
-> 用于形成元组。所以 a->b 等价于 (a,b) 。避免到处堆放括号很有用

【解决方案2】：

我想知道新的 (Scala 2.13) unfold() 是否可以在这里很好地使用。

                        // "file" has been opened
val lines = Iterator.unfold(file.getLines()){ itr =>
              Option.when(itr.hasNext) {
                val sb = new StringBuilder(itr.next)
                while (itr.hasNext && sb.count(_ == '"') % 2 > 0)
                  sb.append("\\n" + itr.next)
                (sb.toString, itr)
              }
            }

现在您可以根据需要迭代内容。

lines.foreach(println)
//Product,Description,Price
//Product A,This is Product A,20
//Product B,"This is much better\nthan Product A",200
//Product C,a "third rate" product,5

请注意，这非常简单，它只计算所有引号，寻找偶数。它不会将转义引号 \" 识别为不同，但使用正则表达式应该不会太难，因此它只计算非转义引号。

由于我们使用的是迭代器，因此它应该具有内存效率，并且可以处理任何大小的文件，只要没有错误的单引号触发文件的其余部分作为一行文本读入。

【讨论】：

【解决方案3】：

您可以使用第三方库来执行此操作，例如 opencsv

maven repo -> https://mvnrepository.com/artifact/au.com.bytecode/opencsv/2.4

代码示例 -> https://www.programcreek.com/java-api-examples/au.com.bytecode.opencsv.CSVReader

【讨论】：