Apache CSV 解析器无法处理带引号的制表符分隔数据答案

【问题标题】：Apache CSV parser is not working on tab delimited data with quotation marksApache CSV 解析器无法处理带引号的制表符分隔数据
【发布时间】：2017-02-28 22:22:23
【问题描述】：

我想解析一份 Google 电子书交易报告。我在 Notepad++ 中打开它以准确查看归档和记录分隔符。它是一个制表符分隔的文件，每个标题字段和数据字段都用引号引起来。 CSV 文件的前两行是：

“交易日期”“ID”“产品”“类型”“预购”“数量”“主要 ISBN”“版本说明名称”“标题”“作者”“原始标价货币”“原始标价”“标价货币”“标价 [含税]”“标价 [不含税]”“销售国家/地区”“出版商收入 %”“出版商收入”“支付货币”“支付金额”“货币兑换率”
“2016. 09. 01.” "ID:1166315449551685" "单次购买" "销售" "无" "1" "9789633780664" "Book and Walk Kft" "Bánk bán" "József Katona" "HUF" "0,00" "HUF" "0,00 " "0,00" "HU" "52,0%" "0,00" "" "" ""

我使用以下代码来解析 CSV 文件：

private List<Sales> parseCsv(File csv) {
    Calendar max = Calendar.getInstance();
    Calendar current = Calendar.getInstance();
    boolean firstRound = true;

    List<Sales> sales = new ArrayList<>();
    Sales currentRecord;
    Reader in;
    try {
        in = new FileReader(csv);
        Iterable<CSVRecord> records;

        try {

            records = CSVFormat.TDF.withQuote('\"').withFirstRecordAsHeader().parse(in);
            for (CSVRecord record : records) {
                currentRecord = new Sales();
                currentRecord.setAuthor(record.get("Author"));
                currentRecord.setTitle(record.get("Title"));
                currentRecord.setPublisher(record.get("Imprint Name"));
                currentRecord.setIsbn(record.get("Primary ISBN"));
                currentRecord.setChannel("Google");
                currentRecord.setBookId(record.get("Id"));
                currentRecord.setCountry(record.get("Country of Sale"));
                currentRecord.setUnits(Integer.parseInt(record.get("Qty")));
                currentRecord.setUnitPrice(Float.parseFloat(record.get("List Price [tax exclusive]")));

                Date transDate;
                try {
                    transDate = sourceDateFormat.parse(record.get("Transaction Date"));
                    if (firstRound) {
                        max.setTime(transDate);
                    };
                    current.setTime(transDate);
                    if (current.after(max)) {
                        max.setTime(current.getTime());
                    }
                    currentRecord.setDatum(transDate);
                } catch (ParseException e) {
                    // TODO Auto-generated catch block
                    LOG.log(Level.SEVERE,"Nem megfeelő formátumú a dátum a {0} file-ban",csv.getAbsolutePath());
                }

                currentRecord.setCurrencyCustomer(record.get("List Price Currency"));
                currentRecord.setCurrencyProceeds(record.get("Payment Amount"));
                currentRecord.setCurrencyProceeds(record.get("Payment Currency"));
                sales.add(currentRecord);
            }
            LOG.log(Level.INFO, "Daily sales transactions of {0} were successfully parsed from ",
                    csv.getAbsolutePath());
            return sales;
        } catch (IOException e1) {
            // TODO Auto-generated catch block
            LOG.log(Level.SEVERE, "Valami nem stimmel a {0} file szerkezetével",csv.getAbsolutePath());
        }
    } catch (FileNotFoundException e1) {
        // TODO Auto-generated catch block
        LOG.log(Level.SEVERE,"A {0} file-t nem találom.",csv.getAbsolutePath());
    }
    return null;
};

当我调试解析过程时，我可以看到 record.get("Author") 抛出运行时异常：

java.lang.IllegalArgumentException: Mapping for Author not found, expected one of [��"

显然我有一个名为作者的列。知道出了什么问题吗？

【问题讨论】：

尝试提供字符集来解析.parse(in, StandardCharsets.UTF_8)

标签： java apache-commons-csv

【解决方案1】：

将其转换为单元测试并使用当前的 commons-csv 版本 1.4 运行时，这对我来说很好，因此：

查看最新版本的 commons-csv
确保文件中确实有 TAB，而不是出于某种原因在作者条目周围出现空白
在调用 parse() 以正确处理非 ASCII 字符时指定文件的实际编码（感谢来自 @tonakai 的 cmets）

以下单元测试适用于 commons-csv 1.4

private final static String DATA = "\"Transaction Date\"\t\"Id\"\t\"Product\"\t\"Type\"\t\"Preorder\"\t\"Qty\"\t\"Primary ISBN\"\t\"Imprint Name\"\t\"Title\"\t\"Author\"\t\"Original List Price Currency\"\t\"Original List Price\"\t\"List Price Currency\"\t\"List Price [tax inclusive]\"\t\"List Price [tax exclusive]\"\t\"Country of Sale\"\t\"Publisher Revenue %\"\t\"Publisher Revenue\"\t\"Payment Currency\"\t\"Payment Amount\"\t\"Currency Conversion Rate\"\n" +
        "\"2016. 09. 01.\"\t\"ID:1166315449551685\"\t\"Single Purchase\"\t\"Sale\"\t\"None\"\t\"1\"\t\"9789633780664\"\t\"Book and Walk Kft\"\t\"Bánk bán\"\t\"József Katona\"\t\"HUF\"\t\"0,00\"\t\"HUF\"\t\"0,00\"\t\"0,00\"\t\"HU\"\t\"52,0%\"\t\"0,00\"\t\"\"\t\"\"\t\"\"";

@Test
public void parseCsv() throws IOException {
    final CSVFormat format = CSVFormat.TDF.withQuote('\"').withFirstRecordAsHeader();
    Iterable<CSVRecord> records = format.parse(new StringReader(DATA));

    System.out.println("Headers: " + Arrays.toString(format.getHeader()));

    for (CSVRecord record : records) {
        assertNotNull(record.get("Author"));
        assertNotNull(record.get("Title"));
        assertNotNull(record.get("Imprint Name"));
        assertNotNull(record.get("Primary ISBN"));
        assertNotNull(record.get("Id"));
        assertNotNull(record.get("Country of Sale"));
        assertNotNull(record.get("Qty"));
        assertNotNull(record.get("List Price [tax exclusive]"));

        assertNotNull(record.get("Transaction Date"));

        assertNotNull(record.get("List Price Currency"));
        assertNotNull(record.get("Payment Amount"));
        assertNotNull(record.get("Payment Currency"));

        System.out.println("Record: " + record.toString());
    }
}

【讨论】：

我检查并使用 1.4 版本。奇怪的是，当我在我的 Eclipse IDE 中获得我的 maven deendencies 并深入研究 commons-csv-1.4.jar 时，我看到 parse 方法具有三种不同的方法签名，但我的 ide 的自动完成提供了基本的方法。
可能是因为你传递了Reader in，其他签名没有使用Reader。如果您想使用 Reader，请尝试在没有 FileReader 的情况下构建它，如 docs.oracle.com/javase/7/docs/api/java/io/FileReader.html 中所述，在 FileInputStream 上使用 InputStreamReader。
我使用的是 1.4 版本，我还检查了所有字段是否由制表符分隔。我用excel和noepad ++检查了它。我还强制@tonakai cooment 正确处理非 ASCII 字符。 'CSVFormat csvFormat = CSVFormat.TDF.withQuote('\"').withFirstRecordAsHeader(); records = CSVParser.parse(csv, StandardCharsets.UTF_8, csvFormat);'。我仍然得到和以前一样的错误。
那么你需要缩小问题的范围。首先通过运行我发布的单元测试并查看它是否有效，然后减少您的非工作代码，使其变得越来越像单元测试，这样您将看到它在哪一步中断。

【解决方案2】：

原来编码是问题的根源。根据@tonakai 评论，我开始分析 Google csv 报告的编码。它是 UTF-16 Little Endian。由于我的文件包含字节顺序标记，我不得不使用“BOMInputStream”并稍微重构我的代码。

Reader r = newReader(csv);
CSVParser csvParser= CSVFormat.TDF.withFirstRecordAsHeader().withQuoteMode(QuoteMode.ALL).parse(r);

.....

private InputStreamReader newReader(final File csv) throws FileNotFoundException {
        return new InputStreamReader(new BOMInputStream(new FileInputStream(csv),ByteOrderMark.UTF_16LE), StandardCharsets.UTF_16LE);
    }

它正在工作

【讨论】：