使用 Java 8 Stream 解析 .csv 文件答案

【问题标题】：Parsing .csv file using Java 8 Stream使用 Java 8 Stream 解析 .csv 文件
【发布时间】：2018-04-04 21:22:22
【问题描述】：

我有一个 .csv 文件，其中包含 500 多家公司的数据。文件中的每一行都引用一个特定的公司数据集。我需要解析这个文件并推断每个文件的数据以调用 4 个不同的 Web 服务。

.csv 文件的第一行包含列名。我正在尝试编写一个采用字符串参数的方法，这与 .csv 文件中的列标题有关。

基于此参数，我希望该方法使用 Java 8 的 Stream 功能解析文件并返回从每一行/公司的列标题中获取的数据列表。

我觉得我让它变得比它需要的更复杂，但我想不出更有效的方法来实现我的目标。

任何想法或想法将不胜感激。

通过stackoverflow搜索，我发现了以下类似但不完全相同的帖子。 Parsing a CSV file for a unique row using the new Java 8 Streams API

    public static List<String> getData(String titleToSearchFor) throws IOException{
    Path path = Paths.get("arbitoryPath");
    int titleIndex;
    String retrievedData = null;
    List<String> listOfData = null;

    if(Files.exists(path)){ 
        try(Stream<String> lines = Files.lines(path)){
            List<String> columns = lines
                    .findFirst()
                    .map((line) -> Arrays.asList(line.split(",")))
                    .get();

            titleIndex = columns.indexOf(titleToSearchFor);

            List<List<String>> values = lines
                    .skip(1)
                    .map(line -> Arrays.asList(line.split(",")))
                    .filter(list -> list.get(titleIndex) != null)
                    .collect(Collectors.toList());

            String[] line = (String[]) values.stream().flatMap(l -> l.stream()).collect(Collectors.collectingAndThen(
                    Collectors.toList(), 
                    list -> list.toArray()));
            String value = line[titleIndex];
            if(value != null && value.trim().length() > 0){
                retrievedData = value;
            }
            listOfData.add(retrievedData);
        }
    }
    return listOfTitles;
}

谢谢

【问题讨论】：

你的代码有很多问题，你编译了吗？
是的，我在eclipse中编译过，没有任何编译错误。我目前无权访问 csv 文件，因此还无法正确测试。

标签： java csv java-8 java-stream

【解决方案1】：

您不应该重新发明轮子并使用通用的 csv 解析器库。例如，您可以只使用Apache Commons CSV。

它将为您处理很多事情并且更具可读性。还有OpenCSV，它更强大，并带有基于注释的数据类映射。

 try (Reader reader = Files.newBufferedReader(Paths.get("file.csv"));
            CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT
                    .withFirstRecordAsHeader()        
        ) {
            for (CSVRecord csvRecord : csvParser) {
                // Access
                String name = csvRecord.get("MyColumn");
                // (..)
          }

编辑：无论如何，如果您真的想自己做，请查看this 示例。

【讨论】：

总是。不重新发明轮子是必须的！ +1
完全同意，我什至不应该尝试重新发明，因为以前比我更好的人已经完成了。不幸的是，我正在进行的项目实际上不允许我导入外部库，并且仅限于使用预安装的库。我不知道 Apache 有一个用于 csv 文件的库，这将来会派上用场。感谢您的信息:)
如果您需要速度，请查看this CSV parser comparison。 univocity-parsers 比其他库更好地处理边缘情况。
它使用流式传输吗？如果我有非常大的文件怎么办？
@MdFaraz Apache Commons CSV 的 CSVParser 实现了 Iterable<CSVRecord>，可以通过流 API 进一步扩展。进一步研究的关键词：iterable to stream

【解决方案2】：

我设法缩短了你的 sn-p 一点。

如果我理解正确，您需要特定列的所有值。该列的名称已给出。

想法是一样的，但我改进了从文件中读取（它读取一次）；删除了代码重复（如line.split(",")），List（Collectors.toList()）中不必要的换行。

// read lines once
List<String[]> lines = lines(path).map(l -> l.split(","))
                                  .collect(toList());

// find the title index
int titleIndex = lines.stream()
                      .findFirst()
                      .map(header -> asList(header).indexOf(titleToSearchFor))
                      .orElse(-1);

// collect needed values
return lines.stream()
            .skip(1)
            .map(row -> row[titleIndex])
            .collect(toList());

^{我有 2 条与问题无关的提示：}

1.你已经硬编码了一个URI，最好将值移动到一个常量或添加一个方法参数。
^{2.如果你检查相反的条件!Files.exists(path)并抛出异常，你可以将主要部分移出if子句。}

【讨论】：

【解决方案3】：

像往常一样，您应该使用 Jackson！ Check out the docs

如果您希望 Jackson 使用第一行作为标题信息：

public class CsvExample {
    public static void main(String[] args) throws IOException {
        String csv = "name,age\nIBM,140\nBurger King,76";
        CsvSchema bootstrapSchema = CsvSchema.emptySchema().withHeader();
        ObjectMapper mapper = new CsvMapper();
        MappingIterator<Map<String, String>> it = mapper.readerFor(Map.class).with(bootstrapSchema).readValues(csv);
        List<Map<String, String>> maps = it.readAll();
    }
}

或者您可以将架构定义为 java 对象：

public class CsvExample {
    private static class Pojo {
        private final String name;
        private final int age;

        @JsonCreator
        public Pojo(@JsonProperty("name") String name, @JsonProperty("age") int age) {
            this.name = name;
            this.age = age;
        }

        @JsonProperty("name")
        public String getName() {
            return name;
        }

        @JsonProperty("age")
        public int getAge() {
            return age;
        }
    }

    public static void main(String[] args) throws IOException {
        String csv = "name,age\nIBM,140\nBurger King,76";
        CsvSchema bootstrapSchema = CsvSchema.emptySchema().withHeader();
        ObjectMapper mapper = new CsvMapper();
        MappingIterator<Pojo> it = mapper.readerFor(Pojo.class).with(bootstrapSchema).readValues(csv);
        List<Pojo> pojos = it.readAll();
    }
}

【讨论】：

不幸的是，我正在进行一个允许我导入有限库的项目，因此希望仅使用 Java JDK 来执行此操作，但对于其他项目来说这很好。谢谢

【解决方案4】：

1) 您不能在 Stream 上调用多个终端操作。
但是您调用其中两个：findFirst() 来检索列名，然后调用 collect() 来收集行值。在 Stream 上调用的第二个终端操作将引发异常。

2) 代替读取 Stream 中所有行的 Stream<String> lines = Files.lines(path))，您应该使用返回 String 列表的 Files.readAllLines() 进行两次操作。
使用第一个元素检索列名，并使用整个列表检索与条件匹配的每一行的值。

3）您将检索拆分为多个小步骤，您可以在单个流处理中缩短这些步骤，该流处理将迭代所有行，仅保留条件匹配的行并收集它们。

它会给出类似的东西：

public static List<String> getData(String titleToSearchFor) throws IOException {
    Path path = Paths.get("arbitoryPath");

    if (Files.exists(path)) {
        List<String> lines = Files.readAllLines(path);

        List<String> columns = Arrays.asList(lines.get(0)
                                                  .split(","));

        int titleIndex = columns.indexOf(titleToSearchFor);

        List<String> values = lines.stream()
                                   .skip(1)
                                   .map(line -> Arrays.asList(line.split(",")))
                                   .map(list -> list.get(titleIndex))
                                   .filter(Objects::nonNull)
                                   .filter(s -> s.trim()
                                                 .length() > 0)
                                   .collect(Collectors.toList());

        return values;
    }

    return new ArrayList<>();

}

【讨论】：

1 - 当然，这是我犯的愚蠢错误。干杯 2 - 我考虑过这一点，但这个函数将被其他可能包含 1000 个条目的 csv 文件重用，所以担心 OutOfMemoryError 3 - 如果使用 readAllLines，这是另一个很好的实现选择。谢谢！