读取多个 .csv 文件并分配列标题答案

【问题标题】：Read multiple .csv files and assign column headers读取多个 .csv 文件并分配列标题
【发布时间】：2021-09-19 17:16:45
【问题描述】：

我有 70 个 .csv 文件，每个文件大小为 10GB，它们都是从原始文件中拆分出来的文件，因此第一个文件之后的每个文件的列名都是第一个文件最后一行的下一行。

我想读取多个 csv 文件，同时将第一个文件的列名分配给读取的下一个文件。我尝试过vroom 和readr，但是它们给出了错误的列长度，data.table::fread 似乎是唯一有效的，但是它不允许一次读取多个文件，除非在循环函数中。

这是我尝试读取多个文件的方法：

lapply(files[1:2] ,fread( select=c("family", "species", "occurrenceStatus", "individualCount", "decimalLatitude", "decimalLongitude", "eventDate", "year")))

但我收到此错误：

fread 中的错误（select = c("family", "species", "occurrenceStatus", "individualCount", : 输入为空或仅包含 BOM 或终端控制字符

虽然它在删除select 函数时有效，但我希望只获得 8/50 列。之后我可以管理它，也许在一个函数中，但如果我包含 >5 个文件，它会占用太多时间和内存。

我也试过了：

species <- rbindlist(Map(fread, file = files[1:2],
              select = c("family", "species", "occurrenceStatus", "individualCount", "decimalLatitude", "decimalLongitude", "eventDate", "year")))

这给出了这个错误：

rbindlist(Map(fread, file = flt, select = c("family", "species", : 第 7 项第 1 列的类属性与第 1 项第 1 列不匹配。

因为第一个文件之后的列名有不同的名称，如上所述。有关如何有效解决此问题的任何想法？

【问题讨论】：

第一个文件有正确的列名，而所有其他文件根本没有列名？您想从每个文件中仅读取前 8 列吗？
@RonakShah 第一个具有正确的列名，所有其他列名具有值，因此可以将这些文件的列名放入第一行，并将标题替换为第一个文件中的列名。这不是前 8 行，特别是我选择的那 8 行，它们的数字是 select(8, 10, 19, 20, 22, 23, 30, 33)

标签： r dataframe csv

【解决方案1】：

你可以试试这个-

library(data.table)

#define the column names in the data
cols <- c("family", "species", "occurrenceStatus", "individualCount", 
          "decimalLatitude", "decimalLongitude", "eventDate", "year")
#define the column numbers
col_num <- c(8, 10, 19, 20, 22, 23, 30, 33)
#Read the 1st file with correct column names
data <- fread(files[1], select = col_num)
#Read all the files from 2nd filename without header and 
#assign them column names using setNames
#combine the data together with rbindlist
result <- rbindlist(lapply(files[-1], function(x) setNames(
            fread(x, select = col_num, header = TRUE), cols)), fill = TRUE)
#Add 1st dataset to rest of them
result <- rbind(data, result)

但是，如果每个文件都大至 10GB，我怀疑这是否能正常工作而不会给您任何内存错误。

【讨论】：

我在处理两个文件的较小子集时设法改进了它，它似乎有效！虽然您可能需要将标题更改为 header = TRUE，否则第一行的值将有 col_num
至于内存问题，我只需要手动进一步过滤，一次运行 3-5 次即可。