R从多个csv读取某些列值答案

【问题标题】：R reading certain column values from multiple csv'sR从多个csv读取某些列值
【发布时间】：2021-10-02 01:29:24
【问题描述】：

我的文件夹中有多个 csv 文件，它们遵循以下 sintax： "销售额-"月"-"年"

例如：

Sales-APR-2019.csv 
Sales-APR-2020.csv 
Sales-MAR-2019.csv 
Sales-DEC-2019.csv

我在 R 中的任务是为 2019 年全年提取某些产品。我将我的功能设置如下：

myfiles = list.files( pattern="SALES-EXTRACT-...-2019-NEW.csv", full.names=TRUE) \
file <- ldply(myfiles, read_csv)

这就是问题所在，文件很大，所以我不想将它们全部加载到 R 中。如果我有我需要的文章，例如 1、2、3、4 和 5，我如何指定只获取哪里的列值等于那些文章？

最后我想省略读取所有 csv 的第一行，其中 1 个文件将被读取为：\

file <- read.csv("SALES--APR-2019.csv",header = TRUE)[-1,]

读取所有文件时，我可以在代码的哪个位置指定 [-1,]？

【问题讨论】：

您也可以选择使用findstr (windows) 或grep (linux) 进行预处理，并使用data.table::fread() 读取结果。看这里：stackoverflow.com/questions/55568068/…

标签： r dataframe csv tidyverse

【解决方案1】：

vroom package 提供了一种在导入期间按名称选择/删除列的“整洁”方法。文档：https://www.tidyverse.org/blog/2019/05/vroom-1-0-0/#column-selection

列选择（col_select）

vroom 参数“col_select”使选择要保留（或省略）的列更加直接。 col_select 的接口与 dplyr::select() 相同。

按名称选择列

data <- vroom("flights.tsv", col_select = c(year, flight, tailnum))
#> Observations: 336,776
#> Variables: 3
#> chr [1]: tailnum
#> dbl [2]: year, flight
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

按名称删除列

data <- vroom("flights.tsv", col_select = c(-dep_time, -air_time:-time_hour))
#> Observations: 336,776
#> Variables: 13
#> chr [4]: carrier, tailnum, origin, dest
#> dbl [9]: year, month, day, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr...
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message
Use the selection helpers
data <- vroom("flights.tsv", col_select = ends_with("time"))
#> Observations: 336,776
#> Variables: 5
#> dbl [5]: dep_time, sched_dep_time, arr_time, sched_arr_time, air_time
#> 
#> Call `spec()` for a copy-pastable column specification
#> Specify the column types with `col_types` to quiet this message

加载多个文件并选择特定列并跳过第一行： H4>

files <- fs::dir_ls(glob = "SALES*2019.csv")
data <- vroom(files, col_select = c(article_1, article_2, article_3, etc), skip = 1)

【讨论】：

【解决方案2】：

如果文章信息存储在每个文件中名为article的列中，您可以尝试-

library(tidyverse)

keep_articles <- 1:5
myfiles = list.files(pattern="SALES-.*-2019.csv", full.names=TRUE)

data <- map_df(myfiles, ~read_csv(.x) %>% 
                  slice(-1) %>%
                  filter(article %in% keep_articles), .id = 'file')

data 将有一个组合数据框读取所有 csv 并保留 keep_articles 的行。还将创建一个额外的file 列来区分不同文件的行。

【讨论】：

列选择（col_select）

加载多个文件并选择特定列并跳过第一行： H4> files <- fs::dir_ls(glob = "SALES*2019.csv") data <- vroom(files, col_select = c(article_1, article_2, article_3, etc), skip = 1)

加载多个文件并选择特定列并跳过第一行： H4>
`files <- fs::dir_ls(glob = "SALES*2019.csv") data <- vroom(files, col_select = c(article_1, article_2, article_3, etc), skip = 1)`