【问题标题】:Reading a messy csv using readLines until a certain row / cell value使用 readLines 读取凌乱的 csv 直到某个行/单元格值
【发布时间】:2026-01-04 03:10:02
【问题描述】:

我正在处理一个我试图加载的凌乱的 csv 文件。如果我对行号进行硬编码,readLines 似乎可以完成这项工作:

readLines(file_path, n = 31)

我需要的是,它使n(或skip)参数变量,以使我的函数更健壮。

我需要n 是:

  1. 具有特定字符串的单元格,例如Data,
  2. 空行

不是同时。我将使用单独的调用。

实现这一目标的潜在选择是什么?我可以想到whichis.nagrep,但我不知道在这种特殊情况下如何使用它们。

我知道如何在阅读完文件后清理文件,但我想避免这一步(如果可能,只读取文件的一部分)。

你能想出一个解决办法吗?

我的数据是 ETG-4000 fNIRS 的输出。

这是整个文件:

messy_data <- c("Header,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "File Version,1.08,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Patient Information,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"ID,someID,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Name,someName,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Comment,someComment,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Age,23,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Sex,Male,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Analyze Information,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"AnalyzeMode,Continuous,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Pre Time[s],20,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Post Time[s],20,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Recovery Time[s],40,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Base Time[s],20,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Fitting Degree,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"HPF[Hz],No Filter,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"LPF[Hz],No Filter,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Moving Average[s],5,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Measure Information,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Date,17/12/2016 12:15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Mode,3x3,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Wave[nm],695,830,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Wave Length,CH1(699.2),CH1(828.2),CH2(697.2),CH2(826.7),CH3(699.2),CH3(828.2),CH4(697.5),CH4(827.8),CH5(697.2),CH5(826.7),CH6(697.5),CH6(827.8),CH7(697.5),CH7(827.8),CH8(698.8),CH8(828.7),CH9(697.5),CH9(827.8),CH10(698.7),CH10(830.2),CH11(698.8),CH11(828.7),CH12(698.7),CH12(830.2),CH13(698.3),CH13(825.7),CH14(697.5),CH14(826.6),CH15(698.3),CH15(825.7),CH16(699.1),CH16(825.9),CH17(697.5),CH17(826.6),CH18(699.1),CH18(825.9),CH19(699.1),CH19(825.9),CH20(698.7),CH20(825.2),CH21(699.1),CH21(825.9),CH22(697.7),CH22(825.7),CH23(698.7),CH23(825.2),CH24(697.7),CH24(825.7)", 
"Analog Gain,6.980392,6.980392,6.980392,6.980392,24.235294,24.235294,6.980392,6.980392,18.745098,18.745098,24.235294,24.235294,18.745098,18.745098,24.235294,24.235294,531.764706,531.764706,18.745098,18.745098,531.764706,531.764706,531.764706,531.764706,42.823529,42.823529,42.823529,42.823529,34.352941,34.352941,42.823529,42.823529,8.54902,8.54902,34.352941,34.352941,8.54902,8.54902,34.352941,34.352941,6.039216,6.039216,8.54902,8.54902,6.039216,6.039216,6.039216,6.039216", 
"Digital Gain,7.67,4.19,7,4.41,7.48,3.02,9.94,5.87,5.05,2.62,8.09,3.83,9.9,5.47,55.48,19.09,9.47,3.27,46.93,19.65,18.88,5.08,41.32,10.19,1.54,0.57,0.39,0.16,1.46,0.37,0.11,0.06,1.2,0.52,0.24,0.08,0.26,0.18,0.27,0.07,0.11,0.06,0.08,0.07,1.17,0.44,0.27,0.21", 
"Sampling Period[s],0.1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"StimType,STIM,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Stim Time[s],,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"A,45,B,100,C,15,D,15,E,15,F,15,G,15,H,15,I,15,J,15,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Repeat Count,1,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Exception Ch,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", ",,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", 
"Data,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", "Probe1(Total),CH1,CH2,CH3,CH4,CH5,CH6,CH7,CH8,CH9,CH10,CH11,CH12,CH13,CH14,CH15,CH16,CH17,CH18,CH19,CH20,CH21,CH22,CH23,CH24,Mark,Time,BodyMovement,RemovalMark,PreScan,,,,,,,,,,,,,,,,,,,"
)

【问题讨论】:

  • 包含reproducible example 总是好的。现在这里有一个类似问题的例子:*.com/questions/37663246/…
  • 我认为这个问题足够通用,因此它不需要示例 csv。我稍后会附上dput() 的输出。您附加的问题并没有真正回答我的问题。它只使用readLines 加载整个文件,然后过滤它。

标签: r csv import read.csv


【解决方案1】:

我认为这很可能是一个坏主意,因为它更有可能减慢进程,而不是加快进程。不过,我可以看到,如果您有一个非常大的文件,其中很大一部分可以通过这样做来避免,那么可能会有好处。

library( readr )
line <- 0L
input <- "start"
while( !grepl( "Data", input ) & input != "" ) {
    line <- line + 1L
    input <- read_lines( file, skip = line - 1L, n_max = 1L )
}
line

我们一次读一行。对于每一行,我们检查文本“数据”或空白行。如果满足任一条件,我们就会停止阅读,这会留下line,这是一个告诉我们第一行要被读入的值。这样你就可以调用类似的东西:

df <- read_lines( file, n_max = line - 1L )

更新:根据@konvas 的建议,添加一个同时测试和读取的选项。

read_with_condition <- function( file, lines.guess = 100L ) {
    line <- 1L
    output <- vector( mode = "character", length = lines.guess )
    input <- "start"
    while( !grepl( "Data", input ) & input != "" ) {
        input <- readr::read_lines( file, skip = line - 1L, n_max = 1L )
        output[line] <- input
        line <- line + 1L
    }
    # discard any unwanted space in the output vector
    # this will also discard the last line to be read in (which failed the test)
    output <- output[ seq_len( line - 2L ) ]
    cat( paste0( "Stopped reading at line ", line - 1L, ".\n" ) )
    return( output )
}

new <- read_with_condition( file, lines.guess = 100L )

所以这里我们测试输入条件,同时将输入行写入一个对象。您可以使用lines.guess 在输出向量中预先分配空间(一个好的猜测应该会加快处理速度,在这里要慷慨而不是保守),最后将清理任何多余的空间。注意这是一个函数,所以最后一行new &lt;- ... 显示了如何调用该函数。

【讨论】:

  • 第一部分找到line 是完美的。但是,使用 read_csv() 会导致 df 只有两列。这对于前 22 行是正确的,但我文件的后半部分有更多列(它很乱,我警告过你;))。我将使用lineread_lines 而不是read_csv。谢谢。
【解决方案2】:

readr 带有一个函数read_lines_chunked,它有助于读取大文件,但没有在满足条件时退出该函数的选项。

我可以看到实现目标的三种可能性

1) 阅读整个文件,只保留所需的行 - 我意识到这可能不是你的选择,否则你不会发布问题:)

lines <- readr::read_lines(file_path)
lines <- lines[seq(1, grep("Data", lines)[1] - 1)]

2) 第一次读取文件以找到n,然后第二次读取该值。一种方法是@rosscova 的回答,另一种方法是使用一些外部工具,如 gnu grep,第三种方法是使用来自 readrread_lines_chunked,如

n <- tryCatch(
    readr::read_lines_chunked(
        file = file_path, 
        callback = readr::DataFrameCallback$new(
            function(x, pos) {
                if (grepl("Data", x)) stop(pos - 1)
            }
        ), 
        chunk_size = 1
    ), 
    error = function(e) as.numeric(e$message)
) 
lines <- readLines(file_path, n = n)

3) 只浏览文件一次,保存每一行直到满足条件。为此,您可以相应地修改 @rosscova 的脚本(将“输入”保存到变量中)或再次使用 read_lines_chunked

lines <- character(1e6) # pre-allocate some space, depending on how 
                        # many lines you are expecting to get

# Define a callback function to read a line and save it; if it meets
# the condition, it breaks by throwing an error
cb <- function(x, pos) {
    if (grepl("Data", x)) {
        # condition met, save only lines up to the current one and break
        lines <<- lines[seq(pos - 1)]
        stop(paste("Stopped reading on line", pos))
    }
    lines[[pos]] <<- x # condition not met yet, save the current line
}

# now call the above in read_lines_chunked
# need to wrap in tryCatch to handle the error 
tryCatch(
    readr::read_lines_chunked(
        file = file_path, 
        callback = readr::DataFrameCallback$new(cb), 
        chunk_size = 1, 
    ), 
    error = identity
)

一般来说,这涉及到一些不好的做法,包括使用&lt;&lt;-,所以要小心使用!

以上所有操作都可以使用data.table::fread 完成,它应该比readr 更快。

方法 1 对于小文件肯定是最快的。

如果您能在您的大文件上对其中的一些进行基准测试并让我们知道哪个是最快的,那就太好了!

【讨论】:

  • “您可以相应地修改@rosscova 的脚本”,在输入条件旁边写入对象的好主意。我会在我的答案中添加一个版本。不过,我宁愿避免涉及&lt;&lt;-tryCatch
最近更新 更多