从单个文件中提取表答案

【问题标题】：Extract tables from a single file从单个文件中提取表
【发布时间】：2017-01-07 15:11:00
【问题描述】：

我正在尝试从 R 中的单个文件中提取多个表。我的文件包含具有相同数量的变量但具有可变数量的记录的表。我想只提取表格（数字）并将它们传输到单独的文件中。表格之间有 4 行（空白行、运行：nr、变量名、单位）我想去掉。在每个空白行中断的替代方法对我来说也是一个很好的解决方案，但我也没有设法做到这一点。下面我提供文件示例 - 我的真实文件包含多个运行（表），每个运行超过 30 个变量和 150-300 条记录。非常感谢您的帮助！

例子：

> data <- readLines(textConnection("
              + MODEL OUTPUT
              +
              + Run: 1
              + V1  V2 V3
              +        mm
              + 20  2  2.0
              + 21  2  1.5
              + 22  2  3.5
              +
              + Run: 2
              + V1 V2 V3
              +       mm
              + 1  1  1.5
              + 2  1  2.5
              +
              + Run: 3
              + V1 V2 V3
              +       mm
              + 11  5  1.5
              + 12  5  2.5                                
              + 13  5  1.0
              + 14  5  4.5"))

【问题讨论】：

我认为 How do I read a text file into R when each record is a paragraph and some records have 4 fields and others have 6 应该让你继续前进。
其他相关："extract data between a pattern from a text file in R" 和 "R convert unstructured csv file to a data frame"
readLines - cumsum - split theme 上有几个。选择你的骗子；）祝你好运！
L <- lapply(split(data, cumsum(data == ""))[-1], function(x) read.table(text = x[-c(1, 2, 4)], header = TRUE)); names(L) <- grep("Run", data, value = TRUE)
您是否可以控制此输出的创建方式？

标签： r

【解决方案1】：

如果我们不将此标记为重复，我会回答。您可以通过一些预处理来处理这个问题，然后cumsum 枚举文本的连续部分，最后read.table 将这些部分读取为表格。

创建示例数据：

file_text <- readLines(textConnection("
    + MODEL OUTPUT
    +
      + Run: 1
    + V1  V2 V3
    +        mm
    + 20  2  2.0
    + 21  2  1.5
    + 22  2  3.5
    +
      + Run: 2
    + V1 V2 V3
    +       mm
    + 1  1  1.5
    + 2  1  2.5
    +
      + Run: 3
    + V1 V2 V3
    +       mm
    + 11  5  1.5
    + 12  5  2.5                                
    + 13  5  1.0
    + 14  5  4.5"))

预处理：消除MODEL OUTPUT，前导+，以及只有mm的行。

file_text = file_text[!grepl('MODEL OUTPUT', file_text)]
file_text = file_text[!grepl('Run: \\d+', file_text)]
file_text = sapply(file_text, sub, pattern = "^\\s*\\+", replacement = "")
file_text = file_text[!grepl('^\\s*mm\\s*$', file_text)]

识别空行 - 将这些称为部分之间的中断 - 然后按部分对行进行编号。

is_break = unname(sapply(file_text, function(x) trimws(x) == ""))
section_id = unname(cumsum(is_break))
section_id
#  [1] 1 2 2 2 2 2 3 3 3 3 4 4 4 4 4 4

最后将文件文本拆分为多个部分，并以表格形式读取：

tabs = lapply(unique(section_id), function(i) {
  # the first line of a section will always be empty
  section_lines = file_text[section_id == i][-1]
  if (length(section_lines)) {
    # there's a section of text
    read.table(text = section_lines, header = TRUE)
  } else {
    # there were two consecutive section breaks, so after the first break
    # there's an empty 'section'
    NA
  }
})

结果是一个 data.frames 列表。现在或更早处理丢失的表格，随心所欲。

tabs
# [[1]]
# [1] NA
# 
# [[2]]
#   V1 V2  V3
# 1 20  2 2.0
# 2 21  2 1.5
# 3 22  2 3.5
# 
# [[3]]
#   V1 V2  V3
# 1  1  1 1.5
# 2  2  1 2.5
# 
# [[4]]
#   V1 V2  V3
# 1 11  5 1.5
# 2 12  5 2.5
# 3 13  5 1.0
# 4 14  5 4.5
#

【讨论】：

您建议的解决方案奏效了，我学到了新技巧。谢谢！