如何从每行具有不同列数的文件中读取前四列到数据框中答案

【问题标题】：How to read first four columns from a file with different number of columns on each row into a data frame如何从每行具有不同列数的文件中读取前四列到数据框中
【发布时间】：2021-09-09 08:12:19
【问题描述】：

我有一个文本文件，其前 10 行如下所示：

 3  a         1       4   6   2
 3  a         1       4   6   2
 4  a         1       4   6   8   2
 4  a         1       4   6   8   2
 3  a         1       4   6   2
 3  a         1       4   6   2
 3  a         1       4   6   2
 3  a         1       4   6   2
 3  a         1       4   6   2
 3  a         1       4   6   2
 5  a         1       4   8  10   2   6
 5  a         2       6   8  10   2   4
 5  a         1       4   8  10   2   6
 5  a         1       4   8  10   2   6
 5  a         2       6   8  10   2   4

我只想读取每行的前四列并将其保存到数据框中。

我尝试了几个代码，最后一个是：

library(data.table)

nudos<-fread("caliz.txt",select=c(1:4),fill=TRUE)

不断给出这个错误信息：

在第 119 行提前停止。预期有 11 个字段，但找到了 13 个。考虑 fill=TRUE 和 comment.char=。第一个丢弃的非空行：>

谢谢！

【问题讨论】：

你能设置 fill=TRUE 然后丢弃多余的行吗？
好吧，因为错误消息表明问题出现在第 119 行，所以前 10 行无助于解决确切的问题。您可以分享第 119 行的文本吗？或者可以分享完整的文本文件吗？

标签： r dataframe multiple-columns

【解决方案1】：

这是一个基本的 R 解决方案。它使用readLines 读取文件并使用一系列*apply 循环来解析它。

# read the file as text lines
txt <- readLines("test.txt")
# split by one or more spaces
txt <- strsplit(txt, " +")
# keep only the vector elements with more than 0 chars
txt <- lapply(txt, function(x) x[sapply(x, nchar) > 0])
# the last line may have a '\n' only, remove it
txt <- txt[lengths(txt) > 0]
# now extract the first 4 elements of each vector
txt <- lapply(txt, '[', 1:4)
# and rbind to data.frame
df1 <- do.call(rbind.data.frame, txt)
names(df1) <- paste0("V", 1:4)

head(df1)
#  V1 V2 V3 V4
#1  3  a  1  4
#2  3  a  1  4
#3  4  a  1  4
#4  4  a  1  4
#5  3  a  1  4
#6  3  a  1  4

【讨论】：

嗨锐。我在“(x)”、“意外输入”中收到消息错误。
@SergioEnriqueYarzaAcuña 查看编辑后错误是否消失。

【解决方案2】：

您的表格似乎格式不正确。即使您只想选择前 4 列，R 也会读取所有列，并且无法处理包含更多或更少元素的行。您必须手动拆分并选择值：

lin = readLines("test.txt")
cells = strsplit(lin," ")
data = c()
for(line in cells){
  found = 0
  cell = 1
  while(found<4){
    c = line[[cell]]
    print(line)
    print(cell)
    print(c)
    if(nchar(c)>0){
      found = found+1
      data=c(data,c)
    }
    cell = cell+1
  }
}

df = as.data.frame(matrix(data,ncol=4,byrow=T))

这导致数据框：

> df
   V1 V2 V3 V4
1   3  a  1  4
2   3  a  1  4
3   4  a  1  4
4   4  a  1  4
5   3  a  1  4
6   3  a  1  4
7   3  a  1  4
8   3  a  1  4
9   3  a  1  4
10  3  a  1  4
11  5  a  1  4
12  5  a  2  6
13  5  a  1  4
14  5  a  1  4
15  5  a  2  6

您现在可以更改某些列的对象类（例如df[,1] = as.integer(df[,1])，因为它们现在都是字符。您可能想要获取数值。但这取决于您。

【讨论】：