【发布时间】:2019-10-29 08:55:26
【问题描述】:
我习惯于将带有列标题和单个表的直接 csv 加载到 R 中,我有一个具有以下结构的大型 csv 文件:
+-----------+---------+--------+---------+--------+---------+
| file_name | | | | | |
+-----------+---------+--------+---------+--------+---------+
| table1 | | | | | |
+-----------+---------+--------+---------+--------+---------+
| Var1 | Var2 | Var3 | Var4 | Var5 | Var6 |
+-----------+---------+--------+---------+--------+---------+
| 198824 | 198824 | 198824 | 198824 | 198824 | 198824 |
+-----------+---------+--------+---------+--------+---------+
| 123 | 1234 | 1242 | 124 | 1241 | 1232 |
+-----------+---------+--------+---------+--------+---------+
| | | | | | |
+-----------+---------+--------+---------+--------+---------+
| | | | | | |
+-----------+---------+--------+---------+--------+---------+
| file_name | | | | | |
+-----------+---------+--------+---------+--------+---------+
| table2 | | | | | |
+-----------+---------+--------+---------+--------+---------+
| Var1 | Var2 | Var3 | Var4 | Var5 | Var6 |
+-----------+---------+--------+---------+--------+---------+
| x | x | x | x | x | x |
+-----------+---------+--------+---------+--------+---------+
| y | y | y | y | y | y |
+-----------+---------+--------+---------+--------+---------+
| z | z | z | z | z | z |
+-----------+---------+--------+---------+--------+---------+
| | | | | | |
+-----------+---------+--------+---------+--------+---------+
| | | | | | |
+-----------+---------+--------+---------+--------+---------+
| file_name | | | | | |
+-----------+---------+--------+---------+--------+---------+
| table3 | | | | | |
+-----------+---------+--------+---------+--------+---------+
| Var1 | Var2 | Var3 | Var4 | Var5 | Var6 |
+-----------+---------+--------+---------+--------+---------+
| 532523 | 25235 | 532523 | 25235 | 532523 | 25235 |
+-----------+---------+--------+---------+--------+---------+
| 25332 | 5325235 | 25332 | 5325235 | 25332 | 5325235 |
+-----------+---------+--------+---------+--------+---------+
数据并非完全非结构化,因为它遵循以下模式:
第一行只有文件名:file_name
第 2 行有表:table1、table2、table3 等。
以及实际的表本身,即从 var1 到 var6 的 6 列及其下方的数据。
然后有 2 个空行,下一组将从 file_name 重复开始,然后是下一个表号和其中的表
CSV 中的所有表格都遵循此模式,但我什至无法将其加载到 R 中,使用 read.csv() 直接加载时会得到以下信息:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
这是否可以使用 R 加载到一个数据帧中,并且还可以将表号变为一列,并将 var1-var6 + 表号作为列标题?
即
+--------+---------+--------+---------+--------+---------+--------------+
| Var1 | Var2 | Var3 | Var4 | Var5 | Var6 | table_number |
+--------+---------+--------+---------+--------+---------+--------------+
| 198824 | 198824 | 198824 | 198824 | 198824 | 198824 | table1 |
| 123 | 1234 | 1242 | 124 | 1241 | 1232 | table1 |
| x | x | x | x | x | x | table2 |
| y | y | y | y | y | y | table2 |
| z | z | z | z | z | z | table2 |
| 532523 | 25235 | 532523 | 25235 | 532523 | 25235 | table3 |
| 25332 | 5325235 | 25332 | 5325235 | 25332 | 5325235 | table3 |
+--------+---------+--------+---------+--------+---------+--------------+
注意每个表(table1、table2等)的行数有不同的行数。
CSV 文件总共有大约 200 个表格,超过了 Excel 的限制(我认为大约 9MM 行)
使用 Brian 的代码,这里是前几行:
> lines_all
[1] "name,,,,," "table1,,,,," "Var1,Var2,Var3,Var4,Var5,Var6" "321,54312,321,54654,3564,54321"
[5] "45,54,4564,54,87,456" ",,,,," ",,,,," "name,,,,,"
[9] "table2,,,,," "Var1,Var2,Var3,Var4,Var5,Var6" "ssvf,afs,fasf,afsaf,zxvz,zvx" "saf,zvx,zz,z,zxvz,zxvzxv"
[13] "zxvsaf,wr,wrw,afsaf,asf,af" ",,,,," ",,,,," "name,,,,,"
[17] "table3,,,,," "Var1,Var2,Var3,Var4,Var5,Var6" "1,2,3,4,5,6" "7,8,9,10,11,12"
[21] "13,14,15,16,17,18" "19,20,21,22,23,24"
【问题讨论】:
标签: r