【发布时间】:2015-07-18 01:09:22
【问题描述】:
我需要用 C++ 读取 csv 文件:文件的第一行包含所有列标题,其余行包含浮点数据(以下示例,文件已被缩小)。
一些文件有问题,我正在使用以下代码
#include <iostream>
#include <fstream>
#include <string>
// Compiled and testen on with Clang++ on Ubuntu 14.04
int main(int argc, char** argv) {
std::ifstream in;
in.open(argv[1]);
if(!in.is_open()) {
std::cerr << "Cannot open file: " << argv[1] << "\n";
return 1;
}
std::string buff;
std::getline(in, buff);
while(!in.eof()) {
std::cout << buff << "\n";
getline(in, buff);
}
in.close();
return 0;
}
对于大多数文件,这运行正常,每次迭代读取一行; “好”文件的示例:
Time,Smile,AU04,AU02,AU15,Trackerfail,AU18,AU09,negAU12,AU10,Expressive,Unilateral_LAU12,Unilateral_RAU12,AU14,Unilateral_LAU14,Unilateral_RAU14,AU05,AU17,AU26,Forward,Backward
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0
0.3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,16.667,0.0
58.3,50.0,0.0,0.0,0.0,33.333,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62.4,33.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0
有些文件会发疯,并在第一个 getline 之后设置 eof 位。在第一次读取之后,buff 包含第一行的一部分和最后一行的一部分; “坏”文件的示例:
Time,Smile,AU04,AU02,AU15,Trackerfail,AU18,AU09,negAU12,AU10,Occlusion,Expressive,Unilateral_LAU12,Unilateral_RAU12,AU14,Unilateral_LAU14,Unilateral_RAU14,AU05,Au17,AU57,AU58
0,0,0,0,0,16.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0.3,0,0,0,0,33.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1.3,0,0,0,0,16.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
57.9,66.667,0,0,0,66.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
60.3,33.333,0,0,0,66.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
和buff的内容在一次调用getline之后:
Time,Smile,AU04,AU02,AU15,Trackerfail,AU18,AU09,negAU12,AU10,Occlusion,Expressive,Unilateral_LAU12,Unilateral_RAU12,AU14,Unilateral_LAU14,Unilateral_RA60.3,33.333,0,0,0,66.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
如您所见,第一行与最后一行混合在一起。我不知道出了什么问题。每行以\n 结尾,文件以空\n 结尾。
我想我的问题是:为什么getline 会跳到文件结尾,同时将某些文件的第一行和最后一行混合在一起,而其他文件却可以正常工作?
编辑:我需要将一个大数据集转换为一种新的、更一致的格式。当前格式充满了不一致(使用0 和0.0 或AU17 和Au17)。不过,这些格式问题应该不会影响简单地读取文件,对吧?
编辑2:
cat -v -e -t 在一个好文件上:
Time,Smile,AU04,AU02,AU15,Trackerfail,AU18,AU09,negAU12,AU10,Expressive,Unilateral_LAU12,Unilateral_RAU12,AU14,Unilateral_LAU14,AU05,AU17,AU26,Forward,Backward^M$
0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,66.667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0^M$
0.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,33.333,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0^M$
etc...
cat -v -e -t 在错误文件上:
Time,Smile,AU04,AU02,AU15,Trackerfail,AU18,AU09,negAU12,AU10,Occlusion,Expressive,Unilateral_LAU12,Unilateral_RAU12,AU14,Unilateral_LAU14,Unilateral_RAU14,AU05,Au17,AU57,AU58^M0,0,0,0,0,16.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M0.3,0,0,0,0,33.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M1.3,0,0,0,0,16.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M1.4,0,0,0,0,33.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M1.8,0,0,0,0,50,0,0,0,0,0,0,0,0,0,0,0,0,0,25,0^M2.8,0,0,0,0,50,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M3,0,0,0,0,33.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M31,0,0,0,0,33.333,0,0,0,0,25,0,0,0,0,0,0,0,0,0,0^M31.1,0,0,0,0,50,0,0,0,0,50,0,0,0,0,0,0,0,0,0,0^M31.2,0,0,0,0,66.667,0,0,0,0,50,0,0,0,0,0,0,0,0,0,0^M31.4,0,0,33.333,0,66.667,0,0,0,0,50,0,0,0,0,0,0,0,0,0,0^M31.5,0,0,33.333,0,66.667,0,0,0,0,50,25,0,0,0,0,0,0,0,0,0^M32,0,0,33.333,0,66.667,0,0,0,0,50,25,0,0,0,0,0,0,0,0,25^M32.1,0,0,33.333,0,83.333,0,0,0,0,50,25,0,0,0,0,0,0,0,0,25^M32.2,0,0,33.333,0,83.333,0,0,0,0,25,25,0,0,0,0,0,0,0,0,25^M32.4,0,0,33.333,0,83.333,0,0,0,0,25,0,0,0,0,0,0,0,0,0,25^M32.7,0,0,33.333,0,83.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,25^M33,0,0,33.333,0,83.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M33.5,0,0,0,0,83.333,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M33.9,0,0,0,0,66.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M55,33.333,0,0,0,66.667,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0^M55.2,66.667,0,0,0,66.667,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0^M55.8,100,0,0,0,66.667,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0^M56.8,100,0,0,0,66.667,0,0,0,0,0,25,0,0,0,0,0,0,0,0,25^M57.4,66.667,0,0,0,66.667,0,0,0,0,0,25,0,0,0,0,0,0,0,0,25^M57.8,66.667,0,0,0,66.667,0,0,0,0,0,25,0,0,0,0,0,0,0,0,0^M57.9,66.667,0,0,0,66.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0^M60.3,33.333,0,0,0,66.667,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
看起来差别很大,我该如何解决?
【问题讨论】:
-
与您的问题无关,但不是两个
std::getline调用,为什么不简单地依赖std::getline(就像几乎所有其他流函数一样)返回实际的流引用,并执行例如while (std::getline(...))? -
至于你的问题,你确定文件没有其他问题吗?您只检查文件结尾而不是循环中的其他错误。错误文件中没有隐藏(不可打印)字符?相同的文件总是有问题吗?
-
@JoachimPileborg 在这段代码的 sn-p 中出现这种情况的原因是我想快速检查不同的位:eof、fail、bad。但是在我的应用程序代码中,我正在按照您的建议进行操作。 编辑: 它永远不会到达循环内部,立即设置 eof-bit;检查失败位会导致单次迭代。是的,它总是在同一个文件上,这让人非常好奇!我检查了
\r和\c字符,但文件中没有。 -
cat -v -e -t <your_file>的输出是什么?我怀疑@Joachim Pileborg 是正确的。 -
您在 Windows 上吗?因为如果文本中有 CTRL-Z(ASCII
0x1a),它也将充当文件结尾。您是否在十六进制编辑器中检查过文件?