提高读取csv文件C++的速度答案

【问题标题】：Increasing the speed of reading a csv file C++提高读取csv文件C++的速度
【发布时间】：2022-01-10 12:53:05
【问题描述】：

我创建了这段代码来读取和过滤我的 csv 文件。它就像我希望它适用于小文件一样。但是我刚刚尝试了一个 200k 行大小的文件，大约需要 4 分钟，这对于我的用例来说太长了。

在进行了一些测试并修复了一些非常愚蠢的事情之后，我将时间缩短到了 3 分钟。我发现大约一半的时间用于读取文件，一半的时间用于生成结果向量。

有什么方法可以提高我的程序的速度吗？特别是从csv部分读取？我现在真的没有想法。如有任何帮助，我将不胜感激。

编辑：过滤器通过时间帧或时间帧和特定列中的过滤词过滤数据，并将数据输出到字符串的结果向量中。

我的 CSV 文件如下所示->

标题是：

ID;Timestamp;ObjectID;UserID;Area;Description;Comment;Checksum

数据是：

523;19.05.2021 12:15;####;admin;global;Parameter changed to xxx; Comment;x3J2j4

std::ifstream input_file(strComplPath, std::ios::in);

int counter = 0;
while (std::getline(input_file, record))
{
    istringstream line(record);
    while (std::getline(line, record, delimiter))
    {
        record.erase(remove(record.begin(), record.end(), '\"'), record.end());
        items.push_back(record);
        //cout << record;
    }

    csv_contents[counter] = items;
    items.clear();
    ++counter;
}
 

for (int i = 0; i < csv_contents.size(); i++) {
    string regexline = csv_contents[i][1];
    string endtime = time_upper_bound;
    string starttime = time_lower_bound;
    bool checkline = false;
    bool isInRange = false, isLater = false, isEarlier = false;

    // Check for faulty Data and replace it with an empty string 
    for (int oo = 0; oo < 8; oo++) {
        if (csv_contents[i][oo].rfind("#", 0) == 0) {
            csv_contents[i][oo] = "";
        }
    }

    if ((regex_search(starttime, m, timestampformat) && regex_search(endtime, m, timestampformat))) {
        filtertimeboth = true;
    }
    else if (regex_search(starttime, m, timestampformat)) {
        filterfromstart = true;
    }
    else if (regex_search(endtime, m, timestampformat)) {
        filtertoend = true;
    }
}

【问题讨论】：

欢迎来到 SO。您应该尝试更清楚地描述您的问题，并分享最少的可重现代码，而不仅仅是转储您的所有代码并期望我们通读并尝试理解您正在尝试的内容[尤其是当代码如此冗长时 :-) ]，看看minimal example
您是否尝试在其上运行 valgrind 以查看哪些功能需要花费大量时间？仅从我怀疑regex_search 的代码来看很慢
您的代码也缺少isDateInRange、isDateLater 和isDateEarlier。将几乎所有东西都放在一个大函数中的一个问题是，隔离和调试/增强程序的特定部分变得复杂。我建议您创建一个 class 来保存 CSV 文件中一条记录的字段。
如果您为 CSV 文件中的所有字段提供实际名称也会有所帮助。它会让它更容易理解。另外，请描述过滤器应该做什么。
csv_contents[counter] = items; -> csv_contents[counter] = std::move(items); 会避免复制，以同样的方式：items.push_back(std::move(record));。

标签： c++ csv io

【解决方案1】：

Ted 已经给出了答案。我同时提出了解决方案。所以让我另外展示一下。

我创建了包含 50 万条记录的测试数据，并且在我的机器上，所有解析内容都在 3 秒内完成。

此外，我还创建了类。

通过使用std::move、增加输入缓冲区大小并为std::vector 使用reserve，可以提高速度。

请参阅下面的另一个解决方案。我省略了过滤。 Ted 已经展示过了。

#include <iostream>
#include <fstream>
#include <iomanip>
#include <string>
#include <ctime>
#include <vector>
#include <chrono>
#include <sstream>
#include <algorithm>
#include <iterator>

constexpr size_t MaxLines = 600'000u;
constexpr size_t NumberOfLines = 500'000u;
const std::string fileName{ "test.csv" };

// Dummy rtoutine for writing a test file
void createFile() {
    if (std::ofstream ofs{ fileName }; ofs) {
        std::time_t ttt = 0;
        for (size_t k = 0; k < NumberOfLines; ++k) {
            std::time_t time = static_cast<time_t>(ttt);
            ttt += 1000;
            ofs << k << ';'
#pragma warning(suppress : 4996)
                << std::put_time(std::localtime(&time), "%d.%m.%Y  %H:%M") << ';'
                << k << ';'
                << "UserID" << k << ';'
                << "Area" << k << ';'
                << "Description" << k << ';'
                << "Comment" << k << ';'
                << "Checksum" << k << '\n';
        }
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for writing\n\n";
}


// We will create a bigger input buffer for our stream
constexpr size_t ifStreamBufferSize = 100'000u;
static char buffer[ifStreamBufferSize];


// Object oriented Model. Class for one record
struct Record {

    // Data
    long id{};
    std::tm time{};
    long objectId{};
    std::string userId{};
    std::string area{};
    std::string description{};
    std::string comment{};
    std::string checkSum{};

    // Methods
    // Extractor operator
    friend std::istream& operator >> (std::istream& is, Record& r) {

        // Read one complete line
        if (std::string line; std::getline(is, line)) {

            // Here we will stor the parts of the line after the split
            std::vector<std::string> parts{};

            // Convert line to istringstream for further extraction of line parts
            std::istringstream iss{ line };

            // One part of a line
            std::string part{};
            bool wrongData = false;

            // Split
            while (std::getline(iss, part, ';')) {

                // Check fpor error
                if (part[0] == '#') {
                    is.setstate(std::ios::failbit);
                    break;
                }
                // add part
                parts.push_back(std::move(part));
            }
            // If all was OK
            if (is) {
                // If we have enough parts
                if (parts.size() == 8) {

                    // Convert parts to target data in record
                    r.id = std::strtol(parts[0].c_str(), nullptr, 10);

                    std::istringstream ss{parts[1]};
                    ss >> std::get_time(& r.time, "%d.%m.%Y  %H:%M");
                    if (ss.fail()) 
                        is.setstate(std::ios::failbit);

                    r.objectId = std::strtol(parts[2].c_str(), nullptr, 10);

                    r.userId = std::move(parts[3]);

                    r.area = std::move(parts[4]);

                    r.description = std::move(parts[5]);

                    r.comment = std::move(parts[6]);

                    r.checkSum = std::move(parts[7]);
                }
                else is.setstate(std::ios::failbit);
            }
        }
        return is;
    }
    // Simple inserter function
    friend std::ostream& operator << (std::ostream& os, const Record& r) {
        return os << r.id << "   "
#pragma warning(suppress : 4996)
            << std::put_time(&r.time, "%d.%m.%Y  %H:%M") << "   "  
            << r.objectId << "   " << r.userId << "   " << r.area << "   " << r.description << "   " << r.comment << "   " << r.checkSum;
    }
};

// Data will hold all records
struct Data {

    // Data part
    std::vector<Record> records{};

    // Constructor will reserve space to avaoid reallocation
    Data() { records.reserve(MaxLines); }

    // Simple extractor. Will call Record's exractor
    friend std::istream& operator >> (std::istream& is, Data& d) {

        // Set bigger file buffer. This is a time saver
        is.rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
        std::copy(std::istream_iterator<Record>(is), {}, std::back_inserter(d.records));
        return is;
    }
    // Simple inserter
    friend std::ostream& operator >> (std::ostream& os, const Data& d) {
        std::copy(d.records.begin(), d.records.end(), std::ostream_iterator<Record>(os, "\n"));
        return os;
    }

};

int main() {
    // createFile();

    auto start = std::chrono::system_clock::now();
    auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);

    if (std::ifstream ifs{ fileName }; ifs) {

        Data data;

        // Start time measurement
        start = std::chrono::system_clock::now();

        // Read and parse complete data
        ifs >> data;

        // End of time measurement
        elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
        std::cout << "\nReading and splitting. Duration: " << elapsed.count() << " ms\n";

        // Some debug output
        std::cout << "\n\nNumber of read records:  " << data.records.size() << "\n\n";
        for (size_t k{}; k < 10; ++k)
            std::cout << data.records[k] << '\n';
    }
    else std::cerr << "\n*** Error: Could not open '" << fileName << "' for reading\n\n";
}

是的，我使用了“ctime”。

【讨论】：

对你来说，这不是火箭科学。我非常感谢您的两个解决方案。我从他们身上学到了很多新东西。感谢您为我花费的时间！

【解决方案2】：

我不确定您程序中的瓶颈到底是什么（我从问题的早期版本中复制了您的代码），但您有很多正则表达式：es 并将阅读记录与后处理混合在一起。我建议您创建一个class 来保存这些记录之一，称为record，为record 重载operator>>，然后使用带有过滤器的文件中的std::copy_if，您可以将其与读数分开设计。在您读取通过过滤器的记录之后进行后处理。

我做了一个小测试，在过滤时读取旧旋转磁盘上的 20 万条记录需要 2 秒。我只使用了time_lower_bound 和time_upper_bound 进行过滤，额外的检查当然会使它变慢一点，但应该不会花费分钟。

例子：

#include <algorithm>
#include <chrono>
#include <ctime>
#include <fstream>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <thread>
#include <vector>

// the suggested class to hold a record
struct record {
    int ID;
    std::chrono::system_clock::time_point Timestamp;
    std::string ObjectID;
    std::string UserID;
    std::string Area;
    std::string Description;
    std::string Comment;
    std::string Checksum;
};

// A free function to read a time_point from an `istream`:
std::chrono::system_clock::time_point to_tp(std::istream& is, const char* fmt) {
    std::chrono::system_clock::time_point tp{};
    // C++20:
    // std::chrono::from_stream(is, tp, fmt, nullptr, nullptr);

    // C++11 to C++17 version:
    std::tm tmtp{};
    tmtp.tm_isdst = -1;
    if(is >> std::get_time(&tmtp, fmt)) {
        tp = std::chrono::system_clock::from_time_t(std::mktime(&tmtp));
    }
    return tp;
}

// The operator>> overload to read one `record` from an `istream`:
std::istream& operator>>(std::istream& is, record& r) {
    is >> r.ID;
    r.Timestamp = to_tp(is, ";%d.%m.%Y %H:%M;"); // using the helper function above
    std::getline(is, r.ObjectID, ';');
    std::getline(is, r.UserID, ';');
    std::getline(is, r.Area, ';');
    std::getline(is, r.Description, ';');
    std::getline(is, r.Comment, ';');
    std::getline(is, r.Checksum);
    return is;
}

// An operator<< overload to print one `record`:
std::ostream& operator<<(std::ostream& os, const record& r) {
    std::ostringstream oss;
    oss << r.ID;
    { // I only made a C++11 to C++17 version for this one:
        std::time_t time = std::chrono::system_clock::to_time_t(r.Timestamp);
        std::tm ts = *std::localtime(&time);
        oss << ';' << ts.tm_mday << '.' << ts.tm_mon + 1 << '.'
            << ts.tm_year + 1900 << ' ' << ts.tm_hour << ':' << ts.tm_min << ';';
    }
    oss << r.ObjectID << ';' << r.UserID << ';' << r.Area << ';'
        << r.Description << ';' << r.Comment << ';' << r.Checksum << '\n';
    return os << oss.str();
}

// The reading and filtering part of `main` would then look like this:
int main() { // not "void main()"
    std::istringstream time_lower_bound_s("20.05.2019 16:40:00");
    std::istringstream time_upper_bound_s("20.05.2021 09:40:00");

    // Your time boundaries as `std::chrono::system_clock::time_point`s - 
    // again using the `to_tp` helper function:
    auto time_lower_bound = to_tp(time_lower_bound_s, "%d.%m.%Y %H:%M:%S");
    auto time_upper_bound = to_tp(time_upper_bound_s, "%d.%m.%Y %H:%M:%S");

    // Verify that the boundaries were parsed ok:
    if(time_lower_bound == std::chrono::system_clock::time_point{} ||
       time_upper_bound == std::chrono::system_clock::time_point{}) {
        std::cerr << "failed to parse boundaries\n";
        return 1;
    }

    std::ifstream is("data"); // whatever your file is called
    if(is) {
        std::vector<record> recs; // a vector with all the records

        // create your filter
        auto filter = [&time_lower_bound, &time_upper_bound](const record& r) {
            // Only copy those `record`s within the set boundaries.
            // You can add additional conditions here too.
            return r.Timestamp >= time_lower_bound &&
                   r.Timestamp <= time_upper_bound;
        };

        // Copy those records that pass the filter:
        std::copy_if(std::istream_iterator<record>(is),
                     std::istream_iterator<record>{}, std::back_inserter(recs),
                     filter);

        // .. post process `recs` here ...

        // print result
        for(auto& r : recs) std::cout << r;
    }
}

【讨论】：

非常感谢您的帮助！我刚刚尝试了您的解决方案，我可以编译但不能将任何输出输入到 recs 向量中。我可以看到有来自 ifstream 的输入。现在我想知道复制实际上如何与 ifstreams 一起工作。由于输入是开始/结束和后插入器以及 copy_if 的条件。但我不能让它复制任何东西，即使是副本。
@DoItWithFlow 不客气！ "... 我可以编译..." - 您是否需要对其进行更改才能编译？ clang++ -Weverything -Wno-c++98-compat -Wno-missing-prototypes -Wno-padded 为我编译干净。 std::copy_if 不知道它是一个流。这就是std::istream_iterators 所做的。如果你愿意，我可以看看你对我的建议的执行情况，看看我是否能看出问题出在哪里。把它放在pastebin.com 或类似的地方，我会检查一下。
我无需对其进行更改即可编译。这个 pastebin 不是我的最终实现，因为最后一个有很多自定义数据类型以符合 SPS 控制器标准。（无需担心）pastebin.com/wbMdyQpC
@DoItWithFlow 您对record 类中的类型所做的每次更改都需要对operator>> 重载进行相应更改。由于您将int ID 更改为std::string RecordID;，因此您需要将is >> RecordID; 更改为std::getline(is, r.RecordID, ';');，这也会影响下一行的格式字符串：r.Timestamp = to_tp(is, "%d.%m.%Y %H:%M;");（删除了初始的;）-我注意到您在格式字符串中的时间戳周围添加了"，但这些不在您提供的示例数据中，因此通过这两个更改，您的std::copy 复制所有记录
哦，好吧。我让它工作了......我很抱歉这些基本问题。但我不得不发现我使用的 ODK-Server 出于某种原因不适用于您的解决方案（不幸的是，关于 ODK-Server 的文档很少）。我让它完全适用于控制台输出，但是一旦我实现了 dll，ODK-Server 就会给我一个执行异常。我必须深入研究这个问题，否则找到不同的方法……我希望你有一个美好的一天。