使用 c++ fstream 进行序列化答案

【问题标题】：do serialization with c++ fstream使用 c++ fstream 进行序列化
【发布时间】：2016-11-30 10:03:39
【问题描述】：

我尝试使用 fstream 进行序列化。流语法是： “索引长度数据索引长度数据...”。例如，11c22cc33ccc。读取文件时，输入流会整体读取“11”作为索引。

索引在 [1, INT_MAX] 内。长度限制为 516。

我可以在索引和长度之间不使用分隔符（例如“@”或“#”）吗？

int main() {
  std::ofstream ofs;
  ofs.open("myfile.txt", std::ofstream::out | std::ofstream::trunc);
  for(int i = 1; i <= 10; ++i) {
    ofs << i; // for index
    ofs << i; // for length
    for (int j = 0; j < i; ++j) ofs << 'c';
  }
  ofs.close();
  std::ifstream ifs;
  ifs.open("myfile.txt", std::ifstream::in);
  for (int i = 0; !ifs.eof() && ifs.good(); ++i) {
    int index = 0, length = 0;
    ifs >> index;
    ifs >> length;
    std::cout << "index is " << index << "length is " << length << std::endl;
    // Jump to the next entry
    ifs.seekg(length, std::ios_base::cur);
  }
}

【问题讨论】：

仅当您的索引只有一位数时。您需要一个固定长度的索引和一个固定长度的字段，然后用零填充数据。
不，索引不是个位数。它可以是任何大于 0 的整数。
如果IndexLength 包含1234 Index 和Length 的值是多少：1、234 或12、34 或123、4？
@DavidThomas，如何处理固定长度？通过 setw()?
@pepero 因为我想把它读/写到一个文件中——我不知道这与简单地将数据读入一个字符串有什么关系，解析它，并将您想要的任何内容写回文件。

标签： c++ serialization io stream std

【解决方案1】：

是的，如果您有固定大小的格式，那么索引为 10 个字符，长度为 3 个字符，您的示例将被编码为：
" 1 1c 2 2cc 3 3ccc"。

您还谈到fstream，但看起来您正在追求文本（人类可读）序列化，而不是二进制序列化。如果是这种情况，但您不需要真正人类可读的形式，您可以用一些位标记长度的第一个字节（ASCII 中的数字编码为 0x30 到 0x39 值，因此您可以例如设置 @ 987654326@ 位而不破坏数据字节。那么您的示例将如下所示：
1qc2rcc3sccc (q = 0x71 = 0x40|0x31 = 0x40|'1')

对于更长的值，它看起来像：113q00123456789 ... ARGH 我想序列化 10 个字符长的字符串“0123456789”，看看发生了什么，我得到了长度 100 10 （或者更糟的是 100123456789，如果你不限制的话），所以长度的开始和结束都必须以某种方式被污染，可能使用位 0x80 来标记长度的结束。
1\361c2\362cc3\363ccc (\361 = 0xF1 = 0x40|0x80|0x31 = 0x40|0x80|'1')

长值第二次尝试：
113q°0123456789（索引 113，长度 10，数据“0123456789”，q = 0x40|'1'，° = 0x80|'0'）。

你不想要二进制形式吗？会更短。

顺便说一句，如果您不介意污染值，但您想保留 7 位 ASCII，您可以污染不是长度的开始和结束，而是索引和长度的结束，并且只能使用 0x40。所以11c 会变成qqc。 113 10 0123456789 将是 11s1p0123456789。

使用与平台无关的字节序进行二进制写入/读取（即，在 little-endian 上写入的文件将在具有 big-endian 的其他平台上工作）。

#include <iostream>
#include <cstdint>
#include <vector>

/**
 * Writes index+length+data in binary form to "out" stream.
 * 
 * Returns number of bytes written to out stream.
 * 
 * Does no data validation (the variable types are only limits for input data).
 * 
 * writeData and readData are done in endiannes agnostic way.
 * So file saved at big-endian platform will be restored correctly on little-endian platform.
 **/
size_t writeData(std::ostream & out,
        const uint32_t index, const uint16_t length, const uint8_t *data) {
    // Write index and length bytes to out stream, resolve endiannes of host platform.
    out.put((char)((index>>0)&0xFF));
    out.put((char)((index>>8)&0xFF));
    out.put((char)((index>>16)&0xFF));
    out.put((char)((index>>24)&0xFF));
    out.put((char)((length>>0)&0xFF));
    out.put((char)((length>>8)&0xFF));
    // If any data, write them to stream
    if (0 < length) out.write(reinterpret_cast<const char *>(data), length);
    return 4 + 2 + length;
}

/**
 * Read data from stream "in" stream into variables index, length and data.
 * 
 * If "in" doesn't contain enough bytes for index+length, zero index/length is returned
 * 
 * If "in" contains more than index+length bytes, but the data are shorter than length,
 * then "repaired" shorter data are returned with shorter "length" (not the read one).
 **/
void readData(std::istream & in,
        uint32_t & index, uint16_t & length, std::vector<uint8_t> & data) {
    // clear current values in index, length, data
    index = length = 0; data.clear();
    // read index+length header from stream
    uint8_t buffer[6];
    in.read(reinterpret_cast<char *>(buffer), 6);
    if (6 != in.gcount()) return;   // header data (index+legth) not found
    // Reassemble read bytes together to index/length numbers in host endiannes.
    index = (buffer[0]<<0) | (buffer[1]<<8) | (buffer[2]<<16) | (buffer[3]<<24);
    length = (buffer[4]<<0) | (buffer[5]<<8);
    if (0 == length) return;    // zero length, nothing more to read
    // Read the binary data of expected length
    data.resize(length);  // reserve memory for read
    in.read(reinterpret_cast<char *>(data.data()), length);
    if (length != in.gcount()) {    // data read didn't have expected length, damaged file?
        // TODO you may want to handle damaged data in other way, like returning index 0
        // This code will simply accept shorter data, and "repair" length
        length = in.gcount();
        data.resize(length);
    }
}

要查看它的实际效果，您可以在cpp.sh 上试用。

【讨论】：

我有点想知道你为什么不简单地添加一个分隔符（空格）......所以你会有1 1 c2 2 cc3 3 ccc。只要确保您阅读完整长度的内容数据，如果会有包含空格的字符串，例如56 11 hello world。
是的，只是想节省一些空间。不必是文本文件。二进制格式没问题。看来这些
@pepero：顺便说一句，这个答案有点“嘲笑”你。你要求什么没有多大意义。如果您想节省空间，只需将 7zip 库添加到您的项目并运行流槽压缩（虽然从问题感觉这对您来说可能有点复杂，但这是不同的故事，您将投入多少精力和学习它）。我会进行二进制序列化，如果输出不需要人类可读和索引，长度通常比 1-2chars 长（32b int 是 4B 长）。所以这个答案只是展示，它可以在没有分隔符的情况下完成。
@pepero 我添加了二进制示例。运行这种槽压缩仍然很好，以进一步节省大小。如果您好奇 0x40 标记代码的外观，我也可以编写其中有趣的部分，但前提是您认为它可以教给您一些东西，并且您会研究它（浪费时间只是为了展示它而没有充分的理由） .
我想知道如何标记长度的结尾。由于以下数据部分可能包含任何内容，因此长度可能必须固定。例如长度1~9：0xF1~0xF9；长度10~99：0x71 0xF0~0x79 0xF9；长度 100~999: 0x71 0x70 0xF0 ~ 0x79 0x79 0xF9