从相同的硬编码字符串文字初始化 std::string 和 std::wstring答案

【问题标题】：Initialize std::string and std::wstring from the same hard coded string literals从相同的硬编码字符串文字初始化 std::string 和 std::wstring
【发布时间】：2018-03-21 19:27:32
【问题描述】：

我正在编写一些单元测试时，偶然发现了一个已经成功地困扰了我几次的场景。

我需要生成一些字符串来测试 JSON 写入器对象。由于作者同时支持 UTF16 和 UTF8 输入，我想同时测试一下。

考虑以下测试：

class UTF8;
class UTF16;

template < typename String, typename SourceEncoding >
void writeJson(std::map<String, String> & data)
{
    // Write to file
}

void generateStringData(std::map<std::string, std::string> & data)
{
    data.emplace("Lorem", "Lorem Ipsum is simply dummy text of the printing and typesetting industry.");
    data.emplace("Ipsum", "Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book");
    data.emplace("Contrary", "Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old");
}

void generateStringData(std::map<std::wstring, std::wstring> & data)
{
    data.emplace(L"Lorem", L"Lorem Ipsum is simply dummy text of the printing and typesetting industry.");
    data.emplace(L"Ipsum", L"Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book");
    data.emplace(L"Contrary", L"Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old");
}

template < typename String, typename SourceEncoding >
void testWriter() {
    std::map<String, String> data;
    generateStringData(data);
    writeJson<String, SourceEncoding>(data);
}

int main() {
    testWriter<std::string, UTF8>();
    testWriter<std::wstring, UTF16>();
}

除了重复的generateStringData() 方法外，我设法很好地包装了所有内容。 如果可以将两种 generateStringData() 方法组合成一个方法，我正在徘徊？

我知道我可以使用一种方法以 UTF8 生成字符串，然后使用其他方法将字符串转换为 UTF16，但我正在尝试找出是否有其他方法。

我考虑过/尝试过什么？

使用 _T() 或 TCHAR 或 #ifdef UNICODE 无济于事，因为我需要在支持 Unicode 的同一平台上使用这两种风格（例如 Win >= 7）
从不是 L"" 的东西初始化 std::wstring 将不起作用，因为它需要 wchar_t
按字符初始化字符不起作用，因为它还需要L''
使用""s 将不起作用，因为返回类型取决于charT 类型

【问题讨论】：

UTF-8 和 UTF-16 字符串对于给定的非 ASCII 文本不包含相同的字节。您的测试用例只包含 7 位 ASCII，它们没有用。
@manni66，这些是一些虚拟值，这不是这个问题的本质......
推荐使用 UTF-8 来读写网页相关数据。如果您必须使用 UTF-16，则将文件逐行写入 UTF-8，并在整个文件上运行最终的 UTF-16 转换（但如果文件大于 2 gig，则会失败）。
@BarmakShemirani，这不是网络相关数据。这是从 Windows 获取的信息，某些 API 默认使用 wchar_t。
预处理器是唯一可行的方法，恐怕。 I had a similar issue once.

标签： c++ string unicode

【解决方案1】：

简短的回答是否定的，您不能将两个 generateStringData() 实现合并在一起。

一个需要输出char数据，另一个需要输出wchar_t数据。您可以使用use #define macros 来减少代码中常见字符串文字的重复，但是您仍然需要在wchar_t 实现中使用L 前缀，并且最好在u8 实现中使用u8 前缀（以确保数据实际上是 UTF-8 而不是编译器定义的），所以在运行时你仍然会在内存中得到单独的字符串。

即使您要使用模板来尝试合并两种实现，最终也需要使用模板特化来分离两种输出类型。

你最好只使用你已经拥有的重载（可能使用#defines 来减少代码中的重复），或者在运行时执行 UTF 转换（你想避免这种情况）。在后一种情况下，您可以通过在应用启动时执行一次这些转换并缓存结果以供重复使用，从而减少测试运行的开销。

【讨论】：

【解决方案2】：

如果您只需要编码为chars 和wchar_ts 的纯ASCII，那么您可以使用函数模板（无需专门化）来完成：

#include <iostream>
#include <map>
#include <string>
#include <utility>

template <typename StringType>
void generateStringData(std::map<StringType, StringType> &data) {
  static const std::pair<const char *, const char *> entries[] = {
    { "Lorem", "Lorem Ipsum is simply dummy text ..."},
    { "Ipsum", "Ipsum has been the industry's standard ..."}
  };
  for (const auto &entry : entries) {
    data.emplace(StringType(entry.first, entry.first + std::strlen(entry.first)),
                 StringType(entry.second, entry.second + std::strlen(entry.second)));
  }
}

int main() {
  std::map<std::string, std::string> ansi;
  generateStringData(ansi);
  std::map<std::wstring, std::wstring> wide;
  generateStringData(wide);

  std::cout << ansi["Lorem"] << std::endl;
  std::wcout << wide[L"Lorem"] << std::endl;
  return 0;
}

这仅有效，因为任何 ASCII 字符的 wchar_t 版本只是扩展为 16 位的 ASCII 值。如果源字符串中有“有趣”的字符，这实际上不会将它们转换为正确的 UTF-16。

还请注意，您几乎肯定会在内存中得到四个字符串副本：可执行文件中 ASCII 源字符串的两个副本（来自函数模板的两个实例化），以及 char 和 @ 987654326@ 堆中的副本。

但这可能并不比预处理器版本差。使用预处理器，您最终可能会在可执行文件中同时获得 char 和 wchar_t 版本，以及在堆中获得 char 和 wchar_t 副本。

预处理器方法可以做的是帮助你绕过这个答案顶部的那个大if；使用预处理器，您可以使用非 ASCII 字符。

[实现说明：最初这些分配使用std::begin(entry.first) 和std::end(entry.first)，但其中包括字符串终止符作为字符串本身的一部分。]

【讨论】：

这让我很吃惊......我认为在我读到这篇文章之前，永远不能使用 const char* 文字来初始化 wstring...
@Edityouprofile：实际上并没有从const char * 初始化wstring。它从chars 范围内复制单个字符值，并依靠隐式转换将它们转换为wchar_ts（或任何目标字符类型）。对于真正的 7 位 ASCII 值，隐式转换是正确的。