如何处理 C++ 程序中的特殊字符？答案

【问题标题】：How can I handle special characters in C++ program?如何处理 C++ 程序中的特殊字符？
【发布时间】：2014-11-14 10:57:30
【问题描述】：

我正在尝试从一个文件夹中读取多个文本文件，并存储每个单词的开始位置。我正在使用 Boost 从标点符号中清除文本。

当单词包含特殊字符（如（Õ、Ø、æ 等）时，我会遇到问题。在这种情况下，我收到一条错误消息："Expression: (unsigned)(c+1).

这是我提到的应用程序的代码：

#include "stdafx.h"
#include <iostream>
#include <fstream>
#include<iterator>
#include<string>
#include "/../dirent.h/dirent.h"
#include <boost/tokenizer.hpp>

using namespace std;
using namespace boost;

int main() {

    DIR*     dir;
    dirent*  pdir;

    dir = opendir("D:/../dataset/"); 

    int number_of_words=0;
    int text_length = 30;
    char filename[300];
    int i=0;
    while (pdir = readdir(dir)) 
    {
        string fileString;

        cout<<"-------------------------------------------"<<endl;
        cout<<"Name of text file: "<<pdir->d_name << endl;
        strcpy(filename, "D:/.../dataset/");
        strcat(filename, pdir->d_name);
        ifstream file(filename);
        std::istream_iterator<std::string> beg(file), end;

        number_of_words = distance(beg,end);

        //cout<<"Number of words in file: "<<number_of_words<<endl;
        ifstream files(filename);
         //char output[200];

         if (file.is_open()) 
         {

             string output;

             while (!files.eof())
             {

                    files >> output;
                    fileString += " ";
                    fileString += output;
                    //cout<<output<<endl;

             }
             //cout<<fileString<<endl;
             cout<<"Number of characters: "<<fileString.size()<<endl;
             cout<<"-------------------------------------------"<<endl;


            string fileStringTokenized;
            tokenizer<>tok (fileString);

            int indice_cuvant_curent = 0;
            int index = 0;
            vector<int> myvector;

            for(tokenizer<>::iterator beg=tok.begin(); beg!=tok.end(); ++beg)
            {
                string currentWord;
                currentWord = *beg;

                myvector.push_back(index);
                index+=currentWord.size();
                //cout<<index<<"\t";

                //cout<<*beg<<endl;
                fileStringTokenized += *beg;
            }

         }
         file.close();
    }
    closedir(dir);
    return 0;
}

为什么会出现这个问题，我该如何解决？

【问题讨论】：

使用 unicode？创建一个最小的例子？我真的会把它复制粘贴到我的机器上并做一个例子，但我没有时间剥离你自己的代码并做你的工作。
也许可以试试std::wstring（宽字符串）。另外：不要像那样使用while (!files.eof())，使用while (files >> output) - 请参阅here。

标签： c++ string boost

【解决方案1】：

这样的事情应该可以工作：

#include <iostream>
#include <string>
#include <vector>
#include <boost/tokenizer.hpp>

using String = std::wstring;
using Tokenizer = boost::tokenizer< boost::char_delimiters_separator<String::value_type>, String::const_iterator, String>;
int main()
{
    String str(L"Õ, Ø, æ");
    Tokenizer tok (str);

    for(Tokenizer::iterator beg=tok.begin(); beg!=tok.end(); ++beg)
    {
        std::wcout << (*beg) << L'\n';
    }
}

它对宽字符使用分词器。

【讨论】：

编译时出现错误：“indentifier "String" " is undefined"; "indentifier "Tokenizer" " is undefined
你用的是c++11吗？
有这种事吗？
我使用的是 Visual Studio 2010
我对程序进行了一些修改，所以我没有直接使用 String 和 Tokenizer 命名空间，现在它可以工作了。非常感谢！

【解决方案2】：

使用 UTF-16 字符串，它将帮助您解决问题

【讨论】：

更好的是，将输入的任何内容转换为 utf-8 并在打印出内容时转换回系统期望的任何内容。然后，您可以继续使用普通的字符串函数（计数字符除外）。 utf-16 是最糟糕的选择，结合了 utf-32 和 utf-8 的缺点。