从 iso-8859-15 (Latin9) 到 UTF-8 的转换？答案

【问题标题】：Conversion from iso-8859-15 (Latin9) to UTF-8?从 iso-8859-15 (Latin9) 到 UTF-8 的转换？
【发布时间】：2012-06-30 18:18:25
【问题描述】：

我需要将一些用 Latin9 字符集格式化的字符串转换为 UTF-8。我不能使用 iconv，因为它不包含在我的嵌入式系统中。你知道是否有一些可用的代码吗？

【问题讨论】：

了解它是否有效的最佳方法是阅读 iconv 的要求和嵌入式系统的功能并进行交叉检查。你甚至没有提到什么嵌入式系统和什么编译器，所以我们不能告诉你太多。
顺便说一句。因为这个问题是第一个搜索结果之一：如果您在 JavaScript 环境中工作，我强烈推荐 Mathias Bynens 编写的 iso-8859-15 包 (npmjs.com/package/iso-8859-15)。

标签： c linux utf-8 transcoding latin9

【解决方案1】：

代码点 1 到 127 在 Latin-9 (ISO-8859-15) 和 UTF-8 中是相同的。

Latin-9 中的代码点 164 在 UTF-8 中是 U+20AC，\xe2\x82\xac = 226 130 172。
Latin-9 中的代码点 166 在 UTF-8 中是 U+0160，\xc5\xa0 = 197 160。
Latin-9 中的代码点 168 在 UTF-8 中是 U+0161，\xc5\xa1 = 197 161。
Latin-9 中的代码点 180 在 UTF-8 中是 U+017D，\xc5\xbd = 197 189。
Latin-9 中的代码点 184 在 UTF-8 中是 U+017E，\xc5\xbe = 197 190。
Latin-9 中的代码点 188 在 UTF-8 中是 U+0152，\xc5\x92 = 197 146。
Latin-9 中的代码点 189 在 UTF-8 中是 U+0153，\xc5\x93 = 197 147。
Latin-9 中的代码点 190 在 UTF-8 中是 U+0178，\xc5\xb8 = 197 184。

Latin-9 中的代码点 128 .. 191（上面列出的除外）都映射到 UTF-8 中的 \xc2\x80 .. \xc2\xbf = 194 128 .. 194 191。

Latin-9 中的代码点 192 .. 255 都映射到 UTF-8 中的 \xc3\x80 .. \xc3\xbf = 195 128 .. 195 191。

这意味着 Latin-9 代码点 1..127 在 UTF-8 中是一个字节长，代码点 164 是三个字节长，其余（128..163 和 165..255）是两个字节长.

如果您首先扫描 Latin-9 输入字符串，您可以确定生成的 UTF-8 字符串的长度。如果您想要或需要 - 毕竟您正在使用嵌入式系统 - 您可以就地进行转换，从头到尾向后工作。

编辑：

这里有两个函数可以用于转换。这些会在使用后返回您需要的动态分配副本free()。它们仅在发生错误时返回NULL（内存不足，errno == ENOMEM）。如果给定 NULL 或要转换的空字符串，函数将返回一个动态分配的空字符串。

换句话说，当你完成这些函数时，你应该总是在这些函数返回的指针上调用free()。（free(NULL) 被允许并且什么都不做。）

如果输入不包含零字节，latin9_to_utf8() 已被验证产生与iconv 完全相同的输出。该函数使用标准 C 字符串，即零字节表示字符串结束。

如果输入仅包含 ISO-8859-15 中的 Unicode 代码点且不包含零字节，则 utf8_to_latin9() 已被验证产生与 iconv 完全相同的输出。当给定随机 UTF-8 字符串时，该函数将 Latin-1 中的八个代码点映射到 Latin-9 等价物，即货币符号到欧元； iconv 要么忽略它们，要么考虑这些错误。

utf8_to_latin9() 行为意味着函数适用于 both Latin 1->UTF-8->Latin 1 和 Latin 9->@ 987654358@->Latin9往返。

#include <stdlib.h>     /* for realloc() and free() */
#include <string.h>     /* for memset() */
#include <errno.h>      /* for errno */

/* Create a dynamically allocated copy of string,
 * changing the encoding from ISO-8859-15 to UTF-8.
*/
char *latin9_to_utf8(const char *const string)
{
    char   *result;
    size_t  n = 0;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s)
            if (*s < 128) {
                s++;
                n += 1;
            } else
            if (*s == 164) {
                s++;
                n += 3;
            } else {
                s++;
                n += 2;
            }
    }

    /* Allocate n+1 (to n+7) bytes for the converted string. */
    result = malloc((n | 7) + 1);
    if (!result) {
        errno = ENOMEM;
        return NULL;
    }

    /* Clear the tail of the string, setting the trailing NUL. */
    memset(result + (n | 7) - 7, 0, 8);

    if (n) {
        const unsigned char  *s = (const unsigned char *)string;
        unsigned char        *d = (unsigned char *)result;

        while (*s)
            if (*s < 128) {
                *(d++) = *(s++);
            } else
            if (*s < 192) switch (*s) {
                case 164: *(d++) = 226; *(d++) = 130; *(d++) = 172; s++; break;
                case 166: *(d++) = 197; *(d++) = 160; s++; break;
                case 168: *(d++) = 197; *(d++) = 161; s++; break;
                case 180: *(d++) = 197; *(d++) = 189; s++; break;
                case 184: *(d++) = 197; *(d++) = 190; s++; break;
                case 188: *(d++) = 197; *(d++) = 146; s++; break;
                case 189: *(d++) = 197; *(d++) = 147; s++; break;
                case 190: *(d++) = 197; *(d++) = 184; s++; break;
                default:  *(d++) = 194; *(d++) = *(s++); break;
            } else {
                *(d++) = 195;
                *(d++) = *(s++) - 64;
            }
    }

    /* Done. Remember to free() the resulting string when no longer needed. */
    return result;
}

/* Create a dynamically allocated copy of string,
 * changing the encoding from UTF-8 to ISO-8859-15.
 * Unsupported code points are ignored.
*/
char *utf8_to_latin9(const char *const string)
{
    size_t         size = 0;
    size_t         used = 0;
    unsigned char *result = NULL;

    if (string) {
        const unsigned char  *s = (const unsigned char *)string;

        while (*s) {

            if (used >= size) {
                void *const old = result;

                size = (used | 255) + 257;
                result = realloc(result, size);
                if (!result) {
                    if (old)
                        free(old);
                    errno = ENOMEM;
                    return NULL;
                }
            }

            if (*s < 128) {
                result[used++] = *(s++);
                continue;

            } else
            if (s[0] == 226 && s[1] == 130 && s[2] == 172) {
                result[used++] = 164;
                s += 3;
                continue;

            } else
            if (s[0] == 194 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1];
                s += 2;
                continue;

            } else
            if (s[0] == 195 && s[1] >= 128 && s[1] <= 191) {
                result[used++] = s[1] + 64;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 160) {
                result[used++] = 166;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 161) {
                result[used++] = 168;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 189) {
                result[used++] = 180;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 190) {
                result[used++] = 184;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 146) {
                result[used++] = 188;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 147) {
                result[used++] = 189;
                s += 2;
                continue;

            } else
            if (s[0] == 197 && s[1] == 184) {
                result[used++] = 190;
                s += 2;
                continue;

            }

            if (s[0] >= 192 && s[0] < 224 &&
                s[1] >= 128 && s[1] < 192) {
                s += 2;
                continue;
            } else
            if (s[0] >= 224 && s[0] < 240 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192) {
                s += 3;
                continue;
            } else
            if (s[0] >= 240 && s[0] < 248 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192) {
                s += 4;
                continue;
            } else
            if (s[0] >= 248 && s[0] < 252 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192) {
                s += 5;
                continue;
            } else
            if (s[0] >= 252 && s[0] < 254 &&
                s[1] >= 128 && s[1] < 192 &&
                s[2] >= 128 && s[2] < 192 &&
                s[3] >= 128 && s[3] < 192 &&
                s[4] >= 128 && s[4] < 192 &&
                s[5] >= 128 && s[5] < 192) {
                s += 6;
                continue;
            }

            s++;
        }
    }

    {
        void *const old = result;

        size = (used | 7) + 1;

        result = realloc(result, size);
        if (!result) {
            if (old)
                free(old);
            errno = ENOMEM;
            return NULL;
        }

        memset(result + used, 0, size - used);
    }

    return (char *)result;
}

虽然iconv() 通常是字符集转换的正确解决方案，但上述两个函数在嵌入式或其他受限环境中肯定有用。

【讨论】：

非常感谢这两个非常有用的功能！！顺便说一句，你真的是说它们也可以用于 ISO-8859-1 吗？例如，如果您尝试将 8859-1 转换为 8859-1 和 8859-15 中不同的 UTF8 字符，例如“|”，会发生什么情况？（8859-1 中的 0xA6）或“1/4”（8859-1 中的 0xBC）？
@cesss：不，我没有。我写道，这些函数适用于 Latin1-UTF8-Latin1 和 Latin9-UTF8-Latin9 往返。这意味着使用这些函数将 Latin1/Latin9 字符串转换为 UTF8 并返回，总是给出原始的 Latin1/Latin9 字符串。转换本身仅对 Latin9 是正确的。如果你想改变它，你需要为 Latin1 代码点编辑这两个函数的代码。（我建议您将编辑后的副本重命名为latin1_to_utf8() 和utf8_to_latin1()，以避免混淆。）
非常感谢！我将查看 unicode.org 中的表格，并在我的函数版本中使用它们。
@cesss：拉丁语 1 (ISO 8859-1) 和 Unicode 共享代码点 0-127 和 160-255；代码点 128-159 在拉丁语 1 中未定义。也就是说，要获得上述函数的 Latin1 版本，您只需要删除代码。
@cesss：不，这只是我的一个习惯。（如果你知道你的字符串缓冲区用 nuls 填充到八字节边界，你可以做一些漂亮的优化，仅此而已。）你可以在 malloc 中使用 n+1，如果你用 @987654366 替换 memset() @ 在第一个函数中。在第二个函数中，将 size = (used | 7) + 1; 替换为 size = used + 1;（也可选择将 memset() 替换为 result[used] = '\0';）。

【解决方案2】：

创建从 128-255 latin9 代码到 UTF-8 字节序列的转换表应该相对容易。你甚至可以使用 iconv 来做到这一点。或者，您可以使用 128-255 latin9 代码创建一个文件，然后使用适当的文本编辑器将其转换为 UTF-8。然后您可以使用这些数据来构建转换表。

【讨论】：