算法挑战：无损字符串压缩的任意就地基本转换答案

【问题标题】：Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression算法挑战：无损字符串压缩的任意就地基本转换
【发布时间】：2015-08-20 13:27:06
【问题描述】：

从一个真实世界的例子开始可能会有所帮助。假设我正在编写一个由 MongoDB 支持的 Web 应用程序，因此我的记录有一个长的十六进制主键，这使得我查看记录的 url 看起来像 /widget/55c460d8e2d6e59da89d08d0。这似乎太长了。网址可以使用比这更多的字符。虽然在 24 位十六进制数字中只有低于 8 x 10^28 (16^24) 的可能值，但仅限于与 [a-zA-Z0-9] 正则表达式类匹配的字符（YouTube 视频 ID 使用更多），62 个字符，您可以只需 17 个字符即可通过 8 x 10^28。

我想要一种算法，它将限制为特定字符字母表的任何字符串转换为具有另一个字符字母表的任何其他字符串，其中每个字符的值c 可以被认为是alphabet.indexOf(c)。

某种形式：

convert(value, sourceAlphabet, destinationAlphabet)

假设

所有参数都是字符串
value 中的每个字符都存在于sourceAlphabet
sourceAlphabet 和 destinationAlphabet 中的每个字符都是独一无二的

最简单的例子

var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";

但我也希望它能够将 War & Peace 从俄语字母加上一些标点符号转换为整个 unicode 字符集，然后无损地再转换回来。

这可能吗？

我被教导在 Comp Sci 101 中进行基数转换的唯一方法是首先通过求和 digit * base^position 转换为基数为 10 的整数，然后反向转换为目标基数。这种方法对于很长的字符串的转换是不够的，因为整数太大了。

确实感觉直观上可以进行基本转换，因为您逐步遍历字符串（可能向后保持标准有效数字顺序），以某种方式跟踪余数，但我我不够聪明，不知道怎么做。

这就是你进来的地方，*。你够聪明吗？

也许这是一个已解决的问题，由某个 18 世纪的数学家在纸上完成，于 1970 年在穿孔卡片上用 LISP 实现，并在密码学 101 中完成了第一个家庭作业，但我的搜索没有结果。

我更喜欢具有函数式风格的 javascript 解决方案，但任何语言或风格都可以，只要您不使用一些大型整数库作弊。当然，效率加分。

请不要批评原始示例。解决问题的一般书呆子信条比解决方案的任何应用更重要。

【问题讨论】：

标题说“就地”。我认为，当迁移到字符少于原始字符的字母表时，这不一定是可能的。
并不是说它确实是可能的——否则arithmetic decoding 会容易得多。
当然。我认为这是一个很好的简化。
这样的事情已经以一种特殊的形式在加密中完成。一个字节可以有 2^8 或 256 个不同的值，但只有不到一半的值表示一个完全可打印的字符，并且在打印时看起来不像是严重的车祸。因此 Base64 定义了一个 64 个“字母”的字符集，并将位字符串拆分为 6 位块，而不是像一个字节中的 8 位。您可以通过拆分 5 位块并使用字母 a-z 和数字 0 - 5 作为示例来手动执行类似操作。你的挑战比这些特殊形式更笼统，但我认为这是可能的。
从捕捉每个案例的另一个角度来看，考虑到计算机的本机字母包含 2 个“字母”，我们通常将其称为 0 和 1。任何你能想到的字母，如果可以的话表示在计算机上，完全可以转换为这个本地字母表。然后，如果您有 2 个这样的字母，您总是可以通过转换为 0 和 1 来将一个转换为另一个。在这样的转换中，最后一个字母可能会被限制为它所属的字母表的一个子集，因为该位置没有一个完整字母的位。

标签： algorithm math encoding cryptography compression

【解决方案1】：

这是一个非常快的 C 解决方案，使用位移操作。它假设您知道解码字符串的长度应该是多少。字符串是每个字母表的 0..maximum 范围内的整数向量。用户可以在字符范围受限的字符串之间进行转换。至于题名中的“in-place”，源向量和目的向量可以重叠，但前提是源字母表不大于目的字母表。

/*
  recode version 1.0, 22 August 2015

  Copyright (C) 2015 Mark Adler

  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.

  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:

  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

  Mark Adler
  madler@alumni.caltech.edu
*/

/* Recode a vector from one alphabet to another using intermediate
   variable-length bit codes. */

/* The approach is to use a Huffman code over equiprobable alphabets in two
   directions.  First to encode the source alphabet to a string of bits, and
   second to encode the string of bits to the destination alphabet. This will
   be reasonably close to the efficiency of base-encoding with arbitrary
   precision arithmetic. */

#include <stddef.h>     // size_t
#include <limits.h>     // UINT_MAX, ULLONG_MAX

#if UINT_MAX == ULLONG_MAX
#  error recode() assumes that long long has more bits than int
#endif

/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
   code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
   *dlen returns the length of the result, which will not exceed the value of
   *dlen when called.  If the original *dlen is not large enough to hold the
   full result, then recode() will return non-zero to indicate failure.
   Otherwise recode() will return 0.  recode() will also return non-zero if
   either of the smax or dmax parameters are less than one.  The non-zero
   return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
   and 3 if any of the elements of source are greater than smax.

   Using this same operation on the result with smax and dmax reversed reverses
   the operation, restoring the original vector.  However there may be more
   symbols returned than the original, so the number of symbols expected needs
   to be known for decoding.  (An end symbol could be appended to the source
   alphabet to include the length in the coding, but then encoding and decoding
   would no longer be symmetric, and the coding efficiency would be reduced.
   This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
           const unsigned *source, size_t slen, unsigned smax)
{
    // compute sbits and scut, with which we will recode the source with
    // sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
    if (smax < 1)
        return 2;
    unsigned sbits = 0;
    unsigned scut = 1;          // 2**sbits
    while (scut && scut <= smax) {
        scut <<= 1;
        sbits++;
    }
    scut -= smax + 1;

    // same thing for dbits and dcut
    if (dmax < 1)
        return 2;
    unsigned dbits = 0;
    unsigned dcut = 1;          // 2**dbits
    while (dcut && dcut <= dmax) {
        dcut <<= 1;
        dbits++;
    }
    dcut -= dmax + 1;

    // recode a base smax+1 vector to a base dmax+1 vector using an
    // intermediate bit vector (a sliding window of that bit vector is kept in
    // a bit buffer)
    unsigned long long buf = 0;     // bit buffer
    unsigned have = 0;              // number of bits in bit buffer
    size_t i = 0, n = 0;            // source and dest indices
    unsigned sym;                   // symbol being encoded
    for (;;) {
        // encode enough of source into bits to encode that to dest
        while (have < dbits && i < slen) {
            sym = source[i++];
            if (sym > smax) {
                *dlen = n;
                return 3;
            }
            if (sym < scut) {
                buf = (buf << (sbits - 1)) + sym;
                have += sbits - 1;
            }
            else {
                buf = (buf << sbits) + sym + scut;
                have += sbits;
            }
        }

        // if not enough bits to assure one symbol, then break out to a special
        // case for coding the final symbol
        if (have < dbits)
            break;

        // encode one symbol to dest
        if (n == *dlen)
            return 1;
        sym = buf >> (have - dbits + 1);
        if (sym < dcut) {
            dest[n++] = sym;
            have -= dbits - 1;
        }
        else {
            sym = buf >> (have - dbits);
            dest[n++] = sym - dcut;
            have -= dbits;
        }
        buf &= ((unsigned long long)1 << have) - 1;
    }

    // if any bits are left in the bit buffer, encode one last symbol to dest
    if (have) {
        if (n == *dlen)
            return 1;
        sym = buf;
        sym <<= dbits - 1 - have;
        if (sym >= dcut)
            sym = (sym << 1) - dcut;
        dest[n++] = sym;
    }

    // return recoded vector
    *dlen = n;
    return 0;
}

/* Test recode(). */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>

// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
    unsigned bits = 0;
    unsigned long long mask = 1;
    while (mask <= max) {
        mask <<= 1;
        bits++;
    }
    mask--;
    unsigned long long ran = 0;
    unsigned have = 0;
    size_t n = 0;
    while (n < len) {
        while (have < bits) {
            ran = (ran << 31) + random();
            have += 31;
        }
        if ((ran & mask) <= max)
            vec[n++] = ran & mask;
        ran >>= bits;
        have -= bits;
    }
}

// Get a valid number from str and assign it to var
#define NUM(var, str) \
    do { \
        char *end; \
        unsigned long val = strtoul(str, &end, 0); \
        var = val; \
        if (*end || var != val) { \
            fprintf(stderr, \
                    "invalid or out of range numeric argument: %s\n", str); \
            return 1; \
        } \
    } while (0)

/* "bet n m len count" generates count test vectors of length len, where each
   entry is in the range 0..n.  Each vector is recoded to another vector using
   only symbols in the range 0..m.  That vector is recoded back to a vector
   using only symbols in 0..n, and that result is compared with the original
   random vector.  Report on the average ratio of input and output symbols, as
   compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
    // get sizes of alphabets and length of test vector, compute maximum sizes
    // of recoded vectors
    unsigned smax, dmax, runs;
    size_t slen, dsize, bsize;
    if (argc != 5) { fputs("need four arguments\n", stderr); return 1; }
    NUM(smax, argv[1]);
    NUM(dmax, argv[2]);
    NUM(slen, argv[3]);
    NUM(runs, argv[4]);
    dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
    bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));

    // generate random test vectors, encode, decode, and compare
    srandomdev();
    unsigned source[slen], dest[dsize], back[bsize];
    unsigned mis = 0, i;
    unsigned long long dtot = 0;
    int ret;
    for (i = 0; i < runs; i++) {
        ranvec(source, slen, smax);
        size_t dlen = dsize;
        ret = recode(dest, &dlen, dmax, source, slen, smax);
        if (ret) {
            fprintf(stderr, "encode error %d\n", ret);
            break;
        }
        dtot += dlen;
        size_t blen = bsize;
        ret = recode(back, &blen, smax, dest, dlen, dmax);
        if (ret) {
            fprintf(stderr, "decode error %d\n", ret);
            break;
        }
        if (blen < slen || memcmp(source, back, slen))  // blen > slen is ok
            mis++;
    }
    if (mis)
        fprintf(stderr, "%u/%u mismatches!\n", mis, i);
    if (ret == 0)
        printf("mean dest/source symbols = %.4f (optimal = %.4f)\n",
               dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
    return 0;
}

【讨论】：

您能否用文字描述一个示例，说明您的算法如何处理非常简单的事情，例如将基数为 3 的三位数字符串转换为基数 4？
在基数 3 中，符号被编码为位串 0、10 和 11。例如，向量 0、1、2、1 变为 0101110。基数 4 只需在一次，编码为 1, 1, 3, 0。最后一个是特殊情况，因为只有一位，所以它被向上移动以使其成为两位。对于这个简短的案例，您仍然得到了四个符号。对于从 3 到 4 的长向量，输出符号是输入符号数量的 83.3%。如果这是用无限精度算术来完成的，这个比率将是 79.3%。

【解决方案2】：

正如在其他 * 答案中所指出的，尽量不要将 digit * base^position 求和视为将其转换为以十为底；相反，可以将其视为指示计算机以自己的术语生成由数字表示的数量的表示（对于大多数计算机可能更接近我们的以 2 为底的概念）。一旦计算机有自己的数量表示，我们就可以指示它以我们喜欢的任何方式输出数字。

通过拒绝“大整数”实现并要求逐个字母转换，您同时认为数量的数字/字母表示实际上并不是它的样子，即每个位置代表一个 @ 987654322@。如果 War and Peace 的第 900 万个字符确实代表了您要求转换它的内容，那么计算机在某些时候需要为 Д * 33^9000000 生成一个表示。

【讨论】：

【解决方案3】：

我不认为任何解决方案通常都可以工作，因为如果 n^e != m 用于某些整数 e 和某些 MAX_INT 因为无法计算某个位置的目标基数的值p 如果 n^p > MAX_INT.

对于某些 e 的 n^e == m 的情况，您可以避免这种情况，因为问题是递归可行的（n 的前 e 位可以相加并转换为第一位M，然后切掉并重复。

如果您没有这个有用的属性，那么最终您将不得不尝试获取原始基数的一部分并尝试在 n^p 和 n^p 将大于 MAX_INT，这意味着这是不可能的。

【讨论】：