如何将字符串拆分为特定字节大小的块？答案

【问题标题】：How to split a string into chunks of a particular byte size?如何将字符串拆分为特定字节大小的块？
【发布时间】：2019-11-25 20:59:03
【问题描述】：

我正在与一个接受最大大小为 5KB 的字符串的 api 交互。

我想获取一个可能超过 5KB 的字符串并将其分成大小小于 5KB 的块。

然后我打算将每个 smaller-than-5kb-string 传递给 api 端点，并在所有请求完成后执行进一步的操作，可能使用类似：

await Promise.all([get_thing_from_api(string_1), get_thing_from_api(string_2), get_thing_from_api(string_3)])

我读到字符串中的字符可以在 1 - 4 个字节之间。

因此，我们可以使用以字节为单位计算字符串长度：

// in Node, string is UTF-8    
Buffer.byteLength("here is some text"); 

// in Javascript  
new Blob(["here is some text"]).size

来源：
https://stackoverflow.com/a/56026151
https://stackoverflow.com/a/52254083

我对@987654333@ 的搜索返回与将字符串拆分为特定字符长度而非字节长度的字符串相关的结果，例如：

var my_string = "1234 5 678905";

console.log(my_string.match(/.{1,2}/g));
// ["12", "34", " 5", " 6", "78", "90", "5"]

来源：
https://stackoverflow.com/a/7033662
https://stackoverflow.com/a/6259543
https://gist.github.com/hendriklammers/5231994

问题

有没有办法将字符串拆分为特定字节长度的字符串？

我可以：

假设字符串每个字符只包含 1 个字节
允许每个字符为 4 个字节的“最坏情况”

但更喜欢更准确的解决方案。

如果存在 Node 和纯 JavaScript 解决方案，我很想知道它们。

编辑

这种计算 byteLength 的方法可能会有所帮助 - 通过迭代字符串中的字符，获取它们的字符代码并相应地递增 byteLength：

function byteLength(str) {
  // returns the byte length of an utf8 string
  var s = str.length;
  for (var i=str.length-1; i>=0; i--) {
    var code = str.charCodeAt(i);
    if (code > 0x7f && code <= 0x7ff) s++;
    else if (code > 0x7ff && code <= 0xffff) s+=2;
    if (code >= 0xDC00 && code <= 0xDFFF) i--; //trail surrogate
  }
  return s;
}

来源：https://stackoverflow.com/a/23329386

这让我对underlying data structures of Buffer 进行了有趣的实验：

var buf = Buffer.from('Hey! ф');
// <Buffer 48 65 79 21 20 d1 84>  
buf.length // 7
buf.toString().charCodeAt(0) // 72
buf.toString().charCodeAt(5) // 1092  
buf.toString().charCodeAt(6) // NaN    
buf[0] // 72
for (let i = 0; i < buf.length; i++) {
  console.log(buf[i]);
}
// 72 101 121 33 32 209 132 undefined
buf.slice(0,5).toString() // 'Hey! '
buf.slice(0,6).toString() // 'Hey! �'
buf.slice(0,7).toString() // 'Hey! ф'

但正如@trincot 在 cmets 中指出的那样，处理多字节字符的正确方法是什么？以及如何确保在空格上拆分块（以免“拆分”一个词？）

有关缓冲区的更多信息：https://nodejs.org/api/buffer.html#buffer_buffer

编辑

如果它有助于其他人理解已接受答案中的精彩逻辑，下面的 sn-p 是我制作的一个重度评论版本，以便我可以更好地理解它。

/**
 * Takes a string and returns an array of substrings that are smaller than maxBytes.  
 *
 * This is an overly commented version of the non-generator version of the accepted answer, 
 * in case it helps anyone understand its (brilliant) logic.  
 *
 * Both plain js and node variations are shown below - simply un/comment out your preference  
 * 
 * @param  {string} s - the string to be chunked  
 * @param  {maxBytes} maxBytes - the maximum size of a chunk, in bytes   
 * @return {arrray} - an array of strings less than maxBytes (except in extreme edge cases)    
 */
function chunk(s, maxBytes) {
  // for plain js  
  const decoder = new TextDecoder("utf-8");
  let buf = new TextEncoder("utf-8").encode(s);
  // for node
  // let buf = Buffer.from(s);
  const result = [];
  var counter = 0;
  while (buf.length) {
    console.log("=============== BEG LOOP " + counter + " ===============");
    console.log("result is now:");
    console.log(result);
    console.log("buf is now:");
    // for plain js
    console.log(decoder.decode(buf));
    // for node  
    // console.log(buf.toString());
    /* get index of the last space character in the first chunk, 
    searching backwards from the maxBytes + 1 index */
    let i = buf.lastIndexOf(32, maxBytes + 1);
    console.log("i is: " + i);
    /* if no space is found in the first chunk,
    get index of the first space character in the whole string,
    searching forwards from 0 - in edge cases where characters
    between spaces exceeds maxBytes, eg chunk("123456789x 1", 9),
    the chunk will exceed maxBytes */
    if (i < 0) i = buf.indexOf(32, maxBytes);
    console.log("at first condition, i is: " + i);
    /* if there's no space at all, take the whole string,
    again an edge case like chunk("123456789x", 9) will exceed maxBytes*/
    if (i < 0) i = buf.length;
    console.log("at second condition, i is: " + i);
    // this is a safe cut-off point; never half-way a multi-byte
    // because the index is always the index of a space    
    console.log("pushing buf.slice from 0 to " + i + " into result array");
    // for plain js
    result.push(decoder.decode(buf.slice(0, i)));
    // for node
    // result.push(buf.slice(0, i).toString());
    console.log("buf.slicing with value: " + (i + 1));
    // slice the string from the index + 1 forwards  
    // it won't erroneously slice out a value after i, because i is a space  
    buf = buf.slice(i + 1); // skip space (if any)
    console.log("=============== END LOOP " + counter + " ===============");
    counter++;
  }
  return result;
}

console.log(chunk("Hey there! € 100 to pay", 12));

【问题讨论】：

是否允许在多字节字符中间发生拆分？
好问题，它用于将文本翻译成语音，生成单个音频文件（如果文本小于 5kb）或多个音频文件（如果文本大于 5kb），所以我想条件必须说类似“在空格字符的实例处中断块”。
我喜欢你的问题的布局和编辑！
有iter-ops 模块，它有灵活的split 和page 操作符。

标签： javascript node.js

【解决方案1】：

使用Buffer 似乎确实是正确的方向。鉴于：

Buffer原型有indexOf和lastIndexOf方法，并且
32是空格的ASCII码，
32 永远不能作为多字节字符的一部分出现，因为构成多字节序列的所有字节always have the most significant bit set。

...您可以如下进行：

function chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    const result = [];
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take the whole string
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        result.push(buf.slice(0, i).toString());
        buf = buf.slice(i+1); // Skip space (if any)
    }
    return result;
}

console.log(chunk("Hey there! € 100 to pay", 12)); 
// -> [ 'Hey there!', '€ 100 to', 'pay' ]

您可以考虑将其扩展为也查找 TAB、LF 或 CR 作为拆分字符。如果是这样，并且您的输入文本可能具有 CRLF 序列，您还需要检测这些序列以避免在块中出现孤立的 CR 或 LF 字符。

你可以把上面的函数变成一个生成器，这样你就可以控制何时开始处理获取下一个块：

function * chunk(s, maxBytes) {
    let buf = Buffer.from(s);
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take all
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        yield buf.slice(0, i).toString();
        buf = buf.slice(i+1); // Skip space (if any)
    }
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

浏览器

Buffer 特定于 Node。然而，浏览器实现了TextEncoder and TextDecoder，这导致了类似的代码：

function * chunk(s, maxBytes) {
    const decoder = new TextDecoder("utf-8");
    let buf = new TextEncoder("utf-8").encode(s);
    while (buf.length) {
        let i = buf.lastIndexOf(32, maxBytes+1);
        // If no space found, try forward search
        if (i < 0) i = buf.indexOf(32, maxBytes);
        // If there's no space at all, take all
        if (i < 0) i = buf.length;
        // This is a safe cut-off point; never half-way a multi-byte
        yield decoder.decode(buf.slice(0, i));
        buf = buf.slice(i+1); // Skip space (if any)
    }
}

for (let s of chunk("Hey there! € 100 to pay", 12)) console.log(s);

【讨论】：

太棒了，我在玩lastIndexOf(32)，但不知道如何动态创建在空间的最后一个实例结束和开始的块并将它们推送到数组。只是好奇，为什么需要If no space found, try forward search 行？上一行不是达到了相同的目标，即它获取了空格字符的最后一个实例（即使只有一个实例，靠近字符串的开头）？而且下一行If there's no space at all, take all不是也有同样的条件，即if (i < 0)，并覆盖分配给i的值吗？
编辑只是删除了一个步骤，将切片替换为lastIndex 的第二个参数，但归结为相同的逻辑。假设 maxBytes 是 1000。所以lastIndex 部分查找出现在前 1001 个字符中的最后一个空格。如果返回 -1，则意味着无法生成块。第二个最好的方法是在位置 1000 之后查找第一个空格。这会导致块稍大，但没有其他方法。当这也失败（-1）时，我们只能将所有字符视为单个块。
啊，我明白了，非常感谢您的澄清，只是为了测试一个极端的边缘情况，我用 14,002 个字符的字符串进行了测试，其中前 5000 个字符之间没有空格， maxBytes 是 5000，行为如您所描述。再次感谢。