如何通过 io.Reader 转换 HTML 实体答案

【问题标题】：How to transform HTML entities via io.Reader如何通过 io.Reader 转换 HTML 实体
【发布时间】：2020-02-25 07:01:06
【问题描述】：

我的 Go 程序发出 HTTP 请求，其响应主体是大型 JSON 文档，其字符串将 & 字符 &amp; 编码为 &amp;（可能是由于某些 Microsoft 平台怪癖？）。我的程序需要以与 json.Decoder 兼容的方式将这些实体转换回 & 字符。

示例响应可能如下所示：

{"name":"A&amp;B","comment":"foo&amp;bar"}

对应的对象如下：

pkg.Object{Name:"A&B", Comment:"foo&bar"}

文档有各种形状，因此在解码后转换 HTML 实体是不可行的。理想情况下，它可以通过将响应正文阅读器包装在另一个执行转换的阅读器中来完成。

有没有一种简单的方法可以将http.Response.Body 包装在一些io.ReadCloser 中，将&amp; 的所有实例替换为&amp;（或者在一般情况下，将任何字符串X 替换为字符串Y）？

我怀疑x/text/transform 可以做到这一点，但不立即知道如何。特别是，我担心实体跨越字节批次的边缘情况。也就是说，例如，一批以&am 结尾，而下一批以p; 开头。是否有一些库或习语可以优雅地处理这种情况？

【问题讨论】：

如果 pkg.Object 结构声明在您的控制范围内，使用 transform 的替代方法可能是定义一个新类型，例如 type MyString string，在 @987654339 上实现 UnmarshalJSON 方法@ 类型，最后将pkg.Object 结构的string 成员重新定义为MyString 类型。使用这种方法，&amp; -> & 转换将发生在 UnmarshalJSON 中。
@ShangjianDing 当然可以，但挑战在于将有数十个对象模型，其中许多具有数十个字段，并确保它们都是“mystring”类型而不是“string” " 会很麻烦。

标签： go streaming transformation html-entities

【解决方案1】：

如果你不想依赖像transform.Reader 这样的外部包，你可以编写一个自定义的io.Reader 包装器。

以下将处理find 元素可能跨越两个Read() 调用的边缘情况：

type fixer struct {
    r        io.Reader // source reader
    fnd, rpl []byte    // find & replace sequences
    partial  int       // track partial find matches from previous Read()
}

// Read satisfies io.Reader interface
func (f *fixer) Read(b []byte) (int, error) {
    off := f.partial
    if off > 0 {
        copy(b, f.fnd[:off]) // copy any partial match from previous `Read`
    }

    n, err := f.r.Read(b[off:])
    n += off

    if err != io.EOF {
        // no need to check for partial match, if EOF, as that is the last Read!
        f.partial = partialFind(b[:n], f.fnd)
        n -= f.partial // lop off any partial bytes
    }

    fixb := bytes.ReplaceAll(b[:n], f.fnd, f.rpl)

    return copy(b, fixb), err // preserve err as it may be io.EOF etc.
}

连同这个助手（可能会使用一些优化）：

// returns number of matched bytes, if byte-slice ends in a partial-match
func partialFind(b, find []byte) int {
    for n := len(find) - 1; n > 0; n-- {
        if bytes.HasSuffix(b, find[:n]) {
            return n
        }
    }
    return 0 // no match
}

工作playground example。

注意：要测试边缘情况逻辑，可以使用 narrowReader 来确保短 Read 并强制匹配在 Reads 之间拆分，如下所示：validation playground example

【讨论】：

【解决方案2】：

您需要创建一个 transform.Transformer 来替换您的角色。

所以我们需要一个将旧的[]byte 转换为新的[]byte，同时保留所有其他数据。实现可能如下所示：

type simpleTransformer struct {
    Old, New []byte
}

// Transform transforms `t.Old` bytes to `t.New` bytes.
// The current implementation assumes that len(t.Old) >= len(t.New), but it also seems to work when len(t.Old) < len(t.New) (this has not been tested extensively)
func (t *simpleTransformer) Transform(dst, src []byte, atEOF bool) (nDst, nSrc int, err error) {
    // Get the position of the first occurance of `t.Old` so we can replace it
    var ci = bytes.Index(src[nSrc:], t.Old)

    // Loop over the slice until we can't find any occurances of `t.Old`
    // also make sure we don't run into index out of range panics
    for ci != -1 && nSrc < len(src) {
        // Copy source data before `nSrc+ci` that doesn't need transformation
        copied := copy(dst[nDst:nDst+ci], src[nSrc:nSrc+ci])
        nDst += copied
        nSrc += copied

        // Copy new data with transformation to `dst`
        nDst += copy(dst[nDst:nDst+len(t.New)], t.New)

        // Skip the rest of old bytes in the next iteration
        nSrc += len(t.Old)

        // search for the next occurance of `t.Old`
        ci = bytes.Index(src[nSrc:], t.Old)
    }

    // Mark the rest of data as not completely processed if it contains a start element of `t.Old`
    // (e.g. if the end is `&amp` and we're looking for `&amp;`)
    // This data will not yet be copied to `dst` so we can work with it again
    // If it is at the end (`atEOF`), we don't need to do the check anymore as the string might just end with `&amp` 
    if bytes.Contains(src[nSrc:], t.Old[0:1]) && !atEOF {
        err = transform.ErrShortSrc
        return
    }

    // Copy rest of data that doesn't need any transformations
    // The for loop processed everything except this last chunk
    copied := copy(dst[nDst:], src[nSrc:])
    nDst += copied
    nSrc += copied

    return nDst, nSrc, err
}

// To satisfy transformer.Transformer interface
func (t *simpleTransformer) Reset() {}

实现必须确保它处理在Transform 方法的多次调用之间拆分的字符，这就是为什么它返回transform.ErrShortSrc 以告诉transform.Reader 它需要有关下一个字节的更多信息.

现在可用于替换流中的字符：

var input = strings.NewReader(`{"name":"A&amp;B","comment":"foo&amp;bar"}`)
r := transform.NewReader(input, &simpleTransformer{[]byte(`&amp;`), []byte(`&`)})
io.Copy(os.Stdout, r) // Instead of io.Copy, use the JSON decoder to read from `r`

输出：

{"name":"A&B","comment":"foo&bar"}

你也可以see this in action on the Go Playground。

【讨论】：

感谢您的回答！如果实体在“src”字节片的块之间分割会发生什么？例如，假设一个响应体向转换器产生了两批字节，第一批以&am 结尾，第二批以p; 开头？转换阅读器是否隐式处理这种情况？
试图更早地提出解决方案 - 这个非常边缘的案例让我也很挣扎。进行字节模式替换很容易，但是流以 N 字节块读取，不能保证不会达到边界。
@colminator 对，从一个字节切片转换到另一个字节切片很容易，但边界条件问题是我希望在其他地方得到解决的问题。
@maerics 我更新了我的答案；一开始我什至没有想到那个边缘案例。它现在不再是一个“简单”的转换器，但它在处理边界重叠时似乎可以正常工作
看起来SpanningTransformer 接口旨在处理边界边缘情况。