如何消除对这个循环向量化的数组边界检查？答案

【问题标题】：How can I eliminate array bound checking on this loop vectorization?如何消除对这个循环向量化的数组边界检查？
【发布时间】：2014-07-13 03:10:31
【问题描述】：

我的任务是从二进制文字 0x0 上的数据库表中拆分多次运行的 varbinary(8000) 列。但是，这可能会改变，所以我想保留这个变量。我想使用 SQLCLR 作为流表值函数快速执行此操作。我知道我的字符串总是至少有几千个字节。

编辑：我已经更新了我的算法。避免内循环展开的讨厌。但是要说服 CLR 对寄存器分配做出正确的选择是极其困难的。如果有一种简单的方法可以让 CLR 相信 j 和 i 真的是同一件事，那就太棒了。但相反，它确实做了一些愚蠢的事情。优化第一个路径循环会很好。但是你不能在循环中使用 goto。

我决定采用 C 函数 memchr 的 64 位实现。基本上，我不是一次扫描一个字节并进行比较，而是使用一些位旋转来一次扫描 8 个字节。作为参考，Array.IndexOf<Byte> 对一个答案进行 4 字节扫描，我只想继续这样做。有几点需要注意：

内存压力是 SQLCLR 函数中一个非常现实的问题。 String.Split 已退出，因为它预先分配了很多我非常想避免的内存。它也适用于 UCS-2 字符串，这需要我将我的 ascii 字符串转换为 unicode 字符串，从而在返回时将我的数据视为 lob 数据类型。（SqlChars/SqlString只能返回4000字节才转成lob类型）。
我想直播。避免使用String.Split 的另一个原因是它一次完成所有工作，造成大量内存压力。在具有大量分隔符的代码上，纯 T-SQL 方法将开始击败它。
我想让它保持“安全”。所以都管理好了。安全检查中似乎有一个非常大的罚款。

Buffer.BlockCopy 真的很快，而且似乎预先支付一次费用比不断支付 BitConverter 的费用要好。这也比将我的输入转换为字符串并保留该引用更便宜。

代码非常快，但似乎我在初始循环和找到匹配项时的关键部分中支付了相当多的绑定检查。因此，对于具有大量分隔符的代码，我往往会输给一个只进行字节比较的更简单的 C# 枚举器。

这是我的代码，

class SplitBytesEnumeratorA : IEnumerator
{
    // Fields
    private readonly byte[] _bytes;
    private readonly ulong[] _longs;
    private readonly ulong _comparer;
    private readonly Record _record = new Record();
    private int _start;
    private readonly int _length;

    // Methods
    internal SplitBytesEnumeratorA(byte[] bytes, byte delimiter)
    {
        this._bytes = bytes;
        this._length = bytes.Length;
        // we do this so that we can avoid a spillover scan near the end.
        // in unsafe implementation this would be dangerous as we potentially
        // will be reading more bytes than we should.

        this._longs = new ulong[(_length + 7) / 8];
        Buffer.BlockCopy(bytes, 0, _longs, 0, _length);
        var c = (((ulong)delimiter << 8) + (ulong)delimiter);
        c = (c << 16) + c;
        // comparer is now 8 copies of the original delimiter.
        c |= (c << 32);
        this._comparer = c;
    }

    public bool MoveNext()
    {
        if (this._start >= this._length) return false;
        int i = this._start;
        var longs = this._longs;
        var comparer = this._comparer;
        var record = this._record;
        record.id++;
        // handle the case where start is not divisible by eight.
        for (; (i & 7) != 0; i++)
        {
            if (i == _length || _bytes[i] == (comparer & 0xFF))
            {
                record.item = new byte[(i - _start)];
                Buffer.BlockCopy(_bytes, _start, record.item, 0, i - _start);
                _start = i + 1;
                return true;
            }
        }

        // main loop. We crawl the array 8 bytes at a time.

        for (int j=i/8; j < longs.Length; j++)
        {
            ulong t1 = longs[j];
            unchecked
            {
                t1 ^= comparer;
                ulong t2 = (t1 - 0x0101010101010101) & ~t1;
                if ((t2 & 0x8080808080808080) != 0)
                {
                    i =j*8;
                    // make every case 3 comparison instead of n. Potentially better. 
                    // This is an unrolled binary search.
                    if ((t2 & 0x80808080) == 0)
                    {
                        i += 4;
                        t2 >>= 32;
                    }

                    if ((t2 & 0x8080) == 0)
                    {
                        i += 2;
                        t2 >>= 16;
                    }

                    if ((t2 & 0x80) == 0)
                {
                i++;
                }
                record.item = new byte[(i - _start)];
                // improve cache locality by not switching collections.
                Buffer.BlockCopy(longs, _start, record.item, 0, i - _start);                _start = i + 1;
                return true;
            }
        }
        // no matches found increment by 8
    }
    // no matches left. Let's return the remaining buffer.
    record.item = new byte[(_length - _start)];
    Buffer.BlockCopy(longs, _start, record.item, 0, (_length - _start));
    _start = _bytes.Length;
    return true;
    }

    void IEnumerator.Reset()
    {
        throw new NotImplementedException();
    }

    public object Current
    {
        get
        {
            return this._record;
        }
    }
}

// We use a class to avoid boxing .
class Record
{
    internal int id;
    internal byte[] item;
}

【问题讨论】：

Array bounds check efficiency in .net 4 and above的可能重复
不知道是不是一模一样。当我在 x64 版本中运行关键循环爬网时发出的代码对字节扫描执行 8 次绑定检查（if(bytes[i] == _delim) 的那个有趣位。事实上，它会生成相当多的代码来执行比较，这有点令人担忧。我我想知道是否值得重新审视一下字节爬网。
.NET JIT 不是一个很好的优化编译器，它仅在 0 到长度数组遍历（以及非常简单的变体）的原始情况下消除了边界检查。你运气不好。（或者，使用不安全的代码。但我看到这会抑制其他优化并使事情变慢）。
在这种情况下，您将进入“快速、安全、廉价：从 3 个领域中选择 2 个”领域。衡量你的实际表现并决定它是否实际上是不可接受的。加速它的成本可能不合理。
当然正确。我决定创建更多统计数据和更现实的基线，以找出我的算法真正令人窒息的地方（可以预期的 I/O 是等式中令人难以置信的主导部分）。我现在在数百毫秒的时间内进行战斗，但是应该更快的东西却不是很麻烦。此处为 SQL Server 2008+ 的完整重现和测试代码：gist.github.com/mburbea/e72151af503873d82d6f

标签： c# arrays performance clr sqlclr

【解决方案1】：

跳出框框思考，您是否考虑过将字符串转换为 XML 并使用 XQuery 进行拆分？

例如，您可以传入分隔符和（空气代码）：

DECLARE @xml as xml
DECLARE @str as varchar(max)
SET @str = (SELECT CAST(t.YourBinaryColumn AS varchar(max) FROM [tableName] t) 
SET @xml = cast(('<X>'+replace(@str,@delimiter,'</X><X>')+'</X>') as xml)

这会将二进制文件转换为字符串并用 XML 标记替换分隔符。那么：

SELECT N.value('.', 'varchar(10)') as value FROM @xml.nodes('X') as T(N)

将获得各个“元素”，即每个分隔符出现之间的数据。

也许这个想法可能是有用的，或者作为催化剂，您可以以此为基础。

【讨论】：

这种方法的性能比 CLR 拆分器甚至是计数表拆分器要差得多。请参阅此帖子:: sqlblog.com/blogs/paul_white/archive/2012/09/05/… 加上一次只有一排。我希望能够一次处理一系列行。