Cython 字符串支持答案

【问题标题】：Cython strings supportCython 字符串支持
【发布时间】：2019-01-17 16:54:58
【问题描述】：

我正在尝试优化一些代码。我已经设法使用 Numpy 和 Numba 优化了我的大部分项目，但还有一些我无法使用这些工具优化的剩余字符串处理代码。因此，我想尝试使用 Cython 优化这部分。

此处的代码采用行程编码字符串（一个字母，可选地后跟一个数字，指示该字母重复多少次）并将其扩展。然后，它使用字典查找将扩展的字符串转换为 0 和 1 的数组，以将字母与 0 和 1 的序列匹配。

是否可以使用 Cython 来优化这段代码？

import numpy as np
import re

vector_list = ["A22gA5BA35QA17gACA3QA7gA9IAAgEIA3wA3gCAAME@EACRHAQAAQBACIRAADQAIA3wAQEE}rm@QfpT}/Mp-.n?",
                "A64IA13CA5RA13wAABA5EAECA5EA4CEgEAABGCAAgAyAABolBCA3WA4GADkBOA?QQgCIECmth.n?"]


_base64chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz@}]^+-*/?,._"
_bin2base64 = {"{:06b}".format(i): base64char for i, base64char in enumerate(_base64chars)}
_base642bin = {v: k for k, v in _bin2base64.items()}

_n_vector_ranks_only = np.arange(1023,-1,-1)


def _decompress_get(data):
    for match in re.finditer(r"(?P<char>.)((?P<count>\d+))?", data):
        if not match.group("count"): yield match.group("char")
        else: yield match.group("char") * int(match.group("count"))


def _n_apply_weights(vector):
    return np.multiply(vector, _n_vector_ranks_only)

def n_decompress(compressed_vector):
    decompressed_b64 = "".join(_decompress_get(compressed_vector))
    vectorized = "".join(_base642bin[c] for c in decompressed_b64)[:-2]
    as_binary = np.fromiter(vectorized, int)
    return as_binary


def test(x, y):
    if len(x) != 1024:
        x = n_decompress(x)
    vector_a = _n_apply_weights(x)
    if len(y) != 1024:
        y = n_decompress(y)
    vector_b = _n_apply_weights(y)
    maxPQ = np.sum(np.maximum(vector_a, vector_b))
    return np.sum(np.minimum(vector_a, vector_b))/maxPQ

v1 = vector_list[0]
v2= vector_list[1]
print(test(v1, v2))

【问题讨论】：

问题中应包含相关代码（非站外）。
代码是 862 个字符，太长了
我已将代码编辑到问题中。提出这个问题的主要原因是我们希望问题对未来的其他人有用，并且可以轻松删除链接
好的，我用代码删除我的 cmets，谢谢。
我已经尝试过让问题更清楚一点...检查编辑以查看您是否同意。

标签： python numpy cython

【解决方案1】：

单独使用 Numpy 可以很好地加快问题的第二部分（通过字典查找）。我已经通过索引到 Numpy 数组来替换字典查找。

我在开始时生成 Numpy 数组。一个技巧是意识到可以使用ord 将字母转换为代表它们的基础数字。对于 ASCII 字符串，它始终介于 0 和 127 之间：

_base642bin_array = np.zeros((128,),dtype=np.uint8)
for i in range(len(_base64chars)):
    _base642bin_array[ord(_base64chars[i])] = i

我在 n_decompress 函数中使用内置的 numpy 函数将 1 和 0 转换为 1。

def n_decompress2(compressed_vector):
    # encode is for Python 3: str -> bytes
    decompressed_b64 = "".join(_decompress_get(compressed_vector)).encode()
    # byte string into the underlying numeric data
    decompressed_b64 = np.fromstring(decompressed_b64,dtype=np.uint8)
    # conversion done by numpy indexing rather than dictionary lookup
    vectorized = _base642bin_array[decompressed_b64]
    # convert to a 2D array of 1s and 0s
    as_binary = np.unpackbits(vectorized[:,np.newaxis],axis=1)
    # remove the two digits you don't care about (always 0) from binary array
    as_binary = as_binary[:,2:]
    # reshape to 1D (and chop off two at the end)
    return as_binary.ravel()[:-2]

这使我的速度比您的版本快 2.4 倍（请注意，我根本没有更改 _decompress_get，所以这两个时间都包括您的 _decompress_get），只是因为使用 Numpy（没有 Cython/Numba，我怀疑他们不会有太大帮助）。我认为主要优点是与字典查找相比，用数字索引到数组中更快。

_decompress_get 可能可以使用 Cython 进行改进，但这是一个非常困难的问题...

【讨论】：

我尝试了代码并使用 1000000 个向量列表作为测试，我通过了从 372 秒到 209 秒的测试，非常令人印象深刻，谢谢。我要求 cython 改进可能只涉及数字的测试部分，所以我写了这样的内容：cdef np.int_t[:] _n_apply_weights(np.int_t[:] vector): return np.multiply(vector, _n_vector_ranks_only)
cdef np.int_t[:] n_decompress(compressed_vector): # cdef char* decompressed_b64, vectorized decompressed_b64 = "".join(_decompress_get(compressed_vector)) vectorized = "".join(_base642bin[c] for c in decompressed_b64)[:-2] cdef np.int_t[:] as_binary = np.fromiter(vectorized, int) return as_binary
cdef double cy_test(x=vector_list[0], y=vector_list[1]): cdef np.int_t[:] ix, iy, vector_a, vector_b if len(x) != 1024: ix = n_decompress(x) vector_a = _n_apply_weights(ix) if len(y) != 1024: iy = n_decompress(y) vector_b = _n_apply_weights(iy) cdef double maxPQ = np.sum(np.maximum(vector_a, vector_b)) return np.sum(np.minimum(vector_a, vector_b))/maxPQ
但这似乎没有帮助，它比numpy需要更多时间。我在这里阅读：notes-on-cython.readthedocs.io/en/latest/std_dev.html
这不是 Cython 用于优化的那种代码。您所做的只是调用已经用 C 编写的 Numpy 函数，因此您添加的只是一些不必要的 Cython 类型检查。此外，您将n_decompress 变成了内存管理灾难（您的char*s 仅在它们指向的变量有效时才有效，但这是短暂的临时）