【问题标题】：strpbrk() in PythonPython 中的 strpbrk()
【发布时间】：2019-04-24 04:01:58
【问题描述】：

在我正在编写的一些 Python 代码中，我需要计算字符串中任何一组字符的出现次数。换句话说，我需要统计一个字符串中字符 [c1, c2, c3,...,cn] 的总出现次数。

在 C 语言中，称为 strpbrk() 的函数可用于执行此操作，通常在 x86 处理器上使用特殊指令以使其更快。

在 Python 中，我编写了以下代码，但它是我的应用程序中最慢的部分。

haystack = <query string>
gc_characters = 0
for c in ['c', 'C', 'g', 'G']:
    gc_characters += haystack.count(c)

有更快的方法吗？

【问题讨论】：

我假设haystack 在这里相当大？
您想要每个字符出现的次数c1,c2,c3... 等？
@DeveshKumarSingh 所有角色的总数。

标签： python python-3.x

【解决方案1】：

我在这里可能有点过火了，但 tl;dr 是原始代码实际上比（编辑：macOS 的）strpbrk() 快，但一些 strpbrk() 实现可能更快！

str.count() 在其内部使用 this bundle of strange and beautiful magic - 难怪它很快。

完整代码在https://github.com/akx/so55822235

Python 代码

这些方法都是用纯 Python 编写的，包括 OP 的原版

def gc_characters_original(haystack):
    gc_characters = 0
    for c in ("c", "C", "g", "G"):
        gc_characters += haystack.count(c)
    return gc_characters


def gc_characters_counter(haystack):
    counter = Counter(haystack)
    return sum(counter.get(c, 0) for c in ["c", "C", "g", "G"])


def gc_characters_manual(haystack):
    gc_characters = 0
    for x in haystack:
        if x in ("c", "C", "g", "G"):
            gc_characters += 1
    return gc_characters


def gc_characters_iters(haystack):
    gc_characters = haystack.count("c") + haystack.count("C") + haystack.count("g") + haystack.count("G")
    return gc_characters

Cython 扩展包装 `strpbrk()`

from libc.string cimport strpbrk

cdef int _count(char* s, char *key):
    assert s is not NULL, "byte string value is NULL"
    cdef int n = 0
    cdef char* pch = strpbrk (s, key)
    while pch is not NULL:
        n += 1
        pch = strpbrk (pch + 1, key)
    return n

def count(s, key):
    return _count(s, key)

...

def gc_characters_cython(haystack_bytes):
    return charcount_cython.count(haystack_bytes, b"cCgG")

手工C扩展包`strpbrk()`

#define PY_SSIZE_T_CLEAN
#include <Python.h>
#include <string.h>

static unsigned int count(const char *str, const char *key) {
  unsigned int n = 0;
  char *pch = strpbrk(str, key);
  while (pch != NULL) {
    n++;
    pch = strpbrk(pch + 1, key);
  }
  return n;
}

static PyObject *charcount_count(PyObject *self, PyObject *args) {
  const char *str, *key;
  Py_ssize_t strl, keyl;

  if (!PyArg_ParseTuple(args, "s#s#", &str, &strl, &key, &keyl)) {
    PyErr_SetString(PyExc_RuntimeError, "invalid arguments");
    return NULL;
  }
  int n = count(str, key);
  return PyLong_FromLong(n);
}

static PyMethodDef CharCountMethods[] = {
    {"count", charcount_count, METH_VARARGS,
     "Count the total occurrences of any s2 characters in s1"},
    {NULL, NULL, 0, NULL},
};

static struct PyModuleDef spammodule = {PyModuleDef_HEAD_INIT, "charcount",
                                        NULL, -1, CharCountMethods};

PyMODINIT_FUNC PyInit_charcount(void) { return PyModule_Create(&spammodule); }

...

def gc_characters_cext_b(haystack_bytes):
    return charcount.count(haystack_bytes, b"cCgG")


def gc_characters_cext_u(haystack):
    return charcount.count(haystack, "cCgG")

测量

在我的 Mac 上，将cCgG 计算在一百万个随机字母的字符串中，即

haystack = "".join(random.choice(string.ascii_letters) for x in range(1_000_000))
haystack_bytes = haystack.encode()
print("original", timeit.timeit(lambda: gc_characters_original(haystack), number=100))
print("unrolled", timeit.timeit(lambda: gc_characters_iters(haystack), number=100))
print("cython", timeit.timeit(lambda: gc_characters_cython(haystack_bytes), number=100))
print("c extension, bytes", timeit.timeit(lambda: gc_characters_cext_b(haystack_bytes), number=100))
print("c extension, unicode", timeit.timeit(lambda: gc_characters_cext_u(haystack), number=100))
print("manual loop", timeit.timeit(lambda: gc_characters_manual(haystack), number=100))
print("counter", timeit.timeit(lambda: gc_characters_counter(haystack), number=100))

产生以下结果：

original               0.34033612700000004
unrolled               0.33661798900000006
cython                 0.6542106270000001
c extension, bytes     0.46668797900000003
c extension, unicode   0.4761082090000004
manual loop           11.625232557
counter                7.0389275090000005

因此，除非我的 mac 的 libc 中的 strpbrk() 实现功能严重不足（编辑：确实如此），否则最好使用 .count()。

编辑

我添加了glibc's strcspn()/strpbrk()，它比the näive version of strpbrk() shipped with macOS快得惊人：

original                       0.329256
unrolled                       0.333872
cython                         0.433299
c extension, bytes             0.432552
c extension, unicode           0.437332
c extension glibc, bytes       0.169704 <-- new
c extension glibc, unicode     0.158153 <-- new

glibc 也有 SSE2 和 SSE4 版本的函数，可能会更快。

编辑 2

我又一次回到这个话题，因为我顿悟了 glibc 的 strcspn() 的巧妙查找表如何用于字符计数：

size_t fastcharcount(const char *str, const char *haystack) {
  size_t count = 0;

  // Prepare lookup table.
  // It will contain 1 for all characters in the haystack.
  unsigned char table[256] = {0};
  unsigned char *ts = (unsigned char *)haystack;
  while(*ts) table[*ts++] = 1;

  unsigned char *s = (unsigned char *)str;
  #define CHECK_CHAR(i) { if(!s[i]) break; count += table[s[i]]; }
  for(;;) {
    CHECK_CHAR(0);
    CHECK_CHAR(1);
    CHECK_CHAR(2);
    CHECK_CHAR(3);
    s += 4;
  }
  #undef CHECK_CHAR
  return count;
}

结果非常令人印象深刻，优于 glibc 实现 4 倍和原始 Python 实现 8.5 倍。

original                       | 6.463880 sec / 2000 iter | 309 iter/s
unrolled                       | 6.378582 sec / 2000 iter | 313 iter/s
cython libc                    | 8.443358 sec / 2000 iter | 236 iter/s
cython glibc                   | 2.936697 sec / 2000 iter | 681 iter/s
cython fast                    | 0.766082 sec / 2000 iter | 2610 iter/s
c extension, bytes             | 8.373438 sec / 2000 iter | 238 iter/s
c extension, unicode           | 8.394805 sec / 2000 iter | 238 iter/s
c extension glib, bytes        | 2.988184 sec / 2000 iter | 669 iter/s
c extension glib, unicode      | 2.992429 sec / 2000 iter | 668 iter/s
c extension fast, bytes        | 0.754072 sec / 2000 iter | 2652 iter/s
c extension fast, unicode      | 0.762074 sec / 2000 iter | 2624 iter/s

【讨论】：

这坦率地说只是让我对.count()印象深刻
@modesitt 是的。看起来已经进行了相当多的优化...github.com/python/cpython/blob/master/Objects/stringlib/…
这里收集了非常酷的比较。我认为最大可能的速度来自使用 _mm_cmpistrX() 的某些实现，但这可能不是真的，这取决于 Python 内部使用的字符串实现......
@leecbaker @modesitt 结果表明 macOS 的 strpbrk() 比 glibc 慢！
我又添加了一个更快的实现。

【解决方案2】：

.count 将在您每次调用它时迭代haystack - 但heavily optimized 超过了我在这里建议的替代方案。这取决于您的真实案例中有多少个字符。你可以试试

from collections import Counter

cnt = Counter(haystack)
gc_characters = sum(cnt.get(e, 0) for e in ['c', 'C', 'g', 'G']])

因为这将遍历字符串一次并存储每个出现字符的计数。仅查找您关心的字符并为这些字符使用一组可能会稍微快一点__contains__。

gc_chars = {'c', 'C', 'g', 'G'}
counts = {e: 0 for e in gc_chars}

for c in gc_chars:
    if c in gc_chars:
        counts[c] += 1

gc_characters = sum(counts.values())

如果您提供有关hastack 的组成以及调用频率的更多详细信息，我们可以尝试为您提供更多帮助。

缓存

另一个想法是，如果hastack 经常相同，您也许可以保留答案的内存缓存

from functools import lru_cache

@lru_cache
def haystack_metric(hastack):
     return sum(haystack.count(c) for c in ['c', 'C', 'g', 'G']))

（无论您选择哪种实现方式）。您也可以探索ctypes - 但我对此没有什么经验。

【讨论】：

我也试过了——使用 Counter 比 OP 的原始想法慢得多。
是的，除非他查看的字符数比提供的 4 个字符多得多，否则我认为 Counter 会在相当长的一段时间内保持缓慢。
既然 functools 已经提供了lru_cache，为什么还要实现自定义缓存装饰器？
那个 cache 装饰器最好用functools.lru_cache 代替（尤其是因为它危险地忽略了**kwargs！）。

Python 代码

Cython 扩展包装 strpbrk()

手工C扩展包strpbrk()

测量

编辑

编辑 2

缓存

Cython 扩展包装 `strpbrk()`

手工C扩展包`strpbrk()`