Cython 与未知大小的 char 数组的 python 和 c 库之间的接口答案

【问题标题】：Cython to interface between python and c-library with unknown size char arrayCython 与未知大小的 char 数组的 python 和 c 库之间的接口
【发布时间】：2020-08-19 14:48:47
【问题描述】：

我有一个 C 库，它从文件中读取二进制数据，对其进行转换并将所有内容存储在一个大的 char* 中，以将数据返回给任何调用它的对象。这在 C 中运行良好，但使用 python/Cython 我在分配内存时遇到问题。

库原型是： int readWrapper(struct options opt, char *lineOut);

我的 pyx 文件：

from libc.string cimport strcpy, memset
from libc.stdlib cimport malloc, free
from libc.stdio cimport printf

cdef extern from "reader.h":
    struct options:
        int debug;
        char *filename; 

    options opt
    int readWrapper(options opt, char *lineOut);

def pyreader(file, date, debug=0):
    import logging
    cdef options options
    
    # Get the filename
    options.filename = <char *>malloc(len(file) * sizeof(char)) 
    options.debug = debug
    # Size of array
    outSize = 50000

    cdef char *line_output = <char *> malloc(outSize * sizeof(char))
    memset(line_output, 1, outSize)
    line_output[outSize] = 0

   # Call reader
   return_val = readWrapper(options, line_output)

   # Create dataframe
   from io import StringIO
   data = StringIO(line_output.decode('UTF-8', 'strict'))
   df = pd.read_csv(data, delim_whitespace=True, header=None)
   # Free memory
   free(line_output)
   return df

只要 line_output 保持在outSize 的大小范围内，它就可以正常工作。但是有些文件比较大，我该如何动态地做到这一点？

根据 DavidW 的建议进行编辑

阅读器包装是这样的：

int readWrapper(struct options opt, char **lineOut)
{
    // Open file for reading
    fp = fopen(opt.filename, "r");

    // Check for valid fp
    if (fp == NULL)
    {
        printf("file pointer is null, aborting\n");
        return (EXIT_FAILURE);
    }
    
    // Allocate memory
    int ARRAY_SIZE = 5000;
    *lineOut = NULL;
    char *outLine = malloc(ARRAY_SIZE * sizeof (char));
    if (outLine == NULL) 
    {
        fprintf(stderr, "Memory allocation failed!");
        return(EXIT_FAILURE);
    } 

    // Create line and multi lines object
    char line[255];
    int numWritten = 0;
    int memIncrease = 10000;

    while (fp != feof)
    {
        // Read part of file
        reader(fp, opt, line);
        size_t num2Write = strlen(line);
        if (ARRAY_SIZE < (numWritten + num2Write + 1))
        {   // Won't fit so enlarge outLine
            ARRAY_SIZE += memIncrease;
            outLine = realloc(outLine, (sizeof *outLine * ARRAY_SIZE));
            if (outLine == NULL)
            {
                fprintf(stderr, "Memory re-allocation failed!");
                return(EXIT_FAILURE);
            }
            sprintf(outLine + numWritten, "%s", line);
            numWritten += num2Write;
        }
    } // data block loop
    *lineOut = outLine;

    if (fp != NULL)
    {
        fclose(fp);
    }
    return (EXIT_SUCCESS);
}

新的 pyx 文件：

from libc.string cimport strcpy, memset
from libc.stdlib cimport malloc, free
from libc.stdio cimport printf

cdef extern from "reader.h":
    struct options:
        int debug;
        char *filename; 

    options opt
    int readWrapper(options opt, char *lineOut);

def pyreader(file, date, debug=0):
    import logging
    cdef options options
    
    # Get the filename
    options.filename = <char *>malloc(len(file) * sizeof(char)) 
    options.debug = debug

    cdef char *line_output = NULL

    # Call reader
    return_val = readWrapper(options, &line_output)

    # Create dataframe
    from io import StringIO
    data = StringIO(line_output.decode('UTF-8', 'strict'))
    df = pd.read_csv(data, delim_whitespace=True, header=None)
    
    # Free memory
    free(line_output)
    free(options.filename)
    return df

现在效果很好，但是在包装器 (C) 和 python (pyx) 部分中使用任何 printf 或 fprintf(stdout,...) 语句会导致

Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe

使用python3 test.py | head 时。没有头部不会显示错误。

最后，关于文件名及其分配的建议对我也不起作用。在运行时使用options.filename = file 会产生TypeError: expected bytes, str found，但会编译。有趣的是，这只发生在我运行调用包装器的 python 代码时： python3 test.py | head。没有管道和头部，BrokenPipeError 不存在。因此，这不是什么大问题，但想了解是什么原因造成的。

在对 BrokenPipeError 进行一些搜索后进行编辑

此 BrokenPipeError 问题发生在头部而不是尾部。这个“错误”的解释可以在这里找到：https://stackoverflow.com/a/30091579/2885280

解决pyx文件：

与前面提到的 readWrapper.c 一起使用的最终 reader.pyx 文件。内存分配由 C 处理，并由 pyx 代码清理（最后）。

from libc.stdlib cimport free

cdef extern from "reader.h":
    struct options:
        int debug;
        char *filename; 
        char *DAY;

    options opt
    int readWrapper(options opt, char **lineOut);

def pyreader(file, date, debug=0):
    import logging
    import sys
    import errno
    import pandas as pd
    # Init return valus
    a = pd.DataFrame()
    cdef options options
    cdef char *line_output = NULL

    # logging
    logging.basicConfig(stream=sys.stdout, 
                        format='%(asctime)s:%(process)d:%(filename)s:%(lineno)s:pyreader: %(message)s',
                        datefmt='%Y%m%d_%H.%M.%S',
                        level=logging.DEBUG if debug > 0 else logging.INFO)
    
    try:        
        # Check inputs
        if file is None:
            raise Exception("No valid filename provided")
        if date is None:
            raise Exception("No valid date provided")

        # Get the filename
        file_enc = file.encode("ascii")
        options.filename = file_enc
        # Get date
        day_enc = date.encode('ascii')
        options.DAY = day_enc
        
        try:
            # Call reader
            return_val = readWrapper(options, &line_output)

            if (return_val > 0):
                logging.error("pyreadASTERIX2 failed with exitcode {}".format(return_val))
                return a
        except Exception:
            logging.exception("Error occurred")
            free(line_output)
            return a

        from io import StringIO
        try:
            data = StringIO(line_output.decode('UTF-8', 'strict'))
            logging.debug("return_val: {} and size: {}".format(return_val, len(line_output.decode('UTF-8', 'strict'))))


            a = pd.read_csv(data, delim_whitespace=True, header=None, dtype={'id':str})
            if a.empty:
                logging.error("failed to load {} not enough data to construct DataFrame".format(file))
                return a 
            logging.debug("converted data into pd")
        except Exception as e:
            logging.exception("Exception occured while loading: {} into DataFrame".format(file))
            return a
        finally:
            free(line_output)
        
        logging.debug("Size of df: {}".format(len(a)))
        # Success, return DataFrame
        return  a
    except Exception:
        logging.exception("pyreader returned with an exception:")
        return a

【问题讨论】：

line_output[outSize] = 0 - 这已经结束了。另外，您还有一些内存泄漏。
基本问题是你的 C 库有一个完全损坏的接口 - 它写入一个 c 数组而不知道数组的大小。如果不删除这个库，就无法解决这个问题。
幸运的是我控制着图书馆。该库还被可执行文件使用，该可执行文件将数据打印到标准输出。我还希望在 pandas 数据框中提供相同的数据，因此这个 pyx 例程以及这个 char 数组和 StringIO 的使用。不幸的是，由于文件的二进制性质，无法预先知道要打印或存储在数据框中的数据量。任何想法如何解决这个问题或使用其他东西然后这个char *？解压后的数据是一行，每行包含（取决于选项）20 到 25 个“列”。
我会让readWrapper 负责分配内存。要么返回它使用指向指针参数的数组。

标签： python memory dynamic cython cythonize

【解决方案1】：

您有两个基本选择：

提前弄清楚如何计算尺寸。

 size = calculateSize(...)  # for example, by pre-reading the file
 line_output = <char*>malloc(size)
 return_val = readWrapper(options, line_output)

让readWrapper 负责分配内存。 C中有两种常用的模式：

一个。返回一个指针（可能使用NULL 表示错误）：

char* readWrapper(options opt)

b.传递一个指向指针的指针并改变它

// C 
int readWrapper(options opt, char** str_out) {
    // work out the length
    *str_out = malloc(length);
    // etc
}

# Cython
char* line_out
return_value = readWrapper(options, &line_out)

您需要确保您分配的所有字符串都已清理干净。 options.filename 仍然存在内存泄漏。对于options.filename，您最好通过Cython 获得指向file 内容的指针。只要file 存在就有效，因此您不需要分配

options.filename = file

只要确保options 的寿命不会超过file（即它不会被存储以供以后在 C 中的任何地方使用）。

一般

something = malloc(...)
try:
    # code
finally:
    free(something)

是确保清理的好模式。

【讨论】：

我选择了选项 2b 并尝试了一些方法。不幸的是 readWrapper 是专有的，我不允许在这里分享它，但只要我删除所有 printf 语句，它就可以工作。 fprintf(stderr, "string here") 工作正常，但是使用 stdout 我得到了我真的不明白的 BrokenPipeErrors。关于遗体泄漏，我也释放了那些，因此将编辑我的 OP。使用您的建议会导致：TypeError: expected bytes, str found 我也不明白。到目前为止谢谢！
我猜编码错误在options.filename = file？那是因为 Python 3 字符串是 Unicode，所以不要以一种明显的方式转换为 C 字符串。你可能想要file_enc = file.encode("ascii"); options.filename = file_enc。显然，ascii 可能不是正确的编码......我担心BrokenPipeErrors 没有线索。
使用 options.filename = file.encode('ascii') 会产生编译错误：Storing unsafe C derivative of temporary Python reference 但使用您建议的中间文件_enc 工作正常。这是 Cython 的问题吗？
@pdj 这是因为options.filename 只是一个指向 Python 字节对象内部的指针。您需要在options 的整个生命周期内保留对 Python 对象的引用。对于options.filename = file.encode('ascii')，file.encode 的返回值只会短暂保留，因此 Cython 会（正确）给您一个错误。
我发现管道问题是头部工作方式的结果（通过“过早”关闭管道，例如，尾部不会破坏管道）。在这里查看答案：stackoverflow.com/a/30091579/2885280