【问题标题】:given a column with string data create a dataframe with ascii equivalent of each character in the string给定包含字符串数据的列,创建一个数据框,其 ascii 等效于字符串中的每个字符
【发布时间】:2019-07-30 22:00:10
【问题描述】:

我正在尝试将字符串列表转换为其 ascii 并将每个字符放在数据框中的列中。我有 30M 这样的字符串,我正在运行的代码遇到内存问题。

例如: strings = ['a','asd',1234,'ewq']

想要得到以下数据框:

     0      1      2     3
0   97    0.0    0.0   0.0
1   97  115.0  100.0   0.0
2   49   50.0   51.0  52.0
3  101  119.0  113.0   0.0

我尝试过的: pd.DataFrame([[ord(chr) for chr in list(str(rec))] for rec in strings]).fillna(0)

错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py", line 435, in __init__
    arrays, columns = to_arrays(data, columns, dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 404, in to_arrays
    dtype=dtype)
  File "/root/anaconda3/lib/python3.7/site-packages/pandas/core/internals/construction.py", line 434, in _list_to_arrays
    content = list(lib.to_object_array(data).T)
  File "pandas/_libs/lib.pyx", line 2269, in pandas._libs.lib.to_object_array
MemoryError

不确定是否相关,但strings 实际上是来自另一个数据框的列,带有.values

此外,最长的字符串几乎是 255 个字符。我知道 30M x 1000 是一个很大的数字。有什么办法可以解决这个问题?

【问题讨论】:

  • 30M是一个大列表,你考虑分块保存到txt文件吗?

标签: python pandas numpy dataframe ascii


【解决方案1】:

您是否尝试过将数据类型显式设置为uint8,然后分块处理数据? 从您的示例代码中,我猜您隐式使用了float32,这需要4倍的内存。

例如如果您将其写入 csv 文件并且您的字符串适合内存,您可以尝试以下代码:

def prepare_list(string, n, default):
    size= len(string)
    res= [ord(char) for char in string[:n]]
    if size < n:
        res+= [default] * (n - size)
    return res

chunk_size= 10000 # number of strings to be processed per step
max_len= 4        # maximum number of columns (=characters per string)
column_names= [str(i+1) for i in range(max_len)] # column names used for the columns
with open('output.csv', 'wt*) as fp:
    while string_list:
        df= pd.DataFrame([prepare_list(s, max_len, 0) for s in string_list[:chunk_size]], dtype='uint8', columns=column_names)
        df.to_csv(fp, header=fp.tell() == 0, index=False)
        string_list= string_list[chunk_size:]

当您阅读像这样创建的csv 时,您需要小心,再次将类型设置为uint8 以避免同样的问题,并确保读取文件时不会将第一列变成指数。例如。像这样:

pd.read_csv('output.csv', dtype='uint8', index=False)

【讨论】:

    【解决方案2】:

    这使用了 pandas 压缩数据类型,但我只是想出如何在构建后将其应用于整个数据帧。注意:我假设所有字符串都是字符串而不是整数和字符串的混合。

    import pandas as pd
    import numpy as np
    strings = ['a','asd','1234','ewq']
    
    # Find padding length
    maxlen = max(len(s) for s in strings)
    
    # Use 8 bit integer with pandas sparse data type, compressing zeros
    dt = pd.SparseDtype(np.int8, 0)
    
    # Create the sparse dataframe from a pandas Series for each integer ord value, padded with zeros
    # NOTE: This compresses the dataframe after creation. I couldn't find the right way to compress
    # each series as the dataframe is built
    
    sdf = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0))))).astype(dt)
    print(f"Memory used: {sdf.info()}")
    
    # <class 'pandas.core.frame.DataFrame'>
    # RangeIndex: 4 entries, 0 to 3
    # Data columns (total 4 columns):
    # 0    4 non-null Sparse[int8, 0]
    # 1    4 non-null Sparse[int8, 0]
    # 2    4 non-null Sparse[int8, 0]
    # 3    4 non-null Sparse[int8, 0]
    # dtypes: Sparse[int8, 0](4)
    # memory usage: 135.0 bytes
    # Memory used: None
    
    # The original uncompressed size
    df = stringsSeries.apply(lambda s: pd.Series((ord(c) for c in s.ljust(maxlen,chr(0)))))
    print(f"Memory used: {df.info()}")
    
    # <class 'pandas.core.frame.DataFrame'>
    # RangeIndex: 4 entries, 0 to 3
    # Data columns (total 4 columns):
    # 0    4 non-null int64
    # 1    4 non-null int64
    # 2    4 non-null int64
    # 3    4 non-null int64
    # dtypes: int64(4)
    # memory usage: 208.0 bytes
    # Memory used: None
    

    【讨论】:

    • 在@jottbe 的回答中,uint8 是数据类型的更好选择,而不是int8
    猜你喜欢
    • 1970-01-01
    • 2015-04-22
    • 2011-02-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-05-28
    相关资源
    最近更新 更多