使用循环 Python 时对大数据进行优化答案

【问题标题】：Optimisation for data large when use loop Python使用循环 Python 时对大数据进行优化
【发布时间】：2021-12-02 09:49:49
【问题描述】：

我有一个包含 15000 行二进制数据的数据框，每个字符串为 365 个字符。我将每个二进制数转换为 365 天，开始日期为 2020 年 12 月 13 日。

因为数据太大了，所以我的程序运行的很慢。有什么方法可以优化我的程序吗？

数据示例：

ID	Nature	Binary
1122	M	1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100

输出：

ID	Nature	Date	Code
1122	M	13/12/2020	1
1122	M	14/12/2020	0
1122	M	..........	...
1122	M	11/12/2021	0

代码：

start_date = '2021-12-13'

table_ = pd.DataFrame({'ID': df.id[0],'Nature':df.Nature[0], Date':pd.date_range(start_date, periods=len(df.binairy[0]), freq='D'), 'Code': list(df.binairy[0])})

for i in range(1,len(df)):
    table_i = pd.DataFrame({'ID': df.id[i],'Nature':df.Nature[i],'Date':pd.date_range(start_date, periods=len(df.binairy[i]), freq='D'), 'Code': list(df.binairy[i]})
    
    table_ = pd.concat([table_,table_i],ignore_index=True)

table_

【问题讨论】：

标签： python pandas database

【解决方案1】：

优化计算时间的最佳方法是并行化您的进程，如果您有多个内核和/或多线程（我猜您使用的是基于 cpu 的环境），则使用多处理库。

【讨论】：

【解决方案2】：

您必须处理数据框中的数据，还是可以将其加载到数据库中？

您可以使用数字表将 1 和 0 的字符串拆分为带有日期的行。对于这个实现，我从this answer on SO 借用了数字表生成器；所以以下假设您已经定义了这些视图。

创建一个表格来保存您的源数据

create table sample_data (
    id int,
    nature char(1),
    bin_str varchar(365)
);

为了测试，我通过复制一行加载了 2500 行

insert sample_data(id, nature, bin_str) values (1,'M','1001100100100010010001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010011001001100100110010011001000000100110011011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100100110010011001001100110110010000001001100100110010011001001100');

insert sample_data(id, nature, bin_str)
select n, nature, bin_str
from sample_data s join generator_4k g
where g.n>1 and g.n<=2500;

然后拆分二进制字符串并添加日期

select id,
       nature,
       date_add('2020-12-13', INTERVAL n DAY) date,
       substring(bin_str, n, 1)               code
from generator_4k
         join sample_data
where generator_4k.n > 0 and generator_4k.n <= length(bin_str)
order by id, n;

id	nature	date	code
1	M	2020-12-14	1
1	M	2020-12-15	0
1	M	2020-12-16	0
.	.	..........	.
1	M	2021-12-12	0

我的本地机器需要几秒钟来处理 25000 行，因此取决于您现有解决方案的速度有多慢 YMMV。

【讨论】：