最简单的方法是对您的数据进行预处理,以获取正确的格式。
一个字典,其中键是您的列名,值是您的变量。
data = [
dict(id='cqug90j', var1=0, var2=1),
dict(id='cqug90k', var1=7, var2=10)
...
...
]
然后你可以使用pd.DataFrame.from_dict(data)。
即使对于数百万个值,这也应该只需要几秒钟的时间来处理。
示例
def generate_data(size=4_000_000):
data = []
iterator = product('abcdefghijklmnopqrstuvwxyz', repeat=6)
start_time = time.perf_counter()
while len(data) < size:
data.append({''.join(next(iterator)): [np.random.randint(-256, 256), np.random.randint(-256, 256)]})
print(f"Generated: {len(data):,d} items in {time.perf_counter() - start_time:5.2f}s")
return data
This would take about ~30 seconds on my laptop.
def reprocess(data):
start_time = time.perf_counter()
data = [dict(id=key, var1=var1, var2=var2) for dictionary in data for key, (var1, var2) in dictionary.items()]
print(f"Reprocessed: {len(data):,d} items in {time.perf_counter() - start_time:5.2f}s")
return data
有趣的是:
data = [dict(id=key, var1=var1, var2=var2) for dictionary in data for key, (var1, var2) in dictionary.items()]
这是一个列表推导等于:
data = []
for dictionary in data:
for key, (var1, var2) in dictionary.items():
data.append(dict(id=key, var1=var1, var2=var2))
Time taken about 2 seconds.
def generate_dataframe(data):
start_time = time.perf_counter()
df = pd.DataFrame.from_dict(data)
print(f"Generate df: {len(df):,d} items in {time.perf_counter() - start_time:5.2f}s")
return df
Which takes about 5 seconds on my device.
完整代码运行
if __name__ == '__main__':
data = generate_data(size=4_000_000)
data = reprocess(data)
df = generate_dataframe(data)
print(f"\n{df.head()}", end="\n\n")
然后输出:
Generated: 4,000,000 items in 30.75s
Reprocessed: 4,000,000 items in 1.47s
Generate df: 4,000,000 items in 3.70s
id var1 var2
0 aaaaaa 173 -191
1 aaaaab 238 -60
2 aaaaac -59 -25
3 aaaaad -225 236
4 aaaaae 137 -18
结论
将 400 万个项目更改为一个数据框所需的总时间约为 6 秒。我不确定你是否需要它更快。但我认为这是一个好的开始。