【问题标题】:MemoryError merging two dataframes with pandas and dasks---how can I do this?MemoryError 将两个数据帧与 pandas 和 dasks 合并---我该怎么做?
【发布时间】:2017-04-07 20:20:21
【问题描述】:

我在 pandas 中有两个数据框。我想合并这两个数据帧,但我一直遇到内存错误。我可以使用什么解决方法?

设置如下:

import pandas as pd

df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
print(df1.shape) # output: (4757076, 4)
print(df2.shape) # output: (428764, 45)


df1.head 

    column1  begin    end    category
0  class1  10001  10468    third
1  class1  10469  11447     third
2  class1  11505  11675     fourth
3  class2  15265  15355   seventh
4  class2  15798  15849   second


print(df2.shape) # (428764, 45)
   column1  begin    .... 
0  class1  10524   .... 
1  class1  10541   ....
2  class1  10549  ....
3  class1  10565  ...
4  class1  10596  ...

我只想将这两个 DataFrame 合并到“column1”上。但是,这总是会导致内存错误。

让我们先在 pandas 中尝试一下,在一个大约有 2 TB RAM 和数百个线程的系统上:

import pandas as pd
df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")
merged = pd.merge(df1, df2, on="column1", how="outer", suffixes=("","_repeated")

这是我得到的错误:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 39, in merge
    return op.get_result()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 217, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 353, in _get_join_info
    sort=self.sort, how=self.how)
  File "/nfs/sw/python/python-3.5.1/lib/python3.5/site-packages/pandas/tools/merge.py", line 559, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas/src/join.pyx", line 160, in pandas.algos.full_outer_join (pandas/algos.c:61256)
MemoryError

That didn't work. Let's try with dask:


import pandas as pd
import dask.dataframe as dd
from numpy import nan


ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)

Here's the error I get:

Traceback (most recent call last):
  File "repeat_finder.py", line 15, in <module>
    merged = dd.merge(ddf1, ddf2,on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
  File "/path/python3.5/site-packages/dask/base.py", line 78, in compute
    return compute(self, **kwargs)[0]
  File "/path/python3.5/site-packages/dask/base.py", line 178, in compute
    results = get(dsk, keys, **kwargs)
  File "/path/python3.5/site-packages/dask/threaded.py", line 69, in get
    **kwargs)
  File "/path/python3.5/site-packages/dask/async.py", line 502, in get_async
    raise(remote_exception(res, tb))
dask.async.MemoryError: 

Traceback
---------
  File "/path/python3.5/site-packages/dask/async.py", line 268, in execute_task
    result = _execute_task(task, data)
  File "/path/python3.5/site-packages/dask/async.py", line 249, in _execute_task
    return func(*args2)
  File "/path/python3.5/site-packages/dask/dataframe/methods.py", line 221, in merge
    suffixes=suffixes, indicator=indicator)
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 59, in merge
    return op.get_result()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 503, in get_result
    join_index, left_indexer, right_indexer = self._get_join_info()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 667, in _get_join_info
    right_indexer) = self._get_join_indexers()
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 647, in _get_join_indexers
    how=self.how)
  File "/path/python3.5/site-packages/pandas/tools/merge.py", line 876, in _get_join_indexers
    return join_func(lkey, rkey, count, **kwargs)
  File "pandas/src/join.pyx", line 226, in pandas._join.full_outer_join (pandas/src/join.c:11286)
  File "pandas/src/join.pyx", line 231, in pandas._join._get_result_indexer (pandas/src/join.c:11474)
  File "path/python3.5/site-packages/pandas/core/algorithms.py", line 1072, in take_nd
    out = np.empty(out_shape, dtype=dtype, order='F')

我怎样才能让它发挥作用,即使它无耻地低效?

编辑:针对合并两列/索引的建议,我认为我不能这样做。这是我要运行的代码:

import pandas as pd
import dask.dataframe as dd

df1 = pd.read_cvs("first1.csv")
df2 = pd.read_csv("second2.csv")

ddf1 = dd.from_pandas(df1, npartitions=2)
ddf2 = dd.from_pandas(df2, npartitions=2)

merged = dd.merge(ddf1, ddf2, on="column1", how="outer", suffixes=("","_repeat")).compute(num_workers=60)
merged = merged[(ddf1.column1 == row.column1) & (ddf2.begin >= ddf1.begin) & (ddf2.begin <= ddf1.end)]
merged = dd.merge(ddf2, merged, on = ["column1"]).compute(num_workers=60)
merged.to_csv("output.csv", index=False)

【问题讨论】:

  • “大约 2 TB 的 RAM 和数百个线程”——哇塞。首先,你在linux上吗?如果是这样,请检查任务的 ulimit 和/或 rlimit。
  • @BrianCain 好主意。尽管如此——我该怎么做呢? :) 这些数据帧不是那么
  • 好的...查看您的编辑后,您的方法似乎是错误的,恕我直言。请解释你打算做什么。似乎您想将 merged 剪辑到一组特定的行。 rows 中有什么内容?我认为你可以用更简单的方式解决这个问题。

标签: python pandas merge out-of-memory dask


【解决方案1】:

您不能只合并column1 上的两个数据框,因为column1 不是任一数据框中每个实例的唯一标识符。试试:

merged = pd.merge(df1, df2, on=["column1", "begin"], how="outer", suffixes=("","_repeated"))

如果您在df2 中也有end 列,您可能需要尝试:

merged = pd.merge(df1, df2, on=["column1", "begin", "end"], how="outer", suffixes=("","_repeated"))

【讨论】:

  • 这不能回答 OP 的问题。 OP 想要在"column1" 上进行外部连接,并且正在获得MemoryError"column1" 不唯一对合并或 MemoryError 无关紧要。 OP 可能没有为服务器上的任务安排足够的资源。
  • 根据我自己的经验,我在合并数据帧时遇到了类似的“MemoryError”问题。 column1 不唯一可能不会导致 MemoryError 仅当数据大小不太大时。鉴于问题中发布的示例数据框,如果仅在column1 上合并,则合并数据框的大小可能会呈指数增长,这可能会导致内存错误。我认为在这种情况下合并多个列,而不是仅column1,可能更合理。
  • 是的,OP 在 2TB RAM 系统上...... OP 正在处理的帧最多会产生 5185840 x 49 的帧。与 2 TB 相比,这算不了什么。我的猜测是,使用简单的操作系统,数据可以在 4GB 的机器上合并。轻松在 8 GB 机器上...
  • 我明白了。它还可能取决于正在使用的 IDE,它可能有自己的限制。所以我只是建议为什么不尝试合并多列上的数据。
  • @mikeqfu “您不能只合并 column1 上的两个数据框,因为 column1 不是任一数据框中每个实例的唯一标识符。”我仍然不太了解这一点。为什么“column1”不是“任一数据帧中每个实例的唯一标识符”?此列是唯一的和非唯一的有什么区别?
猜你喜欢
  • 2018-05-03
  • 1970-01-01
  • 2012-07-23
  • 2022-11-25
  • 2021-01-10
  • 1970-01-01
  • 2020-04-07
  • 2016-03-02
  • 2020-05-06
相关资源
最近更新 更多