【问题标题】:How to create a single dask array from multiple files?如何从多个文件创建单个 dask 数组?
【发布时间】:2020-02-16 18:17:57
【问题描述】:

我正在尝试从多个文件创建一个 dask array。我正在使用dask.array.Array 类来做到这一点。考虑以下代码 sn-p,其中我生成 100 随机整数 array 大小为 (3, 10, 10) 并将它们中的每一个保存在单独的 npy 文件中。然后我尝试创建一个 dask array 将所有这些数组组合成一个形状为 (3, 100, 100)dask array

import numpy as np
from itertools import product
from dask import array as da
from dask.base import tokenize

names = list()
for i in range(100):
    arr = np.random.randint(0, 9, (3, 10, 10))
    fn = 'data/array_{}.npy'.format(i)
    np.save(fn, arr)
    names.append('Array-{}'.format(tokenize(fn)))

indices = list(product(range(10), range(10)))

dsk = {
    (name, 0, *index): (np.load, name)
    for name, index in zip(names, indices)
}
namex = 'Combined_Array'
dtype=int
shape = (3, 100, 100)
chunks = (3, 10, 10)
d = da.Array(dsk, namex, chunks, dtype, shape)

不幸的是它在normalize_chunks方法中抛出了一个错误:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-008559464c9e> in <module>
----> 1 d = da.Array(dsk, namex, chunks, dtype, shape)

~/.conda/envs/Py3Dev/lib/python3.7/site-packages/dask/array/core.py in __new__(cls, dask, name, chunks, dtype, meta, shape)
   1026         else:
   1027             dt = None
-> 1028         self._chunks = normalize_chunks(chunks, shape, dtype=dt)
   1029         if self._chunks is None:
   1030             raise ValueError(CHUNKS_NONE_ERROR_MESSAGE)

~/.conda/envs/Py3Dev/lib/python3.7/site-packages/dask/array/core.py in normalize_chunks(chunks, shape, limit, dtype, previous_chunks)
   2481             )
   2482 
-> 2483     return tuple(tuple(int(x) if not math.isnan(x) else x for x in c) for c in chunks)
   2484 
   2485 

~/.conda/envs/Py3Dev/lib/python3.7/site-packages/dask/array/core.py in <genexpr>(.0)
   2481             )
   2482 
-> 2483     return tuple(tuple(int(x) if not math.isnan(x) else x for x in c) for c in chunks)
   2484 
   2485 

TypeError: 'int' object is not iterable

我在这里做错了吗?

【问题讨论】:

    标签: python arrays dask dask-distributed


    【解决方案1】:

    将参数作为命名参数传递解决了该问题,因为此示例中未包含参数并且这些参数的顺序已更改。代码中还有其他错误。正确的代码是:

    import numpy as np
    from itertools import product
    from dask import array as da
    
    names = list()
    for i in range(100):
        arr = np.random.randint(0, 9, (3, 10, 10))
        fn = 'data/array_{}.npy'.format(i)
        np.save(fn, arr)
        names.append(fn)
    
    indices = list(product(range(10), range(10)))
    namex = 'Combined_Array'
    dsk = {
        (namex, 0, *index): (np.load, name)
        for name, index in zip(names, indices)
    }
    
    dtype=int
    shape = (3, 100, 100)
    chunks = (3, 10, 10)
    
    d = da.Array(dask=dsk, name=namex, chunks=chunks, dtype=dtype, shape=shape)
    

    详细讨论请参见相关github issue

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-08-20
      • 1970-01-01
      • 2021-05-14
      • 2021-01-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-19
      相关资源
      最近更新 更多