如何使用 xarray 数据集实现 numpy 索引答案

【问题标题】：How to achieve numpy indexing with xarray Dataset如何使用 xarray 数据集实现 numpy 索引
【发布时间】：2021-04-22 19:00:48
【问题描述】：

我知道二维数组的 x 和 y 索引（numpy 索引）。

在documentation 之后，xarray 使用例如Fortran 风格的索引。

所以当我通过例如

ind_x = [1, 2]
ind_y = [3, 4]

我希望索引对 (1,3) 和 (2,4) 有 2 个值，但 xarray 返回一个 2x2 矩阵。

现在我想知道如何用 xarray 实现类似 numpy 的索引？

注意：我想避免将整个数据加载到内存中。所以使用.values api 不是我正在寻找的解决方案的一部分。

【问题讨论】：

你能再具体一点吗？如果您不想“将整个数据加载到内存中”，我假设您正在使用基于 dask 的 xarray？
据我了解 xarray 正确它只是将 netcdf 的标头加载到内存中。 Dask 用于将函数应用于大型数据集，因为我认为它只处理数据块。
我已经更新了我对 dask 用例的回答。当然，性能在很大程度上取决于数据本身的存储方式，以及与之相关的 dask 分块的设置方式。

标签： python numpy python-xarray

【解决方案1】：

您可以访问底层的numpy 数组以直接对其进行索引：

import xarray as xr

x = xr.tutorial.load_dataset("air_temperature")

ind_x = [1, 2]
ind_y = [3, 4]

print(x.air.data[0, ind_y, ind_x].shape)
# (2,)

编辑：

假设您的数据在 dask-backed xarray 中并且不想将所有数据加载到内存中，您需要在 dask 数组后面的 xarray 数据上使用 vindex对象：

import xarray as xr

# simple chunk to convert to dask array
x = xr.tutorial.load_dataset("air_temperature").chunk({"time":1})

extract = x.air.data.vindex[0, ind_y, ind_x]

print(extract.shape)
# (2,)

print(extract.compute())
# [267.1, 274.1], dtype=float32)

【讨论】：

对不起，我已经更新了问题。这个解决方案非常适合小批量数据。但我想访问数千个文件，并希望避免将数据加载到内存中。所以我需要使用 xarray 本身的索引工具。
看起来不错。但是你知道如果不使用 xarray+dask 时 xarray 是否会将所有数据加载到内存中吗？因为 netcdf4 库确实 annefou.github.io/metos_python/07-LargeFiles
如果可能，它只会加载分析所需的数据（例如，如果底层数据类型允许 - 我不知道 netcdf 如何处理这个）。这就是为什么这些工作流程的“最佳实践”是将数据加载为 dask 数组，减少并最终只计算它。当然也有例外（例如对同一数据重复任务），您可能希望事先persist 数据
使用 netcdf4 库比使用 xarray+Dask 快 6 倍。
@dl.meteo 虽然这取决于用例，但 xarray+Dask 可能不是开箱即用的超快。但是如果配置得当，它可以非常快，更重要的是非常可扩展。但这超出了这个问题的范围。您已经问过“如何使用 xarray 数据集实现 numpy 索引” - 我已经指出了两种方法。如果这回答了您的问题，请考虑接受我的解决方案。

【解决方案2】：

为了考虑速度，我用不同的方法做了一个测试。

def method_1(file_paths: List[Path], indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        d = Dataset(file, 'r')
        data.append(d.variables['hrv'][indices])
        d.close()
    return data


def method_2(file_paths: List[Path], indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        data.append(xarray.open_dataset(file, engine='h5netcdf').hrv.values[indices])
    return data


def method_3(file_paths: List[Path], indices) -> List[np.array]:
    data=[]
    for file in file_paths:
        data.append(xarray.open_mfdataset([file], engine='h5netcdf').hrv.data.vindex[indices].compute())
    return data

In [1]: len(file_paths)
Out[1]: 4813

结果：

method_1（使用 netcdf4 库）：101.9s
method_2（使用 xarray 和 values API）：591.4s
method_3（使用 xarray+dask）：688.7s

我猜 xarray+dask 在.compute 步骤中需要很长时间。

【讨论】：

如果你想使用dask，我强烈推荐checking out the docs。在您的具体示例中，您在循环中调用.compute，这通常是一种不好的做法，所以我对 dask 的表现低于预期并不感到惊讶。
这就是为什么我之前尝试使用open_mfdataset 打开所有文件的原因，但在文档xarray.pydata.org/en/stable/generated/… 中没有进一步说明如何沿新轴堆叠二维数组。我知道我必须定义concat_dims 参数，但没有信息如何。