如何设置xarray.assign输出的坐标？答案

【问题标题】：How to set the coordinates of the output of xarray.assign?如何设置xarray.assign输出的坐标？
【发布时间】：2021-12-23 11:24:48
【问题描述】：

我一直在尝试根据 xarray 数据集中数据点的纬度坐标创建两个新变量。但是，我似乎只能分配新的坐标。数据集如下所示：

<xarray.Dataset>
Dimensions:  (lon: 360, lat: 180, time: 412)
Coordinates:
  * lon      (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 355.5 356.5 357.5 358.5 359.5
  * lat      (lat) float64 -89.5 -88.5 -87.5 -86.5 -85.5 ... 86.5 87.5 88.5 89.5
  * time     (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Data variables:
    evapr    (time, lat, lon) float32 ...
    lhtfl    (time, lat, lon) float32 ...
...

到目前为止，我尝试的是这样的：

def get_latitude_band(latitude):
    return np.select(
        condlist=
        [abs(latitude) < 23.45,
         abs(latitude) < 35,
         abs(latitude) < 66.55],
        choicelist=
        ["tropical",
         "sub_tropical",
         "temperate"],
        
        default="frigid"
    )

def get_hemisphere(latitude):
    return np.select(
        [latitude > 0, latitude <=0],
        ["north", "south"]
    )

    
mhw_data = mhw_data \
    .assign(climate_zone=get_latitude_band(mhw_data.lat)) \
    .assign(hemisphere=get_hemisphere(mhw_data.lat)) \
    .reset_index(["hemisphere", "climate_zone"]) \
    .reset_coords()
            
print(mhw_data)

这让我很接近：

<xarray.Dataset>
Dimensions:        (lon: 360, lat: 180, time: 412, hemisphere: 180, climate_zone: 180)
Coordinates:
  * lon            (lon) float64 0.5 1.5 2.5 3.5 4.5 ... 356.5 357.5 358.5 359.5
  * lat            (lat) float64 -89.5 -88.5 -87.5 -86.5 ... 86.5 87.5 88.5 89.5
  * time           (time) datetime64[ns] 1981-09-01 1981-10-01 ... 2015-12-01
Dimensions without coordinates: hemisphere, climate_zone
Data variables:
    evapr          (time, lat, lon) float32 ...
    lhtfl          (time, lat, lon) float32 ...
    ...
    hemisphere_    (hemisphere) object 'south' 'south' ... 'north' 'north'
    climate_zone_  (climate_zone) object 'frigid' 'frigid' ... 'frigid' 'frigid'
...

但是，我想堆叠 DataSet 并将其转换为 DataFrame。我做不到，我认为是因为新变量hemisphere_和climate_zone_没有time、lat、lon坐标：

stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T

在“lon”上产生KeyError。

所以我的问题是：如何将新变量分配给 xarray 数据集以保持原始时间坐标、纬度和经度？

【问题讨论】：

标签： python pandas python-xarray

【解决方案1】：

要分配新变量或坐标，xarray 需要知道维度的名称。定义 DataArray 或 Coordinate 的方法有很多种，但最接近您当前使用的方法是提供 (dim_names, array) 的元组：

mhw_data = mhw_data.assign_coords(
    climate_zone=(('lat', ), get_latitude_band(mhw_data.lat)),
    hemisphere=(('lat', ), get_hemisphere(mhw_data.lat)),
)

这里我使用da.assign_coords，它将climate_zone 和hemisphere 定义为non-dimension coordinates，您可以将其视为关于纬度和数据的附加元数据，但它们不是正确的数据他们自己。这也将允许在将单个数组发送给 pandas 时保留它们。

对于堆叠，转换为 pandas 会自动堆叠。以下将返回一个 DataFrame，其中变量/非维度坐标作为列，维度作为 MultiIndex：

stacked = mhw_data.to_dataframe()

或者，如果您想要一个由(lat, lon, time) 索引的系列仅用于其中一个坐标，您可以始终使用expand_dims：

(
    mhw_data.climate_zone
    .expand_dims(lon=mhw_data.lon, time=mhw_data.time)
    .to_series()
)

【讨论】：

不幸的是，这仍然将这些变量（climate_zone 和 hemisphere）分配给仅一个坐标（纬度）。因此，在“lat”以外的任何维度上堆叠时，错误仍然存在。我真正需要做的是添加一个新变量，而不是坐标。索引这些变量的坐标应该与其余的相同：“lat”、“lon”和“time”。我会说这是一种更简洁的方法，可以得到与所有这些链式作业相同的结果。
查看我的编辑 - 你可以直接使用 .to_pandas() ，或者使用 expand_dims 做你想做的事
嘿，差点就成功了。您的编辑会导致错误，但错误消息会告诉您如何修复它：cannot convert Datasets with 3 dimensions ... Please use Dataset.to_dataframe() instead. 我肯定会通过该修复将我接受的答案从我的答案转换为您的答案。
啊谢谢你的捕获和报告！我不知道 to_pandas ;)

【解决方案2】：

我为自己制定的两种可能的解决方案如下：

首先，将 xarray 数据堆叠到 pandas DataFrames 中，然后创建新列：

df = None
variables = list(mhw_data.data_vars)

for var in tqdm(variables): 
    
    stacked = mhw_data[var].stack(dim=["lon", "lat", "time"]).to_pandas().T
    if df is None:
        df = stacked
    else:
        df = pd.concat([df, stacked], axis=1)

df.reset_index(inplace=True)
df.columns = list(mhw_data.variables)

df["climate_zone"] = df["lat"].swifter.apply(get_latitude_band)
df["hemisphere"] = df["lat"].swifter.apply(get_hemisphere)

为您要添加的每个变量创建新的xarray.DataArrays，然后将它们添加到数据集中：

# calculate climate zone and hemisphere from latitude. 
latitudes = mhw_data.lat.values.reshape(-1, 1)

zones = get_latitude_band(latitudes)
hemispheres = get_hemisphere(latitudes)

# Take advantage of numpy broadcasting to get our data to lign up with the xarray shape. 
shape = tuple(mhw_data.sizes.values())
base = np.zeros(shape)

zones = zones + base
hemispheres = hemispheres + base

# finally, create two new DataArrays and assign them as variables in the dataset. 
zone_xarray = xr.DataArray(data=zones, coords=mhw_data.coords, dims=mhw_data.dims)
hemi_xarray = xr.DataArray(data=hemispheres, coords=mhw_data.coords, dims=mhw_data.dims)

mhw_data["zone"] = zone_xarray
mhw_data["hemisphere"] = hemi_xarray

# ... call the code to stack and convert to pandas (shown in method 1) ...#

我的直觉是，方法 1 更快，内存效率更高，因为没有重复值需要广播到大型 3 维数组中。但是，我没有对此进行测试。

另外，我的直觉是，有一种不那么繁琐的xarray 原生方式可以实现相同的目标，但我找不到。

有一点是肯定的，方法 1 更加简洁，因为不需要创建中间数组或重塑数据。

【讨论】：

忘记我丑陋的解决方法，迈克尔对他的答案进行了一些修改。谢谢迈克尔！