【发布时间】:2019-04-04 02:31:10
【问题描述】:
我正在尝试为 Geopandas 实现一个并行化函数,该函数采用单个矢量数据(即:包含 Multipolygon 数据类型的 Shapefile),并将其转换为具有用户定义的单元格 x 和 y 大小的标准细胞网格。
由于此功能可能会导致严重的内存问题(即:由太高的空间分辨率引起),我想知道是否可以将数据迭代地保存在给定的目标文件中。这样,由于每个并行进程都运行“GRID”功能,同一进程可以以附加模式迭代地保存数据。这样一来,我相信不会有内存问题。
这是我的“SHP_to_GRID_Function”。请注意,下面的代码仍然要求多处理生成的整个数据直接由内存处理(因此对于大型数据集来说溢出是肯定的)。
import pandas as pd
import numpy as np
import geopandas as gpd
from shapely.geometry import Polygon
from multiprocessing import Pool
import os
from functools import partial
def info(title):
print(title)
print('module name:', __name__)
print('parent process:', os.getppid())
print('process id:', os.getpid())
def parallelize_df(gdf, func, n_cores, dx=100, dy=100, verbose=False):
Geometries= gdf.loc[:, 'geometry'].values
pool = Pool(processes=n_cores)
func_partial=partial(func, dx, dy, verbose) # prod_x has only one argument x (y is fixed to 10)
results = pool.map(func_partial, Geometries)
pool.close()
pool.join()
print(np.shape(results))
GRID = gpd.GeoSeries(np.array(results).ravel())
print("GRID well created")
return GRID
def generate_grid_from_Poligon(dx=100, dy=100, verbose=False, polygon=None):
if verbose == True:
info('function parallelize_df')
else:
None
xmin,ymin,xmax,ymax = polygon.bounds
lenght = dx
wide = dy
cols = list(np.arange(int(np.floor(xmin)), int(np.ceil(xmax)), wide))
rows = list(np.arange(int(np.floor(ymin)), int(np.ceil(ymax)), lenght))
rows.reverse()
subpolygons = []
for x in cols:
for y in rows:
subpolygons.append( Polygon([(x,y), (x+wide, y), (x+wide, y-lenght), (x, y-lenght)]) )
return subpolygons
def main(GDF, n_cores='standard', dx=100, dy=100, verbose= False):
"""
GDF: geodataframe
n_cores: use standard or a positive numerical (int) value. It will set the number of cores to use in the multiprocessing
args: (dx: dimension in the x coordinate to make the grid
dy: dimenion in the y coordinate to make the grid)
"""
if isinstance(n_cores, str):
import multiprocessing
N_cores = multiprocessing.cpu_count() -1
elif isinstance(n_cores, int):
N_cores =n_cores
GRID_GDF = parallelize_df(GDF, generate_grid_from_Poligon, n_cores=N_cores, dx=dx, dy=dy, verbose=verbose)
return GRID_GDF
感谢您抽出宝贵的时间,
此致,
菲利普·利尔
【问题讨论】:
标签: python-3.x python-multiprocessing geopandas