【发布时间】:2021-11-09 10:25:06
【问题描述】:
我正在按照 TFX on Cloud AI Platform Pipelines 教程在 Google Cloud 上实现 Kubeflow 编排管道。主要区别在于我正在尝试实现一个对象检测解决方案,而不是本教程提出的出租车应用程序。
出于这个原因,我(在本地)创建了一个通过 labelImg 标记的图像数据集,并使用我已上传到 GS 存储桶上的this script 将其转换为 .tfrecord。然后我按照 TFX 教程创建 GKE 集群 (the default one, with this configuration) 和运行代码所需的 Jupyter Notebook,导入相同的模板。
主要区别在于管道的第一个组件,我将 CSVExampleGen 组件更改为 ImportExampleGen 一个:
def create_pipeline(
pipeline_name: Text,
pipeline_root: Text,
data_path: Text,
# TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
# query: Text,
preprocessing_fn: Text,
run_fn: Text,
train_args: tfx.proto.TrainArgs,
eval_args: tfx.proto.EvalArgs,
eval_accuracy_threshold: float,
serving_model_dir: Text,
metadata_connection_config: Optional[
metadata_store_pb2.ConnectionConfig] = None,
beam_pipeline_args: Optional[List[Text]] = None,
ai_platform_training_args: Optional[Dict[Text, Text]] = None,
ai_platform_serving_args: Optional[Dict[Text, Any]] = None,
) -> tfx.dsl.Pipeline:
"""Implements the chicago taxi pipeline with TFX."""
components = []
# Brings data into the pipeline or otherwise joins/converts training data.
example_gen = tfx.components.ImportExampleGen(input_base=data_path)
# TODO(step 7): (Optional) Uncomment here to use BigQuery as a data source.
# example_gen = tfx.extensions.google_cloud_big_query.BigQueryExampleGen(
# query=query)
components.append(example_gen)
管道中没有插入其他组件,数据路径指向包含 .tfrecord 的存储桶上的文件夹位置:
DATA_PATH = 'gs://(project bucket)/(dataset folder)'
这是运行器代码(与 TFX 教程中的基本相同):
def run():
"""Define a kubeflow pipeline."""
# Metadata config. The defaults works work with the installation of
# KF Pipelines using Kubeflow. If installing KF Pipelines using the
# lightweight deployment option, you may need to override the defaults.
# If you use Kubeflow, metadata will be written to MySQL database inside
# Kubeflow cluster.
metadata_config = tfx.orchestration.experimental.get_default_kubeflow_metadata_config(
)
runner_config = tfx.orchestration.experimental.KubeflowDagRunnerConfig(
kubeflow_metadata_config=metadata_config,
tfx_image=configs.PIPELINE_IMAGE)
pod_labels = {
'add-pod-env': 'true',
tfx.orchestration.experimental.LABEL_KFP_SDK_ENV: 'tfx-template'
}
tfx.orchestration.experimental.KubeflowDagRunner(
config=runner_config, pod_labels_to_attach=pod_labels
).run(
pipeline.create_pipeline(
pipeline_name=configs.PIPELINE_NAME,
pipeline_root=PIPELINE_ROOT,
data_path=DATA_PATH,
# TODO(step 7): (Optional) Uncomment below to use BigQueryExampleGen.
# query=configs.BIG_QUERY_QUERY,
preprocessing_fn=configs.PREPROCESSING_FN,
run_fn=configs.RUN_FN,
train_args=tfx.proto.TrainArgs(num_steps=configs.TRAIN_NUM_STEPS),
eval_args=tfx.proto.EvalArgs(num_steps=configs.EVAL_NUM_STEPS),
eval_accuracy_threshold=configs.EVAL_ACCURACY_THRESHOLD,
serving_model_dir=SERVING_MODEL_DIR,
# TODO(step 7): (Optional) Uncomment below to use provide GCP related
# config for BigQuery with Beam DirectRunner.
# beam_pipeline_args=configs
# .BIG_QUERY_WITH_DIRECT_RUNNER_BEAM_PIPELINE_ARGS,
# TODO(step 8): (Optional) Uncomment below to use Dataflow.
# beam_pipeline_args=configs.DATAFLOW_BEAM_PIPELINE_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_training_args=configs.GCP_AI_PLATFORM_TRAINING_ARGS,
# TODO(step 9): (Optional) Uncomment below to use Cloud AI Platform.
# ai_platform_serving_args=configs.GCP_AI_PLATFORM_SERVING_ARGS,
))
if __name__ == '__main__':
logging.set_verbosity(logging.INFO)
run()
然后创建管道并使用 Notebook 中的以下代码调用运行:
!tfx pipeline create --pipeline-path=kubeflow_runner.py --endpoint={ENDPOINT} --build-image
!tfx run create --pipeline-name={PIPELINE_NAME} --endpoint={ENDPOINT}
问题在于,虽然示例中的管道运行正常,但此管道总是失败,GKE 集群上的 pod 退出并显示代码 137 (OOMKilled)。
This is a snapshot of the cluster workload status 和 this is a full log dump of the run that crashes。
我已经尝试减小数据集大小(现在整个 .tfrecord 大约为 6MB)并将其在本地拆分为两组(验证和训练),因为当组件应该拆分数据集时似乎会发生崩溃,但这些都没有改变这种情况。
您是否知道它为什么会出现内存不足以及我可以采取哪些步骤来解决这个问题?
非常感谢。
【问题讨论】:
标签: google-cloud-platform google-kubernetes-engine kubeflow-pipelines tfx google-cloud-ai-platform-pipelines