使用 AugmentedManifest 在 SageMaker 中进行语义分割失败答案

【问题标题】：Semantic Segmentation failing in SageMaker using AugmentedManifest使用 AugmentedManifest 在 SageMaker 中进行语义分割失败
【发布时间】：2021-08-17 03:52:23
【问题描述】：

我正在使用增强清单，所有标签都使用 mTurk 完成，并且我正在尝试使用这些文件训练模型。

我有一个 Jupyter Notebook、Python 3.7 和 TensorFlow 2。

首先，我进行一些基本的初始化并配置清单文件位置。

import boto3
import re
import sagemaker
from sagemaker import get_execution_role
import time
from time import gmtime, strftime
import json

role = get_execution_role()
sess = sagemaker.Session()
s3 = boto3.resource("s3")

training_image = sagemaker.amazon.amazon_estimator.image_uris.retrieve(
    "semantic-segmentation", boto3.Session().region_name
)

augmented_manifest_filename_train = (
    "output.manifest"
)
bucket_name = "<private>"  

s3_output_path = "s3://{}/output".format(bucket_name)
s3_train_data_path = "s3://{}/output/trees-and-houses/manifests/output/{}".format(
    bucket_name, augmented_manifest_filename_train)

augmented_manifest_s3_key = s3_train_data_path.split(bucket_name)[1][1:]
s3_obj = s3.Object(bucket_name, augmented_manifest_s3_key)
augmented_manifest = s3_obj.get()["Body"].read().decode("utf-8")
augmented_manifest_lines = augmented_manifest.split("\n")
num_training_samples = len(augmented_manifest_lines)

这一切都很好。我可以打印我的清单文件并查看其属性。然后，我配置作业：

# Create unique job name
job_name_prefix = "groundtruth-augmented-manifest-demo"
timestamp = time.strftime("-%Y-%m-%d-%H-%M-%S", time.gmtime())
job_name = job_name_prefix + timestamp
s3_output_location = "s3://{}/training_outputs/".format(bucket_name)

并创建估计器和超参数

# Create a model object set to using "Pipe" mode.
model = sagemaker.estimator.Estimator(training_image,
                                      role,
                                      instance_count=1,
                                      instance_type='ml.p3.8xlarge',
                                      volume_size = 50,
                                      max_run = 360000,
                                      input_mode = 'Pipe',
                                      output_path=s3_output_location,
                                      job_name=job_name,
                                      sagemaker_session=sess)

model.set_hyperparameters(
    backbone="resnet-101",
    algorithm="psp",
    use_pretrained_model="False", 
    crop_size=240,
    num_classes=3,
    epochs=10,
    base_size=540,
    learning_rate=0.0001,
    optimizer="rmsprop",
    lr_scheduler="poly",
    mini_batch_size=4,
    early_stopping=True,
    early_stopping_patience=2,
    early_stopping_min_epochs=10,
    num_training_samples=num_training_samples
)

最后，由于我的文件很大，我使用“管道”训练输入。

# Create a train data channel with S3_data_type as 'AugmentedManifestFile' and attribute names.
train_data = sagemaker.inputs.TrainingInput(s3_data= s3_train_data_path,
                                        distribution='FullyReplicated',
                                        content_type='application/x-recordio',
                                        s3_data_type='AugmentedManifestFile',
                                        compression='Gzip',
                                        attribute_names=attribute_names,
                                        input_mode='Pipe',
                                        record_wrapping='RecordIO') 
data_channels = {'train': train_data }

最后，我尝试训练我的模型，就像 AWS 的示例一样。由于我使用的是增强清单，因此我不需要验证通道。

# Train a model.
model.fit(inputs=data_channels, logs=True, wait=True)

但是，开始训练时出现以下错误：

UnexpectedStatusException: Error for Training job semantic-segmentation-2021-05-28-23-53-46-966: Failed. Reason: ClientError: Unable to initialize the algorithm. Failed to validate input data configuration. (caused by ValidationError)

Caused by: 'validation' is a required property

Failed validating 'required' in schema['allOf'][2]:
    {'required': ['validation']}

On instance:
    {'train': {'ContentType': 'application/x-recordio',
               'RecordWrapperType': 'RecordIO',
               'S3DistributionType': 'FullyReplicated',
               'TrainingInputMode': 'Pipe'}}

【问题讨论】：

标签： tensorflow amazon-sagemaker semantic-segmentation

【解决方案1】：

您可以在增强清单上使用较低级别的 API 来运行您的训练作业：

# Create unique job name
import time
role = sagemaker.get_execution_role()
nn_job_name_prefix = "labeljob-86-chain-augmented-manifest"
timestamp = time.strftime("-%Y-%m-%d-%H-%M", time.gmtime())
nn_job_name = nn_job_name_prefix + timestamp

training_params = {
    "AlgorithmSpecification": {"TrainingImage": training_image, "TrainingInputMode": "Pipe"},
    "RoleArn": role,
    "OutputDataConfig": {"S3OutputPath": "s3://{}/{}/output/".format(bucket, s3_prefix)},
    "ResourceConfig": {"InstanceCount": 1, "InstanceType": "ml.p3.2xlarge", "VolumeSizeInGB": 50},
    "TrainingJobName": nn_job_name,
    "HyperParameters": {
        "backbone" : "resnet-50",  # Other option is resnet-101
        "algorithm" : "deeplab", #fcn, psp, deeplab
        "use_pretrained_model" : "True",  # Use the pre-trained model.
        "crop_size" : "240",  # Size of image random crop.
        "num_classes" : "2",  # ------------------ IMPORTANT! number of classes (starting at 0)
        "epochs" : "45",  # Number of epochs to run. (small for testing)
        "learning_rate" : "0.001", 
        "optimizer" : "adam",  # 'sgd', 'adam', 'rmsprop', 'nag', 'adagrad'.
        "lr_scheduler" : "poly",  # Other options include 'cosine' and 'step'.
        "mini_batch_size" : "16",  # small mini batch size for this data set size.
        "validation_mini_batch_size" : "4",
        "early_stopping" : "True",  # Turn on early stopping. If OFF, other early stopping parameters are ignored.
        "early_stopping_patience" : "2",  # Tolerate these many epochs if the mIoU doens't increase.
        "early_stopping_min_epochs" : "5",  # No matter what, run these many number of epochs.
        "num_training_samples": str(num_training_samples),  # --------------------------------------IMPORTANT!

    },
    "StoppingCondition": {"MaxRuntimeInSeconds": 86400},
    "InputDataConfig": [
        {
            "ChannelName": "train",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "AugmentedManifestFile",
                    "S3Uri": "s3://{}/{}/{}".format(bucket, s3_prefix, "train.manifest"),
                    "S3DataDistributionType": "FullyReplicated",
                    "AttributeNames": attribute_names,
                }
            },
            "ContentType": "application/x-recordio",
            "RecordWrapperType": "RecordIO",
            "CompressionType": "None",
        },
        {
            "ChannelName": "validation",
            "DataSource": {
                "S3DataSource": {
                    "S3DataType": "AugmentedManifestFile",
                    "S3Uri": "s3://{}/{}/{}".format(bucket, s3_prefix, "validation.manifest"),
                    "S3DataDistributionType": "FullyReplicated",
                    "AttributeNames": attribute_names,
                }
            },
            "ContentType": "application/x-recordio",
            "RecordWrapperType": "RecordIO",
            "CompressionType": "None",
        },
    ],
}

print("Training job name: {}".format(nn_job_name))
print(
    "\nInput Data Location: {}".format(
        training_params["InputDataConfig"][0]["DataSource"]["S3DataSource"]
    )
)

【讨论】：