【发布时间】:2013-03-02 21:39:53
【问题描述】:
作为利用 Azure VM 支持的大规模迁移的一部分,我需要将大约 420 万张图像从美国中北部迁移到美国西部(对于那些不知道的人,美国中北部不支持它们)。图像都在一个容器中,分为大约 119,000 个目录。
我正在使用 Copy Blob API 中的以下内容:
public static void CopyBlobDirectory(
CloudBlobDirectory srcDirectory,
CloudBlobContainer destContainer)
{
// get the SAS token to use for all blobs
string blobToken = srcDirectory.Container.GetSharedAccessSignature(
new SharedAccessBlobPolicy
{
Permissions = SharedAccessBlobPermissions.Read |
SharedAccessBlobPermissions.Write,
SharedAccessExpiryTime = DateTime.UtcNow + TimeSpan.FromDays(14)
});
var srcBlobList = srcDirectory.ListBlobs(
useFlatBlobListing: true,
blobListingDetails: BlobListingDetails.None).ToList();
foreach (var src in srcBlobList)
{
var srcBlob = src as ICloudBlob;
// Create appropriate destination blob type to match the source blob
ICloudBlob destBlob;
if (srcBlob.Properties.BlobType == BlobType.BlockBlob)
destBlob = destContainer.GetBlockBlobReference(srcBlob.Name);
else
destBlob = destContainer.GetPageBlobReference(srcBlob.Name);
// copy using src blob as SAS
destBlob.BeginStartCopyFromBlob(new Uri(srcBlob.Uri.AbsoluteUri + blobToken), null, null);
}
}
问题是,它太慢了。哇哇太慢了。按照发出命令复制所有这些东西的速度,这将需要大约四天的时间。我不太确定瓶颈是什么(连接限制客户端、Azure 端的速率限制、多线程等)。
所以,我想知道我的选择是什么。有什么办法可以加快速度,还是我只是坚持需要四天才能完成的工作?
编辑:我如何分发作品以复制所有内容
//set up tracing
InitTracer();
//grab a set of photos to benchmark this
var photos = PhotoHelper.GetAllPhotos().Take(500).ToList();
//account to copy from
var from = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
"oldAccount",
"oldAccountKey");
var fromAcct = new CloudStorageAccount(from, true);
var fromClient = fromAcct.CreateCloudBlobClient();
var fromContainer = fromClient.GetContainerReference("userphotos");
//account to copy to
var to = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(
"newAccount",
"newAccountKey");
var toAcct = new CloudStorageAccount(to, true);
var toClient = toAcct.CreateCloudBlobClient();
Trace.WriteLine("Starting Copy: " + DateTime.UtcNow.ToString());
//enumerate sub directories, then move them to blob storage
//note: it doesn't care how high I set the Parallelism to,
//console output indicates it won't run more than five or so at a time
var plo = new ParallelOptions { MaxDegreeOfParallelism = 10 };
Parallel.ForEach(photos, plo, (info) =>
{
CloudBlobDirectory fromDir = fromContainer.GetDirectoryReference(info.BuildingId.ToString());
var toContainer = toClient.GetContainerReference(info.Id.ToString());
toContainer.CreateIfNotExists();
Trace.WriteLine(info.BuildingId + ": Starting copy, " + info.Photos.Length + " photos...");
BlobHelper.CopyBlobDirectory(fromDir, toContainer, info);
//this monitors the container, so I can restart any failed
//copies if something goes wrong
BlobHelper.MonitorCopy(toContainer);
});
Trace.WriteLine("Done: " + DateTime.UtcNow.ToString());
【问题讨论】:
-
您是否使用大量线程来执行此操作?大部分时间都在抄袭。我认为你可以极大地并行化它。也许在 azure 上有一堆工人角色。
-
我也有同样的想法;最初我是同步运行的。经过一些测试,这需要将近两周的时间,所以我重写了它以使用 BeginStartCopyFromBlob(),并将对 CopyBlobDirectory() 的调用包装在 Parallel.ForEach 中。但是,Parallel 框架不允许我一次运行超过 5 个左右的作业(即使我设置了更高的度数);我不确定如何强制它运行更多。
-
您能否像这样生成大量线程:stackoverflow.com/questions/5041153/…,例如每个工作实例 1000 个,然后启动几十个工作角色?
-
我编辑了这篇文章,以说明我是如何管理所有复制工作的。
-
您会说开始每个副本大约需要 500 毫秒吗?
标签: c# azure parallel-processing azure-blob-storage parallel.foreach