从 Java 中的 S3 上的文件在 S3 上创建一个 zip 文件答案

【问题标题】：Create a zip file on S3 from files on S3 in Java从 Java 中的 S3 上的文件在 S3 上创建一个 zip 文件
【发布时间】：2019-11-12 18:29:33
【问题描述】：

我在 S3 上有很多文件需要压缩，然后通过 S3 提供压缩文件。目前我将它们从流压缩到本地文件，然后再次上传文件。这会占用大量磁盘空间，因为每个文件大约有 3-10MB，我必须压缩多达 100.000 个文件。所以一个zip可以有超过1TB。所以我想要一个解决方案：

Create a zip file on S3 from files on S3 using Lambda Node

这里它接缝了 zip 是直接在 S3 上创建的，而不占用本地磁盘空间。但我只是不够聪明，无法将上述解决方案转移到 Java。我还在 java aws sdk 上发现了相互矛盾的信息，说他们计划在 2017 年改变流行为。

不确定这是否会有所帮助，但这是我迄今为止一直在做的事情（Upload 是我保存 S3 信息的本地模型）。我刚刚删除了日志记录和其他东西以获得更好的可读性。我想我没有占用空间将 InputStream 直接下载“管道”到 zip 中。但就像我说的那样，我也想避免使用本地 zip 文件并直接在 S3 上创建它。然而，这可能需要使用 S3 作为目标而不是 FileOutputStream 创建 ZipOutputStream。不知道怎么做。

public File zipUploadsToNewTemp(List<Upload> uploads) {
    List<String> names = new ArrayList<>();

    byte[] buffer = new byte[1024];
    File tempZipFile;
    try {
      tempZipFile = File.createTempFile(UUID.randomUUID().toString(), ".zip");
    } catch (Exception e) {
      throw new ApiException(e, BaseErrorCode.FILE_ERROR, "Could not create Zip file");
    }
    try (
        FileOutputStream fileOutputStream = new FileOutputStream(tempZipFile);
        ZipOutputStream zipOutputStream = new ZipOutputStream(fileOutputStream)) {

      for (Upload upload : uploads) {
        InputStream inputStream = getStreamFromS3(upload);
        ZipEntry zipEntry = new ZipEntry(upload.getFileName());
        zipOutputStream.putNextEntry(zipEntry);
        writeStreamToZip(buffer, zipOutputStream, inputStream);
        inputStream.close();
      }
      zipOutputStream.closeEntry();
      zipOutputStream.close();
      return tempZipFile;
    } catch (IOException e) {
      logError(type, e);
      if (tempZipFile.exists()) {
        FileUtils.delete(tempZipFile);
      }
      throw new ApiException(e, BaseErrorCode.IO_ERROR,
          "Error zipping files: " + e.getMessage());
    }
}

  // I am not even sure, but I think this takes up memory and not disk space
private InputStream getStreamFromS3(Upload upload) {
    try {
      String filename = upload.getId() + "." + upload.getFileType();
      InputStream inputStream = s3FileService
          .getObject(upload.getBucketName(), filename, upload.getPath());
      return inputStream;
    } catch (ApiException e) {
      throw e;
    } catch (Exception e) {
      logError(type, e);
      throw new ApiException(e, BaseErrorCode.UNKOWN_ERROR,
          "Unkown Error communicating with S3 for file: " + upload.getFileName());
    }
}


private void writeStreamToZip(byte[] buffer, ZipOutputStream zipOutputStream,
      InputStream inputStream) {
    try {
      int len;
      while ((len = inputStream.read(buffer)) > 0) {
        zipOutputStream.write(buffer, 0, len);
      }
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, "Could not write stream to zip");
    }
}

最后上传源码。 Inputstream 是从 Temp Zip 文件创建的。

public PutObjectResult upload(InputStream inputStream, String bucketName, String filename, String folder) {
    String uploadKey = StringUtils.isEmpty(folder) ? "" : (folder + "/");
    uploadKey += filename;

    ObjectMetadata metaData = new ObjectMetadata();

    byte[] bytes;
    try {
      bytes = IOUtils.toByteArray(inputStream);
    } catch (IOException e) {
      throw new ApiException(e, BaseErrorCode.IO_ERROR, e.getMessage());
    }
    metaData.setContentLength(bytes.length);
    ByteArrayInputStream byteArrayInputStream = new ByteArrayInputStream(bytes);

    PutObjectRequest putObjectRequest = new PutObjectRequest(bucketPrefix + bucketName, uploadKey, byteArrayInputStream, metaData);
    putObjectRequest.setCannedAcl(CannedAccessControlList.PublicRead);

    try {
      return getS3Client().putObject(putObjectRequest);
    } catch (SdkClientException se) {
      throw s3Exception(se);
    } finally {
      IOUtils.closeQuietly(inputStream);
    }
  }

刚刚发现了一个与我需要的类似的问题，也没有答案：

Upload ZipOutputStream to S3 without saving zip file (large) temporary to disk using AWS S3 Java

【问题讨论】：

为什么会占用磁盘空间？为什么首先将下载的字节保存到磁盘。如果你不这样做，它不会占用磁盘空间。发布您尝试过的内容怎么样，以便 wa 可以解释如何做得更好？
我不想让问题过于复杂。这是相当多的源代码，我可能无法按照我想要的方式进行改进。我觉得从头开始会更好
那就从头开始吧。并且不要将下载的对象写入文件，这样就不会占用任何磁盘空间。
我建议使用 Amazon EC2 实例（低至 1c/小时，或者您甚至可以使用 Spot 实例以更低的价格获得它）。编写一个脚本来遍历文件，然后下载、压缩、上传。如果 EC2 实例与 Amazon S3 位于同一区域，则不收取数据传输费用。
我已经添加了我当前用于压缩和上传的源代码

标签： java amazon-web-services amazon-s3 java-stream aws-sdk

【解决方案1】：

您可以从 S3 数据中获取输入流，然后压缩这批字节并将其流式传输回 S3

        long numBytes;  // length of data to send in bytes..somehow you know it before processing the entire stream
        PipedOutputStream os = new PipedOutputStream();
        PipedInputStream is = new PipedInputStream(os);
        ObjectMetadata meta = new ObjectMetadata();
        meta.setContentLength(numBytes);

        new Thread(() -> {
            /* Write to os here; make sure to close it when you're done */
            try (ZipOutputStream zipOutputStream = new ZipOutputStream(os)) {
                ZipEntry zipEntry = new ZipEntry("myKey");
                zipOutputStream.putNextEntry(zipEntry);
                
                S3ObjectInputStream objectContent = amazonS3Client.getObject("myBucket", "myKey").getObjectContent();
                byte[] bytes = new byte[1024];
                int length;
                while ((length = objectContent.read(bytes)) >= 0) {
                    zipOutputStream.write(bytes, 0, length);
                }
                objectContent.close();
            } catch (IOException e) {
                e.printStackTrace();
            }
        }).start();
        amazonS3Client.putObject("myBucket", "myKey", is, meta);
        is.close();  // always close your streams

【讨论】：

处理前怎么知道numBytes的大小？

【解决方案2】：

我建议使用 Amazon EC2 实例（低至 1c/小时，或者您甚至可以使用 Spot 实例以更低的价格获得它）。较小的实例类型成本较低，但带宽有限，因此请调整大小以获得您喜欢的性能。

然后编写一个脚本来遍历文件：

下载
邮编
上传
删除本地文件

所有的 zip 魔术都发生在本地磁盘上。无需使用流。只需使用 Amazon S3 download_file() 和 upload_file() 调用即可。

如果 EC2 实例与 Amazon S3 位于同一区域，则不收取数据传输费用。

【讨论】：