AWS Glue - 是否使用爬虫答案

【问题标题】：AWS Glue - using Crawlers or notAWS Glue - 是否使用爬虫
【发布时间】：2018-11-22 09:48:32
【问题描述】：

要在 Parquet 格式的 S3 存储桶中的数据上运行作业，有两种方法：

创建爬虫创建模式表，使用glueContext.create_dynamic_frame.from_catalog(dbname, tablename)在Glue作业中形成动态框架。
使用glueContext.create_dynamic_frame.from_options("s3", {"paths": [full_s3_path] }, format="parquet")直接从S3读取

由于我的数据方案不会及时改变，使用爬虫有什么优势（性能方面或其他方面）？在这种情况下，我为什么需要爬虫？

【问题讨论】：

【解决方案1】：

如果您的数据未分区或您不想使用predicate-pushdown 功能，那么您不需要运行爬虫。

但是，如果它已分区并且您希望能够使用谓词下推部分加载数据，则应在数据目录中注册新分区，而 Crawler 是最简单的方法之一（尽管有 alternatives )

【讨论】：