【发布时间】:2021-03-09 10:33:43
【问题描述】:
我需要从 AWS Glue 爬虫元数据目录中获取表和列名。我使用了boto3,但不断获得 100 张桌子的数量,即使还有更多。设置NextToken 没有帮助。如果可能,请提供帮助。
想要的结果如下:
lst = [table_one.col_one, table_one.col_two, table_two.col_one....table_n.col_n]
def harvest_aws_crawler():
glue = boto3.client('glue', region_name='')
response = glue.get_tables(DatabaseName='', NextToken = '')
#response syntax:
#https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/glue.html#Glue.Client.get_tables
crawler_list_tables = []
for tables in response['TableList']:
while (response.get('NextToken') is not None):
crawler_list_tables.append(tables['Name'])
break
print(len(crawler_list_tables))
harvest_aws_crawler()
更新后的代码,还是需要有tablename+columnname:
def harvest_aws_crawler():
glue = boto3.client('glue', region_name='')
next_token = ""
#response syntax:
#https://boto3.amazonaws.com/v1/documentation/api/1.9.42/reference/services/glue.html#Glue.Client.get_tables
response = glue.get_tables(DatabaseName='', NextToken = next_token)
tables_from_crawler = []
while True:
table_list = response['TableList']
for table_dict in table_list:
table_name = table_dict['Name']
#append table_name+column_name
for columns in table_name['StorageDescriptor']['Columns']:
tables_from_crawler.append(table_name + '.' + columns['Name'])
#tables_from_crawler.append(table_name)
next_token = response.get('NextToken')
if next_token is None:
break
print(tables_from_crawler)
harvest_aws_crawler()
【问题讨论】:
-
下一个请求不要使用
NextToken,否则总是会得到前100个。或者使用get_tables的分页器:boto3.amazonaws.com/v1/documentation/api/latest/reference/… -
@luk2302 - 谢谢,但我在使用
get_paginator时仍然获得了 100 张桌子@
标签: python amazon-web-services pyspark boto3 aws-glue