【发布时间】:2021-07-09 19:55:31
【问题描述】:
我是一名 AWS 新手,尝试使用 AWS Textract 将多页文件表解析为 CSV 文件。
我尝试使用 AWS 的示例 in this page,但是当我们处理多页文件时,response = client.analyze_document(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES']) 会中断,因为在这些情况下我们需要异步处理,如您所见 in the documentation here。正确调用的函数是client.start_document_analysis,运行后使用client.get_document_analysis(JobId)检索文件。
因此,我使用此逻辑而不是使用client.analyze_document 函数来调整他们的示例,调整后的代码如下所示:
client = boto3.client('textract')
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
jobid=response['JobId']
jobstatus="IN_PROGRESS"
while jobstatus=="IN_PROGRESS":
response=client.get_document_analysis(JobId=jobid)
jobstatus=response['JobStatus']
if jobstatus == "IN_PROGRESS": print("IN_PROGRESS")
time.sleep(5)
但是当我运行它时,我得到以下错误:
Traceback (most recent call last):
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 125, in <module>
main(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 112, in main
table_csv = get_table_csv_results(file_name)
File "/Users/santanna_santanna/PycharmProjects/KlooksExplore/PDFWork/textract_python_table_parser.py", line 62, in get_table_csv_results
response = client.start_document_analysis(Document={'Bytes': bytes_test}, FeatureTypes=['TABLES'])
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 316, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 608, in _make_api_call
api_params, operation_model, context=request_context)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/client.py", line 656, in _convert_to_request_dict
api_params, operation_model)
File "/Users/santanna_santanna/anaconda3/lib/python3.6/site-packages/botocore/validate.py", line 297, in serialize_to_request
raise ParamValidationError(report=report.generate_report())
botocore.exceptions.ParamValidationError: Parameter validation failed:
Missing required parameter in input: "DocumentLocation"
Unknown parameter in input: "Document", must be one of: DocumentLocation, FeatureTypes, ClientRequestToken, JobTag, NotificationChannel
这是因为调用 start_document_analysis 的标准方法是使用带有这种合成器的 S3 文件:
response = client.start_document_analysis(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': documentName
}
},
FeatureTypes=["TABLES"])
但是,如果我这样做,我将破坏AWS example 中提出的命令行逻辑:
python textract_python_table_parser.py file.pdf.
问题是:如何调整 AWS 示例以处理多页文件?
【问题讨论】:
标签: amazon-web-services amazon-s3 amazon-textract