Python pandas 将 mongodb 集合乱序导出到 CSV 列答案

【问题标题】：Python pandas export mongodb collection to CSV columns out of orderPython pandas 将 mongodb 集合乱序导出到 CSV 列
【发布时间】：2020-07-30 07:57:08
【问题描述】：

我有一个 Python 脚本，它在我们所有的 AWS 账户（大约 150 个）中创建一个 EC2 实例列表，并将结果存储在 MongoDB 中。

我正在使用 Python pandas 模块将 mongodb 集合导出到 CSV 文件。它的工作原理是标题乱序，我不想打印 MongoDB 索引。

在脚本的原始版本中（在添加数据库之前），我使用 CSV 模块来编写文件并且标题是正确的：

我添加数据库既是为了学习，也是因为它可以更轻松地处理我们拥有的所有亚马逊帐户。

如果我在 mongo 数据库中查看我正在打印的集合的 json，所有字段的顺序都正确：

{'_id': ObjectId('5f14f9ffa40de31278dade03'), 'AWS Account': 'jf-master-pd', 'Account Number': '123456789101', 'Name': 'usawsweb001', 'Instance ID': 'i-01e5e920b4d3d5dcb', 'AMI ID': 'ami-006219aba10688d0b', 'Volumes': 'vol-0ce8db4e071bc7229, vol-099f6d212a91121d0, vol-0bb36e343e9c01374, vol-05610645edfd02253, vol-05adc01d70d75d649', 'Private IP': '172.31.62.168', 'Public IP': 'xx.xx.xx.xx', 'Private DNS': 'ip-172-31-62-168.ec2.internal', 'Availability Zone': 'us-east-1e', 'VPC ID': 'vpc-68b1ff12', 'Type': 't2.micro', 'Key Pair Name': 'jf-timd', 'State': 'running', 'Launch Date': 'July 20 2020'}
{'_id': ObjectId('5f14f9ffa40de31278dade05'), 'AWS Account': 'jf-master-pd', 'Account Number': '123456789101', 'Name': 'usawsweb002', 'Instance ID': 'i-0b7db2bcab853ef96', 'AMI ID': 'ami-006219aba10688d0b', 'Volumes': 'vol-095a9dcf54ca97c0e, vol-0c8e96b71fbb7dfcf, vol-070c16c457f91c54e, vol-0dc1eaf2e826fa3a6, vol-0f0f157a8489ab939', 'Private IP': '172.31.63.131', 'Public IP': 'xx.xx.xx.xx', 'Private DNS': 'ip-172-31-63-131.ec2.internal', 'Availability Zone': 'us-east-1e', 'VPC ID': 'vpc-68b1ff12', 'Type': 't2.micro', 'Key Pair Name': 'jf-timd', 'State': 'running', 'Launch Date': 'July 20 2020'}
{'_id': ObjectId('5f14f9ffa40de31278dade07'), 'AWS Account': 'jf-master-pd', 'Account Number': '123456789101', 'Name': 'usawsweb003', 'Instance ID': 'i-0611acf4b6cc53b61', 'AMI ID': 'ami-006219aba10688d0b', 'Volumes': 'vol-0aa28f89f6ce50577, vol-0e37ff844e8b9c47a, vol-0d54c713ae231739c, vol-0e29df46edc814619, vol-07e0c40a8913b1d31', 'Private IP': '172.31.52.44', 'Public IP': 'xx.xx.xx.xx', 'Private DNS': 'ip-172-31-52-44.ec2.internal', 'Availability Zone': 'us-east-1e', 'VPC ID': 'vpc-68b1ff12', 'Type': 't2.micro', 'Key Pair Name': 'jf-timd', 'State': 'running', 'Launch Date': 'July 20 2020'}

但是使用 python pandas 从 mongo 数据库中导出标题是不正常的。信息与正确的标题对齐，但列完全乱序：

在我的代码中，我正在创建一个包含服务器信息的字典，然后将字典传递给打印 Mongo 集合的函数：

def list_instances(aws_account,aws_account_number, interactive, regions, show_details, instance_col):
for region in regions:
    if 'gov' in aws_account and not 'admin' in aws_account:
        try:
            session = boto3.Session(profile_name=aws_account, region_name=region)
        except botocore.exceptions.ProfileNotFound as e:
            profile_missing_message = f"An exception has occurred: {e}"
            account_found = 'no'
            raise
    else:
        try:
            session = boto3.Session(profile_name=aws_account, region_name=region)
            account_found = 'yes'
        except botocore.exceptions.ProfileNotFound as e:
            profile_missing_message = f"An exception has occurred: {e}"
            raise
    try:
        ec2 = session.client("ec2")
    except Exception as e:
        print(f"An exception has occurred: {e}")
    message = f"  Region: {region} in {aws_account}: ({aws_account_number})  "
    banner(message)

    print(Fore.RESET)
    # Loop through the instances
    try:
        instance_list = ec2.describe_instances()
    except Exception as e:
        print(f"An exception has occurred: {e}")
        for reservation in instance_list["Reservations"]:
                for instance in reservation.get("Instances", []):
                    instance_count = instance_count + 1
                    launch_time = instance["LaunchTime"]
                    launch_time_friendly = launch_time.strftime("%B %d %Y")
                    tree = objectpath.Tree(instance)
                    block_devices = set(tree.execute('$..BlockDeviceMappings[\'Ebs\'][\'VolumeId\']'))
                    if block_devices:
                        block_devices = list(block_devices)
                        block_devices = str(block_devices).replace('[','').replace(']','').replace('\'','')
                    else:
                        block_devices = None
                    private_ips =  set(tree.execute('$..PrivateIpAddress'))
                    if private_ips:
                        private_ips_list = list(private_ips)
                        private_ips_list = str(private_ips_list).replace('[','').replace(']','').replace('\'','')
                    else:
                        private_ips_list = None
                    public_ips =  set(tree.execute('$..PublicIp'))
                    if len(public_ips) == 0:
                        public_ips = None
                    if public_ips:
                        public_ips_list = list(public_ips)
                        public_ips_list = str(public_ips_list).replace('[','').replace(']','').replace('\'','')
                    else:
                        public_ips_list = None
                    name = None
                    if 'Tags' in instance:
                        try:
                            tags = instance['Tags']
                            name = None
                            for tag in tags:
                                if tag["Key"] == "Name":
                                    name = tag["Value"]
                                if tag["Key"] == "Engagement" or tag["Key"] == "Engagement Code":
                                    engagement = tag["Value"]
                        except ValueError:
                            # print("Instance: %s has no tags" % instance_id)
                            raise
                    key_name = instance['KeyName'] if instance['KeyName'] else None
                    vpc_id = instance.get('VpcId') if instance.get('VpcId') else None
                    private_dns = instance['PrivateDnsName'] if instance['PrivateDnsName'] else None
                    ec2info[instance['InstanceId']] = {
                        'AWS Account': aws_account,
                        'Account Number': aws_account_number,
                        'Name': name,
                        'Instance ID': instance['InstanceId'],
                        'AMI ID': instance['ImageId'],
                        'Volumes': block_devices,
                        'Private IP': private_ips_list,
                        'Public IP': public_ips_list,
                        'Private DNS': private_dns,
                        'Availability Zone': instance['Placement']['AvailabilityZone'],
                        'VPC ID': vpc_id,
                        'Type': instance['InstanceType'],
                        'Key Pair Name': key_name,
                        'State': instance['State']['Name'],
                        'Launch Date': launch_time_friendly
                    }
                    mongo_instance_dict = {'_id': '', 'AWS Account': aws_account, "Account Number": aws_account_number, 'Name': name, 'Instance ID': instance["InstanceId"], 'AMI ID': instance['ImageId'], 'Volumes': block_devices,  'Private IP': private_ips_list, 'Public IP': public_ips_list, 'Private DNS': private_dns, 'Availability Zone': instance['Placement']['AvailabilityZone'], 'VPC ID': vpc_id, 'Type': instance["InstanceType"], 'Key Pair Name': key_name, 'State': instance["State"]["Name"], 'Launch Date': launch_time_friendly}
                    insert_doc(mongo_instance_dict)
    mongo_export_to_file(interactive, aws_account)

这是将字典插入 MongoDB 的函数：

def insert_doc(mydict):
    mydb, mydb_name, instance_col = set_db()
    mydict['_id'] = ObjectId()
    instance_doc = instance_col.insert_one(mydict)
    return instance_doc

这是将 MongoDB 写入文件的函数：

def mongo_export_to_file():
    aws_account = 'jf-master-pd'
    today = datetime.today()
    today = today.strftime("%m-%d-%Y")
    mydb, mydb_name, instance_col = set_db()
    # make an API call to the MongoDB server
    cursor = instance_col.find()
    # extract the list of documents from cursor obj
    mongo_docs = list(cursor)

    # create an empty DataFrame for storing documents
    docs = pandas.DataFrame(columns=[])

    # iterate over the list of MongoDB dict documents
    for num, doc in enumerate(mongo_docs):
        # convert ObjectId() to str
        doc["_id"] = str(doc["_id"])
        # get document _id from dict
        doc_id = doc["_id"]
        # create a Series obj from the MongoDB dict
        series_obj = pandas.Series( doc, name=doc_id )
         # append the MongoDB Series obj to the DataFrame obj
        docs = docs.append(series_obj)
        # get document _id from dict
        doc_id = doc["_id"]
        # Set the output file
        output_dir = os.path.join('..', '..', 'output_files', 'aws_instance_list', 'csv', '')
        output_file = os.path.join(output_dir, 'aws-instance-master-list-' + today +'.csv')

        # export MongoDB documents to a CSV file
        docs.to_csv(output_file, ",") # CSV delimited by commas

这是github 中原始代码目录的链接。我们想要的文件是 aws_ec2_list_instances.py 和 ec2_mongo.py

为什么 MongoDB 版本中的列和标题乱序？从 pandas 打印到文件时，如何摆脱 mongo 为 ID 添加的额外列？

【问题讨论】：

尝试使用集合包中的 OrderedDict 代替字典
您有我们可以在某处使用的测试平台吗？我尝试运行你的代码，pandas 很难安装，之后，我不能确定我的 mongodb 集合设置是否和你的一样。您发布的仓库中未定义您的 create_mongodb。
这很奇怪。我已经重新添加了create_mongodb 定义。不知道为什么它消失了。该脚本现在正在运行，请查看我的答案。如果您再次查看 repo，请注意 drop_mongodb 功能还没有完全到位，它仍在进行中。谢谢！

标签： python mongodb amazon-web-services

【解决方案1】：

Pandas 是一个非常灵活和宽容的库，用于管理和分析数据。如果您只想在 csv 模块成为标准配置时将 MongoDB 集合转换为 CSV 文件，这完全是矫枉过正，而且您使用它的方式非常低效。另一件需要注意的事情是，直到最近，Python 和 Pandas 都没有尝试在 dict 中保留项目的顺序。在 Python 3.5 版本开始保留顺序之前，代码是在假设 dict 中项目的顺序不重要的情况下编写的。只有从 Python 3.7 开始，维护 dict 条目的顺序才成为官方语言功能。

DataFrame 是 Pandas 的主要数据结构，它表示一个二维数据数组。关于它的一些事情可能会令人困惑，我认为您被行和列都可以具有命名索引的事实绊倒了。一般来说，在 Panda 中谈到数据时，“index”指的是行索引。

在您的数据中，行索引将是 MongoDB _id 的值，您想将其丢弃。这很好，但它可能会让您认为“索引”是指列。

系列通常用于表示一列数据。当使用 dict 初始化时，键被视为索引，即行标签，而不是列标签。您将看到 DataFrames 和 Series 之间的大多数操作将 Series 视为列。但正如我所说，Pandas 很灵活，所以它们有 DataFrame.append 函数，将 Series 视为一行。

问题在于，当追加一行时，Pandas 期望 Series 将一行追加到现有列。当 Series 具有 DataFrame 中不存在的索引（原始字典中的键）时，它将它们作为新列添加到列的末尾，并且如您所见，它按排序顺序添加它们。这实际上是当前版本（1.0.5）中的bug，它可能被允许持续这么长时间而没有被修复，因为 dict 顺序无论如何都会被忽略，但要感谢它，因为它让你进一步调查。

通过将 Series 附加到最初为空的 DataFrame 将 MongoDB 集合转换为 DataFrame 确实效率低下。 DataFrame 完全能够读取您的 MongoDB 集合，而且您编写的代码要少得多。

如果你需要 Pandas，这是我推荐的 mongo_export_to_file 版本：

def mongo_export_to_file():
    today = datetime.today()
    today = today.strftime("%m-%d-%Y")
    _, _, instance_col = set_db()
    # make an API call to the MongoDB server
    mongo_docs = instance_col.find()

    # Convert the mongo docs to a DataFrame
    docs = pandas.DataFrame(mongo_docs)
    # Discard the Mongo ID for the documents
    docs.pop("_id")

    # compute the output file directory and name
    output_dir = os.path.join('..', '..', 'output_files', 'aws_instance_list', 'csv', '')
    output_file = os.path.join(output_dir, 'aws-instance-master-list-' + today +'.csv')

    # export MongoDB documents to a CSV file, leaving out the row "labels" (row numbers)
    docs.to_csv(output_file, ",", index=False) # CSV delimited by commas

这是我将在不需要 Pandas 的项目中使用的版本：

def mongo_export_to_file():  
    today = datetime.today()
    today = today.strftime("%m-%d-%Y")
    _, _, instance_col = set_db()
    # make an API call to the MongoDB server
    mongo_docs = instance_col.find()
    if mongo_docs.count() == 0:
        return

    fieldnames = list(mongo_docs[0].keys())
    fieldnames.remove('_id')

    # compute the output file directory and name
    output_dir = os.path.join('..', '..', 'output_files', 'aws_instance_list', 'csv', '')
    output_file = os.path.join(output_dir, 'aws-instance-master-list-' + today +'.csv')
    with open(output_file, 'w', newline='') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(mongo_docs)

【讨论】：

绝妙的解决方案！我从你的例子中学到了很多。非常感谢！
太棒了！当然，我昨晚没有这样做的唯一原因是奖励赏金的选项尚不可用。我不得不等待。再次感谢！非常感谢您的解决方案。

【解决方案2】：

为什么 MongoDB 版本中的列和标题乱序？

鉴于 JSON 格式正确，问题出现在 mongo_export_to_file() 函数中。首先，请注意输出中的列按字母顺序排序。一个快速而简单的解决方法是在每个列名后附加一个字母以保留原始顺序（AWS account -> a_AWS_account；Account Number -> b_Account _Number）。这将使其余代码保持不变。

无论如何，您一定在某处丢失了原始列顺序。 Python dict 不一定保持原来的顺序。根据@Shubham 的评论，我会尝试两件事：

用OrderedDict 替换for 循环第一行中的doc dict：

...
from collections import OrderedDict
...
...
    # iterate over the list of MongoDB dict documents
    for num, doc in enumerate(mongo_docs):
        doc = OrderedDict(doc)

如果问题仍然存在，则问题来自 API 调用 cursor = instance_col.find()。查看游标的内容。应该有一种方法可以保留 JSON 中的顺序。也许它记录在 pymongo 库 (link to the find function) 中，尽管 sort 参数似乎没有任何效果。

从 pandas 打印到文件时，如何去掉 mongo 为 ID 添加的额外列？

在导出为csv格式时，添加index=False:

      # export MongoDB documents to a CSV file
      docs.to_csv(output_file, sep=",", index=False) # CSV delimited by commas

【讨论】：

@bluethundr，我不明白你为什么接受这个答案。它不能解决您保留列顺序的问题。我担心接受这个答案会误导未来的读者。从现有的 Dict 创建 OrderedDict 没有帮助，因为它所做的只是保留 Dict 无论如何要给出的任何顺序。您的问题是 Series 是使用错误的数据类型，因为它为您提供了一列键和一列值，并且在稍后从行转换为列标题时对键进行了排序。此外，设置 index=False 会删除行标签，但不会删除 _id 列。
好的，谢谢@OldPro。明天我将使用您的解决方案，我可能会接受您的回答。这确实有道理。感谢您的宝贵时间！

【解决方案3】：

我找到了解决方案here。

我要做的是创建一个文件名列表并将其应用于 DF：

# export MongoDB documents to a CSV file
fieldnames = [ 'AWS Account', 'Account Number', 'Name', 'Instance ID', 'AMI ID', 'Volumes', 'Private IP', 'Public IP', 'Private DNS', 'Availability Zone', 'VPC ID', 'Type', 'Key Pair Name', 'State', 'Launch Date']
docs.to_csv(output_file, columns=fieldnames, sep=",", index=False) # CSV delimited by commas

【讨论】：

这是一个不幸的解决方案选择，因为它现在需要您维护一个完全独立于数据库的完整字段列表和顺序。当数据库添加一列时，如果您不在这里也更新它，您将丢失它。使用我的解决方案，您首先不会丢失列的顺序。（我的解决方案也更有效率。）
好的，谢谢。我会试试你的解决方案，看看效果如何。