【问题标题】:Create CSV from XML/Json using Python Pandas使用 Python Pandas 从 XML/Json 创建 CSV
【发布时间】:2020-10-27 04:13:48
【问题描述】:

我正在尝试将一个 xml 解析为多个不同的文件 -

示例 XML

<integration-outbound:IntegrationEntity
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <integrationEntityHeader>
        <integrationTrackingNumber>281#963-4c1d-9d26-877ba40a4b4b#1583507840354</integrationTrackingNumber>
        <referenceCodeForEntity>25428</referenceCodeForEntity>
        <attachments>
            <attachment>
                <id>d6esd1d518b06019e01</id>
                <name>durance.pdf</name>
                <size>0</size>
            </attachment>
            <attachment>
                <id>182e60164ddd4236b5bd96109</id>
                <name>ssds</name>
                <size>0</size>
            </attachment>
        </attachments>
        <source>SIM</source>
        <entity>SUPPLIER</entity>
        <action>CREATE</action>
        <timestampUTC>20200306T151721</timestampUTC>
        <zDocBaseVersion>2.0</zDocBaseVersion>
        <zDocCustomVersion>0</zDocCustomVersion>
    </integrationEntityHeader>
    <integrationEntityDetails>
        <supplier>
            <requestId>2614352</requestId>
            <controlBlock>
                <dataProcessingInfo>
                    <key>MODE</key>
                    <value>Onboarding</value>
                </dataProcessingInfo>
                <dataProcessingInfo>
                    <key>Supplier_Type</key>
                    <value>Operational</value>
                </dataProcessingInfo>
            </controlBlock>
            <id>1647059</id>
            <facilityCode>0001</facilityCode>
            <systemCode>1</systemCode>
            <supplierType>Operational</supplierType>
            <systemFacilityDetails>
                <systemFacilityDetail>
                    <facilityCode>0001</facilityCode>
                    <systemCode>1</systemCode>
                    <FacilityStatus>ACTIVE</FacilityStatus>
                </systemFacilityDetail>
            </systemFacilityDetails>
            <status>ACTIVE</status>
            <companyDetails>
                <displayGSID>254232128</displayGSID>
                <legalCompanyName>asdasdsads</legalCompanyName>
                <dunsNumber>03-175-2493</dunsNumber>
                <legalStructure>1</legalStructure>
                <website>www.aaadistributor.com</website>
                <noEmp>25</noEmp>
                <companyIndicator1099>No</companyIndicator1099>
                <taxidAndWxformRequired>NO</taxidAndWxformRequired>
                <taxidFormat>Fed. Tax</taxidFormat>
                <wxForm>182e601649ade4c38cd4236b5bd96109</wxForm>
                <taxid>27-2204474</taxid>
                <companyTypeFix>SUPPLIER</companyTypeFix>
                <fields>
                    <field>
                        <id>LOW_CUURENT_SERV</id>
                        <value>1</value>
                    </field>
                    <field>
                        <id>LOW_COI</id>
                        <value>USA</value>
                    </field>
                    <field>
                        <id>LOW_STATE_INCO</id>
                        <value>US-PA</value>
                    </field>
                    <field>
                        <id>CERT_INSURANCE</id>
                        <value>d6e6e460fe8958564c1d518b06019e01</value>
                    </field>
                    <field>
                        <id>COMP_DBA</id>
                        <value>asdadas</value>
                    </field>
                    <field>
                        <id>LOW_AREUDIVE</id>
                        <value>N</value>
                    </field>
                    <field>
                        <id>LOW_BU_SIZE1</id>
                        <value>SMLBUS</value>
                    </field>
                    <field>
                        <id>EDI_CAP</id>
                        <value>Y</value>
                    </field>
                    <field>
                        <id>EDI_WEB</id>
                        <value>N</value>
                    </field>
                    <field>
                        <id>EDI_TRAD</id>
                        <value>N</value>
                    </field>
                </fields>
            </companyDetails>
            <allLocations>
                <location>
                    <addressInternalid>1704342</addressInternalid>
                    <isDelete>false</isDelete>
                    <internalSupplierid>1647059</internalSupplierid>
                    <acctGrpid>HQ</acctGrpid>
                    <address1>2501 GRANT AVE</address1>
                    <country>USA</country>
                    <state>US-PA</state>
                    <city>PHILADELPHIA</city>
                    <zip>19114</zip>
                    <phone>(215) 745-7900</phone>
                </location>
            </allLocations>
            <contactDetails>
                <contactDetail>
                    <contactInternalid>12232</contactInternalid>
                    <isDelete>false</isDelete>
                    <addressInternalid>1704312142</addressInternalid>
                    <contactType>Main</contactType>
                    <firstName>Raf</firstName>
                    <lastName>jas</lastName>
                    <title>Admin</title>
                    <email>abcd@gmail.com</email>
                    <phoneNo>123-42-23-23</phoneNo>
                    <createPortalLogin>yes</createPortalLogin>
                    <allowedPortalSideProducts>SIM,iSource,iContract</allowedPortalSideProducts>
                </contactDetail>
                <contactDetail>
                    <contactInternalid>1944938</contactInternalid>
                    <isDelete>false</isDelete>
                    <addressInternalid>1704342</addressInternalid>
                    <contactType>Rad</contactType>
                    <firstName>AVs</firstName>
                    <lastName>asd</lastName>
                    <title>Founder</title>
                    <email>as@sds.com</email>
                    <phoneNo>21521-2112-7900</phoneNo>
                    <createPortalLogin>yes</createPortalLogin>
                    <allowedPortalSideProducts>SIM,iContract,iSource</allowedPortalSideProducts>
                </contactDetail>
            </contactDetails>
            <myLocation>
                <addresses>
                    <myLocationsInternalid>1704342</myLocationsInternalid>
                    <isDelete>false</isDelete>
                    <addressInternalid>1704342</addressInternalid>
                    <usedAt>N</usedAt>
                </addresses>
            </myLocation>
            <bankDetails>
                <fields>
                    <field>
                        <id>LOW_BANK_KEY</id>
                        <value>123213</value>
                    </field>
                    <field>
                        <id>LOW_EFT</id>
                        <value>123123</value>
                    </field>
                </fields>
            </bankDetails>
            <forms>
                <form>
                    <id>CATEGORY_PRODSER</id>
                    <records>
                        <record>
                            <Internalid>24348</Internalid>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>CATEGOR_LEVEL_1</id>
                                    <value>MR</value>
                                </field>
                                <field>
                                    <id>LOW_PRODSERV</id>
                                    <value>RES</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_2</id>
                                    <value>keylevel221</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_3</id>
                                    <value>keylevel3127</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_4</id>
                                    <value>keylevel4434</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_5</id>
                                    <value>keylevel5545</value>
                                </field>
                            </fields>
                        </record>
                        <record>
                            <Internalid>24349</Internalid>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>CATEGOR_LEVEL_1</id>
                                    <value>MR</value>
                                </field>
                                <field>
                                    <id>LOW_PRODSERV</id>
                                    <value>RES</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_2</id>
                                    <value>keylevel221</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_3</id>
                                    <value>keylevel3125</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_4</id>
                                    <value>keylevel4268</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_5</id>
                                    <value>keylevel5418</value>
                                </field>
                            </fields>
                        </record>
                        <record>
                            <Internalid>24350</Internalid>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>CATEGOR_LEVEL_1</id>
                                    <value>MR</value>
                                </field>
                                <field>
                                    <id>LOW_PRODSERV</id>
                                    <value>RES</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_2</id>
                                    <value>keylevel221</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_3</id>
                                    <value>keylevel3122</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_4</id>
                                    <value>keylevel425</value>
                                </field>
                                <field>
                                    <id>LOW_LEVEL_5</id>
                                    <value>keylevel5221</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
                <form>
                    <id>OTHER_INFOR</id>
                    <records>
                        <record>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>S_EAST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>W_EST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>M_WEST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>N_EAST</id>
                                    <value>N</value>
                                </field>
                                <field>
                                    <id>LOW_AREYOU_ASSET</id>
                                    <value>-1</value>
                                </field>
                                <field>
                                    <id>LOW_SWART_PROG</id>
                                    <value>-1</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
                <form>
                    <id>ABDCEDF</id>
                    <records>
                        <record>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>LOW_COD_CONDUCT</id>
                                    <value>-1</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
                <form>
                    <id>CODDUC</id>
                    <records>
                        <record>
                            <isDelete>false</isDelete>
                            <fields>
                                <field>
                                    <id>LOW_SUPPLIER_TYPE</id>
                                    <value>2</value>
                                </field>
                                <field>
                                    <id>LOW_DO_INT_BOTH</id>
                                    <value>1</value>
                                </field>
                            </fields>
                        </record>
                    </records>
                </form>
            </forms>
        </supplier>
    </integrationEntityDetails>
</integration-outbound:IntegrationEntity>

目标是实现通用 xml 到 csv 的转换。根据输入文件,xml 应该被展平并分解为多个 csv 并存储。

输入是上面的 xml 和下面的配置 csv 文件。需要用文件中提到的对应XPATH创建3个csv文件

XPATH,ColumName,CSV_File_Name,ParentKey
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/id,id,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/name,name,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/attachments/attachment[]/size,size,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/source,source,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/entity,entity,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/action,action,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/timestampUTC,timestampUTC,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocBaseVersion,zDocBaseVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/zDocCustomVersion,zDocCustomVersion,integrationEntityHeader.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsControlBlock.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/key,key,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/controlBlock/dataProcessingInfo[]/value,value,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsControlBlock.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/id,id,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/Internalid,Internalid,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/isDelete,FormId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/id,SupplierFormRecordFieldId,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/forms/form[]/records/record[]/fields/field[]/value,SupplierFormRecordFieldValue,integrationEntityDetailsForms.csv,
/integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber,integrationTrackingNumber,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityHeader/referenceCodeForEntity,referenceCodeForEntity,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/requestId,requestId,integrationEntityDetailsForms.csv,Y
/integration-outbound:IntegrationEntity/integrationEntityDetails/supplier/id,supplier_id,integrationEntityDetailsForms.csv,Y

我需要从中创建 3 个 csv 文件输出。

设计是选择每个csv文件并获取xpath并从xml中选择相应的值并获取它

第 1 步 - 将 xml 转换为 Json -

import json
import xmltodict

with open("/home/s0998hws/test.xml") as xml_file:
    data_dict = xmltodict.parse(xml_file.read())
    xml_file.close()

    # generate the object using json.dumps()
    # corresponding to json data

json_data = json.dumps(data_dict)

# Write the json data to output
# json file
with open("data.json", "w") as json_file:
    json_file.write(json_data)
    json_file.close()

with open('data.json') as f:
    d = json.load(f)

第 2 步 - 使用 panda normalize 函数进行标准化 - 使用 xpath / 转换为 .和 [] 作为其他分隔符并构建要从 json 中提取的列,即代码将查找 /integration-outbound:IntegrationEntity/integrationEntityHeader/integrationTrackingNumber 并转换为 .integrationEntityHeader.integrationTrackingNumber 和第一个 [] 它会爆炸,在那里

df_1=pd.json_normalize(data=d['integration-outbound:IntegrationEntity'])
df_2=df_1[['integrationEntityHeader.integrationTrackingNumber','integrationEntityDetails.supplier.requestId','integrationEntityHeader.referenceCodeForEntity','integrationEntityDetails.supplier.id','integrationEntityDetails.supplier.forms.form']]
df_3=df_2.explode('integrationEntityDetails.supplier.forms.form') 
df_3['integrationEntityDetails.supplier.forms.form.id']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('id'))
df_3['integrationEntityDetails.supplier.forms.form.records']=df_3['integrationEntityDetails.supplier.forms.form'].apply(lambda x: x.get('records'))

我试图使用 csv 文件中的元数据并对其进行修改,但挑战是

df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid']=df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(lambda x: x.get('Internalid'))

因错误而失败 -

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib64/python3.6/site-packages/pandas/core/series.py", line 3848, in apply
    mapped = lib.map_infer(values, f, convert=convert_dtype)
  File "pandas/_libs/lib.pyx", line 2327, in pandas._libs.lib.map_infer
  File "<stdin>", line 1, in <lambda>
AttributeError: 'list' object has no attribute 'get'

原因是panda dataframe中的数据有list when和array,使用上述方法无法得到。

下面是生成的输出

integrationEntityHeader.integrationTrackingNumber   integrationEntityDetails.supplier.requestId integrationEntityHeader.referenceCodeForEntity  integrationEntityDetails.supplier.id    integrationEntityDetails.supplier.forms.form    integrationEntityDetails.supplier.forms.form.id integrationEntityDetails.supplier.forms.form.records
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'CATEGORY_PRODSER', 'records': {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}}    CATEGORY_PRODSER    {'record': [{'Internalid': '24348', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3127'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4434'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5545'}]}}, {'Internalid': '24349', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3125'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel4268'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5418'}]}}, {'Internalid': '24350', 'isDelete': 'false', 'fields': {'field': [{'id': 'CATEGOR_LEVEL_1', 'value': 'MR'}, {'id': 'LOW_PRODSERV', 'value': 'RES'}, {'id': 'LOW_LEVEL_2', 'value': 'keylevel221'}, {'id': 'LOW_LEVEL_3', 'value': 'keylevel3122'}, {'id': 'LOW_LEVEL_4', 'value': 'keylevel425'}, {'id': 'LOW_LEVEL_5', 'value': 'keylevel5221'}]}}]}
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'OTHER_INFOR', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}}   OTHER_INFOR {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'S_EAST', 'value': 'N'}, {'id': 'W_EST', 'value': 'N'}, {'id': 'M_WEST', 'value': 'N'}, {'id': 'N_EAST', 'value': 'N'}, {'id': 'LOW_AREYOU_ASSET', 'value': '-1'}, {'id': 'LOW_SWART_PROG', 'value': '-1'}]}}}
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'CORPORATESUSTAINABILITY', 'records': {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}}  CORPORATESUSTAINABILITY {'record': {'isDelete': 'false', 'fields': {'field': {'id': 'LOW_COD_CONDUCT', 'value': '-1'}}}}
    0   281#999eb16e-242c-4239-b33e-ae6f5296fb15#10c7338c-ab63-4c1d-9d26-877ba40a4b4b#1583507840354 2614352 25428   1647059 {'id': 'PRODUCTSERVICES', 'records': {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}}  PRODUCTSERVICES {'record': {'isDelete': 'false', 'fields': {'field': [{'id': 'LOW_SUPPLIER_TYPE', 'value': '2'}, {'id': 'LOW_DO_INT_BOTH', 'value': '1'}]}}}

预期输出 integrationEntityDetailsForms.csv

integrationTrackingNumber   requestId   referenceCodeForEntity  supplier.id integrationEntityDetails.supplier.forms.form.id InternalId  isDelete    SupplierFormRecordFieldId   SupplierFormRecordFieldValue
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_PRODSERV    RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_3 keylevel3127
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_4 keylevel4434
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24348   FALSE   LOW_LEVEL_5 keylevel5545
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   CATEGOR_LEVEL_1 MR
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_PRODSERV    RES
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_2 keylevel221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_3 keylevel3122
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_4 keylevel425
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CATEGORY_PRODSER    24350   FALSE   LOW_LEVEL_5 keylevel5221
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   S_EAST  N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   W_EST   N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   M_WEST  N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   N_EAST  N
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 OTHER_INFOR     FALSE   LOW_AREYOU_ASSET    -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CORPORATESUSTAINABILITY     FALSE   LOW_SWART_PROG  -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 CORPORATESUSTAINABILITY     FALSE   LOW_COD_CONDUCT -1
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 PRODUCTSERVICES     FALSE   LOW_SUPPLIER_TYPE   2
281#963-4c1d-9d26-877ba40a4b4b#1583507840354    2614352 25428   1647059 PRODUCTSERVICES     FALSE   LOW_DO_INT_BOTH 1

【问题讨论】:

  • 您的 XML 格式不正确,命名空间前缀 integration-outbound 未定义。

标签: python xml pandas csv jsonparser


【解决方案1】:

我认为问题中缺少这一行:

df_3['integrationEntityDetails.supplier.forms.form.records.record'] = (
    df_3['integrationEntityDetails.supplier.forms.form.records'].apply(
        lambda x: x.get('record')
    )
)

然后,对于 Internalid,您可以这样做:

df_3['integrationEntityDetails.supplier.forms.form.records.record.Internalid'] = (
    df_3['integrationEntityDetails.supplier.forms.form.records.record'].apply(
        lambda x: x[0].get('Internalid') if type(x) == list else x.get('Internalid')
    )
)

【讨论】:

    【解决方案2】:

    考虑XSLT,这是一种专用语言,旨在转换 XML 文件,例如在某些部分将它们展平。 Python 的第三方模块lxml 可以运行 XSLT 1.0 脚本和 XPath 1.0 表达式。

    具体来说,XSLT 可以处理您的 XPath 提取。然后,从单个转换结果树中,构建所需的三个数据框。为了格式良好,下面假设以下根和数据结构:

    <integration-outbound:IntegrationEntity
        xmlns:integration-outbound="http://example.com"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
       
        ...same content...
    
    </integration-outbound:IntegrationEntity>
    

    XSLT (另存为.xsl,一个特殊的.xml文件)

    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                                  xmlns:integration-outbound="http://example.com"
                                  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
        <xsl:strip-space elements="*"/>
        
        <xsl:template match="integration-outbound:IntegrationEntity">
         <data>
           <xsl:apply-templates select="integrationEntityHeader/descendant::attachment"/>
           <xsl:apply-templates select="integrationEntityDetails/descendant::dataProcessingInfo"/>
           <xsl:apply-templates select="integrationEntityDetails/descendant::forms/descendant::field"/>
         </data>
        </xsl:template>
        
        <xsl:template match="attachment">
         <integrationEntityHeader>
           <xsl:copy-of select="ancestor::integrationEntityHeader/*[name()!='attachments']"/>
           <xsl:copy-of select="*"/>
         </integrationEntityHeader>
        </xsl:template>
        
        <xsl:template match="dataProcessingInfo">
         <integrationEntityDetailsControlBlock>
           <xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() &lt;= 2]"/>
           <requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId>
           <supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id>
           <xsl:copy-of select="*"/>
         </integrationEntityDetailsControlBlock>
        </xsl:template>
        
        <xsl:template match="field">
         <integrationEntityDetailsForms>
           <form_id><xsl:value-of select="ancestor::form/id"/></form_id>
           <xsl:copy-of select="ancestor::record/*[name()!='fields']"/>
           <SupplierFormRecordFieldId><xsl:value-of select="id"/></SupplierFormRecordFieldId>
           <SupplierFormRecordFieldValue><xsl:value-of select="id"/></SupplierFormRecordFieldValue>
           <xsl:copy-of select="ancestor::integration-outbound:IntegrationEntity/integrationEntityHeader/*[position() &lt;= 2]"/>
           <requestId><xsl:value-of select="ancestor::supplier/requestId"/></requestId>
           <supplier_id><xsl:value-of select="ancestor::supplier/id"/></supplier_id>
         </integrationEntityDetailsForms>
        </xsl:template>
        
    </xsl:stylesheet>
    

    Online Transformation

    Python

    import lxml.etree as et
    import pandas as pd
    
    # LOAD XML AND XSL
    doc = et.parse('Input.xml')
    style = et.parse('Script.xsl')
    
    # INITIALIZE AND RUN TRANSFORMATION
    transformer = et.XSLT(style)
    flat_doc = transformer(doc)
    
    # BUILD THREE DATA FRAMES
    df_header = pd.DataFrame([{i.tag:i.text for i in el} 
              for el in flat_doc.xpath('integrationEntityHeader')])
    
    df_detailsControlBlock = pd.DataFrame([{i.tag:i.text for i in el} 
              for el in flat_doc.xpath('integrationEntityDetailsControlBlock')])
    
    df_detailsForms = pd.DataFrame([{i.tag:i.text for i in el} 
              for el in flat_doc.xpath('integrationEntityDetailsForms')])
    

    【讨论】:

    • 这行得通。但不希望对每个不同的 xml 都使用 xslt 来解析。但令人难以置信的想法!我发布的解决方案将配置作为 csv 并相应地解析。同样可以通过简单的调整来解析 json。非常感谢您的努力并提供解决方案
    • 明白。假设 XML 变化不大,也可以在 XSLT 中进行小幅调整。另外,您留在 DOM 框架中,而不是集成像 JSON 这样的不同可序列化类型。 XSLT is a sibling to XPath。由于 XML 是一种留给设计人员的开放式数据格式(与 2-D 数据帧不同),因此很难为所有用例提供可通用的答案。但祝你好运,编码愉快!
    【解决方案3】:

    xml转换成dict,然后编写解析逻辑,原因是同样可以用于json。 stackoverflow 非常有用,解决方案是基于所有这些链接的响应构建的。为简单起见,我创建了一个 3 级嵌套 xml。这适用于 Python3

    <?xml version="1.0"?><Company><Employee><FirstName>Hal</FirstName><LastName>Thanos</LastName><ContactNo>122131</ContactNo><Email>hal.thanos@xyz.com</Email><Addresses><Address><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Iron</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>iron.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
    <?xml version="1.0"?><Company><Employee><FirstName>Captain</FirstName><LastName>America</LastName><ContactNo>13322</ContactNo><Email>captain.america@xyz.com</Email><Addresses><Address><City>Trivandrum</City><State>Kerala</State><Zip>28115</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Sword</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>sword.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
    <?xml version="1.0"?><Company><Employee><FirstName>Thor</FirstName><LastName>Odison</LastName><ContactNo>156565</ContactNo><Email>thor.odison@xyz.com</Email><Addresses><Address><City>Tirunelveli</City><State>TamilNadu</State><Zip>36595</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form></forms></Address></Addresses></Employee><Employee><FirstName>Spider</FirstName><LastName>Man</LastName><ContactNo>12324</ContactNo><Email>spider.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
    <?xml version="1.0"?><Company><Employee><FirstName>Black</FirstName><LastName>Widow</LastName><ContactNo>16767</ContactNo><Email>black.widow@xyz.com</Email><Addresses><Address><City>Mysore</City><State>Karnataka</State><Zip>12478</Zip><forms><form><id>ID1</id><value>LIC</value></form></forms></Address></Addresses></Employee><Employee><FirstName>White</FirstName><LastName>Man</LastName><ContactNo>5634</ContactNo><Email>white.man@xyz.com</Email><Addresses><Address><type>Permanent</type><City>Bangalore</City><State>Karnataka</State><Zip>560212</Zip><forms><form><id>ID3</id><value>LIC</value></form></forms></Address><Address><type>Temporary</type><City>Concord</City><State>NC</State><Zip>28027</Zip><forms><form><id>ID1</id><value>LIC</value></form><form><id>ID2</id><value>PAS</value></form><form><id>ID3</id><value>SSN</value></form><form><id>ID2</id><value>CC</value></form></forms></Address></Addresses></Employee></Company>
    

    这个 xml 的配置文件是所有可能的数组/多级/爆炸列应该被称为 []。如代码中所述,需要标头。

    根据您的商店更改变量 process_config_csv = 'config.csv' xml_file_name = 'test.xml'

    XPATH,ColumName,CSV_File_Name
    /Company/Employee[]/FirstName,FirstName,Name.csv
    /Company/Employee[]/LastName,LastName,Name.csv
    /Company/Employee[]/ContactNo,ContactNo,Name.csv
    /Company/Employee[]/Email,Email,Name.csv
    /Company/Employee[]/FirstName,FirstName,Address.csv
    /Company/Employee[]/LastName,LastName,Address.csv
    /Company/Employee[]/ContactNo,ContactNo,Address.csv
    /Company/Employee[]/Email,Email,Address.csv
    /Company/Employee[]/Addresses/Address[]/City,City,Address.csv
    /Company/Employee[]/Addresses/Address[]/State,State,Address.csv
    /Company/Employee[]/Addresses/Address[]/Zip,Zip,Address.csv
    /Company/Employee[]/Addresses/Address[]/type,type,Address.csv
    /Company/Employee[]/FirstName,FirstName,Form.csv
    /Company/Employee[]/LastName,LastName,Form.csv
    /Company/Employee[]/ContactNo,ContactNo,Form.csv
    /Company/Employee[]/Email,Email,Form.csv
    /Company/Employee[]/Addresses/Address[]/type,type,Form.csv
    /Company/Employee[]/Addresses/Address[]/forms/form[]/id,id,Form.csv
    /Company/Employee[]/Addresses/Address[]/forms/form[]/value,value,Form.csv
    

    根据配置文件创建多个csv的代码是

    import json
    import xmltodict
    import json
    import os
    import csv
    import numpy as np
    import pandas as pd
    import sys
    from collections import defaultdict
    import numpy as np
    
    def getMatches(L1, L2):
        R = set()
        for elm in L1:
            for pat in L2:
                if elm.find(pat) != -1:
                    if elm.find('.', len(pat)+1) != -1:
                        R.add(elm[:elm.find('.', len(pat)+1)])
                    else:
                        R.add(elm)
        return list(R)
    
    def xml_parse(xml_file_name):
        try:
            process_xml_file = xml_file_name
            with open(process_xml_file) as xml_file:
                for xml_string in xml_file:
                    """Converting the xml to Dict"""
                    data_dict = xmltodict.parse(xml_string)
                    """Converting the dict to Pandas DF"""
                    df_processing = pd.json_normalize(data_dict)
                    xml_parse_loop(df_processing)
                xml_file.close()
        except Exception as e:
            s = str(e)
            print(s)
    
    def xml_parse_loop(df_processing_input):
        CSV_File_Name = []
        """Getting the list of csv Files to be created"""
        with open(process_config_csv, newline='') as csvfile:
            DataCaptured = csv.DictReader(csvfile)
            for row in DataCaptured:
                if row['CSV_File_Name'] not in CSV_File_Name:
                    CSV_File_Name.append(row['CSV_File_Name'])
        """Iterating the list of CSV"""
        for items in CSV_File_Name:
                df_processing = df_processing_input
                df_subset_process = []
                df_subset_list_all_cols = []
                df_process_sub_explode_Level = []
                df_final_column_name = []
                print('Parsing the xml file for creating the file - ' + str(items))
                """Fetching the field list for processs from the confic File"""
                with open(process_config_csv, newline='') as csvfile:
                        DataCaptured = csv.DictReader(csvfile)
                        for row in DataCaptured:
                            if row['CSV_File_Name'] in items:
                                    df_final_column_name.append(row['ColumName'])
                                    """Getting the columns until the first [] """
                                    df_subset_process.append(row['XPATH'].strip('/').replace("/",".").split('[]')[0])
                                    """Getting the All the columnnames"""
                                    df_subset_list_all_cols.append(row['XPATH'].strip('/').replace("/",".").replace("[]",""))
                                    """Getting the All the Columns to explode"""
                                    df_process_sub_explode_Level.append(row['XPATH'].strip('/').replace('/', '.').split('[]'))
                explode_ld = defaultdict(set)
                """Putting Level of explode and column names"""
                for x in df_process_sub_explode_Level:
                    if len(x) > 1:
                        explode_ld[len(x) - 1].add(''.join(x[: -1]))
                explode_ld = {k: list(v) for k, v in explode_ld.items()}
                #print(' The All column list is for the file ' + items + " is " + str(df_subset_list_all_cols))
                #print(' The first processing for the file ' + items + " is " + str(df_subset_process))
                #print('The explode level of attributes for the file ' + items + " is " + str(explode_ld))
                """Remove column duplciates"""
                df_subset_process = list(dict.fromkeys(df_subset_process))
                for col in df_subset_process:
                    if col not in df_processing.columns:
                        df_processing[col] = np.nan
                df_processing = df_processing[df_subset_process]
                df_processing_col_list = df_processing.columns.tolist()
                print ('The total levels to be exploded : %d' % len(explode_ld))
                i=0
                level=len(explode_ld)
                for i in range(level):
                    print (' Exploding the Level : %d' % i )
                    df_processing_col_list = df_processing.columns.tolist()
                    list_of_explode=set(df_processing_col_list) & set(explode_ld[i + 1])
                    #print('List to expolde' + str(list_of_explode))
                    """If founc in explode list exlplode some xml doesnt need to have a list it could be column handling the same"""
                    for c in list_of_explode:
                        print (' There are column present which needs to be exploded - ' + str(c))
                        df_processing = pd.concat((df_processing.iloc[[type(item) == list for item in df_processing[c]]].explode(c),df_processing.iloc[[type(item) != list for item in df_processing[c]]]))
                        print(' Finding the columns need to be fetched ')
                    """From the overall column list fecthing the attributes needed to explode"""
                    next_level_pro_lst = getMatches(df_subset_list_all_cols,explode_ld[ i + 1 ])
                    #print(next_level_pro_lst)
                    df_processing_col_list = df_processing.columns.tolist()
                    for nex in next_level_pro_lst:
                        #print ("Fetching " + nex.rsplit('.', 1)[1] + ' from ' + nex.rsplit('.', 1)[0] + ' from ' + nex )
                        parent_col=nex.rsplit('.', 1)[0]
                        child_col=nex.rsplit('.', 1)[1]
                        #print(parent_col)
                        #print(df_processing_col_list)
                        if parent_col not in df_processing_col_list:
                            df_processing[nex.rsplit('.', 1)[0]] = ""
                        try:
                            df_processing[nex] = df_processing[parent_col].apply(lambda x: x.get(child_col))
                        except AttributeError:
                            df_processing[nex] = ""
                    df_processing_col_list = df_processing.columns.tolist()
                    if i == level-1:
                        print('Last Level nothing to be done')
                    else:
                        """Extracting All columns until the next exlode column list is found"""
                        while len(set(df_processing_col_list) & set(explode_ld[i + 2]))==0:
                            next_level_pro_lst = getMatches(df_subset_list_all_cols, next_level_pro_lst)
                            #print(next_level_pro_lst)
                            for nextval in next_level_pro_lst:
                                if nextval not in df_processing_col_list:
                                    #print("Fetching " + nextval.rsplit('.', 1)[1] + ' from ' + nextval.rsplit('.', 1)[0] + ' from ' + nextval)
                                    if nextval.rsplit('.', 1)[0] not in df_processing.columns:
                                        df_processing[nextval.rsplit('.', 1)[0]] = ""
                                    try:
                                        df_processing[nextval] = df_processing[nextval.rsplit('.', 1)[0]].apply(lambda x: x.get(nextval.rsplit('.', 1)[1]))
                                    except AttributeError:
                                        df_processing[nextval] = ""
    
                            df_processing_col_list = df_processing.columns.tolist()
    
    
                df_processing = df_processing[df_subset_list_all_cols]
                df_processing.columns = df_final_column_name
                # if file does not exist write header
                if not os.path.isfile(items):
                    print("The file does not exists Exists so writing new")
                    df_processing.to_csv('{}'.format(items), header='column_names',index=None)
                else:  # else it exists so append without writing the header
                    print("The file does exists Exists so appending")
                    df_processing.to_csv('{}'.format(items), mode='a', header=False,index=None)
    
    
    from datetime import datetime
    startTime = datetime.now().strftime("%Y%m%d_%H%M%S")
    startTime = str(os.getpid()) + "_" + startTime
    process_task_name = ''
    process_config_csv = 'config.csv'
    xml_file_name = 'test.xml'
    old_print = print
    
    def timestamped_print(*args, **kwargs):
        now = datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")
        printheader = now + " xml_parser " + " " + process_task_name + " - "
        old_print(printheader, *args, **kwargs)
    print = timestamped_print
    
    xml_parse(xml_file_name)
    

    创建的输出是

    [, ~]$ cat Name.csv
    FirstName,LastName,ContactNo,Email
    Hal,Thanos,122131,hal.thanos@xyz.com
    Iron,Man,12324,iron.man@xyz.com
    Captain,America,13322,captain.america@xyz.com
    Sword,Man,12324,sword.man@xyz.com
    Thor,Odison,156565,thor.odison@xyz.com
    Spider,Man,12324,spider.man@xyz.com
    Black,Widow,16767,black.widow@xyz.com
    White,Man,5634,white.man@xyz.com
    [, ~]$ cat Address.csv
    FirstName,LastName,ContactNo,Email,City,State,Zip,type
    Iron,Man,12324,iron.man@xyz.com,Bangalore,Karnataka,560212,Permanent
    Iron,Man,12324,iron.man@xyz.com,Concord,NC,28027,Temporary
    Hal,Thanos,122131,hal.thanos@xyz.com,Bangalore,Karnataka,560212,
    Sword,Man,12324,sword.man@xyz.com,Bangalore,Karnataka,560212,Permanent
    Sword,Man,12324,sword.man@xyz.com,Concord,NC,28027,Temporary
    Captain,America,13322,captain.america@xyz.com,Trivandrum,Kerala,28115,
    Spider,Man,12324,spider.man@xyz.com,Bangalore,Karnataka,560212,Permanent
    Spider,Man,12324,spider.man@xyz.com,Concord,NC,28027,Temporary
    Thor,Odison,156565,thor.odison@xyz.com,Tirunelveli,TamilNadu,36595,
    White,Man,5634,white.man@xyz.com,Bangalore,Karnataka,560212,Permanent
    White,Man,5634,white.man@xyz.com,Concord,NC,28027,Temporary
    Black,Widow,16767,black.widow@xyz.com,Mysore,Karnataka,12478,
    [, ~]$ cat Form.csv
    FirstName,LastName,ContactNo,Email,type,id,value
    Iron,Man,12324,iron.man@xyz.com,Temporary,ID1,LIC
    Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,PAS
    Iron,Man,12324,iron.man@xyz.com,Temporary,ID3,SSN
    Iron,Man,12324,iron.man@xyz.com,Temporary,ID2,CC
    Hal,Thanos,122131,hal.thanos@xyz.com,,ID1,LIC
    Hal,Thanos,122131,hal.thanos@xyz.com,,ID2,PAS
    Iron,Man,12324,iron.man@xyz.com,Permanent,ID3,LIC
    Sword,Man,12324,sword.man@xyz.com,Temporary,ID1,LIC
    Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,PAS
    Sword,Man,12324,sword.man@xyz.com,Temporary,ID3,SSN
    Sword,Man,12324,sword.man@xyz.com,Temporary,ID2,CC
    Captain,America,13322,captain.america@xyz.com,,ID1,LIC
    Captain,America,13322,captain.america@xyz.com,,ID2,PAS
    Sword,Man,12324,sword.man@xyz.com,Permanent,ID3,LIC
    Spider,Man,12324,spider.man@xyz.com,Temporary,ID1,LIC
    Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,PAS
    Spider,Man,12324,spider.man@xyz.com,Temporary,ID3,SSN
    Spider,Man,12324,spider.man@xyz.com,Temporary,ID2,CC
    Thor,Odison,156565,thor.odison@xyz.com,,ID1,LIC
    Thor,Odison,156565,thor.odison@xyz.com,,ID2,PAS
    Spider,Man,12324,spider.man@xyz.com,Permanent,ID3,LIC
    White,Man,5634,white.man@xyz.com,Temporary,ID1,LIC
    White,Man,5634,white.man@xyz.com,Temporary,ID2,PAS
    White,Man,5634,white.man@xyz.com,Temporary,ID3,SSN
    White,Man,5634,white.man@xyz.com,Temporary,ID2,CC
    White,Man,5634,white.man@xyz.com,Permanent,ID3,LIC
    Black,Widow,16767,black.widow@xyz.com,,ID1,LIC
    

    片段和答案是从不同的线程中提取的,感谢 @Mark Tolonen @Mandy007 @deadshot

    Create a dict of list using python from csv

    https://stackoverflow.com/questions/62837949/extract-a-list-from-a-list

    How to explode Panda column with data having different dict and list of dict

    这绝对可以做得更短,性能更好,并且可以进一步增强

    【讨论】:

      【解决方案4】:

      火花来救援!

      以下代码使用 Scala 编写,但如果您愿意,可以轻松将其转换为 Python。

      Databrick's XML library 使 XML 处理变得容易。

      val headers = spark.read.format("xml").option("rowTag", "integrationEntityHeader").load("stackOverflowRafaXML.xml")
      
      headers.write.csv(<headerFilename>) // Create CSV from the header file
      
      val details = spark.read.format("xml").option("rowTag", "integrationEntityDetails").load("stackOverflowRafaXML.xml")
      
      // The details need further unnesting. To get suppliers, for instance, you can do
      val supplier = spark.read.format("xml").option("rowTag", "supplier").load("stackOverflowRafaXML.xml")
      
      supplier.show
      
      +--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
      |        allLocations|         bankDetails|      companyDetails|      contactDetails|        controlBlock|facilityCode|               forms|     id|          myLocation|requestId|status|supplierType|systemCode|systemFacilityDetails|
      +--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
      |[[HQ, 2501 GRANT ...|[[[[LOW_BANK_KEY,...|[No, SUPPLIER, 25...|[[[1704312142, SI...|[[[MODE, Onboardi...|           1|[[[CATEGORY_PRODS...|1647059|[[1704342, false,...|  2614352|ACTIVE| Operational|         1|     [[ACTIVE, 1, 1]]|
      +--------------------+--------------------+--------------------+--------------------+--------------------+------------+--------------------+-------+--------------------+---------+------+------------+----------+---------------------+
      
      
      
      
      

      【讨论】:

        【解决方案5】:

        xml 的格式对我来说有点陌生。

        你试过 pandas_read_xml 吗?

        pip install pandas_read_xml
        

        你可以做类似的事情

        import pandas_read_xml as pdx
        
        df = pdx.read_xml('filename.xml')
        

        要变平,你可以

        df = pdx.flatten(df)
        

        df = pdx.fully_flatten(df)
        

        【讨论】:

          猜你喜欢
          • 2021-04-17
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-08-11
          • 2013-12-02
          • 2017-10-04
          • 1970-01-01
          • 2019-07-27
          相关资源
          最近更新 更多