【问题标题】：Generate a JSON structure from a file based on the number of columns in the file using Python使用 Python 根据文件中的列数从文件生成 JSON 结构
【发布时间】：2018-10-31 16:49:17
【问题描述】：

我有一个问题，我需要生成一个 json 有效负载以将其作为数据传递给 api，问题是我需要根据来自文件的 ID 数量生成 json 结构。例如：如果一个人有 5 个 ID，那么我需要生成 5 行数据；如果有 4 个 ID，则为 4 行，依此类推...

这是我的数据文件的样子：

Member_ID,User_ID,Proxy_ID,A_ID,Login_ID,First_Name,Last_Name
M1000,U1000,P1000,A1000,Jim1,Jim,Kong
M2000,U2000,P2000,A2000,OlilaJ,Olila,Jayavarman
M3000,U3000,P3000,A3000,LisaKop,Lisa,Kopkingg
M4000,U4000,P4000,A4000,KishoreP,Kishore,Pindhar
M5000,U5000,P5000,A5000,Gobi123,Gobi,Nadar

数据也可以是：

Member_ID,User_ID,A_ID,Login_ID,First_Name,Last_Name
M1000,U1000,A1000,Jim1,Jim,Kong
M2000,U2000,A2000,OlilaJ,Olila,Jayavarman
M3000,U3000,A3000,LisaKop,Lisa,Kopkingg
M4000,U4000,A4000,KishoreP,Kishore,Pindhar
M5000,U5000,A5000,Gobi123,Gobi,Nadar

我无法找到一种方法来为每种此类输入文件动态生成行数。

from datetime import datetime
import json
import requests

start_time = datetime.now()

delim = "," # Just in case we switch to tsv or something
headers = {'content-type': 'application/json'}

with open('Documents/Onboarding_sample.csv', 'r') as file:
    i = next(file)
    listcolumns = i.split(",")
    sub = "ID"
    IDcolumns = [s for s in listcolumns if sub.lower() in s.lower()]
    print len(IDcolumns)
    for line in file:
        line_list = line.split(delim)
        Member_ID = line_list[0]
        User_ID = line_list[1]
        Proxy_ID = line_list[2]
        A_ID = line_list[3]
        payload = { 
            "IndividualInfo":
            [{
            "Member_ID": Member_ID,
            "Identifiertype":"001",
            "EType:01"
            }
            {
            "User_ID": User_ID,
            "Identifiertype":"001",
            "EType:01"
            }
            {
            "Proxy_ID": Proxy_ID,
            "Identifiertype":"001",
            "EType:01"
            }
            {
            "A_ID": A_ID,
            "Identifiertype":"001",
            "EType:01"
            }
            ]
        }
        try:
            r = requests.post("http://www.google.com/blahblah", data=json.dumps(payload), timeout=(1,20), headers=headers)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh:
            print "HTTP Error:%s" %errh
        except requests.exceptions.ConnectionError as errc:
            print "Error Connecting:%s" %errc
        except requests.exceptions.Timeout as errt:
            print "Timeout error:%s" %errt

        print "This is a JSON object."
        print payload

end_time = datetime.now()

print('Duration: {}'.format(end_time - start_time))

无论我在文件中获得什么 ID，有人可以告诉我如何正确且动态地执行此操作。

【问题讨论】：

请注意，您的数据无效，“Identifiertype:001”中的引用可能有误，还有一些逗号
你的payload既不是有效的python语法也不是有效的json结构。你预期的 json 输出是什么？
我把它改成了包含逗号，现在看起来还好吗？
很遗憾没有。你跑过这段代码吗？它肯定会给你一个SyntaxError。您仍然缺少嵌套的dicts 之间的逗号，并且嵌套的dicts 中的键/值对也都搞砸了。 "Identifiertype:001" 应该是 {"Identifiertype":"001"} 吗？

标签： python python-3.x python-2.7 python-requests

【解决方案1】：

根据其他人的建议，使用csv 模块可能更容易，但也可以通过约定方法实现：

delim = "," # Just in case we switch to tsv or something

with open('test.txt', 'r') as file:
    # Create a list of valid headers in comma seperated values and their respective index
    header = [(i, col) for i, col in enumerate(next(file).rstrip().split(delim)) if col.endswith('_ID')]

    # Create a list of data in comma seperated values
    data = [l.rstrip().split(delim) for l in file.readlines()]

    # Go through each record to create a payload
    for record in data:

        # Here we use the header index to retrieve the respective data to create the dictionary with list comprehension
        payload = {'IndividualInfo': [{key: record[i], 'Identifiertype': '001', 'EType':'01'} for i, key in header]}

        # Do whatever you need with json.dumps(payload)

结果如下：

# the index/header pairs
# [(0, 'Member_ID'), (1, 'User_ID'), (2, 'Proxy_ID'), (3, 'A_ID'), (4, 'Login_ID')]

# the separated data
# [['M1000', 'U1000', 'P1000', 'A1000', 'Jim1', 'Jim', 'Kong'], ['M2000', 'U2000', 'P2000', 'A2000', 'OlilaJ', 'Olila', 'Jayavarman'], ['M3000', 'U3000', 'P3000', 'A3000', 'LisaKop', 'Lisa', 'Kopkingg'], ['M4000', 'U4000', 'P4000', 'A4000', 'KishoreP', 'Kishore', 'Pindhar'], ['M5000', 'U5000', 'P5000', 'A5000', 'Gobi123', 'Gobi', 'Nadar']]

# The payloads
# {'IndividualInfo': [{'Member_ID': 'M1000', 'Identifiertype': '001', 'EType': '01'}, {'User_ID': 'U1000', 'Identifiertype': '001', 'EType': '01'}, {'Proxy_ID': 'P1000', 'Identifiertype': '001', 'EType': '01'}, {'A_ID': 'A1000', 'Identifiertype': '001', 'EType': '01'}, {'Login_ID': 'Jim1', 'Identifiertype': '001', 'EType': '01'}]}
# {'IndividualInfo': [{'Member_ID': 'M2000', 'Identifiertype': '001', 'EType': '01'}, {'User_ID': 'U2000', 'Identifiertype': '001', 'EType': '01'}, {'Proxy_ID': 'P2000', 'Identifiertype': '001', 'EType': '01'}, {'A_ID': 'A2000', 'Identifiertype': '001', 'EType': '01'}, {'Login_ID': 'OlilaJ', 'Identifiertype': '001', 'EType': '01'}]}
# {'IndividualInfo': [{'Member_ID': 'M3000', 'Identifiertype': '001', 'EType': '01'}, {'User_ID': 'U3000', 'Identifiertype': '001', 'EType': '01'}, {'Proxy_ID': 'P3000', 'Identifiertype': '001', 'EType': '01'}, {'A_ID': 'A3000', 'Identifiertype': '001', 'EType': '01'}, {'Login_ID': 'LisaKop', 'Identifiertype': '001', 'EType': '01'}]}
# {'IndividualInfo': [{'Member_ID': 'M4000', 'Identifiertype': '001', 'EType': '01'}, {'User_ID': 'U4000', 'Identifiertype': '001', 'EType': '01'}, {'Proxy_ID': 'P4000', 'Identifiertype': '001', 'EType': '01'}, {'A_ID': 'A4000', 'Identifiertype': '001', 'EType': '01'}, {'Login_ID': 'KishoreP', 'Identifiertype': '001', 'EType': '01'}]}
# {'IndividualInfo': [{'Member_ID': 'M5000', 'Identifiertype': '001', 'EType': '01'}, {'User_ID': 'U5000', 'Identifiertype': '001', 'EType': '01'}, {'Proxy_ID': 'P5000', 'Identifiertype': '001', 'EType': '01'}, {'A_ID': 'A5000', 'Identifiertype': '001', 'EType': '01'}, {'Login_ID': 'Gobi123', 'Identifiertype': '001', 'EType': '01'}]}

请注意，我使用enumerate() 创建索引/标题组合，因为如果_ID 列之间存在其他列，它为您提供了一种准确的方法来定位相应数据。

编辑：

对于 Python 2.7，请改用以下内容 (sample on repl.it)：

delim = "," # Just in case we switch to tsv or something

with open('test.txt', 'r') as file:
    # Create a list of valid headers in comma seperated values and their respective index
    header = [(i, col) for i, col in enumerate(next(file).rstrip().split(delim)) if col.endswith('_ID')]
    # Create a list of data in comma seperated values
    data = []
    for f in file:
        data.append(f.rstrip().split(delim))

# We're done with reading the file,
# We can proceed outside the `with` context manager from this point

# Go through each record to create a payload
for record in data:

    # Here we use the header index to retrieve the respective data to create the dictionary with list comprehension
    payload = {'IndividualInfo': [{key: record[i], 'Identifiertype': '001', 'EType':'01'} for i, key in header]}

    # Do whatever you need with json.dumps(payload)

【讨论】：

我正在尝试运行它，但它给了我一个错误，ValueError：混合迭代和读取方法会丢失数据
在我的 Python 3.7 上运行良好。你是哪个版本的？
哦！我在 2.7
哎呀。如果您不受限制，我建议尽可能升级到版本 3。不管怎样，我已经更新了答案。它已经在 repl.it 中进行了测试，所以现在应该没问题了。
这很好用！非常感谢 :) 另外，如果我想合并进程以便拥有超过一百万条记录的更大文件可以更快地运行，我们该怎么做？

【解决方案2】：

使用 DictReader 获取文件中的标题

import csv
with open('names.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    print reader.fieldnames # gets you file header
    for row in reader:
        Member_ID = row["Member_ID"]
        User_ID = row["User_ID"]
        Proxy_ID = row.get("Proxy_ID", "")
        A_ID = row.get("A_ID", "")

        if Proxy_ID:
            ....
        else:
            ....

【讨论】：

【解决方案3】：

您可以使用pandas 和.to_json(orient='records')

df = pd.read_csv(open(file))
df.to_json(orient='records')

这将输出与文件中 ID 一样多的记录：

[{"Member_ID":"M1000","User_ID":"U1000","A_ID":"A1000","Login_ID":"Jim1","First_Name":"Jim","Last_Name":"Kong"},...,{"Member_ID":"M2000","User_ID":"U2000","A_ID":"A2000","Login_ID":"OlilaJ","First_Name":"Olila","Last_Name":"Jayavarman"}]

【讨论】：

这种方法将输出与文件中 ID 一样多的记录。
不，它不会那样做！
我现在看到了，如果有多个用户，列表中会有多个个人信息？

【解决方案4】：

你可以这样做：

...
with open('Documents/Onboarding_sample.csv') as f:
    rows = [line.strip().split(',') for line in f.readlines()]

payload = [{key: val for key, val in zip(rows[0], row) if key.endswith('_ID')}
           for row in rows[1:]]
...

或

import csv

...
with open('Documents/Onboarding_sample.csv') as f:
    rows = [row for row in csv.reader(f)]

payload = [{key: val for key, val in zip(rows[0], row) if key.endswith('_ID')}
           for row in rows[1:]]
...

【讨论】：

不，这不是我想要的。您的逻辑输出是文件中每一行的 1 个 JSON。

【解决方案5】：

delim = "," # Just in case we switch to tsv or something
headers = {'content-type': 'application/json'}

with open('Documents/Onboarding_sample.csv', 'r') as file:
    i = next(file)
    listcolumns = i.split(delim)
    sub = "ID"
    payload = {"IndividualInfo": []}
    for line in file:
        line_list = line.split(delim)  

        for val in enumerate(i, line_list)
            if val.lower().endswith(SUB.lower()):                   
                payload["individualInfo"].append(
                     {
                        header: val,
                        "Identifiertype": "001",
                        "EType":"01"
                      }
                )

        try:
            r = requests.post("http://www.google.com/blahblah", data=json.dumps(payload), timeout=(1,20), headers=headers)
            r.raise_for_status()
        except requests.exceptions.HTTPError as errh:
            print "HTTP Error:%s" %errh
        except requests.exceptions.ConnectionError as errc:
            print "Error Connecting:%s" %errc
        except requests.exceptions.Timeout as errt:
            print "Timeout error:%s" %errt

        print "This is a JSON object."
        print payload

上述解决方案并不理想，最好按照其他答案中的建议与 csv dict reader 结合使用，然后过滤掉 ids

import csv
with open('names.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    print reader.fieldnames # gets you file header
    for row in reader:
        for k in row:
            if k.lower().endswith('id'):
                  ....

【讨论】：

为每个“else”子句构建整个有效载荷？
其实我并没有发现所有的 id 都得到了类似的处理。一个简单的 if 就足够了。上述解决方案并不理想，可能最好按照其他答案中的建议与 csv 结合，然后过滤掉 ids