在python中将逐行csv文件转换为json答案

【问题标题】：converting a row-wise csv file to json in python在python中将逐行csv文件转换为json
【发布时间】：2015-12-02 10:53:49
【问题描述】：

我有这个输入文件，我想将其转换为 json。

1.] 如您所见，键：值以行方式而不是列方式分布。

2.] 每个都有一个“评论”键，其值分布在每个元素的不同行中。因为有些用户可能会写很长的 cmets。

key,values

heading,A
Title,1
ID,12
Owner,John
Status,Active
Comments,"Im just pissed "
        ,"off from your service"
,
heading,B
Title,2
ID,21
Owner,Von
Status,Active
Comments,"Service is  "
        ,"really great"
        ,"I just enjoyed my weekend"
,
heading,C
Title,3
ID,31
Owner,Jesse
Status,Active
Comments,"Service"
        ,"needs to be"
        ,"improved"

输出

{{'heading':'A','Title':1,'ID':12,'Owner':'John','Status':'Active', "Comments":"Im just pissed off from your service"},
{....}, 
{.....}}

由于我的 csv 文件以行方式具有“key”：“values”，我真的不知道如何将其转换为 json。

=====我试过的=====

f = open( 'csv_sample.csv', 'rU' )
reader = csv.DictReader( f, fieldnames = ( "key","value" ))
for i in reader:
    print i


{'value': 'values', 'key': 'key'}
{'value': 'A', 'key': 'heading'}
{'value': '1', 'key': 'Title'}
{'value': '12', 'key': 'ID'}
{'value': 'John', 'key': 'Owner'}
{'value': 'Active', 'key': 'Status'}

如您所见，这不是我想要的。请帮忙

【问题讨论】：

你希望结果是 {'key':'values','heading': 'A' ...

标签： python json csv

【解决方案1】：

编辑：也许可以尝试以下方式：

import json

def headingGen(lines):
    newHeading = {}
    for line in lines:
        try:
            k, v = line.strip('\n').split(',', 1)
            v = v.strip('"')
            if not k and not v:
                yield newHeading
                newHeading = {}
            elif not k.strip():
                newHeading[prevk] = newHeading[prevk] + v
            else:
                prevk = k
                newHeading[k] = v
        except Exception as e:
            print("I had a problem at line "+line+" : "+str(e))
    yield newHeading


def file_to_json(filename):
    with open(filename, 'r') as fh:
        next(fh)
        next(fh)
        return json.dumps(list(headingGen(fh)))

【讨论】：

我更新了帖子：每个都有一个“评论”键，其值分布在每个元素的不同行中。因为有些用户可能会写冗长的 cmets。
感谢 Dain，但我在“k, v = line.strip('\n').split(',') ”上收到“太多值无法解包”错误
如果是因为评论中有逗号，试试k, v = line.strip('\n').split(',', 1)
我现在得到这个 ---- UnicodeDecodeError: 'utf8' codec can't decode byte 0x85 in position 6: invalid start byte
问题出在这类 cmets（它们之间有逗号），几乎所有 cmets 都是这种格式——“出于特殊原因，我们有时可能需要关闭我们的服务” [即 ", , ," 格式 cmets ]。这会导致 --"too many values to unpack" 错误

【解决方案2】：

这不是一个简单的转换，所以我们需要准确地指定它：

输入文件是一个 csv 文件，有两列名为 key 和 values
一条记录由不同的行组成，定义了映射的键和值
键 heading 表示记录的开始
空白键是续行 - 其值应与前一个值相加
如果续行的值不以分隔符开头，并且如果前面的值不以分隔符结尾，则插入空格（分隔符为空格、制表符、点、逗号和-）
heading 字段不能有续行 - 这允许更简单的解码

代码可以是：

with open('csv_sample.csv') as fd
    rd = csv.DictReader(fd)
    rec = None
    lastkey = None
    sep = ' \t,.-'
    for row in rd:
        # print row
        key = row['key'].strip()
        if key == 'heading':
            if rec is not None:
                # process previous record
                print json.dumps(rec)
            rec = { key: row['values'] }
        elif key == '': # continuation line
            if (rec[lastkey][-1] in sep) or (row['values'] in sep):
                rec[lastkey] += row['values']
            else:
                rec[lastkey] += ' ' + row['values']
        else:
            # normal field: add it to rec and store key
            rec[key] = row['values']
            lastkey = key
    # process last record
    if rec is not None:
        print json.dumps(rec)

您可以通过将print json.dumps(rec) 更改为yield json.dumps(rec) 轻松地将其转换为生成器

用你的例子，它给出了：

{"Status": "Active", "Title": "1", "Comments": "Im just pissed off from your service", "heading": "A", "Owner": "John", "ID": "12"}
{"Status": "Active", "Title": "2", "Comments": "Service is  really greatI just enjoyed my weekend", "heading": "B", "Owner": "Von", "ID": "21"}
{"Status": "Active", "Title": "3", "Comments": "Serviceneeds to beimproved", "heading": "C", "Owner": "Jesse", "ID": "31"}

由于此代码使用 csv 模块，它通过构造不受 cmets 中的逗号影响。

【讨论】：

第 7 行：KeyError: 'key'
@shalini：我无法在 Python 2.7 下重现。你的 Python 版本是什么，你的操作系统是什么？或者您能否展示您的真实输入，因为我可以在对另一个答案的评论中看到 UnicodeDecodeError
您能否取消注释print row 行并显示错误前显示的内容？

【解决方案3】：

试试这个：

def convert_to_json(fname):
    result = []
    rec = {}
    with open(fname) as f:
        for l in f:
            if not l.strip() or l.startswith('key'):
                continue

            if l.startswith(','):
                result.append(rec)
                rec = {}
            else:
                k, v = l.strip().split(',')
                if k.strip():
                    try:
                        rec[k] = int(v)
                    except:
                        rec[k] = v.strip('"')
                else:
                    rec['Comments'] += v.strip('"')
    result.append(rec)
    return result

print convert_to_json('./csv_sample.csv')

输出：

[{'Status': 'Active', 'Title': 1, 'Comments': 'Im just pissed off from your service', 'heading': 'A', 'Owner': 'John', 'ID': 12}, {'Status': 'Active', 'Title': 2, 'Comments': 'Service is  really greatI just enjoyed my weekend', 'heading': 'B', 'Owner': 'Von', 'ID': 21}, {'Status': 'Active', 'Title': 3, 'Comments': 'Serviceneeds to beimproved', 'heading': 'C', 'Owner': 'Jesse', 'ID': 31}]

【讨论】：

【解决方案4】：

此答案使用 Python 的列表推导式来提供一种功能性风格，以替代使用命令式风格的其他（也不错的）答案。我喜欢这种风格，因为它很好地区分了问题的不同方面。

嵌套列表推导式首先将输入拆分为多个部分，然后通过使用正则表达式将其拆分为项目并将函数 split_item() 应用于每个项目以最终获得键/值，从而从每个部分创建字典来构造结果对。

源数据按部分读取以提高内存效率。

import re
import json

# Define a regular expression splitting a section into items.
# Each newline which is not followed by whitespace splits.
splitter = re.compile(r'\n(?!\s)')

def section_generator(f):
    # Generator reading a single section from the input file in each iteration.
    # The sections are separated by a comma on a separate line.
    section = ''
    for line in f:
        if line == ',\n':
            yield section
            section = ''
        else:
            section += line
    yield section

def split_item(item):
    # Convert the the item including "key,value" into a key/value pair.
    key, value = item.split(',', 1)
    if value.startswith('"'):
        # Convert multiline quoted string to unquoted single line.
        value = ''.join(line.strip().lstrip(',').strip('"')
                        for line in value.strip().splitlines())
    elif value.isdigit():
        # Convert numeric value to int.
        value = int(value)
    return key, value

with open('csv_sample.csv', 'rU') as f:
    # Ignore the "header" (skip everything until the empty line is found).
    for line in f:
        if line == '\n':
            break

    # Construct the resulting list of dictionaries using list comprehensions.
    result = [dict(split_item(item) for item in splitter.split(section) if item)
              for section in section_generator(f)]

print json.dumps(result)

【讨论】：

你在编辑之前尝试过吗？这是正确的。它使用列表理解。不明白请在编辑前询问。
我恢复到原来的版本。列表理解比循环更有效，而且我相信它也更具可读性（当你习惯它时）。
我只是将键/值拆分和转换变成了一个单独的函数。希望这能提高可读性。
由于示例已更改为在“评论”中包含多行引用字符串，因此我更新了代码以正确处理该问题，尝试坚持使用列表理解以便在此处提供其他答案的替代方案（一些我也喜欢）。