通过CSV文件解析转换为JSON格式文件答案

【问题标题】：Parsing through CSV file to convert to JSON format file通过CSV文件解析转换为JSON格式文件
【发布时间】：2013-07-23 04:18:50
【问题描述】：

我收到了以下从 Excel 电子表格中提取的 CSV 文件。只是为了提供一些可能有帮助的背景信息，它讨论了 AGI 编号（将其视为蛋白质标识符），这些蛋白质标识符的未修饰肽序列，然后是对未修饰序列进行修改的修饰肽序列，索引/索引这些修饰，然后是重复肽的组合光谱计数。文本文件名为 MASP.GlycoModReader.txt，信息格式如下：

AGI,UnMd Peptide (M) = x,Mod Peptide (oM) = Ox,Index/Indeces of Modification,counts,Combined 
Spectral count for repeated Peptides

AT1G56070.1,NMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR,NoMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR,2,17
AT1G56070.1,LYMEARPMEEGLAEAIDDGR,LYoMEARPoMEEGLAEAIDDGR,"3, 9",1
AT1G56070.1,EAMTPLSEFEDKL,EAoMTPLSEFEDKL,3,7
AT1G56070.1,LYMEARPMEEGLAEAIDDGR,LYoMEARPoMEEGLAEAIDDGR,"3, 9",2
AT1G56070.1,EGPLAEENMR,EGPLAEENoMR,9,2
AT1G56070.1,DLQDDFMGGAEIIK,DLQDDFoMGGAEIIK,7,1

上面解压后需要得到的输出文件格式如下：

AT1G56070.1,{"peptides": [{"sequence": "NMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR", "mod_sequence":    
"NoMSVIAHVDHGKSTLTDSLVAAAGIIAQEVAGDVR" , "mod_indeces": 2, "spectral_count": 17}, {"sequence": 
"LYMEARPMEEGLAEAIDDGR" , "mod_sequence": "LYoMEARPoMEEGLAEAIDDGR", "mod_indeces": [3, 9], 
"spectral_count": 3}, {"sequence": "EAMTPLSEFEDKL" , "mod_sequence": "EAoMTPLSEFEDKL", 
"mod_indeces": [3,9], "spectral_count": 7}, {"sequence": "EGPLAEENMR", "mod_sequence": 
"EGPLAEENoMR", "mod_indeces": 9, "spectral_count": 2}, {"sequence": "DLQDDFMGGAEIIK", 
"mod_sequence": "DLQDDFoMGGAEIIK", "mod_indeces": [7], "spectral_count": 1}]}

我在下面提供了我的解决方案：如果有人用另一种语言提供更好的解决方案，或者可以分析我的解决方案并让我知道是否有更有效的方法来解决这个问题，请在下面发表评论。谢谢。

    #!/usr/bin/env node

    var fs = require('fs');
    var csv = require('csv');
    var data ="proteins.csv";

    /* Uses csv nodejs module to parse the proteins.csv file.
    * Parses the csv file row by row and updates the peptide_arr.
    * For new entries creates a peptide object, for similar entries it updates the
    * counts in the peptide object with the same AGI#.
    * Uses a peptide object to store protein ID AGI#, and the associated data.
    * Writes all formatted peptide objects to a txt file - output.txt.
    */

    // Tracks current row
    var x = 0;
    // An array of peptide objects stores the information from the csv file
    var peptide_arr = [];

    // csv module reads row by row from data 
    csv()
    .from(data)
    .to('debug.csv')
    .transform(function(row, index) {
        // For the first entry push a new peptide object with the AGI# (row[0]) 
        if(x == 0) {
        // cur is the current peptide read into row by csv module
        Peptide cur = new Peptide( row[0] );

        // Add the assoicated data from row (1-5) to cur
        cur.data.peptides.push({
            "sequence" : row[1];
            "mod_sequence" : row[2];
            if(row[5]){
            "mod_indeces" : "[" + row[3] + ", " + row[4] + "]";
            "spectral_count" : row[5];  
            } else {
            "mod_indeces" : row[3];
            "spectral_count" : row[4];  
            }
        });

        // Add the current peptide to the array
        peptide_arr.push(cur);
        }

        // Move to the next row
        x++;
    });

    // Loop through peptide_arr and append output with each peptide's AGI# and its data
    String output = "";
    for(var peptide in peptide_arr) 
    {
        output = output + peptide.toString()
    }
    // Write the output to output.txt
    fs.writeFile("output.txt", output);

    /* Peptide Object :
     *  - id:AGI#
     *  - data: JSON Array associated
     */
    function Peptide(id) // this is the actual function that does the ID retrieving and data 
                        // storage
{
    this.id = id;
    this.data = {
        peptides: []
    };
}

/* Peptide methods :
 *  - toJson : Returns the properly formatted string
 */
Peptide.prototype = {
    toString: function(){
        return this.id + "," + JSON.stringify(this.data, null, " ") + "/n"
    }
};

编辑说明：当我运行我发布的这个解决方案时，似乎出现了内存泄漏错误；它无限运行，但不会产生任何实质性的可读输出。如果有人愿意协助评估为什么会发生这种情况，那就太好了。

【问题讨论】：

这应该是代码审查而不是 SO。

标签： javascript python scripting

【解决方案1】：

你的版本好用吗？看起来您只创建了一个 Peptide 对象。另外，“if(row[5])”语句在做什么？在您的示例数据中，始终有 5 个元素。另外， mod_indeces 总是应该是一个列表，对吗？因为在您的示例输出文件中 mod_indeces 不是第一个肽中的列表。无论如何，这是我在 python 中提出的：

import csv
import json
data = {}
with open('proteins.csv','rb') as f:
    reader = csv.reader(f)
    for row in reader:
        name = row[0]
        sequence = row[1]
        mod_sequence = row[2]
        mod_indeces = map(int,row[3].split(', '))
        spectral_count = int(row[4])
        peptide = {'sequence':sequence,'mod_sequence':mod_sequence,
                   'mod_indeces':mod_indeces,'spectral_count':spectral_count}
        if name in data:
            data[name]['peptides'].append(peptide)
        else:
            data[name] = {'peptides':[peptide]}
    f.close()

f = open('output.txt','wb')
for protein in data:
    f.write(protein)
    f.write(',')
    f.write(json.dumps(data[protein]))
    f.write('\n')
f.close()

如果您在 Windows 上并希望以纯文本形式查看文件，您可能需要将 '\n' 替换为 '\r\n' 或 os.linesep。

如果你想跳过一些行（如果有标题什么的），你可以这样做：

import csv
import json
data = {}
rows_to_skip = 1
rows_read = 0
with open('proteins.csv','rb') as f:
    reader = csv.reader(f)
    for row in reader:
        if rows_read >= rows_to_skip:
            name = row[0]
            sequence = row[1]
            mod_sequence = row[2]
            mod_indeces = map(int,row[3].split(', '))
            spectral_count = int(row[4])
            peptide = {'sequence':sequence,'mod_sequence':mod_sequence,
                       'mod_indeces':mod_indeces,'spectral_count':spectral_count}
            if name in data:
                data[name]['peptides'].append(peptide)
            else:
                data[name] = {'peptides':[peptide]}
        rows_read += 1
    f.close()

f = open('output.txt','wb')
for protein in data:
    f.write(protein)
    f.write(',')
    f.write(json.dumps(data[protein]))
    f.write('\n')
f.close()

如果您希望字典的键按特定顺序排列，您可以使用 orderedDict 代替默认字典。只需将肽线替换为以下内容：

peptide = OrderedDict([('sequence',sequence),
                       ('mod_sequence',mod_sequence),
                       ('mod_indeces',mod_indeces),
                       ('spectral_count',spectral_count)])

现在订单被保留了。也就是说，sequence 后跟 mod_sequence 后跟 mod_indeces 后跟 spectral_count。要更改顺序，只需更改 OrderedDict 中元素的顺序即可。

请注意，您还必须添加 from collections import OrderedDict 才能使用 OrderedDict。

【讨论】：

谢谢你，马修！我以 Python 格式保存了您的脚本，并从 Mac OS X 上的终端运行它。我收到以下错误，这可能是我运行它造成的，但无论如何我都会发布它：
Traceback（最近一次调用最后一次）：文件“/Users/zsyed/PythonPeptideJSON.py”，第 8 行，在中，用于阅读器中的行：_csv.Error：出现换行符未加引号的字段 - 您需要以通用换行模式打开文件吗？
感谢您对我的程序的反馈。我会考虑你所说的，然后再试一遍
奇怪，我没有这个问题。人们here 说以“rU”模式打开文件似乎可以解决这个问题，所以不妨试一试。
太棒了，这行得通，但不幸的是我遇到了另一个错误：\。很抱歉打扰您：