【问题标题】:Reformat a text file to csv using bash script使用 bash 脚本将文本文件重新格式化为 csv
【发布时间】:2017-11-06 21:04:35
【问题描述】:

我有一个包含几千行文本的文件 (exOut.txt),格式如下:

[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV] solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405
[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.497312, total=11.0min
[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.499232, total=11.0min
[Parallel(n_jobs=-2)]: Done   2 out of   6 | elapsed: 11.0min remaining: 22.0min
[CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.499762, total=11.1min
[Parallel(n_jobs=-2)]: Done   3 out of   6 | elapsed: 11.1min remaining: 11.1min
[CV]  solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.449309, total=19.6min
[Parallel(n_jobs=-2)]: Done   4 out of   6 | elapsed: 19.6min remaining: 9.8min
[CV]  solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.449831, total=19.7min
[CV]  solver=newton-cg, penalty=l2, multi_class=ovr, max_iter=187.637633813, C=0.778324314482, score=0.451609, total=19.7min
[Parallel(n_jobs=-2)]: Done   6 out of   6 | elapsed: 19.7min remaining:    0.0s
[Parallel(n_jobs=-2)]: Done   6 out of   6 | elapsed: 19.7min finished
...

我正在尝试编写一个 shell 脚本,它将获取这个文件并重新格式化它以创建一个 csv 格式的新文件,只记录带有“score”属性的行。这应该类似于:

solver,penalty,multi_class,max_iter,C,score
sag,l2,multinomial,187.638,0.312,0.497
sag,l2,multinomial,187.638,0.312,0.499
sag,l2,multinomial,187.638,0.312,0.500
newton-cg,l2,ovr,187.638,0.779,0.449
newton-cg,l2,ovr,187.638,0.779,0.450
newton-cg,l2,ovr,187.638,0.779,0.450

如果可能,所有值都四舍五入到最接近的第 1000 位。

最终,我想采用这个 csv 并通过识别除“分数”之外的所有字段都相等的记录来制作一个精简版本,并用给定这些参数的平均分数替换那些记录。例如:

solver,penalty,multi_class,max_iter,C,avg_score
sag,l2,multinomial,187.638,0.312,0.499
newton-cg,l2,ovr,187.638,0.779,0.450

感谢任何帮助!我不是正则表达式的专家,这主要是我问的原因。

编辑 1 感谢您的反馈,这里有更多信息:

到目前为止,我已经使用 grep、awk 和 sed 尝试了各种脚本,包括 grep '=.*,' exOut.txt,它只识别模式的一个大事件而不是多个字段,以及 sed 's/^[^\=]*\=//g' exOutput.txt > firstCSV.csv,它只清理每一行的第一部分.

【问题讨论】:

  • 欢迎来到 Stack Overflow。这不是代码编写服务,您可以在其中发布您的要求和选择的语言,然后有人为您编写代码。我们非常乐意提供帮助,但我们确实希望您首先努力自己解决问题,并将您的努力包含在您的问题中。请edit 在此处询问之前显示您自己尝试过的代码。如果您需要更多信息,请参阅How to Ask
  • awk 应该很简单。试一试,如果您仍然无法获得它,您可以发布您的代码和它给您的输出示例。 (而且awk 非常值得花时间学习。)
  • 如果你想要一个 bash 脚本,为什么用 python 标记?
  • 数据文件是由python程序使用scikit-learn生成的,所以我想如果有人在他们有想法之前创建了这种类型的日志文件......但你是对的,一个python标签可能令人困惑。

标签: regex bash shell csv


【解决方案1】:

1.问题可以参考parseRawDataFile函数。

2.你可以参考parseCsvDataFile函数的问题。

3.代码中有一些硬代码,请注意。

import codecs
datausage = r"e:\temp\111.txt"
storagePath = r"e:\temp\222.csv"
storagePath_New = r"e:\temp\222_New.csv"
strStr = "score"

def parseRawDataFile(path):
    sumResult = "solver,penalty,multi_class,max_iter,C,score\r\n"
    # [CV]  solver=sag, penalty=l2, multi_class=multinomial, max_iter=187.637633813, C=0.31181629405, score=0.497312, total=11.0min

    with open(path) as f:

        lines = f.readlines()
        for line in lines:
            if strStr in line:

                tempList = line.replace(",", "=").split("=")
                # print(tempList)
                sumResult += "%s,%s,%s,%.03f,%.03f,%.03f\r\n" % (tempList[1], tempList[3], tempList[5], float(tempList[7]), float(tempList[9]), float(tempList[11]))

    # print(sumResult)
    return sumResult

def parseCsvDataFile(path):
    sumResult = "solver,penalty,multi_class,max_iter,C,score\r\n"
    # sag,l2,multinomial,187.638,0.312,0.497

    nameList = []
    with open(path) as f:
        lines = f.readlines()
        temptSum = 0.0
        tempNum = 0
        for line in lines:
            if "solver" in line:
                continue
            tempList = line.split(",")
            if tempList[0] not in nameList:
                if temptSum != 0:
                    sumResult += "%.03f\r\n" % (temptSum/tempNum)
                sumResult += "%s,%s,%s,%.03f,%.03f," % (tempList[0], tempList[1], tempList[2], float(tempList[3]), float(tempList[4]))

                tempNum = 0
                temptSum = 0.0
                nameList.append(tempList[0])
            if tempList[0] in nameList:
                tempNum += 1
                temptSum += float(tempList[-1])

        sumResult += "%.03f\r\n" % (temptSum / tempNum)
    print(sumResult)
    return sumResult

def writeLog(f, d):
    try:
        f = codecs.open(f, 'w', 'utf-8')
        f.write(d)
        f.close()
    except:
        print("file is not exist.")

result = parseRawDataFile(datausage)
writeLog(storagePath, result)
result = parseCsvDataFile(storagePath)
writeLog(storagePath_New, result)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2011-02-10
    • 1970-01-01
    • 2015-08-19
    • 2023-03-19
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多