将大文本文件拆分为列答案

【问题标题】：Splitting a large text file into columns将大文本文件拆分为列
【发布时间】：2012-02-09 19:51:22
【问题描述】：

已解决：我的一个好朋友为我编写了以下程序：

filename="my_input_file"
context="channel"               # this is the key which separates the blocks in the input file
desired_column_separator=","    # this will separate the columns in the output file
output_prefix="modified_"       # prefix for the output file


if [ -d ./tmp ]
then

echo " "
echo "***WARNING***"
echo "I want to use and delete a ./tmp/ directory, but one already exists... please remove/rename it, or alter my code***"
echo " " 
exit
fi


mkdir ./tmp
cd ./tmp

csplit -z -n 4 ../$filename  /$context/ {*} 1> /dev/null

filenum=`ls -1 ./ | wc -l`
limit=`echo "$filenum - 1" | bc -l`
lines=`wc -l < xx0000`

touch tmp.dat


        for j in `seq 1 $lines`
        do

    oldstring=''

                for i in `seq 0 $limit`
                do

                inputNo=`printf "%04d" $i`
                string=`head -n $j 'xx'$inputNo | tail -n 1`

        oldstring=$oldstring$string$desired_column_separator

                done

        finalstring=`echo $oldstring | tr -d '\r' | tr -d '\n'`  

        echo "working on line "$j" out of "$lines
                echo -n $finalstring >> tmp.dat                
                echo -e "\r" >> tmp.dat

        done

mv tmp.dat ../$output_prefix$filename
cd ..
rm -r -f ./tmp/

echo "...done!"

原创：我知道在这个论坛上拆分文本文件已经被做死了，但是我找不到针对我的问题的特定方法。我想将一个大文件 (>200mb) 拆分为文本行上的列，但“拆分”函数将每一列放在自己的文件中。老实说，3,000 多个单独的文件文本很难加载到其他程序中。除此之外，我还想提取文本文件的一部分用作我的数据的标题（第 4 行的最后一部分）。初始文件由一列组成，如下所示：

channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:08:02.311422
dt:
0.000000
data:
-8.000000E-4
-8.000000E-4
-1.600000E-3
... (9,994 lines omitted)
-2.400000E-3
-1.600000E-3
-2.400000E-3
channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
... (another 9,997 lines omitted)

我希望它看起来像这样：

channel names:                     channel names:
03/02/2012 12:03:03 - TDS3k(CH1)   03/02/2012 12:03:03 - TDS3k(CH1)
start times:                       start times:
03/02/2012 12:08:02.311422         03/02/2012 12:33:11.169533
dt:                                dt:
0.000000                           0.000000
data:                              data:
-8.000000E-4                       -8.000000E-4   ...
-8.000000E-4                       -1.600000E-3   ...
-1.600000E-3                       -1.600000E-3   ...
...                                ...

我怀疑在正确的位置进行拆分比标题更容易，但我也做得不够好。

提前致谢

编辑：我还没有使用任何特定的语言。我只需要一种格式的数据，我可以在 R 中对其进行分析。我会采用你们提出的任何可行的建议。

【问题讨论】：

你想用什么编程语言？
您知道，您可能只想稍微提示一下您为此使用的工具（语言/电子表格/数据库/其他）。
我没有使用任何特定的语言。 'sed' 和 'awk' 都是朋友建议的，但我无法让它们工作。我会在某个时候将它加载到 R 中。

标签： split

【解决方案1】：

您使用什么语言？每个条目有多少个“数据”条目？

使用 python，最简单的方法是首先将数据分解为“条目”，然后为每个条目编写一个解析函数，以仅生成您希望在最终输出中看到的值。然后只需加入最终输出，或使用 csv 模块编写。

input = """channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
channel names:
03/02/2012 12:03:03 - TDS3k(CH1)
start times:
03/02/2012 12:33:11.169533
dt:
0.000000
data:
-8.000000E-4
-1.600000E-3
-1.600000E-3
"""

LINES_PER_ENTRY = 10

def parseEntry(entry):
    return entry

raw = input.split('\n')

entries =  [raw[i*LINES_PER_ENTRY:(i+1)*LINES_PER_ENTRY] for i in range(len(raw)/10)]


parsed_entries = [parseEntry(entry) for entry in entries]

outfile = open('outfile.txt','w')
for parsed_entry in parsed_entries:
    outfile.write('\t'.join(parsed_entry) + "\n")
print parsed_entries

【讨论】：

我正在尝试使用 R 进行分析，但我需要首先将数据采用正确的格式。我会选择任何有效的语言。一位朋友建议使用“sed”和“awk”，但我无法让它们中的任何一个工作。这会将数据作为列输出到单个文件中吗？每个输出列（包括文本和标题）的总行数应为 10007，但这可能会有所不同，因此使用固定的条目长度是有风险的。感谢您的帮助，明天我足够清醒时会研究 Python。另外，“5分钟编辑”到底是怎么回事？真的很烦。
在这种情况下，您能否提供一个更大的示例来说明您要解析的内容？
如果你愿意，但我不知道它会做什么。也许我可以更好地描述它：我的数据在一列中包含多个“实例”。每个“实例”包含 7 行文本（如操作所示），后跟 10,000 行数字。 “实例”是串联的。