【问题标题】:Formatting columns txt file (Bash)格式化列 txt 文件 (Bash)
【发布时间】:2020-08-03 21:49:52
【问题描述】:

我对编码很陌生,我正在处理一个没有正确列格式的 txt 文件。我尝试了各种“列”编码来分隔制表符或逗号,但似乎无法解决问题。我要更改的txt文件看起来像这样..

"","SNPID","chr","position","coded_all","noncoded_all","strand_genome","beta","SE","pval","AF_coded_all","HWE_pval","callrate","n_total","imputed","used_for_imp","oevar_imp","cases_hwe","controls_hwe","cases_maf","controls_maf","ORIG_RSID","ORIG_oevar_imp"
"1","rs12238997",1,693731,"G","A","+",-0.38288,0.217017,0.0671875,0.136449469720989,0.830832,1,3558,0,1,0.68019,1,0.717247,0.111293,0.137696,"1:693731",0.68523
"2","rs55727773",1,706368,"G","A","+",0.0495436,0.184018,0.787664,0.495455309418386,0.737299,1,3558,0,1,0.37193,1,0.731107,0.501533,0.495154,"1:706368",0.374977
"3","rs144155419",1,717587,"A","G","+",-0.610968,0.783633,0.40289,0.01515919857503,1,1,3558,0,1,0.53527,1,1,0.0109798,0.0153663,"1:717587",0.5463
"4","1:718624",1,718624,"G","C","+",-2.49538,4.33782,0.45716,0.00158445978704298,1,1,3558,0,1,0.44799,1,1,0.000520833,0.00163717,"1:718624",0.537329
"5","1:718625",1,718625,"G","T","+",-2.49432,4.33666,0.457265,0.00158445978704298,1,1,3558,0,1,0.44808,1,1,0.000520833,0.00163717,"1:718625",0.537482
"6","rs564367954",1,720984,"G","T","+",0.152295,1.9944,0.939879,0.00164619033597763,1,1,3558,0,1,0.38407,1,1,0.00183333,0.00163691,"1:720984",0.418335

但是,我希望将格式设置为这样的格式..

SNPID   chr position    coded_all   noncoded_all    strand_genome   beta    SE  pval    AF_coded_all    n_total oevar_imp
1:10177_A/AC_1:10177    1   10177   AC  A   +   -0.1885 0.3084  0.5411  0.5 5552    0.00451
1:10235_T/TA_1:10235    1   10235   TA  T   +   -6.782  18.56   0.7149  0   5552    0.00020
1:10352_T/TA_1:10352    1   10352   TA  T   +   0.2509  0.2392  0.2942  0.5 5552    0.00721
1:10539_C/A_1:10539 1   10539   A   C   +   -1.832  5.502   0.7392  0   5552    0.00420
1:10616_CCGCCGTTGCAAAGGCGCGCCG/C_1:10616    1   10616   C   CCGCCGTTGCAAAGGCGCGCCG  +   1.213   1.308   0.3537  1   5552    0.00778
1:10642_G/A_1:10642 1   10642   A   G   +   5.491   14.83   0.7111  0   5552    0.00012
1:11008_C/G_1:11008 1   11008   G   C   +   -0.3325 0.3133  0.2886  0   5552    0.01250
1:11012_C/G_1:11012 1   11012   G   C   +   -0.3314 0.3132  0.2901  0   5552    0.01251
1:11063_T/G_1:11063 1   11063   G   T   +   0.1657  14.14   0.9906  0   5552    0.00019

每个标题都在一列之上,而不仅仅是第一行是字符串中的标题。任何建议将不胜感激!

【问题讨论】:

  • 您示例中的格式化文本与源文本有很大不同。你能澄清一下文本应该如何格式化吗?请同时提供您的代码。

标签: bash


【解决方案1】:

有两种方法

  1. 使用sed '/"/!s/,/ /g;s/","/ /g; s/"//g'预处理csv文件,然后使用column -T <filename>

  2. 但是,最有效的方法是使用 csvkit。 首先安装csvkit:sudo pip install csvkit(你需要安装python3和python3-pip)

套件中有各种工具,例如 csvcut,它允许查看和剪切您感兴趣的列。例如,csvcut -n file.csv 提供了可用的列(注意,我将您的示例数据复制到 file.csv 中)

$ csvcut -n file.csv 
  1: 
  2: SNPID
  3: chr
  4: position
  5: coded_all
  6: noncoded_all
  7: strand_genome
  8: beta
  9: SE
 10: pval
 11: AF_coded_all
 12: HWE_pval
 13: callrate
 14: n_total
 15: imputed
 16: used_for_imp
 17: oevar_imp
 18: cases_hwe
 19: controls_hwe
 20: cases_maf
 21: controls_maf
 22: ORIG_RSID
 23: ORIG_oevar_imp

剪切第一列和其他几列并将其制成表格,如下所示

$ csvcut -c 2,3,4,5,6 file.csv  | csvlook
| SNPID       |  chr | position | coded_all | noncoded_all |
| ----------- | ---- | -------- | --------- | ------------ |
| rs12238997  | True |  693,731 | G         | A            |
| rs55727773  | True |  706,368 | G         | A            |
| rs144155419 | True |  717,587 | A         | G            |
| 1:718624    | True |  718,624 | G         | C            |
| 1:718625    | True |  718,625 | G         | T            |
| rs564367954 | True |  720,984 | G         | T            |


您可以使用csvformat 根据您选择的分隔符进行分列。例如,对于制表符分隔的输出,这里是命令和输出

$ csvcut -c 2,3,4,5,6 file.csv  | csvformat -T
SNPID   chr position    coded_all   noncoded_all
rs12238997  1   693731  G   A
rs55727773  1   706368  G   A
rs144155419 1   717587  A   G
1:718624    1   718624  G   C
1:718625    1   718625  G   T
rs564367954 1   720984  G   T

希望这是有用的。如果需要,您现在可以编写脚本。当然,您可能需要稍微调整您的数据以满足您的需求。

您可以在此处阅读有关 csvkit 的更多信息:https://csvkit.readthedocs.io/en/latest/tutorial/1_getting_started.html#installing-csvkit

一切顺利!

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-07-02
    • 1970-01-01
    • 1970-01-01
    • 2014-12-28
    • 2020-01-20
    • 1970-01-01
    相关资源
    最近更新 更多