【问题标题】:Sorting .csv from command line从命令行对 .csv 进行排序
【发布时间】:2015-06-22 02:20:48
【问题描述】:

我正在尝试按最后一列对这一行(2010 年整个美国的人口普查区人口密度)进行排序。

12001,2,1009,Alachua FL,29.65612,-82.327274,0.0005131,0.013289229,12,902.9869232

censusBlockDensities.csv(从评论移到这里)

17001,1,1010,Adams IL,39.960197,-91.373363,0.08861,00.037495258,23,613.41090336
17001,1,1020,Adams IL,39.955861,-91.354113,0.19038,0.493081936,2,4.05612100686
17001,1,1031,Adams IL,39.956978,-91.369,0.002268,0.005874093,0,0,22.8543955664
17001,1,1041,Adams IL,39.94333,-91.345319,0.000358,0.0009236128,0,0480.4506562
17001,1,1051,Adams IL,39.948201,-91.352052,0.213797,0.553731688,64,115.5794427

【问题讨论】:

  • 太棒了!你有什么尝试吗?你用的是什么外壳?它是 UNIX 还是 Windows 命令行?

标签: shell csv sorting unix


【解决方案1】:

我假设是一个 unix shell(即 bash)。

阅读排序命令的手册页: man sort

来自手册页:

环境指定的区域设置会影响排序顺序。设置 LC_ALL=C 以获得使用本机字节值的传统排序顺序。

export LC_ALL=C

sort -t , -k 10,10 -n censusBlockDensities.csv

标志说明:

-t ,:指定逗号作为字段分隔符。

-k 10,10:指定仅在第 10 个字段(开始、停止)排序(第一个字段是 1,而不是 0)。

KEYDEF 是 F[.C][OPTS][,F[.C][OPTS]] 表示开始和停止位置,其中 F 是字段编号,C 是字段中的字符位置;两者都是原点 1,停止位置默认为行尾。如果既不是 -t 也不是 -b 生效,字段中的字符从前一个空格的开头开始计算。 OPTS 是一个或多个单字母排序选项 [bdfgiMhnRrV],它覆盖该键的全局排序选项。如果没有给出键,则使用整行作为键。

-n:执行数字排序,而不是默认的字母数字排序(或者,将“n”添加到-k 参数中,如下评论中所述)。

censusBlockDensities.csv

17001,1,1010,Adams IL,39.960197,-91.373363,0.08861,00.037495258,23,613.41090336
17001,1,1020,Adams IL,39.955861,-91.354113,0.19038,0.493081936,2,4.05612100686
17001,1,1031,Adams IL,39.956978,-91.369,0.002268,0.005874093,0,0,22.8543955664
17001,1,1041,Adams IL,39.94333,-91.345319,0.000358,0.0009236128,0,0480.4506562
17001,1,1051,Adams IL,39.948201,-91.352052,0.213797,0.553731688,64,115.5794427

输出:

17001,1,1020,Adams IL,39.955861,-91.354113,0.19038,0.493081936,2,4.05612100686
17001,1,1031,Adams IL,39.956978,-91.369,0.002268,0.005874093,0,0,22.8543955664
17001,1,1051,Adams IL,39.948201,-91.352052,0.213797,0.553731688,64,115.5794427
17001,1,1041,Adams IL,39.94333,-91.345319,0.000358,0.0009236128,0,0480.4506562
17001,1,1010,Adams IL,39.960197,-91.373363,0.08861,00.037495258,23,613.41090336

编辑:有用的评论表明我的回答有误。您还需要-n 标志来执行数字排序(默认为字母数字)。我已经修改了我的答案以包括这一点。您还可以通过尝试使用-r 标志以相反的顺序排序来验证它是否正常工作。我还在-k 10 参数中添加了停止字段索引,如another post 中所述。

此外,您应该检查输入文件以确保每行中的字段数量相同:

awk '{print gsub(/,/,"")}' censusBlockDensities.csv

9
9
10 <-- the third record has an additional field
9
9

【讨论】:

  • 试过了,没用。请注意,这是我在 Excel 中操作的 .csv,然后再次导出为 .csv export LC_ALL=C sort -t , -k 10 censusBlockDensities.csv &gt; 2.csv 输出:17001,1,1010,Adams IL,39.960197,-91.373363,0.08861,00.037495258,23,613.41090336 17001,1,1020,Adams IL,39.955861,-91.354113,0.19038,0.493081936,2,4.05612100686 17001,1,1031,Adams IL,39.956978,-91.369,0.002268,0.005874093,0,0,22.8543955664 17001,1,1041,Adams IL,39.94333,-91.345319,0.000358,0.0009236128,0,0480.4506562 17001,1,1051,Adams IL,39.948201,-91.352052,0.213797,0.553731688,64,115.5794427
  • sort -n -t , -k 10sort -t , -k 10n - 数字排序的“n”
  • @user3502552 关于您的新示例数据,我注意到的第一件事是您在第三行中有一个附加字段。请参阅我更新的答案以了解检测这些问题的方法。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2014-04-20
  • 2015-06-25
  • 1970-01-01
  • 1970-01-01
  • 2014-10-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多