【问题标题】:can't explain sort(1) behaviour无法解释 sort(1) 行为
【发布时间】:2013-12-22 16:53:33
【问题描述】:

当我看到ls以奇怪的顺序列出以下文件时,我一直对此感到困惑:

Star Wars Episode II - Attack of the Clones (2002) BDRip.mkv
Star Wars Episode III - Revenge of the Sith (2005) BDRip.mkv
Star Wars Episode I - The Phantom Menace (1999) BDRip.mkv
Star Wars Episode IV - A New Hope (1977) BDRip.mkv
Star Wars Episode VI - Return of the Jedi (1983) BDRip.mkv
Star Wars Episode V - The Empire Strikes Back (1980) BDRip.mkv

从人类的角度来看,“I”应该先走,然后是“II”,依此类推。

所以我创建了包含以下内容的文件:

$ cat 1
Star Wars Episode II - Attack
Star Wars Episode III - Revenge
Star Wars Episode I - The
Star Wars Episode IV - A
Star Wars Episode VI - Return
Star Wars Episode V - The

如果我对它进行排序,它会给我这个:

$ sort 1
Star Wars Episode II - Attack
Star Wars Episode III - Revenge
Star Wars Episode I - The
Star Wars Episode IV - A
Star Wars Episode VI - Return
Star Wars Episode V - The

但是,如果我删除“-”以及排序正确后的所有内容:

$ cat 1
Star Wars Episode II 
Star Wars Episode III 
Star Wars Episode I 
Star Wars Episode IV 
Star Wars Episode VI 
Star Wars Episode V 

$ sort 1
Star Wars Episode I 
Star Wars Episode II 
Star Wars Episode III 
Star Wars Episode IV 
Star Wars Episode V 
Star Wars Episode VI 

所以,只要我在空格后添加任何符号,它就会开始对我来说不可预知的排序:

$ cat 1
Star Wars Episode II y
Star Wars Episode III x
Star Wars Episode I z
Star Wars Episode IV w
Star Wars Episode VI v
Star Wars Episode V u

$ sort 1
Star Wars Episode III x
Star Wars Episode II y
Star Wars Episode IV w
Star Wars Episode I z
Star Wars Episode VI v
Star Wars Episode V u

关于这种排序行为的任何提示?

更新:排序:使用‘en_CA.UTF-8’排序规则

update #2 根据下面的评论,这是因为语言环境。

ls | LANG=C sort
Star Wars Episode I - The Phantom Menace (1999) BDRip.mkv
Star Wars Episode II - Attack of the Clones (2002) BDRip.mkv
Star Wars Episode III - Revenge of the Sith (2005) BDRip.mkv
Star Wars Episode IV - A New Hope (1977) BDRip.mkv
Star Wars Episode V - The Empire Strikes Back (1980) BDRip.mkv
Star Wars Episode VI - Return of the Jedi (1983) BDRip.mkv

为什么 UTF8 语言环境让它与众不同? 我检查了 ru_RU.UTF8(排序错误)和 ru_RU.KOI8-R(正确排序)

更新#3关于语言环境:http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

【问题讨论】:

  • LC_ALL=C 前面加上它可以工作,所以它必须与语言环境有关。
  • unix.com/showthread.php?t=156805 用罗马数字对文件进行排序的脚本
  • “ii”是 ru_RU 语言环境中排在“i”之前的二合字母(当它不被视为罗马数字时)?快速的 Google 显示已经报告了针对 ru_RU.UTF8 区域设置的排序顺序问题的错误,因此这完全有可能是您所看到的内容的一部分......
  • 请看我下面的回答并更新到原始问题。这是 UTF8 语言环境的默认行为,至少是我使用过的语言环境。他们忽略空格。我最初的问题与 ru.RU.* 语言环境无关,而是与 *.UTF8 和 en_CA.UTF8 相关。

标签: linux sorting ls


【解决方案1】:

【讨论】:

    【解决方案2】:

    在使用基于语言环境的排序时,它会忽略所有非字母数字字符:

    II - Attack   -> "IIA"
    III - Revenge -> "III"
    I - The       -> "ITh"
    IV - A        -> "IVA"
    VI - Return   -> "VIR"
    V - The       -> "VTh"
    

    使用LC_ALL=C,空格字符排在字母数字前面:

    I - The       -> "I -"
    II - Attack   -> "II "
    III - Revenge -> "III"
    IV - A        -> "IV "
    V - The       -> "V -"
    VI - Return   -> "VI "
    

    所以这是巧合,但它需要 30 多部电影才能真正失败。

    【讨论】:

    • 好的,谢谢。更正:据我所知,它只是 UTF8 语言环境。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-05-18
    • 2015-07-08
    • 2021-10-31
    • 1970-01-01
    相关资源
    最近更新 更多