【问题标题】:Sort in alphabetical order with lowercase before uppercase?按字母顺序排序,小写在大写之前?
【发布时间】:2018-12-14 14:37:09
【问题描述】:

我从古腾堡计划的“威廉莎士比亚全集”开始,这是一个 UTF-8 文本文件,可从http://www.gutenberg.org/ebooks/100 获得。在 PowerShell 中,我跑了

Get-Content -Tail 50 $filename | Sort-Object -CaseSensitive

- 我相信 - 将文件的最后 50 行(即,由换行符分隔的字符串)传送到 Sort-Object,它被配置为按字母顺序排序,字符串以小写字母开头,然后以大写字母开头。

为什么下图中的输出(尤其是 P 中的)没有按照-CaseSensitive 开关排序?什么是解决方案?

【问题讨论】:

  • 对我来说是正确的。如果单词相同,则 -CaseSensitive 将小写优先于大写,但它不会将所有小写优先于大写。例如。愉快的 > 愉快的 > 请 > 请 > 门户 > 力量。

标签: powershell sorting output


【解决方案1】:

注意:此答案侧重于对 整个字符串 进行排序的一般情况(按 所有 个字符,而不仅仅是按 first em> 一个)。

您正在寻找 ordinal 排序,其中字符按 Unicode 代码点按数字排序( “ASCII 值”),因此 所有大写字母,作为一个组,排在所有小写字母之前

从 Windows PowerShell v5.1 / PowerShell Core v7.0 开始,Sort-Object 总是使用词法排序[1 ](默认使用 invariant 区域性,但可以使用 -Culture 参数更改),其中 区分大小写的排序 仅表示 给定字母的小写形式直接在大写形式之前,并非所有字母统称为;例如,bB 之前排序,但它们都在aA 之后(此外,逻辑与序数大小写相反,它是大写 字母第一):

PS> 'B', 'b', 'A', 'a' | Sort-Object -CaseSensitive
a
A
b
B

有一个解决方法,但是,它 (a) 将大写字母排在小写字母之前,并且 (b) 以牺牲性能为代价:

  • 要通过直接序数排序获得更好的性能,您需要直接使用 .NET 框架 - 见下文,它还提供了一种先对小写字母进行排序的解决方案。
  • this GitHub issue 正在讨论增强 Sort-Object 以支持序数排序。
# PSv4+ syntax
# Note: Uppercase letters come first.
PS> 'B', 'b', 'A', 'a' |
      Sort-Object { -join ([int[]] $_.ToCharArray()).ForEach('ToString', 'x4') } 
A
B
a
b

解决方案将每个输入字符串映射到由 4 位十六进制组成的字符串。字符代码点的表示,例如'aB'变成'00610042',分别代表代码点0x610x42;比较这些表示就相当于按字符的代码点对字符串进行排序。


使用 .NET 进行直接、性能更好的序数排序:

# Get the last 50 lines as a list.
[Collections.Generic.List[string]] $lines = Get-Content -Tail 50 $filename

# Sort the list in place, using ordinal sorting
$lines.Sort([StringComparer]::Ordinal)

# Output the result.
# Note that uppercase letters come first.
$lines

[StringComparer]::Ordinal 返回一个实现[System.Collections.IComparer] 接口的对象。

管道中使用此解决方案是可能的,但需要通过管道将行数组作为单个对象发送,-ReadCount 参数提供:

Get-Content -Tail 50 $filename -ReadCount 0 | ForEach-Object { 
  ($lines = [Collections.Generic.List[string]] $_).Sort([StringComparer]::Ordinal)
  $lines # output the sorted lines 
}

注意:如上所述,这排序大写字母第一


要对所有小写字母进行排序首先,您需要通过[System.Comparison[string]]委托实现自定义排序,这在PowerShell中可以被实现为一个脚本块({ ... }),它接受两个输入字符串并返回它们的排序等级(-1(或任何负值)代表小于0代表等于1(或任何正值)大于):

$lines.Sort({ param([string]$x, [string]$y)
  # Determine the shorter of the two lengths.
  $count = if ($x.Length -lt $y.Length) { $x.Length } else { $y.Length }
  # Loop over all characters in corresponding positions.
  for ($i = 0; $i -lt $count; ++$i) {
    if ([char]::IsLower($x[$i]) -ne [char]::IsLower($y[$i])) {
      # Sort all lowercase chars. before uppercase ones.
      return (1, -1)[[char]::IsLower($x[$i])]
    } elseif ($x[$i] -ne $y[$i]) { # compare code points (numerically)
      return $x[$i] - $y[$i]
    }
    # So far the two strings compared equal, continue.
  }
  # The strings compared equal in all corresponding character positions,
  # so the difference in length, if any, is the decider (longer strings sort
  # after shorter ones).
  return $x.Length - $y.Length
})

注意:对于 英文 文本,上述内容应该可以正常工作,但为了支持所有可能包含代理代码单元对和不同规范化形式(组合重音字符与分解重音字符)的 Unicode 文本,还需要做更多的工作。


[1] 在 Windows 上,默认执行所谓的单词排序:“某些非字母数字字符可能具有分配给它们的特殊权重。例如,连字符 (-) 可能分配给它的权重非常小,因此coopco-op 在排序列表中彼此相邻出现。";在 Unix 类平台上,字符串排序 是默认设置,其中没有特殊权重适用于非字母数字字符。 - 见the docs

【讨论】:

    【解决方案2】:

    获得所需结果的一种方法是获取每个字符串的第一个字符并将其转换为Int,这将为您提供该字符的 ASCII 代码,然后您可以按数字排序为所需的顺序。

    Get-Content -Tail 50 $filename | Sort-Object -Property @{E={[int]$_[0]};Ascending=$true} 
    

    我们可以使用sort-object-property 参数创建一个表达式,我们使用[int] 转换为int,然后使用$_ 抓取第一个字符以获取管道中的当前字符串/行然后[0] 获取该字符串中的第一个字符并按升序对其进行排序。

    这提供了以下输出。

    您可能希望从输出中删除空白,但我将由您决定。

     
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
        DONATIONS or determine the status of compliance for any particular state
        Foundation, how to help produce our new eBooks, and how to subscribe to
        Gutenberg-tm eBooks with only a loose network of volunteer support.
        International donations are gratefully accepted, but we cannot make any
        Most people start at our Web site which has the main PG search facility:
        Project Gutenberg-tm eBooks are often created from several printed
        Please check the Project Gutenberg Web pages for current donation
        Professor Michael S. Hart was the originator of the Project Gutenberg-tm
        Section 5. General Information About Project Gutenberg-tm electronic
        This Web site includes information about Project Gutenberg-tm, including
        While we cannot and do not solicit contributions from states where we
        against accepting unsolicited donations from donors in such states who
        approach us with offers to donate.
        concept of a library of electronic works that could be freely shared
        considerable effort, much paperwork and many fees to meet and keep up
        editions, all of which are confirmed as not protected by copyright in
        have not met the solicitation requirements, we know of no prohibition
        how to make donations to the Project Gutenberg Literary Archive
        including checks, online payments and credit card donations. To donate,
        methods and addresses. Donations are accepted in a number of other ways
        necessarily keep eBooks in compliance with any particular paper edition.
        our email newsletter to hear about new eBooks.
        please visit: www.gutenberg.org/donate
        statements concerning tax treatment of donations received from outside
        the United States. U.S. laws alone swamp our small staff.
        the U.S. unless a copyright notice is included. Thus, we do not
        visit www.gutenberg.org/donate
        with anyone. For forty years, he produced and distributed Project
        www.gutenberg.org
        we have not received written confirmation of compliance. To SEND
        with these requirements. We do not solicit donations in locations where
        works.
    

    更新

    先对小写进行排序,然后修剪空白行。本质上,我只是将 ascii 数字乘以任意数量,以便在数字上高于小写对应数字。

    在示例文本中,没有任何行以特殊字符或标点符号开头,这可能需要修改以正确处理这些情况。

    Get-Content -Tail 50 $filename | ? { -not [string]::IsNullOrEmpty($_) } | Sort-Object -Property {
        if($_[0] -cmatch "[A-Z]")
        {
            5*[int]$_[0]
        }
        else
        {
            [int]$_[0]
        } 
    }
    

    这将输出:

    against accepting unsolicited donations from donors in such states who
    approach us with offers to donate.
    considerable effort, much paperwork and many fees to meet and keep up
    concept of a library of electronic works that could be freely shared
    editions, all of which are confirmed as not protected by copyright in
    how to make donations to the Project Gutenberg Literary Archive
    have not met the solicitation requirements, we know of no prohibition
    including checks, online payments and credit card donations. To donate,
    methods and addresses. Donations are accepted in a number of other ways
    necessarily keep eBooks in compliance with any particular paper edition.
    our email newsletter to hear about new eBooks.
    please visit: www.gutenberg.org/donate
    statements concerning tax treatment of donations received from outside
    the U.S. unless a copyright notice is included. Thus, we do not
    the United States. U.S. laws alone swamp our small staff.
    visit www.gutenberg.org/donate
    with these requirements. We do not solicit donations in locations where
    works.
    www.gutenberg.org
    with anyone. For forty years, he produced and distributed Project
    we have not received written confirmation of compliance. To SEND
    DONATIONS or determine the status of compliance for any particular state
    Foundation, how to help produce our new eBooks, and how to subscribe to
    Gutenberg-tm eBooks with only a loose network of volunteer support.
    International donations are gratefully accepted, but we cannot make any
    Most people start at our Web site which has the main PG search facility:
    Please check the Project Gutenberg Web pages for current donation
    Professor Michael S. Hart was the originator of the Project Gutenberg-tm
    Project Gutenberg-tm eBooks are often created from several printed
    Section 5. General Information About Project Gutenberg-tm electronic
    This Web site includes information about Project Gutenberg-tm, including
    While we cannot and do not solicit contributions from states where we
    

    【讨论】:

    • 您的解决方案效果很好。我将其简化为: echo '.'获取内容 -Tail 10 $filename |排序对象-属性 {[int]$_[0]}
    • 我很高兴它有所帮助,作为旁注,如果您想交换它以使小写高于大写,那么我已经更新了答案以显示我提出的解决方案with(可能有更好的方法)。
    【解决方案3】:

    比较 Jacob 和 mklement0 的响应,Jacob 的解决方案具有视觉简单、直观、使用管道以及可扩展到按第一个单词的第二个字符或第二个单词的第一个字符等排序的优点。mklement0 的解决方案具有更快的优势,并让我了解如何排序小写然后大写。

    下面我想分享我对 Jacob 解决方案的扩展,它按第二个单词的第一个字符排序。对于莎士比亚全集不是特别有用,但对于逗号分隔的表格非常有用。

    Function Replace-Nulls($line) {
    
     $dump_var = @(
          if ( !($line) ) {
               $line = [char]0 + " " + [char]0 + " [THIS WAS A LINE OF NULL WHITESPACE]"
          } # End if
          if ( !(($line.split())[1]) ) {
               $line += " " + [char]8 + " [THIS WAS A LINE WITH ONE WORD AND THE REST NULL WHITESPACE]"
          } # End if
     ) # End definition of dump_var
    
     return $line
    
    } # End Replace-Nulls
    
    echo "."
    $cleaned_output = Get-Content -Tail 20 $filename | ForEach-Object{ Replace-Nulls($_) }
    $cleaned_output | Sort-Object -Property {[int]((($_).split())[1])[0]}
    

    【讨论】:

    • 不幸的是,您的问题是模棱两可的:您自己(不成功)尝试使用Sort-Object 肯定会按输入字符串中的 all 字符排序,而不仅仅是 first 一。仅按第一个字母排序让我觉得这是一个奇特的用例。我建议不要发布此答案,而是将您对其他两个答案的对比转移到您的 question 中,并完全省略代码 sn-p,因为它是您提出的问题的附带问题,可能只是分散注意力给未来的读者。
    • P.S.:我按所有字符排序。解决方案也可以在管道中工作,但不容易(见我的更新)。最终,只有增强 Sort-Object 才能提供一个很好的解决方案。
    猜你喜欢
    • 2011-08-14
    • 2017-06-23
    • 1970-01-01
    • 2022-01-25
    • 2017-11-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-11-20
    相关资源
    最近更新 更多