需要帮助提高 PowerShell 分隔文本解析脚本的性能答案

【问题标题】：Need help improving performance of PowerShell delimited-text parsing script需要帮助提高 PowerShell 分隔文本解析脚本的性能
【发布时间】：2012-02-12 16:31:32
【问题描述】：

我需要解析一个用竖线分隔的大型文件，以计算第 5 列符合和不符合我的条件的记录数。

PS C:\temp> gc .\items.txt -readcount 1000 | `
  ? { $_ -notlike "HEAD" } | `
  % { foreach ($s in $_) { $s.split("|")[4] } } | `
  group -property {$_ -ge 256} -noelement | `
  ft –autosize

此命令执行我想要的操作，返回如下输出：

计数名称 ----- ---- 1129339 真 2013703 错误

但是，对于 500 MB 的测试文件，根据 Measure-Command 的测量，此命令大约需要 5.5 分钟才能运行。一个典型的文件超过 2 GB，等待 20 分钟以上的时间太长了。

您是否找到了提高此命令性能的方法？

例如，有没有办法确定 Get-Content 的 ReadCount 的最佳值？没有它，完成同一个文件需要 8.8 分钟。

【问题讨论】：

你试过 StreamReader 吗？我认为 Get-Content 会先将整个文件加载到内存中，然后再对其进行任何操作。
你的意思是导入 System.IO？
是的，如果可以，请使用 .net 框架。我过去常常阅读 SQL Server 生成的大型日志文件，结果很好。我不知道在 powershell 中有什么其他方法可以有效地读取大文件，但我不是专家。
@Gisli，如果您写评论作为答案，我可以投票并最终接受它。使用 StreamReader 可以让我将测试文件的时间缩短到 1 分钟。

标签： performance powershell csv

【解决方案1】：

你试过 StreamReader 吗？我认为 Get-Content 会在对其进行任何操作之前将整个文件加载到内存中。

StreamReader class

【讨论】：

【解决方案2】：

使用@Gisli 的提示，这是我最终得到的脚本：

param($file = $(Read-Host -prompt "File"))
$fullName = (Get-Item "$file").FullName
$sr = New-Object System.IO.StreamReader("$fullName")
$trueCount = 0; 
$falseCount = 0; 
while (($line = $sr.ReadLine()) -ne $null) {
      if ($line -like 'HEAD|') { continue }
      if ($line.split("|")[4] -ge 256) { 
            $trueCount++
      }
      else {
            $falseCount++
      }
}
$sr.Dispose() 
write "True count:   $trueCount"
write "False count: $falseCount"

它在大约一分钟内产生相同的结果，符合我的性能要求。

【讨论】：

【解决方案3】：

只是添加另一个示例，使用 StreamReader 读取一个非常大的 IIS 日志文件并输出所有唯一的客户端 IP 地址和一些性能指标。

$path = 'A_245MB_IIS_Log_File.txt'
$r = [IO.File]::OpenText($path)

$clients = @{}

while ($r.Peek() -ge 0) {
    $line = $r.ReadLine()

    # String processing here...
    if (-not $line.StartsWith('#')) {
        $split = $line.Split()
        $client = $split[-5]
        if (-not $clients.ContainsKey($client)){
            $clients.Add($client, $null)
        }
    }
}

$r.Dispose()
$clients.Keys | Sort

与Get-Content 的性能比较：

StreamReader：完成：5.5 秒，PowerShell.exe：35,328 KB RAM。

获取内容：完成：23.6 秒。 PowerShell.exe：1,110,524 KB RAM。

【讨论】：