尽管文件没有静态行长,我还是尝试了二进制搜索。
首先是一些注意事项,然后是代码:
有时需要根据行首的升序排序键提取日志文件的最后 n 行。键实际上可以是任何东西,但在日志文件中通常表示日期时间,通常采用 YYMMDDHHNNSS 格式(可能带有一些插曲)。
日志文件通常是基于文本的文件,由多行组成,有时有数百万行。日志文件通常具有固定长度的行宽,在这种情况下,通过二分搜索很容易访问特定的键。但是,日志文件可能也经常具有可变的行宽。要访问这些,可以使用平均线宽的估计值来计算文件末尾的位置,然后从那里依次处理到 EOF。
但也可以对这种类型的文件采用二进制方法,如此处所示。一旦文件大小增加,优势就会出现。日志文件的最大大小由文件系统决定:理论上,NTFS 允许 16 EiB (16 x 2^60 B);实际上,在 Windows 8 或 Server 2012 下,它是 256 TiB (256 x 2^40 B)。
(256 TiB 的实际含义:一个典型的日志文件被设计为可供人类阅读,并且每行很少超过 80 个字符。假设您的日志文件在令人惊讶的 12 年中愉快且完全不间断地记录总共 4,383 天,每次 86,400 秒,那么您的应用程序可以每毫秒将 9 个条目写入所述日志文件,最终在第 13 年达到 256 TiB 的限制。)
二进制方法的最大优点是,对于由 2^n 字节组成的日志文件,n 次比较就足够了,随着文件大小的变大而迅速获得优势:而 1 KiB 的文件大小需要 10 次比较(1根据 102.4 B),1 MiB(每 50 KiB 1 个)只需要 20 次比较,1 GiB(每 33⅓ MiB 1 次)需要 30 次比较,1 TiB 大小的文件(每 25 GiB 1 次)只需 40 次比较。
到函数。做了这些假设:日志文件以 UTF8 编码,日志行由 CR/LF 序列分隔,时间戳按升序位于每行的开头,格式可能为 [YY]YYMMDDHHNNSS,可能中间有一些插曲。 (所有这些假设都可以通过重载的函数调用轻松修改和维护。)
在外循环中,二进制缩小是通过比较提供的最早日期时间来匹配的。一旦在二进制流中找到了一个新位置,就会在内部循环中进行独立的前向搜索,以定位下一个 CR/LF 序列。此序列之后的字节标记了正在比较的记录键的开始。如果此键大于或等于我们要搜索的键,则将其忽略。仅当找到的键小于我们正在搜索其位置的键时,才会将其视为我们想要的键之前的记录的可能条件。我们最终得到最大键小于搜索键的最后一条记录。
最后,除了最终候选之外的所有日志记录都以字符串数组的形式返回给调用者。
该功能需要导入System.IO。
Imports System.IO
'This function expects a log file which is organized in lines of varying
'lengths, delimited by CR/LF. At the start of each line is a sort criterion
'of any kind (in log files typically YYMMDD HHMMSS), by which the lines are
'sorted in ascending order (newest log line at the end of the file). The
'earliest match allowed to be returned must be provided. From this the sort
'key's length is inferred. It needs not to exist neccessarily. If it does,
'it can occur multiple times, as all other sort keys. The returned string
'array contains all these lines, which are larger than the last one found to
'be smaller than the provided sort key.
Public Shared Function ExtractLogLines(sLogFile As String,
sEarliest As String) As String()
Dim oFS As New FileStream(sLogFile, FileMode.Open, FileAccess.Read,
FileShare.Read) 'The log file as file stream.
Dim lMin, lPos, lMax As Long 'Examined stream window.
Dim i As Long 'Iterator to find CR/LF.
Dim abEOL(0 To 1) As Byte 'Bytes to find CR/LF.
Dim abCRLF() As Byte = {13, 10} 'Search for CR/LF.
Dim bFound As Boolean 'CR/LF found.
Dim iKeyLen As Integer = sEarliest.Length 'Length of sort key.
Dim sActKey As String 'Key of examined log record.
Dim abKey() As Byte 'Reading the current key.
Dim lCandidate As Long 'File position of promising candidate.
Dim sRecords As String 'All wanted records.
'The byte array accepting the records' keys is as long as the provided
'key.
ReDim abKey(0 To iKeyLen - 1) '0-based!
'We search the last log line, whose sort key is smaller than the sort
'provided in sEarliest.
lMin = 0 'Start at stream start
lMax = oFS.Length - 1 - 2 '0-based, and without terminal CRLF.
Do
lPos = (lMax - lMin) \ 2 + lMin 'Position to examine now.
'Although the key to be compared with sEarliest is located after
'lPos, it is important, that lPos itself is not modified when
'searching for the key.
i = lPos 'Iterator for the CR/LF search.
bFound = False
Do While i < lMax
oFS.Seek(i, SeekOrigin.Begin)
oFS.Read(abEOL, 0, 2)
If abEOL.SequenceEqual(abCRLF) Then 'CR/LF found.
bFound = True
Exit Do
End If
i += 1
Loop
If Not bFound Then
'Between lPos and lMax no more CR/LF could be found. This means,
'that the search is over.
Exit Do
End If
i += 2 'Skip CR/LF.
oFS.Seek(i, SeekOrigin.Begin) 'Read the key after the CR/LF
oFS.Read(abKey, 0, iKeyLen) 'into a string.
sActKey = System.Text.Encoding.UTF8.GetString(abKey)
'Compare the actual key with the earliest key. We want to find the
'largest key just before the earliest key.
If sActKey >= sEarliest Then
'Not interested in this one, look for an earlier key.
lMax = lPos
Else
'Possibly interesting, remember this.
lCandidate = i
lMin = lPos
End If
Loop While lMin < lMax - 1
'lCandidate is the position of the first record to be taken into account.
'Note, that we need the final CR/LF here, so that the search for the
'next CR/LF sequence following below will match a valid first entry even
'in case there are no entries to be returned (sEarliest being larger than
'the last log line).
ReDim abKey(CInt(oFS.Length - lCandidate - 1)) '0-based.
oFS.Seek(lCandidate, SeekOrigin.Begin)
oFS.Read(abKey, 0, CInt(oFS.Length - lCandidate))
'We're done with the stream.
oFS.Close()
'Convert into a string, but omit the first line, then return as a
'string array split at CR/LF, without the empty last entry.
sRecords = (System.Text.Encoding.UTF8.GetString(abKey))
sRecords = sRecords.Substring(sRecords.IndexOf(Chr(10)) + 1)
Return sRecords.Split(ControlChars.CrLf.ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
End Function