我最近一直在玩弄 CSV 文件的分隔符/定界符检测问题。我提出了以下建议,希望能对其他人有所帮助,并可能会收到反馈以进行改进。
我的解决方案基于我阅读过的有关该问题的几篇文章。
因为对字段分隔符没有限制,所以我决定使用 ASCII 表并消除明显的(字母数字字符)和不那么明显的(不可打印的),但 TAB 代码除外。使用这些值,我填充了一个字典,其中 ASCII 代码是键,值将用我的代码填充。
然后就是逐行读取 CSV,查看每一行是否出现任何字典键字符,并增加我遇到的每个键字符的值。循环继续到文件末尾或在此示例中限制为 100 次。您可以根据需要更改此设置,但 100 足以检测分隔符。分隔符然后由具有最大值的字典键(ASCII 码)确定。
调用例程示例
private sub Main()
dim separator As Char
separator= separatorDetect(txtInputFile.Text)
end sub
主要检测功能
Private Function separatorDetect(ByVal StrFileName As String) As Char
Dim i As Int16 = 0
Dim separator As List(Of Char)
Dim dictSeparators As New Dictionary(Of Integer, Integer)
dictSeparators.Add(9, 0)
dictSeparators.Add(33, 0)
For i = 35 To 47
dictSeparators.Add(i, 0)
Next
For i = 91 To 96
dictSeparators.Add(i, 0)
Next
For i = 123 To 126
dictSeparators.Add(i, 0)
Next
Dim lineCounter As Integer = 0
Dim line As String = String.Empty
Dim keyList As New List(Of Integer)
For Each key In dictSeparators.Keys
keyList.Add(key)
Next
Dim tmp As Char
Using textReader = New StreamReader(StrFileName)
Do Until textReader.EndOfStream
line = textReader.ReadLine.Trim
For Each key In keyList
tmp = Convert.ToChar(key)
dictSeparators.Item(key) = dictSeparators.Item(key) + InStrCount(line, tmp)
Next
lineCounter += 1
If lineCounter = 99 Then GoTo readEnd
Loop
End Using
readEnd:
Dim max = dictSeparators.Aggregate(Function(l, r) If(l.Value > r.Value, l, r)).Key
Return Chr(max)
End Function
计数函数的递归索引
Private Function InStrCount(ByVal SourceString As String, ByVal SearchString As Char, _
Optional ByRef StartPos As Integer = 0, _
Optional ByRef Count As Integer = 0) As Integer
If SourceString.IndexOf(SearchString, StartPos) > -1 Then
Count += 1
InStrCount(SourceString, SearchString, SourceString.IndexOf(SearchString, StartPos) + 1, Count)
End If
Return Count
End Function
这对我有用,但我总是很高兴看到更好的优化方式。