【问题标题】:How to remove and count words from a text file?如何从文本文件中删除和计算单词?
【发布时间】:2013-02-15 13:44:30
【问题描述】:

我想为特定文件集合中的文本文件查找术语频率和反转文档频率 (TF-IDF)。

所以在这种情况下,我只想计算文件中的总字数、文件中特定字的出现次数,并删除像aanthe 等字。

vb.net 有解析器吗?
提前致谢。

【问题讨论】:

  • 浏览这个tutorial,如果有帮助请告诉我。

标签: vb.net information-retrieval


【解决方案1】:

我知道的最简单的方法是:

Private Function CountWords(Filename as String) As Integer
    Return IO.File.ReadAllText(Filename).Split(" ").Count 
End Function

如果你想删除单词,你可以:

Private Sub RemoveWords (Filename as String, DeleteWords as List(Of String))
    Dim AllWords() As String = IO.File.ReadAllText(Filename).Split(" ")
    Dim RemainingWords() As String = From Word As String In AllWords
                                     Where DeleteWords.IndexOf(Word) = -1

    'Do something with RemainingWords ex:
    'IO.File.WriteAllText(Filename, String.Join(vbNewLine, RemainingWords)
End Sub    

这假设单词用空格分隔

【讨论】:

    【解决方案2】:

    执行此操作的最简单方法是将文本文件读入单个字符串,然后使用 .NET 框架查找匹配项:

    Dim text As String = File.ReadAllText("D:\Temp\MyFile.txt")
    Dim index As Integer = text.IndexOf("hello")
    If index >= 0 Then
       ' String is in file, starting at character "index"
    End If
    

    或者解决方案 2 你需要 StreamReader 和 Regx。

    //read file content in StreamReader
    StreamReadertxt Reader = new StreamReader(fName);
    szReadAll = txtReader.ReadToEnd();//Reads the whole text file to the end
    txtReader.Close(); //Closes the text file after it is fully read.
    txtReader = null;
    //search word in file content
    if (Regex.IsMatch(szReadAll, "SearchME", RegexOptions.IgnoreCase))//If the match is found in allRead
      MessageBox.Show("found");
    else
      MessageBox.Show("not found");
    

    就是这样,我希望这能解决您的疑问。 问候

    【讨论】:

      【解决方案3】:

      也许regular expressions 会帮助你:

      Using System.IO
      Using System.Text.RegularExpressions
      
      ...
      
      Dim anyWordPattern As String = "\b\w+\b"
      Dim myWordPattern As String = "\bMyWord\b"
      Dim replacePattern As String = "\b(?<sw>a|an|the)\b"
      Dim content As String = File.ReadAllText(<file name>)
      Dim coll = Regex.Matches(content, anyWordPattern)
      Console.WriteLine("Total words: {0}", coll.Count)
      coll = Regex.Matches(content, myWordPattern, RegexOptions.Multiline Or RegexOptions.IgnoreCase)
      Console.WEriteLine("My word occurrences: {0}", coll.Count)
      Dim replacedContent = Regex.Replace(content, replacePattern, String.Empty, RegexOptions.Multiline Or RegexOptions.IgnoreCase)
      Console.WriteLine("Replaced content: {0}", replacedContent)
      

      使用的正则表达式说明:

      • \b - 字边界;
      • \w - 任何单词字符;
      • + - 量词,1 个或多个;
      • (?...) - 命名组,称为“sw” - 停用词
      • a|an|the - 替代品,“a”或“an”或“the”

      【讨论】:

        【解决方案4】:

        你可以试试这样的:

        Dim text As String = IO.File.ReadAllText("C:\file.txt")
        Dim wordsToSearch() As String = New String() {"Hello", "World", "foo"}
        Dim words As New List(Of String)()
        Dim findings As Dictionary(Of String, List(Of Integer))
        
        'Dividing into words'
        words.AddRange(text.Split(New String() {" ", Environment.NewLine()}, StringSplitOptions.RemoveEmptyEntries))
        'Discarting all the words you dont want'
        words.RemoveAll(New Predicate(Of String)(AddressOf WordsDeleter))
        
        findings = SearchWords(words, wordsToSearch)
        
        Console.WriteLine("Number of 'foo': " & findings("foo").Count)
        

        以及使用的功能:

        Private Function WordsDeleter(ByVal obj As String) As Boolean
            Dim wordsToDelete As New List(Of String)(New String() {"a", "an", "then"})
            Return wordsToDelete.Contains(obj.ToLower)
        End Function
        
        Private Function SearchWords(ByVal allWords As List(Of String), ByVal wordsToSearch() As String) As Dictionary(Of String, List(Of Integer))
            Dim dResult As New Dictionary(Of String, List(Of Integer))()
            Dim i As Integer = 0
        
            For Each s As String In wordsToSearch
                dResult.Add(s, New List(Of Integer))
        
                While i >= 0 AndAlso i < allWords.Count
                    i = allWords.IndexOf(s, i)
                    If i >= 0 Then dResult(s).Add(i)
                    i += 1
                End While
            Next
        
            Return dResult
        End Function
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2019-11-07
          • 2015-07-13
          • 1970-01-01
          • 1970-01-01
          • 2019-10-27
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多