【问题标题】:NSString tokenize in Objective-CObjective-C 中的 NSString 标记化
【发布时间】:2010-09-20 13:43:03
【问题描述】:

在 Objective-C 中标记/拆分 NSString 的最佳方法是什么?

【问题讨论】:

    标签: objective-c cocoa tokenize


    【解决方案1】:

    找到答案here

    NSString *string = @"oop:ack:bork:greeble:ponies";
    NSArray *chunks = [string componentsSeparatedByString: @":"];
    

    【讨论】:

    • 作为对未来读者的参考,我想注意的是,相反的是[anArray componentsJoinedByString:@":"];
    • 谢谢,但是如何拆分由更多标记分隔的 NSString? (如果你明白我的意思,我的英语不是很好)@Adam
    • @Adam,我想你想要的是componentsSeparatedByCharactersInSet。请参阅下面的答案。
    【解决方案2】:

    如果您只想拆分字符串,请使用-[NSString componentsSeparatedByString:]。如需更复杂的标记化,请使用 NSScanner 类。

    【讨论】:

      【解决方案3】:

      如果您的标记化需求更复杂,请查看我的开源 Cocoa 字符串标记/解析工具包:ParseKit:

      http://parsekit.com

      对于使用分隔符 char(如 ':')的简单字符串拆分,ParseKit 肯定是矫枉过正。但同样,对于复杂的标记化需求,ParseKit 非常强大/灵活。

      另见ParseKit Tokenization documentation

      【讨论】:

      • 这还能用吗?我试过了,但遇到了一些错误,我不敢尝试修复自己。
      • 嗯?活? ParseKit 项目得到积极维护,是的。但是,这里的 cmets 不是提交项目错误的正确位置。如果您需要提交错误,它在 Google Code 和 Github 上。
      • 听起来不错,但现在我无法删除我的反对票,除非您以某种方式编辑答案(网站规则)。也许您可以注意它的工作版本,或者它是否使用 ARC 等?或者你可以在某个地方添加一个空间,这取决于你:)
      【解决方案4】:

      每个人都提到过componentsSeparatedByString:,但你也可以使用CFStringTokenizer(请记住NSStringCFString 是可以互换的),它也会标记自然语言(比如中文/日文不会在空格上分割单词)。

      【讨论】:

      【解决方案5】:

      如果你想标记多个字符,你可以使用 NSString 的componentsSeparatedByCharactersInSet。 NSCharacterSet 有一些方便的预制集合,例如 whitespaceCharacterSetillegalCharacterSet。它有 Unicode 范围的初始化器。

      您还可以组合字符集并使用它们进行标记,如下所示:

      // Tokenize sSourceEntityName on both whitespace and punctuation.
      NSMutableCharacterSet *mcharsetWhitePunc = [[NSCharacterSet whitespaceAndNewlineCharacterSet] mutableCopy];
      [mcharsetWhitePunc formUnionWithCharacterSet:[NSCharacterSet punctuationCharacterSet]];
      NSArray *sarrTokenizedName = [self.sSourceEntityName componentsSeparatedByCharactersInSet:mcharsetWhitePunc];
      [mcharsetWhitePunc release];
      

      请注意,componentsSeparatedByCharactersInSet 如果连续遇到多个 charSet 成员,将生成空白字符串,因此您可能需要测试长度小于 1。

      【讨论】:

      • 不涉及空格根本不分隔所有逻辑标记的语言。糟糕的解决方案。
      • @uchuugaka 在这种情况下,您将使用不同的字符集或用于标记化的字符集。我只是用具体的例子来说明一个一般概念。
      【解决方案6】:

      我有一个案例,我必须在使用 ldapsearch 进行 LDAP 查询后拆分控制台输出。首先设置并执行 NSTask(我在这里找到了一个很好的代码示例:Execute a terminal command from a Cocoa app)。但后来我不得不拆分和解析输出,以便仅从 Ldap-query-output 中提取打印服务器名称。不幸的是,这是相当乏味的字符串操作,如果我们用简单的 C 数组操作来操作 C 字符串/数组,这根本就没有问题。所以这是我使用可可对象的代码。如果您有更好的建议,请告诉我。

      //as the ldap query has to be done when the user selects one of our Active Directory Domains
      //(an according comboBox should be populated with print-server names we discover from AD)
      //my code is placed in the onSelectDomain event code
      
      //the following variables are declared in the interface .h file as globals
      @protected NSArray* aDomains;//domain combo list array
      @protected NSMutableArray* aPrinters;//printer combo list array
      @protected NSMutableArray* aPrintServers;//print server combo list array
      
      @protected NSString* sLdapQueryCommand;//for LDAP Queries
      @protected NSArray* aLdapQueryArgs;
      @protected NSTask* tskLdapTask;
      @protected NSPipe* pipeLdapTask;
      @protected NSFileHandle* fhLdapTask;
      @protected NSMutableData* mdLdapTask;
      
      IBOutlet NSComboBox* comboDomain;
      IBOutlet NSComboBox* comboPrinter;
      IBOutlet NSComboBox* comboPrintServer;
      //end of interface globals
      
      //after collecting the print-server names they are displayed in an according drop-down comboBox
      //as soon as the user selects one of the print-servers, we should start a new query to find all the
      //print-queues on that server and display them in the comboPrinter drop-down list
      //to find the shares/print queues of a windows print-server you need samba and the net -S command like this:
      // net -S yourPrintServerName.yourBaseDomain.com -U yourLdapUser%yourLdapUserPassWord -W adm rpc share -l
      //which dispalays a long list of the shares
      
      - (IBAction)onSelectDomain:(id)sender
      {
          static int indexOfLastItem = 0; //unfortunately we need to compare this because we are called also if the selection did not change!
      
          if ([comboDomain indexOfSelectedItem] != indexOfLastItem && ([comboDomain indexOfSelectedItem] != 0))
          {
      
              indexOfLastItem = [comboDomain indexOfSelectedItem]; //retain this index for next call
      
          //the print-servers-list has to be loaded on a per univeristy or domain basis from a file dynamically or from AN LDAP-QUERY
      
          //initialize an LDAP-Query-Task or console-command like this one with console output
          /*
      
           ldapsearch -LLL -s sub -D "cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com" -h "yourLdapServer.com" -p 3268 -w "yourLdapUserPassWord" -b "dc=yourBaseDomainToSearchIn,dc=com" "(&(objectcategory=computer)(cn=ps*))" "dn"
      
      //our print-server names start with ps* and we want the dn as result, wich comes like this:
      
           dn: CN=PSyourPrintServerName,CN=Computers,DC=yourBaseDomainToSearchIn,DC=com
      
           */
      
          sLdapQueryCommand = [[NSString alloc] initWithString: @"/usr/bin/ldapsearch"];
      
      
          if ([[comboDomain stringValue] compare: @"firstDomain"] == NSOrderedSame) {
      
            aLdapQueryArgs = [NSArray arrayWithObjects: @"-LLL",@"-s", @"sub",@"-D", @"cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com",@"-h", @"yourLdapServer.com",@"-p",@"3268",@"-w",@"yourLdapUserPassWord",@"-b",@"dc=yourFirstDomainToSearchIn,dc=com",@"(&(objectcategory=computer)(cn=ps*))",@"dn",nil];
          }
          else {
            aLdapQueryArgs = [NSArray arrayWithObjects: @"-LLL",@"-s", @"sub",@"-D", @"cn=yourLdapUser,ou=yourOuWithLdapUserAccount,dc=yourDomain,dc=com",@"-h", @"yourLdapServer.com",@"-p",@"3268",@"-w",@"yourLdapUserPassWord",@"-b",@"dc=yourSecondDomainToSearchIn,dc=com",@"(&(objectcategory=computer)(cn=ps*))",@"dn",nil];
      
          }
      
      
          //prepare and execute ldap-query task
      
          tskLdapTask = [[NSTask alloc] init];
          pipeLdapTask = [[NSPipe alloc] init];//instead of [NSPipe pipe]
          [tskLdapTask setStandardOutput: pipeLdapTask];//hope to get the tasks output in this file/pipe
      
          //The magic line that keeps your log where it belongs, has to do with NSLog (see https://stackoverflow.com/questions/412562/execute-a-terminal-command-from-a-cocoa-app and here http://www.cocoadev.com/index.pl?NSTask )
          [tskLdapTask setStandardInput:[NSPipe pipe]];
      
          //fhLdapTask  = [[NSFileHandle alloc] init];//would be redundand here, next line seems to do the trick also
          fhLdapTask = [pipeLdapTask fileHandleForReading];
          mdLdapTask  = [NSMutableData dataWithCapacity:512];//prepare capturing the pipe buffer which is flushed on read and can overflow, start with 512 Bytes but it is mutable, so grows dynamically later
          [tskLdapTask setLaunchPath: sLdapQueryCommand];
          [tskLdapTask setArguments: aLdapQueryArgs];
      
      #ifdef bDoDebug
          NSLog (@"sLdapQueryCommand: %@\n", sLdapQueryCommand);
          NSLog (@"aLdapQueryArgs: %@\n", aLdapQueryArgs );
          NSLog (@"tskLdapTask: %@\n", [tskLdapTask arguments]);
      #endif
      
          [tskLdapTask launch];
      
          while ([tskLdapTask isRunning]) {
            [mdLdapTask appendData: [fhLdapTask readDataToEndOfFile]];
          }
          [tskLdapTask waitUntilExit];//might be redundant here.
      
          [mdLdapTask appendData: [fhLdapTask readDataToEndOfFile]];//add another read for safety after process/command stops
      
          NSString* sLdapOutput = [[NSString alloc] initWithData: mdLdapTask encoding: NSUTF8StringEncoding];//convert output to something readable, as NSData and NSMutableData are mere byte buffers
      
      #ifdef bDoDebug
          NSLog(@"LdapQueryOutput: %@\n", sLdapOutput);
      #endif
      
          //Ok now we have the printservers from Active Directory, lets parse the output and show the list to the user in its combo box
          //output is formatted as this, one printserver per line
          //dn: CN=PSyourPrintServer,OU=Computers,DC=yourBaseDomainToSearchIn,DC=com
      
          //so we have to search for "dn: CN=" to retrieve each printserver's name
          //unfortunately splitting this up will give us a first line containing only "" empty string, which we can replace with the word "choose"
          //appearing as first entry in the comboBox
      
          aPrintServers = (NSMutableArray*)[sLdapOutput componentsSeparatedByString:@"dn: CN="];//split output into single lines and store it in the NSMutableArray aPrintServers
      
      #ifdef bDoDebug
          NSLog(@"aPrintServers: %@\n", aPrintServers);
      #endif
      
          if ([[aPrintServers objectAtIndex: 0 ] compare: @"" options: NSLiteralSearch] == NSOrderedSame){
            [aPrintServers replaceObjectAtIndex: 0 withObject: slChoose];//replace with localized string "choose"
      
      #ifdef bDoDebug
            NSLog(@"aPrintServers: %@\n", aPrintServers);
      #endif
      
          }
      
      //Now comes the tedious part to extract only the print-server-names from the single lines
          NSRange r;
          NSString* sTemp;
      
          for (int i = 1; i < [aPrintServers count]; i++) {//skip first line with "choose". To get rid of the rest of the line, we must isolate/preserve the print server's name to the delimiting comma and remove all the remaining characters
            sTemp = [aPrintServers objectAtIndex: i];
            sTemp = [sTemp stringByTrimmingCharactersInSet: [NSCharacterSet whitespaceAndNewlineCharacterSet]];//remove newlines and line feeds
      
      #ifdef bDoDebug
            NSLog(@"sTemp: %@\n", sTemp);
      #endif
            r = [sTemp rangeOfString: @","];//now find first comma to remove the whole rest of the line
            //r.length = [sTemp lengthOfBytesUsingEncoding:NSUTF8StringEncoding];
            r.length = [sTemp length] - r.location;//calculate number of chars between first comma found and lenght of string
      #ifdef bDoDebug
            NSLog(@"range: %i, %i\n", r.location, r.length);
      #endif
      
            sTemp = [sTemp stringByReplacingCharactersInRange:r withString: @"" ];//remove rest of line
      #ifdef bDoDebug
            NSLog(@"sTemp after replace: %@\n", sTemp);
      #endif
      
            [aPrintServers replaceObjectAtIndex: i withObject: sTemp];//put back string into array for display in comboBox
      
      #ifdef bDoDebug
            NSLog(@"aPrintServer: %@\n", [aPrintServers objectAtIndex: i]);
      #endif
      
          }
      
          [comboPrintServer removeAllItems];//reset combo box
          [comboPrintServer addItemsWithObjectValues:aPrintServers];
          [comboPrintServer setNumberOfVisibleItems:aPrintServers.count];
          [comboPrintServer selectItemAtIndex:0];
      
      #ifdef bDoDebug
          NSLog(@"comboPrintServer reloaded with new values.");
      #endif
      
      
      //release memory we used for LdapTask
          [sLdapQueryCommand release];
          [aLdapQueryArgs release];
          [sLdapOutput release];
      
          [fhLdapTask release];
      
          [pipeLdapTask release];
      //    [tskLdapTask release];//strangely can not be explicitely released, might be autorelease anyway
      //    [mdLdapTask release];//strangely can not be explicitely released, might be autorelease anyway
      
          [sTemp release];
      
          }
      }
      

      【讨论】:

        【解决方案7】:

        我自己遇到过这样的例子:仅按组件分隔字符串是不够的,许多任务,例如
        1) 将令牌分类为类型
        2) 添加新令牌
        3) 分隔字符串自定义闭包,例如“{”和“}”之间的所有单词
        对于任何此类要求,我发现 Parse Kit 可以救命。

        我用它成功地解析了 .PGN(prtable 游戏符号)文件,它非常快速且精简。

        【讨论】:

          【解决方案8】:

          如果您希望将字符串标记为搜索词,同时保留“引用的短语”,这里有一个 NSString 类别,它尊重各种类型的引用对:""''‘’“”

          用法:

          NSArray *terms = [@"This is my \"search phrase\" I want to split" searchTerms];
          // results in: ["This", "is", "my", "search phrase", "I", "want", "to", "split"]
          

          代码:

          @interface NSString (Search)
          - (NSArray *)searchTerms;
          @end
          
          @implementation NSString (Search)
          
          - (NSArray *)searchTerms {
          
              // Strip whitespace and setup scanner
              NSCharacterSet *whitespace = [NSCharacterSet whitespaceAndNewlineCharacterSet];
              NSString *searchString = [self stringByTrimmingCharactersInSet:whitespace];
              NSScanner *scanner = [NSScanner scannerWithString:searchString];
              [scanner setCharactersToBeSkipped:nil]; // we'll handle whitespace ourselves
          
              // A few types of quote pairs to check
              NSDictionary *quotePairs = @{@"\"": @"\"",
                                           @"'": @"'",
                                           @"\u2018": @"\u2019",
                                           @"\u201C": @"\u201D"};
          
              // Scan
              NSMutableArray *results = [[NSMutableArray alloc] init];
              NSString *substring = nil;
              while (scanner.scanLocation < searchString.length) {
                  // Check for quote at beginning of string
                  unichar unicharacter = [self characterAtIndex:scanner.scanLocation];
                  NSString *startQuote = [NSString stringWithFormat:@"%C", unicharacter];
                  NSString *endQuote = [quotePairs objectForKey:startQuote];
                  if (endQuote != nil) { // if it's a valid start quote we'll have an end quote
                      // Scan quoted phrase into substring (skipping start & end quotes)
                      [scanner scanString:startQuote intoString:nil];
                      [scanner scanUpToString:endQuote intoString:&substring];
                      [scanner scanString:endQuote intoString:nil];
                  } else {
                      // Single word that is non-quoted
                      [scanner scanUpToCharactersFromSet:whitespace intoString:&substring];
                  }
                  // Process and add the substring to results
                  if (substring) {
                      substring = [substring stringByTrimmingCharactersInSet:whitespace];
                      if (substring.length) [results addObject:substring];
                  }
                  // Skip to next word
                  [scanner scanCharactersFromSet:whitespace intoString:nil];
              }
          
              // Return non-mutable array
              return results.copy;
          
          }
          
          @end
          

          【讨论】:

            【解决方案9】:

            如果您正在寻找拆分字符串(单词、段落、字符、句子和行)的语言特征,请使用字符串枚举:

            NSString * string = @" \n word1!    word2,%$?'/word3.word4   ";
            
            [string enumerateSubstringsInRange:NSMakeRange(0, string.length)
                                       options:NSStringEnumerationByWords
                                    usingBlock:
             ^(NSString *substring, NSRange substringRange, NSRange enclosingRange, BOOL *stop) {
                 NSLog(@"Substring: '%@'", substring);
             }];
            
             // Logs:
             // Substring: 'word1'
             // Substring: 'word2'
             // Substring: 'word3'
             // Substring: 'word4' 
            

            此 api 可与空格不总是分隔符的其他语言(例如日语)一起使用。同样使用NSStringEnumerationByComposedCharacterSequences 是枚举字符的正确方法,因为许多非西方字符的长度超过一个字节。

            【讨论】:

            猜你喜欢
            • 2011-10-17
            • 1970-01-01
            • 2011-03-12
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            相关资源
            最近更新 更多