基于R中的层次结构对单词进行分组答案

【问题标题】：Grouping words based on hierarchy in R基于R中的层次结构对单词进行分组
【发布时间】：2013-09-09 20:20:03
【问题描述】：

我想在我的单词向量中获得一个层次结构，如示例中所示：

# Start (in reality these will not be right next to each other)

words <- c("hello-world", "hello", "string", "sub-string", "custom-fields", 
           "custom", "hi-hat", "hat") 

# Result

highlevel <- c("hello-world", "sub-string", "custom-fields", "hi-hat")
lowerlevel <- c("hello", "string", "custom", "hat")

实际上，我将面对大数据，并且正在寻找一种有效的方法来对这些数据进行分组。如果可能的话，我也希望它们以某种方式联系起来。目标是先搜索较高级别的词，找不到时再寻找较低级别的词。

想法？

【问题讨论】：

“高级”单词是否定义为带有破折号的单词？如果是这样grep('-', words, value=TRUE)g=grep('-', words);hl=words[g];ll=words[-g].
在当前情况下，我认为“-”、“。” （字面意思，不是正则表达式）或数字。

标签： string r hierarchy

【解决方案1】：

g <- grep('[-.[:digit:]]', words) # give indices of matches.

highlevel <- words[g]
lowlevel <- words[-g]

【讨论】：