基本上,您需要首先将文本块分成句子。这已经够棘手了,即使在英语中也是如此,因为您需要注意句号、问号、感叹号和任何其他句子终止符。
然后在删除所有标点符号(逗号、分号、冒号等)后一次处理一个句子。
然后,当你剩下一个单词数组时,它就变得更简单了:
for i = 1 to num_words-1:
for j = i+1 to num_words:
phrase = words[i through j inclusive]
store phrase
就是这样,非常简单(在对文本块进行初始按摩后,可能不会像您想象的那么简单)。
这将为您提供每个句子中包含两个或多个单词的所有短语。
分句、分词、去除标点符号等将是最难的部分,但我已经向您展示了一些简单的初始规则。其余的应该在每次文本块破坏算法时添加。
更新:
根据要求,这里有一些给出短语的 Java 代码:
public class testme {
public final static String text =
"My username is click upvote." +
" I have 4k rep on stackoverflow.";
public static void procSentence (String sent) {
System.out.println ("==========");
System.out.println ("sentence [" + sent + "]");
// Split sentence at whitspace into array.
String [] sa = sent.split("\\s+");
// Process each starting word.
for (int i = 0; i < sa.length - 1; i++) {
// Process each phrase.
for (int j = i+1; j < sa.length; j++) {
// Build the phrase.
String phrase = sa[i];
for (int k = i+1; k <= j; k++) {
phrase = phrase + " " + sa[k];
}
// This is where you have your phrase. I just
// print it out but you can do whatever you
// wish with it.
System.out.println (" " + phrase);
}
}
}
public static void main(String[] args) {
// This is the block of text to process.
String block = text;
System.out.println ("block [" + block + "]");
// Keep going until no more sentences.
while (!block.equals("")) {
// Remove leading spaces.
if (block.startsWith(" ")) {
block = block.substring(1);
continue;
}
// Find end of sentence.
int pos = block.indexOf('.');
// Extract sentence and remove it from text block.
String sentence = block.substring(0,pos);
block = block.substring(pos+1);
// Process the sentence (this is the "meat").
procSentence (sentence);
System.out.println ("block [" + block + "]");
}
System.out.println ("==========");
}
}
哪个输出:
block [My username is click upvote. I have 4k rep on stackoverflow.]
==========
sentence [My username is click upvote]
My username
My username is
My username is click
My username is click upvote
username is
username is click
username is click upvote
is click
is click upvote
click upvote
block [ I have 4k rep on stackoverflow.]
==========
sentence [I have 4k rep on stackoverflow]
I have
I have 4k
I have 4k rep
I have 4k rep on
I have 4k rep on stackoverflow
have 4k
have 4k rep
have 4k rep on
have 4k rep on stackoverflow
4k rep
4k rep on
4k rep on stackoverflow
rep on
rep on stackoverflow
on stackoverflow
block []
==========
现在,请记住这是非常基本的 Java(有些人可能会说它是用 Java 方言编写的 C :-)。它只是为了说明如何根据您的要求从句子中输出单词分组。
它确实没有完成我在原始答案中提到的所有花哨的句子检测和标点符号删除。