【发布时间】:2016-06-26 05:52:37
【问题描述】:
我在一个文件中有大量句子(10,000 个)。该文件每个文件包含一个句子。在整个集合中,我想找出哪些单词在一个句子中一起出现以及它们的频率。
例句:
"Proposal 201 has been accepted by the Chief today.",
"Proposal 214 and 221 are accepted, as per recent Chief decision",
"This proposal has been accepted by the Chief.",
"Both proposal 3 MazerNo and patch 4 have been accepted by the Chief.",
"Proposal 214, ValueMania, has been accepted by the Chief."};
我想对以下输出进行编码。我应该能够提供三个起始词作为程序的参数:“Chief, accepted, Proposal”
Chief accepted Proposal 5
Chief accepted Proposal has 3
Chief accepted Proposal has been 3
...
...
for all combinations.
我知道组合可能很大。
我在网上搜索过,但没有找到。我已经编写了一些代码,但无法理解它。也许知道该域的人可能知道。
ReadFileLinesIntoArray rf = new ReadFileLinesIntoArray();
try {
String[] tmp = rf.readFromFile("c:/scripts/SelectedSentences.txt");
for (String t : tmp){
String[] keys = t.split(" ");
String[] uniqueKeys;
int count = 0;
System.out.println(t);
uniqueKeys = getUniqueKeys(keys);
for(String key: uniqueKeys)
{
if(null == key)
{
break;
}
for(String s : keys)
{
if(key.equals(s))
{
count++;
}
}
System.out.println("Count of ["+key+"] is : "+count);
count=0;
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
private static String[] getUniqueKeys(String[] keys) {
String[] uniqueKeys = new String[keys.length];
uniqueKeys[0] = keys[0];
int uniqueKeyIndex = 1;
boolean keyAlreadyExists = false;
for (int i = 1; i < keys.length; i++) {
for (int j = 0; j <= uniqueKeyIndex; j++) {
if (keys[i].equals(uniqueKeys[j])) {
keyAlreadyExists = true;
}
}
if (!keyAlreadyExists) {
uniqueKeys[uniqueKeyIndex] = keys[i];
uniqueKeyIndex++;
}
keyAlreadyExists = false;
}
return uniqueKeys;
}
有人可以帮忙编码吗?
【问题讨论】:
-
是的,你会有非常大的排列集。您可以使用 Map(如 TreeMap)将地图中的键存储为唯一字符串,并将地图的值存储为计数。或者,您可以创建自己的小型数据结构来存储名称/值信息。
-
3 的输出对于 Chief、accepted、Proposal 意味着什么?这是否意味着有 3 个句子在句子中出现这 3 个单词?大小写重要吗?
-
抱歉“首席接受提案”应为 5,“已接受首席提案”应为 3...将编辑
-
@JonathanGrey:为什么“首席接受提案”的值为 5?
-
可能是您应该考虑的一个问题,您打算如何处理这些排列?取决于此,您是否真的需要生成所有这些?给定一组维度,您能否构建一个带有显式接收器/顶部节点的图并对其进行操作,边表示出现次数
标签: java hashmap linkedhashmap