【问题标题】:Remove repeated content Java [closed]删除重复的内容Java [关闭]
【发布时间】:2020-06-26 18:47:04
【问题描述】:

我得到了这个文本,我需要过滤掉这些重复的行和单词。 我不知道是否有比我正在做的更好的方法。

00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:00,413|03:50:25,600|ISDB|PERFEITAMENTE. EU
00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A?
00:00:04,357|00:00:05,259|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,259|00:00:05,414|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT?QUE
00:00:07,914|00:00:07,916|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT?QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO
00:00:08,997|00:00:09,178|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

我正在使用该代码,将这些行放在一个 HashSet 中,这样它们就不会重复了。

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.util.HashSet;
import java.util.Scanner;
import java.util.Set;
public class Testecc {
   public static void main(String args[]) throws Exception {
      String filePath = "C://teste//teste1.txt";
      String input = null;
      //Buffered reader
      BufferedReader br = new BufferedReader(new FileReader(filePath));
      while((input=br.readLine()) !=null){
                input=br.readLine();

      //FileWriter (criando arquivo)
      FileWriter writer = new FileWriter("C://teste//teste.txt");
      //hashset para elimitar duplicatas
      Set set = new HashSet();
      String line;
      //adicionando linhas no hashset
      while((line=br.readLine())!=null){
          String line1= line.substring(0,31);
          String line2=line.substring(31);
          System.out.println(line);
          if(set.add(line2)){

      writer.append(line1+line2+"\n");
          }
      }
      writer.flush();
      System.out.println("Pronto!");
   }
}
   }

这样我删除了重复的行:

00:00:01,135|00:00:01,315|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:01,315|00:00:02,218|ISDB|BOBAS PARA
00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS
00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV�S
00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A�.
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A�. ELES
00:00:05,414|00:00:05,775|ISDB|COLOCARAM AS FOTOS
00:00:05,775|00:00:06,677|ISDB|COLOCARAM AS FOTOS COMO
00:00:06,677|00:00:06,858|ISDB|COLOCARAM AS FOTOS COMO PAPEL
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:06,858|03:50:32,400|ISDB|PAREDE, PARECE AT� QUE
00:00:07,914|00:00:07,916|ISDB|PAREDE, PARECE AT� QUE EU
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT� QUE EU GOSTO

但我还需要删除重复的单词。

我真的没有想法。

我该怎么做?

【问题讨论】:

  • 确定保留哪个“重复”行的规则是什么?我没有看到任何明显的规则。
  • “但我也需要删除重复的单词”到底是什么意思?您想保留“COLOCARAM AS FOTOS COMO PAPEL DE”而不是“COLOCARAM AS FOTOS”、“COLOCARAM AS FOTOS COMO”,..._
  • 请解释一下“重复词”是什么意思,你需要如何处理包含它们的行?
  • 另外,请使用英文 cmets,它有助于传达您的意图。
  • 我只需要保留最后一行,例如:00:00:01,315|00:00:02,218|ISDB|BOBAS PARA 00:00:02,218|00:00:02,398| ISDB|BOBAS PARA AMIGOS 00:00:02,398|00:00:02,759|ISDB|BOBAS PARA AMIGOS E AO 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO 投资

标签: java java.util.scanner bufferedreader hashset filewriter


【解决方案1】:

您可以使用每个日志行的最后一个管道后部分作为键,然后将每一行插入LinkedHashMap,以删除重复项:

String filePath = "C:/log.txt";
BufferedReader br = new BufferedReader(new FileReader(filePath));
String input;
Map<String, String> logMap = new LinkedHashMap<>();
while ((input = br.readLine()) != null) {
    input = br.readLine();
    String key = input.replaceAll("^.*\\|", "");
    logMap.put(key, input);
}

// Now print out the map minus duplicates
for (String line : logMap.values()) {
    System.out.println(line);
}

您可以轻松地将过滤后的日志写入另一个文件,而不是打印到控制台。请注意,此方法将保留每个重复项的最后行。

【讨论】:

  • 结果相同:00:00:00,413|03:50:25,600|ISDB|PERFEITAMENTE。欧盟 00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE。欧盟 PEDI REVISTAS 00:00:01,315|00:00:02,218|ISDB|BOBAS PARA 00:00:02,218|00:00:02,398|ISDB|BOBAS PARA AMIGOS 00:00:02,398|00:00:02,759|ISDB| BOBAS PARA AMIGOS E AO 00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV�S 00:00:03,274|00:00:04,357|ISDB|DISSO TROUXERAM ISSO A�. 00:00:06,677|00:00:06,858|ISDB|DISSO TROUXERAM ISSO A.. ELES
【解决方案2】:

有一个地图可以保存按特定键分组的行值。键是行的开头,从您感兴趣的单词开始,例如前 5 个字母。然后将这些线添加到地图中,如果该线比之前找到的线长,请替换它。

try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {

  final Map<String, String> map = new LinkedHashMap<>();

  br.lines().forEach(line -> {
        String message = line.substring(line.lastIndexOf("|") + 1);
        if (message.isEmpty()) {
          return;
        }
        String key = message.split(" ")[0];
        if (map.get(key) == null) {
          map.put(key, line);
        } else if (map.get(key).length() < line.length()) {
          map.remove(key);
          map.put(key, line);
        }
      }
  );

  map.forEach((k, v) -> System.out.println(v));
}

上面的代码会给你以下输出。

00:00:00,413|03:50:25,600|ISDB|>> FALAM QUE A GENTE COMBINA
00:00:01,135|00:00:01,315|ISDB|PERFEITAMENTE. EU PEDI REVISTAS
00:00:02,759|00:00:03,274|ISDB|BOBAS PARA AMIGOS E AO INV?
00:00:04,357|00:00:05,259|ISDB|DISSO TROUXERAM ISSO A? ELES
00:00:06,858|03:50:32,400|ISDB|COLOCARAM AS FOTOS COMO PAPEL DE
00:00:07,914|00:00:08,997|ISDB|PAREDE, PARECE AT?QUE EU GOSTO

【讨论】:

  • 您处理地图的逻辑是多余的,您不需要在插入之前检查 null。只需做一个放置(或查看我的答案)。此外,哈希映射不维护插入顺序,因此您的答案打印日志的顺序可能不正确。
  • @TimBiegeleisen 你说得对,把它改成LinkedHashMap,谢谢提示!
  • 我猜我们快到了。但是得到了那个错误:“线程“main”中的异常java.lang.StringIndexOutOfBoundsException:开始31,结束36,长度33”
  • @RodrigoMarros 查看更新后的代码。
  • @Eugene 是的,现在可以了。一切正常。非常感谢。有很多东西要学习和学习。
猜你喜欢
  • 2021-08-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2022-01-06
  • 1970-01-01
  • 2016-10-09
  • 2015-02-23
相关资源
最近更新 更多