【问题标题】:java spark word matches between two strings两个字符串之间的java spark单词匹配
【发布时间】:2017-04-02 07:24:07
【问题描述】:

我想知道两个不同长字符串的单词与 SPARK (Java Api) 之间是否存在某种巧合。

String string1 = "car bike bus ..." (about 100 words);
String string2 = "boat plane car ..." (about 100 words);

我怎么能这样做??

我创建了一种方法,但我认为它效率不高(迭代次数过多):

List<String> a1 = new ArrayList<>();
List<String> a2 = new ArrayList<>();

a1.add("car");
a1.add("boat");
a1.add("bike");

a2.add("car");
a2.add("nada");
a2.add("otro");


JavaRDD<String> rdd = jsc.parallelize(a1);
JavaRDD<String> counts = rdd.filter(new Function<String, Boolean>() {
    @Override
    public Boolean call(String s) throws Exception {
        Boolean occurrence = false;
        for(int i=0; i<a2.size(); i++) {
            if(StringUtils.containsIgnoreCase(s, a2.get(i))) {
                System.out.println("encontrado");
                occurrence = true;
                break;
            }
        }
        return occurrence;
    }
});
System.out.println(counts.count());

【问题讨论】:

    标签: java apache-spark parallel-processing sparkapi


    【解决方案1】:

    您可以使用intersect 方法,该方法可用于 RDD 和 Dataset。下面是使用 Spark 2.0、Java 和 Dataset 的示例。

    public class SparkIntersection {
        public static void main(String[] args) {
        //SparkSession 
        SparkSession spark = SparkSession
                .builder()
                .appName("SparkIntersection")
                .config("spark.sql.warehouse.dir", "/file:C:/temp")
                .master("local[*]") 
                .getOrCreate();
        //List
        List<String> data1 = Arrays.asList("one","two","three","four","five");
        List<String> data2 = Arrays.asList("one","six","three","nine","ten");
        //Dataset
        Dataset<String> ds1 = spark.createDataset(data1, Encoders.STRING());
        Dataset<String> ds2 = spark.createDataset(data2, Encoders.STRING());
        //Intersect
        Dataset<String> ds = ds1.intersect(ds2);
        ds.show();
        //stop
        spark.stop();
       }
    }
    

    【讨论】:

      猜你喜欢
      • 2015-04-08
      • 1970-01-01
      • 2019-01-12
      • 2021-02-15
      • 1970-01-01
      • 1970-01-01
      • 2023-03-11
      • 1970-01-01
      • 2014-05-28
      相关资源
      最近更新 更多