【问题标题】:Sorting disordered after joining in Spark RDD加入Spark RDD后排序无序
【发布时间】:2018-06-11 12:42:59
【问题描述】:

我正在尝试从评分数据集中获取观看次数最多的电影,并将电影数据集中的相应电影名称映射到公共电影 ID。当我加入我已经排序的前 10 名观看次数最多的电影 id 时,未按最终结果排序。我还尝试了 sortbykey(false) ,但它不起作用。

JavaRDD<String> movies = sc.textFile("in/ml-1m/ratings.dat");
     // System.out.println(getmovieidanduserid());
     JavaRDD<String> movieid = movies.flatMap(line -> Arrays.asList(line.split("::")[1]).iterator());

 JavaPairRDD<String, Integer> moviemost = movieid.mapToPair(id -> new Tuple2<>(id, 1));

   JavaPairRDD<String, Integer> moviemostlist = moviemost.reduceByKey((x, y) -> x + y);

 JavaPairRDD<Integer, String> countToWordParis = moviemostlist.mapToPair(wordToCount -> new Tuple2<>(wordToCount._2(),
                    wordToCount._1()));
JavaPairRDD<Integer, String> sortedCountToWordParis = countToWordParis.sortByKey(false);

JavaPairRDD<String, Integer> sortedWordToCountPairs = sortedCountToWordParis
                .mapToPair(countToWord -> new Tuple2<>(countToWord._2(), countToWord._1()));

 JavaPairRDD<String, Integer> mostwatched = sc.parallelizePairs(sortedWordToCountPairs.take(10));

System.out.println(sortedWordToCountPairs.take(10));

 for( Tuple2<String, Integer>  mlost : mostwatched.collect())
 {
     System.out.println(mlost._1() + " : " + mlost._2());
    }

 JavaRDD<String> moviesname = sc.textFile("in/ml-1m/movies.dat");

JavaPairRDD<String, String> moviesiduser=moviesname.mapToPair(getPairFunction());
JavaPairRDD<String,Tuple2<Integer,String>> joindata=mostwatched.join(moviesiduser);
System.out.println("-------top movies----------");
System.out.println(joindata.take(10));



for (Tuple2<String, Tuple2<Integer, String>> wordToCount : joindata.collect()) {
System.out.println(wordToCount._1() + " : " + wordToCount._2());
}
    }   


    private static PairFunction<String, String, String> getmovieidanduserid() {
        return (PairFunction<String, String, String>) line -> new Tuple2<>(line.split("::")[1],
                                                                           line.split("::")[0]);
    }

    private static PairFunction<String, String, String> getPairFunction() {
        return (PairFunction<String, String, String>) line -> new Tuple2<>(line.split("::")[0],
                                                                           line.split("::")[1]);
    }

前 10 部电影 ID 和观看次数

2858 : 3428
260 : 2991
1196 : 2990
1210 : 2883
480 : 2672
2028 : 2653
589 : 2649
2571 : 2590
1270 : 2583
593 : 2578

与电影名称映射后

593 : (2578,Silence of the Lambs, The (1991))
589 : (2649,Terminator 2: Judgment Day (1991))
480 : (2672,Jurassic Park (1993))
2858 : (3428,American Beauty (1999))
260 : (2991,Star Wars: Episode IV - A New Hope (1977))
2571 : (2590,Matrix, The (1999))
2028 : (2653,Saving Private Ryan (1998))
1270 : (2583,Back to the Future (1985))
1210 : (2883,Star Wars: Episode VI - Return of the Jedi (1983))
1196 : (2990,Star Wars: Episode V - The Empire Strikes Back (1980))

美国美女最受关注,但排在第 4 行

【问题讨论】:

    标签: java apache-spark rdd


    【解决方案1】:

    在spark(其实几乎所有的类SQL查询引擎)中,join操作并不能保证顺序的维护。加入后需要排序。

    【讨论】:

      猜你喜欢
      • 2015-05-04
      • 2016-05-07
      • 1970-01-01
      • 2016-08-26
      • 2020-05-23
      • 2015-06-19
      • 2021-06-04
      • 1970-01-01
      • 2020-06-28
      相关资源
      最近更新 更多