【发布时间】:2016-01-22 22:07:02
【问题描述】:
(u'142578', (u'The-North-side-9890', (u' 12457896', 45.0)))
(u'124578', (u'The-West-side-9091', (u' 14578217', 0.0)))
这是我从 Joining the two RDD based on Ids 中得到的,这就像 (key, (value_left, value_right)) 使用这个 Spark Join。
所以我想要像
这样的输出The-North-side-9890,12457896,45.0
The-West-side-9091,14578217,0.0
为此,我尝试使用以下代码
from pyspark import SparkContext
sc = SparkContext("local", "info")
file1 = sc.textFile('/home/hduser/join/part-00000').map(lambda line: line.split(','))
result = file1.map(lambda x: (x[1]+', '+x[2],float(x[3][:-3]))).reduceByKey(lambda a,b:a+b)
result = result.map(lambda x:x[0]+','+str(x[1]))
result = result.map(lambda x: x.lstrip('[(').rstrip(')]')).coalesce(1).saveAsTextFile("hdfs://localhost:9000/finalop")
但给我以下输出
(u'The-North-side-9896', (u' 12457896',0.0
(u'The-East-side-9876', (u' 47125479',0.0
所以我想清理这个我该怎么做
帮助我实现这一目标。
【问题讨论】:
-
你能改进这个例子吗?
-
stackoverflow.com/questions/34198439/… 指知道moew的问题
-
我们如何删除 ()u'u" . 以获得干净的输出
标签: python apache-spark