如何在字符串单词和数字的 RDD 中将数字字符串转换为 int？答案

【问题标题】：How to convert numeric string to int in a RDD of string words and numbers?如何在字符串单词和数字的 RDD 中将数字字符串转换为 int？
【发布时间】：2020-10-24 02:35:48
【问题描述】：

所以我有一个 RDD，其中包含字符串格式的单词和数字，我已经拆分并删除了标点符号和空格：

['Hi', 'today', 'is', 'a', 'great', 'day', 'to', 'gather', 'flowers', 'lets', 'collect', '50', 'Roses', '400', 'Tulips', 'and', '20', 'Sunflowers', 'today']

我想计算不同单词的数量并按字母和数字顺序对它们进行排序，以便输出如下所示：

(20, 1)
(50, 1)
(400, 1)
('Hi', 1)
('today, 2)

我尝试使用 sortby，但我怀疑因为数字是字符串，它仅按第一个数字排序，因此数字 400 在 50 之前。我该如何解决这个问题？

【问题讨论】：

标签： apache-spark pyspark rdd

【解决方案1】：

您必须将 RDD 分成两部分并执行归约和排序，然后合并结果

import re
numbers = (rdd.filter(lambda l: re.match('^[0-9]+$', l))
              .map(lambda l: (int(l), 1))
              .reduceByKey(lambda a,b: a+b)
              .sortByKey())
text = (rdd.filter(lambda l: not re.match('^[0-9]+$', l))
           .map(lambda l: (l, 1))
           .reduceByKey(lambda a,b: a+b)
           .sortByKey())

然后合并两者：

numbers.union(text).collect()

[(20, 1),
 (50, 1),
 (400, 1),
 ('Hi', 1),
 ('Roses', 1),
 ('Sunflowers', 1),
 ('Tulips', 1),
 ('a', 1),
 ('and', 1),
 ('collect', 1),
 ('day', 1),
 ('flowers', 1),
 ('gather', 1),
 ('great', 1),
 ('is', 1),
 ('lets', 1),
 ('to', 1),
 ('today', 2)]

这是因为您无法对具有不同比较的平面 RDD 进行排序。

【讨论】：