用第二个数据帧翻译一个数据帧答案

【问题标题】：Translating a dataframe with a second dataframe用第二个数据帧翻译一个数据帧
【发布时间】：2019-11-03 13:14:34
【问题描述】：

我有两个文本文件：

具有以下形式的翻译/别名：

还有一个，每行三个条目：

34 456 9900
111 333 444
234 2 562
...

如果可能，我想翻译第二列，例如，我希望输出数据框包含行：

34, 99999, 9900
111, 333, 444
234, 278, 562

读取文本文件可以正常工作。但是，我在翻译 b 列时确实遇到了问题。这是我现在的基本代码结构：

translation = sc.textFile("transl.txt")\
    .map(lambda line: line.split(" "))

def translate(string):
    x = translation.filter(lambda x: x[0] == string).collect()
    if x == []:
        return string
    return x[0][1]

d = sc.textFile("text.txt")\
    .map(lambda line: line.split(" "))\
    .toDF(["a", "b", "c"])\
    .withColumn("b", translate(d.b))\

除了最后一行，一切正常。我知道将函数应用于 spark 中的列并不容易，但是我不知道该怎么做。

【问题讨论】：

标签： python dataframe apache-spark pyspark

【解决方案1】：

您可以通过left join 实现这一目标。请看下面的注释代码：

import pyspark.sql.functions as F

l1 = [
(123, 456)
,(2, 278)
,(456, 99999)
]

l2 = [
(34, 456, 9900)
,(111, 333, 444)
,(234, 2, 562)
]

df1=spark.createDataFrame(l1, ['one1', 'two1'])
df2=spark.createDataFrame(l2, ['one2', 'two2', 'three2'])

#creates an dataframe with five columns one1, two1, one2, two2, three2
df = df2.join(df1, df2.two2 == df1.one1 , 'left')

#checks if a value in your dictionary dataframe is avaiable, if not it will keep the current value
#otherwise the value will be translated
df = df.withColumn('two2', F.when(F.col('two1').isNull(), F.col('two2') ).otherwise(F.col('two1')))

df = df.drop('one1', 'two1')

df.show()

输出：

+----+-----+------+
|one2| two2|three2|
+----+-----+------+
| 111|  333|   444|
| 234|  278|   562|
|  34|99999|  9900|
+----+-----+------+

【讨论】：

【解决方案2】：

如果您将这两个文件作为数据框导入，则有一种稍微不同的方法是将它们连接起来。我在下面展示了一个示例：

# Sample DataFrame's from provided example
import pandas as pd
translations = pd.DataFrame({
    'Key': [123,2,456],
    'Translation': [456,278,99999]
    })  

entries = pd.DataFrame({
    'A': [34,11,234],
    'B': [456,333,2],
    'C': [9900,444,562]
    })

导入文件后，我们可以通过查找键合并它们，使用左连接

df = pd.merge(entries, translations, left_on='B', right_on='Key', how='left')

但是，这会给我们留下一个无法找到查找的带有 NaN 的列。为了解决这个问题，我们从“B”中获取值，同时用我们的查找值覆盖原来的“B”列。

df['B'] = df['Translation'].mask(pd.isna, df['B'])

现在我们需要删除额外的列以获得您请求的结果：

df.drop(columns=['Key', 'Translation'])

df 现在看起来像这样：

    A   B       C
0   34  99999   9900
1   11  333     444
2   234 278     562

【讨论】：

感谢您的回答，但我正在寻找使用 pyspark 数据框而不是 pandas 的解决方案。