【发布时间】:2020-10-29 10:50:41
【问题描述】:
我有一个 pyspark 数据框,其中有一列 urls,其中包含一些 url,另一个 pyspark 数据框也包含 urls 和 id,但这些 url 是链接,例如abc.com 在第一个,abc.com/contact 在第二个。我想在第一个数据帧的新列中收集与特定域相关的所有链接 ID。我目前正在这样做
url_list = df1.select('url').collect()
all_rows = df2.collect()
ids = list()
urls = list()
for row in all_rows:
ids.append(row.id)
urls.append(row.url)
dict_ids = dict([(i.website,"") for i in url_list])
for url,id in zip(urls, ids):
res = [ele.website for ele in url_list if(ele.website in url)]
if len(res)>0:
print(res)
dict_ids[res[0]]+=('\n\n\n'+id+'\n\n\n')
这个太费时间了,我想用火花处理所以我也试过这个
def add_id(url, id):
for i in url_list:
if i.website in url:
dict_ids[i.website]+=id
add_id_udf=udf(add_id,StringType())
test = df_crawled_2.withColumn("Test", add_id_udf(df2['url'],df2['id']))
display(test)
input:
df1::
url
http://example.com
http://example2.com/index.html
df2::
url,id
http://example.com/contact, 12
http://example2.com/index.html/pif, 45
http://example.com/about, 68
http://example2.com/index.html/juk/er, 96
expected output:
df1::
url,id
http://example.com, [12,68]
http://example2.com/index.html, [45,96]
or even a dictionary is fine with urls as keys and id as values.
但是在第二种情况下这个 dict_ids 仍然是空的。有人可以帮我吗?
【问题讨论】:
-
您能分享一个输入示例及其理想输出吗?
标签: python-3.x apache-spark pyspark