【发布时间】:2016-12-09 21:01:09
【问题描述】:
我的示例数据如下所示
{ Line 1
Line 2
Line 3
Line 4
...
...
...
Line 6
Complete info:
Dept : HR
Emp name is Andrew lives in Colorodo
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Alex lives in Texas
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Mathew lives in California
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Dept : QC
Emp name is Nguyen lives in Nevada
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : Retail
DOJ : 11/04/2011
DOL : 08/21/2013
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Cassey lives in Newyork
DOB : 03/09/1958
Project name : Healthcare
DOJ : 06/04/2011
DOL : 09/21/2011
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
Emp name is Ronney lives in Alasca
DOB : 03/09/1958
Project name : Audit
DOJ : 09/11/2013
DOL : 09/01/2014
Project name : ContorlManagement
DOJ : 01/08/2015
DOL : 02/14/2016
line21
line22
line23
...
}
我需要的输出;
{
Dept Empname State Dob Projectname DOJ DOE
HR Andrew Colorodo 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Andrew Colorodo 03/09/1958 Retail 11/04/2011 08/21/2013
HR Andrew Colorodo 03/09/1958 Audit 09/11/2013 09/01/2014
HR Andrew Colorodo 03/09/1958 ControlManagement 06/04/2011 09/21/2011
HR Alex Texas 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Alex Texas 03/09/1958 ControlManagement 06/04/2011 09/21/2011
HR Mathews California 03/09/1958 Healthcare 06/04/2011 09/21/2011
HR Mathews California 03/09/1958 Retail 11/04/2011 08/21/2013
HR Mathews California 03/09/1958 Audit 09/11/2013 09/01/2014
HR Mathews California 03/09/1958 ControlManagement 06/04/2011 09/21/2011
QC Nguyen Nevada 03/09/1958 Healthcare 06/04/2011 09/21/2011
QC Nguyen Nevada 03/09/1958 Retail 11/04/2011 08/21/2013
QC Nguyen Nevada 03/09/1958 Audit 09/11/2013 09/01/2014
QC Nguyen Nevada 03/09/1958 ControlManagement 06/04/2011 09/21/2011
QC Casey Newyork 03/09/1958 Healthcare 06/04/2011 09/21/2011
QC Casey Newyork 03/09/1958 Retail 11/04/2011 08/21/2013
QC Casey Newyork 03/09/1958 Audit 09/11/2013 09/01/2014
QC Casey Newyork 03/09/1958 ControlManagement 06/04/2011 09/21/2011}
我尝试了以下选项: 1)考虑在地图内使用地图然后进行匹配。出了这么多错误。然后从这里阅读一篇文章,它解释了我的地图里面不能有另一张地图。事实上,不能在另一个内部进行任何 Rdd 转换。对不起。 Spark 的新手。
2) 尝试使用 reg 表达式。然后在捕获的组上调用地图。但由于每个部门都有多个员工,每个员工都有多个项目信息,我不能重复分组这部分数据,也无法与相应的员工进行映射。员工和部门详细信息也是如此。
Q1 : 是否可以在 Spark/Scala 中将上述示例数据转换为上述数据格式?
Q2:如果是这样,我应该追求的逻辑/概念是什么?
提前致谢。
【问题讨论】:
-
这不是 Spark 的完美匹配。任何线性传递通常都不是最好在 Spark 中完成的。不过,在普通 Scala 中执行此操作非常简单 - 只需以这种方式预处理文件并将结果放入 Spark 中以供以后处理?
-
数据有多大?你真的需要 Spark 吗?
-
数据约为 75GB。如果有任何解决方案/逻辑,可在 spark 中使用(即使其复杂/冗长。低效的代码),我想在与其他人一起尝试之前尝试一下。有任何想法吗?谢谢。
标签: scala apache-spark nested-loops