注意:spark scala hadoop直接的版本匹配非常重要!
一、Spark2.0.1安装
从https://archive.apache.org/dist/spark/spark-2.0.1/中下载编译好的spark包,选择
因为已经编译好了,直接解压即可。但是注意:解压的路径中不能有空格。比如解压到D盘 : D:\spark-2.0.1-bin-hadoop2.7
解压完成后,配置环境变量,path路径
然后在cmd中运行spark-shell
会提示没有hadoop环境,接下来会讲hadoop环境的安装!
二、scala 2.11.8 安装
从https://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.msi 下载scala
下载完成后,直接安装,安装完成后程序会自动在环境变量里添加path
在cmd中运行scala version 命令:
表示安装成功。
三、IDEA集成scala开发环境
可以参考:http://dblab.xmu.edu.cn/blog/1327/
首先下载IDEA的scala插件,在plugin插件商店安装
此处我已安装好,因为下载慢的原因,也可以直接从https://confluence.jetbrains.com/display/SCA/Scala+Plugin+for+IntelliJ+IDEA 去下载插件包,然后本地安装
安装完成重启IDEA后,创建scala projecr
选择IDEA,适合初学者,点击下一步,选择SDK,没有SDK可选,点击create,就可以看到之前安装好的scala2.11.8
然后新建个 scala object类,因为只有object才有main方法
至此Scala环境搭建完毕,最后搭建hadoop环境。
四、hadoop2.7.7 搭建
下载hadoop,从:https://mirror.bit.edu.cn/apache/hadoop/common/hadoop-2.7.7/ 下载
windows需要有管理员权限去打开压缩包,否则解压时候会提示 “客户端没有所需的授权”
然后到环境变量部分设置HADOOP_HOME为Hadoop的解压目录,如图所示:
然后再设置该目录下的bin目录到系统变量的PATH下,我这里也就是C:\Hadoop\bin,如果已经添加了HADOOP_HOME系统变量,也可用%HADOOP_HOME%\bin来指定bin文件夹路径名。
这两个系统变量设置好后,开启一个新的cmd窗口,然后直接输入spark-shell命令。如图所示:
java.io.IOException: Could not locate executable D:\hadoop-2.7.7\bin\winutils.exe in the Hadoop binaries.
按照提示,可以去 https://github.com/steveloughran/winutils 选择你安装的Hadoop版本号,然后进入到bin目录下,找到winutils.exe文件,下载方法是点击winutils.exe文件,进入之后在页面的右上方部分有一个Download按钮,点击下载即可。 如图所示:
将下载好bin目录覆盖到Hadoop的bin目录下,
但是运行时会提示无法在64位环境中执行,此时建议去:http://www.pc6.com/softview/SoftView_578664.html 下载对应版本的bin,将bin目录覆盖到hadoop的bin目录下
新建classpath变量,设置为D:\hadoop-2.7.7\bin\winutils.exe
同时,将bin\hadoop.dll复制到C:\Windows\System32目录下
设置spark安装目录的权限不能为隐藏和只读
因为无法识别系统配置的JAVA_HOME,所以需要配置hadoop的java home路径, 打开D:\hadoop-2.7.7\etc\hadoop\hadoop-env.cmd 文件
set JAVA_HOME="D:\hadoop-2.7.7\jdk1.8.0_181"
注意JAVA_HOME中的路径不能有空格,否则无法识别
修改D:\hadoop-2.7.7\etc\hadoop\core-site.xml
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/hadoop/tmp</value>
<description>Abase for other temporary directories.</description>
</property>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
修改D:\hadoop-2.7.7\etc\hadoop\hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/hadoop/data/dfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/hadoop/data/dfs/datanode</value>
</property>
</configuration>
默认是在hadoop安装的同级目录下创新hadoop文件夹作为name节点和data节点
配置好后,到sbin目录下运行start-dfs.cmd
会启动namenode和datanode,但是在启动datanode时候提示
java.lang.RuntimeException: Error while running command to get file permissions : java.io.IOException: Cannot run program "D:\hadoop-2.7.7\bin\winutils.exe": CreateProcess error=740, 请求的操作需要提升。
通过hadoop version可以查看hadoop版本
在浏览器中打开链接:http://localhost:50070/dfshealth.html#tab-overview
即可查看:
五、通过maven创建spark项目
我们点击初始界面的Create New Project进入如图界面。并按图创建Maven工程文件。
创建完成后,在project右键点击Add Framework Support.......,
在pom.xml中加入spark必要的jar
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>spark</groupId>
<artifactId>picc-spark</artifactId>
<version>1.0-SNAPSHOT</version>
<name>picc-spark</name>
<!-- FIXME change it to the project's website -->
<url>http://www.example.com</url>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<hadoopVersion>2.7.2</hadoopVersion>
<sparkVersion>2.0.1</sparkVersion>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<!-- Hadoop start -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-hdfs</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-core</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>${hadoopVersion}</version>
</dependency>
<!-- Hadoop -->
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${sparkVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${sparkVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${sparkVersion}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.version}</artifactId>
<version>${sparkVersion}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.version}</artifactId>
<version>${sparkVersion}</version>
</dependency>
</dependencies>
<build>
<pluginManagement><!-- lock down plugins versions to avoid using Maven defaults (may be moved to parent pom) -->
<plugins>
<!-- clean lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#clean_Lifecycle -->
<plugin>
<artifactId>maven-clean-plugin</artifactId>
<version>3.1.0</version>
</plugin>
<!-- default lifecycle, jar packaging: see https://maven.apache.org/ref/current/maven-core/default-bindings.html#Plugin_bindings_for_jar_packaging -->
<plugin>
<artifactId>maven-resources-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.0</version>
</plugin>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
</plugin>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
</plugin>
<plugin>
<artifactId>maven-install-plugin</artifactId>
<version>2.5.2</version>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
</plugin>
<!-- site lifecycle, see https://maven.apache.org/ref/current/maven-core/lifecycles.html#site_Lifecycle -->
<plugin>
<artifactId>maven-site-plugin</artifactId>
<version>3.7.1</version>
</plugin>
<plugin>
<artifactId>maven-project-info-reports-plugin</artifactId>
<version>3.0.0</version>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
编写wordcount程序
运行成功!!