Python Spark安装及配置步骤
一、scala安装
scala下载路径
https://www.scala-lang.org/files/archive/
1、下载安装包
muyi@master:~$ wget http://www.scala-lang.org/files/archive/scala-2.12.7.tgz
2、解压文件到根目录
tar xvf '/home/muyi/Desktop/scala-2.12.7.tgz'
3、移动文件到指定目录
sudo mv scala-2.12.7 /usr/local/scala
4、编辑配置文件
gedit /home/muyi/.bash_profile
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
5、使配置生效
source /home/muyi/.bash_profile
6、启动scala
muyi@master:~$ scala
****
在CentOS以及其他的Linux系统中遇到安装包安装错误的原因,大多数都是因为缺少依赖包导致的,所以对于错误:zipimport.ZipImportError: can’t decompress data,是因为缺少zlib 的相关工具包导致的,知道了问题所在,那么我们只需要安装相关依赖包即可,
1、打开终端,输入一下命令安装zlib相关依赖包:
yum -y install zlib*
2、进入 python安装包,修改Module路径的setup文件:
vim module/setup
找到一下一行代码,去掉注释:
#zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz
去掉注释
zlib zlibmodule.c -I$(prefix)/include -L$(exec_prefix)/lib -lz
另外,在这里说明一下,对于在安装Python安装的过程中遇到这个问题,安装完上面的依赖包后,即可重新进入终端,进入python的安装包路径下执行:
make && make install
重新编译安装即可
****
二、Spark 安装
spark下载路径
https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
1、下载安装包
muyi@master:~$ wget https://www.apache.org/dyn/closer.lua/spark/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
2、解压文件
tar zxf spark-2.4.0-bin-hadoop2.7.tgz
3、移动文件
sudo mv spark-2.4.0-bin-hadoop2.7 /usr/local/spark/
4、配置环境变量
gedit /home/muyi/.bash_profile
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
5、使配置生效
source /home/muyi/.bash_profile
6、启动Pyspark
pyspark
7、设置pyspark显示信息
切换到/usr/local/spark/conf
cp log4j.properties.template log4j.properties
编辑 log4j.properties
修改 log4j.rootCategory=WARN,console
8、创建测试文本
file:/home/muyi/wordcount/input/LICENSE.txt
hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt
其中hdfs文件 需要启动 hadoop服务 start-all.sh
三、Python Spark 运行
1、PySpark本地运行
pyspark --master local[4]
sc.master
textFile = sc.textFile("file:/usr/local/spark/README.md")
textFile.count()
注意:使用 file关键字
In [3]: textFile = spark.read.text("file:/home/muyi/wordcount/input/LICENSE.txt")
In [4]: textFile.count()
Out[4]: 1594
In [6]: textFile = sc.textFile("file:/home/muyi/wordcount/input/LICENSE.txt")
In [7]: textFile.count()
Out[7]: 1594
读取HDFS文件
textFile = sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.txt")
textFile.count()
In [3]: lines=sc.textFile("file:/home/muyi/hadooplist.txt")
In [4]: pairs=line.map(lambda s:(s,1))
In [5]: pairs=lines.map(lambda s:(s,1))
In [6]: counts = pairs.reduceByKey(lambda a,b:a+b)
In [7]: counts
Out[7]: PythonRDD[8] at RDD at PythonRDD.scala:53
In [8]: counts.count()
Out[8]: 11
In [14]: accum =sc.accumulator(0)
In [15]: accum
Out[15]: Accumulator
In [16]: sc.parallelize([1,2,3,4]).foreach(lambda x:accum.add(x))
In [17]: accum.value
Out[17]: 10
退出
exit()
2、Hadoop Yarn 运行 pyspark
yarn-site.xml 配置 master 和 slave 都需要配置
mapreduce_shuffle
master:18040
master:18030
master:18025
master:18141
yarn.resourcemanager.webapp.address
master:18088
false
false
100000
10000
3000
2000
启动 hadoop start-all.sh
HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop pyspark --master yarn --deploy-mode client
3、Spark Standalone 运行 pyspark
3.1 配置 spark-env.sh
export JAVA_HOME=/usr/java/jdk1.8.0_181
export SCALA_HOME=/usr/local/scala
export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
export SPARK_MASTER_IP=192.168.222.3
export SPARK_MASTER_PORT=7077
export SPARK_HOME=/usr/local/spark
export HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop
export SPARK_DIST_CLASSPATH=$(/usr/hadoop/hadoop-2.7.7/bin/hadoop classpath)
export SPARK_WORKER_CORES=1
export SPARK_WORKER_MEMORY=1g
export SPARK_WORKER_INSTANCES=1
3.2 复制spark程序到slave
3.3 在master虚拟机编辑slaves
编辑 master虚拟机 /usr/local/spark/conf/slaves
输入 slave
3.4 启动服务
启动 spark start-all.sh
启动 hadoop start-all.sh
[muyi@master spark]$ pyspark --master spark://master:7077 --num-executors 1 --total-executor-cores 3 --executor-memory 512m
Python 3.7.0 (default, Jun 28 2018, 13:15:42)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help.
18/12/07 04:38:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.0
/_/
Using Python version 3.7.0 (default, Jun 28 2018 13:15:42)
SparkSession available as 'spark'.
In [1]: ts=sc.textFile("hdfs://master:9000/user/hduser/wordcount/input/LICENSE.t
...: xt")
In [2]: ts.count()
Out[2]: 1594
四、在IPython NoteBook运行 Python Spark程序
Linux虚拟机安装 anaconda3步骤
1、清华镜像网站下载安装文件
https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/
命令行:
wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.3.1-Linux-x86_64.sh
2、安装Anaconda.sh文件
bash Anaconda3-.3.5.1-LINux-X86_64.sh -b
3、配置~/.bash_profile文件,添加anaconda的bin目录到PATH中
export PATH=/home/muyi/anaconda3/bin:$PATH
export ANACONDA_PATH=/home/muyi/anaconda3
source ~/.bash_profile
4、重启虚拟机
不同模式运行IPython NoteBook命令
4.1、Local
[muyi@master Desktop]$ cd '/home/muyi/pythonwork/ipynotebook'
[muyi@master ipynotebook]$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
4.2、Hadoop Yarn-client
只能测试服务器文件 即 hdfs路径下的文件
启动hadoop start-all.sh
IPython NoteBook 在 Hadoop Yarn-client运行
启动hadoop start-all.sh
[muyi@master ~]$ cd '/home/muyi/pythonwork/ipynotebook'
[muyi@master ipynotebook]$ PYSPARK_DRIVER_PYTHON_OPTS="notebook" HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop MASTER=yarn-client pyspark
或者
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" HAHOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop MASTER=yarn-client pyspark
4.3、Spark Stand Alone
首先 启动hadoop start-all.sh
其次 启动 spark start-all.sh
PYTHON_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=spark://master:7077 pyspark --num-executors 1 --total-executor-cores 3 --executor-memory 512m
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="notebook" MASTER=spark://master:7077 pyspark --num-executors 1 --total-executor-cores 2 --executor-memory 512m
五、spark-submit 实现
1、[muyi@master spark]$ ./bin/spark-submit examples/src/main/python/pi.py
2、[muyi@master spark]$ ./bin/spark-submit examples/src/main/python/wordcount.py 'file:/home/muyi/Desktop/test.txt'
3、测试Spark安装是否成功
在hadoop yarn中查看
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 1g --executor-memory 1g --executor-cores 1 examples/jars/spark-examples*.jar 10
4、spark-submit 测试python文件
[muyi@master spark]$ ./bin/spark-submit examples/src/main/python/pi.py
注意事项
SPARK 配置pyspark 需要在slave上安装anaconda
yarn 运行 需要在 master 和slave 同时安装anaconda、spark
设置虚拟内存 不能小于2G。
最终 ~/.bash_profile 配置
PATH=$PATH:$HOME/bin
export PATH
export JAVA_HOME=/usr/java/jdk1.8.0_181
export PATH=$JAVA_HOME/bin:$PATH
export HADOOP_HOME=/usr/hadoop/hadoop-2.7.7
export PATH=$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$PATH
export HADOOP_CLASSPATH=$/lib/tools.jar
export SCALA_HOME=/usr/local/scala
export PATH=$PATH:$SCALA_HOME/bin
export SPARK_HOME=/usr/local/spark
export PATH=$PATH:$SPARK_HOME/bin
export ANACONDA_PATH=/home/muyi/anaconda3
export PATH=$PATH:$ANACONDA_PATH/bin
export PYSPARK_DRIVER_PYTHON=$ANACONDA_PATH/bin/ipython
export PYSPARK_PYTHON=$ANACONDA_PATH/bin/python
export HADOOP_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop
export HDFS_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop
export YARN_CONF_DIR=/usr/hadoop/hadoop-2.7.7/etc/hadoop
六、Python Spark RDD
转换
intRdd = sc.parallelize([3,1,2,5,5])
intRdd.collect()
stringRdd = sc.parallelize(['apple','orange','pear','apple'])
stringRdd.collect()
def addOne(x):
return (x+1)
intRdd.map(addOne).collect()
intRdd.map(lambda x : x + 2).collect()
stringRdd.map(lambda x:'firut:'+x).collect()
intRdd.filter(lambda x:x
intRdd.filter(lambda x:x==3).collect()
intRdd.filter(lambda x:x>1 and x
stringRdd.filter(lambda x:'a' in x).collect()
intRdd.distinct().collect()
srdd=intRdd.randomSplit([0.4,0.6])
srdd[0].collect()
srdd[1].collect()
grdd= intRdd.groupBy(lambda x: "even" if(x%2==0) else "odd").collect()
print(grdd[0][0],sorted(grdd[0][1]))
print(grdd[1][0],sorted(grdd[1][1]))
intRdd1=sc.parallelize([3,1,2,5,5])
intRdd2=sc.parallelize([5,6])
intRdd3=sc.parallelize([2,7])
intRdd1.union(intRdd2).union(intRdd3).collect()
intRdd1.intersection(intRdd2).collect()
intRdd1.subtract(intRdd2).collect()
intRdd1.cartesian(intRdd2).collect()
动作
intRdd.first()
intRdd.take(2)
intRdd.takeOrdered(3,key=lambda x: -x)
intRdd.takeOrdered(3,key=lambda x: x)
intRdd.stats()
intRdd.min()
intRdd.max()
intRdd.stdev()
intRdd.count()
intRdd.sum()
intRdd.mean()
RDD Key-Value 转换
Word Count 案例
[muyi@master Desktop]$ cd '/home/muyi/pythonwork/ipynotebook'
[muyi@master ipynotebook]$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
textFile=sc.textFile("file:/home/muyi/pythonwork/ipynotebook/data/test.txt")
stringRDD=textFile.flatMap(lambda line : line.split(" "))
countsRDD=stringRDD.map(lambda word:(word,1)).reduceByKey(lambda x,y : x+y)
countsRDD.saveAsTextFile("file:/home/muyi/pythonwork/ipynotebook/data/output")
%ll data
%ll data/output
%cat data/output/part-00000
如果第二次执行报错,提示output目录已经存在
删除output目录
%rm -R data/output
集成开发环境
GTK版本升级
使用eclipse的时候让升级gtk+
yum install gtk2 gtk2-devel gtk2-devel-docs
查看是否安装GTK
pkg-config --list-all grep gtk
查看版本
pkg-config --modversion gtk+-2.0
领取专属 10元无门槛券
私享最新 技术干货