链接:https://github.com/markgrover/cloudcon-hive
2008.tar.gz:2008 年航班延误数据集。
airports.csv:将机场代码与其全称关联的数据集。
2.将数据集放到CDH某个节点的本地
通过xftp工具,把本机的文件拖到 slave1 节点的 /home/my_flight 下
-put
命令,从本地文件系统拷贝到HDFS,其中/user/tmp/为hdfs中的路径
hdfs dfs -put /home/my_flight /user/tmp/
Hive CLI使用HiveServer1。Hive CLI使用Thrift协议连接到远程Hiveserver1实例。要连接到服务器,必须指定主机名。端口号是可选的。
$hive -h <host_name> -p <port>
Beeline使用JDBC连接到远程HiveServer2实例。连接参数包括JDBC URL。
$ beeline -u <url> -n <username> -p <password>
CREATE external TABLE flight_data(
year INT,
month INT,
day INT,
day_of_week INT,
dep_time INT,
crs_dep_time INT,
arr_time INT,
crs_arr_time INT,
unique_carrier STRING,
flight_num INT,
tail_num STRING,
actual_elapsed_time INT,
crs_elapsed_time INT,
air_time INT,
arr_delay INT,
dep_delay INT,
origin STRING,
dest STRING,
distance INT,
taxi_in INT,
taxi_out INT,
cancelled INT,
cancellation_code STRING,
diverted INT,
carrier_delay STRING,
weather_delay STRING,
nas_delay STRING,
security_delay STRING,
late_aircraft_delay STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
LOAD DATA INPATH '/user/tmp/my_flights' INTO TABLE flight_data;
HIVE默认使用MR作为计算引擎,在HIVE中执行
SELECT COUNT(*)
FROM flight_data;
耗时一分左右,因为执行的是MR程序
在CDH的HIVE中选择计算引擎为Spark。
再次尝试运行上述程序,发现运行失败,是因为本机是基于VMWARE的伪分布式环境,每个yarn节点配置的核心数和内存空间无法满足spark作业的要求,根据报错修改Yarn配置(注意每个节点都要修改)。
</repositories>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.11.12</scala.version>
<avro.version>1.8.2-cdh6.3.2</avro.version>
<crunch.version>0.11.0-cdh6.3.2</crunch.version>
<flume.version>1.9.0-cdh6.3.2</flume.version>
<connector.version>hadoop3-1.9.10-cdh6.3.2</connector.version>
<hadoop.version>3.0.0-cdh6.3.2</hadoop.version>
<hbase.version>2.1.0-cdh6.3.2</hbase.version>
<indexer.version>1.5-cdh6.3.2</indexer.version>
<hive.version>2.1.1-cdh6.3.2</hive.version>
<kafka.version>2.2.1-cdh6.3.2</kafka.version>
<kitesdk.version>1.0.0-cdh6.3.2</kitesdk.version>
<kudu.version>1.10.0-cdh6.3.2</kudu.version>
<oozie.version>5.1.0-cdh6.3.2</oozie.version>
<pig.version>0.17.0-cdh6.3.2</pig.version>
<search.version>1.0.0-cdh6.3.2</search.version>
<sentry.version>2.1.0-cdh6.3.2</sentry.version>
<solr.version>7.4.0-cdh6.3.2</solr.version>
<spark.version>2.4.0-cdh6.3.2</spark.version>
<sqoop.version>1.4.7-cdh6.3.2</sqoop.version>
<zookeeper.version>3.4.5-cdh6.3.2</zookeeper.version>
</properties>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<plugin>
<artifactId>maven-jar-plugin</artifactId>
<version>3.0.2</version>
<configuration>
<archive>
<manifest>
<mainClass>com.spark.SparkOnHive</mainClass>
</manifest>
</archive>
<finalName>SparkOnHive</finalName>
</configuration>
</plugin>
& mvn clean package -f "c:\Users\GRE\Desktop\SparkLearn\spark_java\pom.xml"
./spark-submit --master yarn-cluster --class org.apache.spark.examples.SparkPi ../lib/spark-examples-xx.jar 100
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。