Apache Spark is a fast and general-purpose (mostly in-memory) cluster computing system. It provides convenient APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs and fault tolerance through lineage tracking.
For the beginner running a local version of Spark might seem a daunting task. There are many installation steps required for a production deployment of Apache Spark. In this post I show a method for installing and experimenting with Spark in 3 easy commands on Ubuntu 14.04. So here they are starting from a bash prompt:
1) Install Java
sudo apt-add-repository ppa:webupd8team/java && sudo apt-get update && sudo apt-get install oracle-java7-installer
2) Download Spark and extract
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.2.0-bin-hadoop2.4.tgz && tar -xvf spark-1.2.0-bin-hadoop2.4.tgz
3) Start the pyspark shell/repl:
spark-1.2.0-bin-hadoop2.4/bin/pyspark
This should be enough to get a fully functioning shell where to experiment with the Spark Python API:
Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.2.0 /_/ Using Python version 2.7.6 (default, Mar 22 2014 22:59:56) SparkContext available as sc.
Trying a couple of commands should work. Here we create an RDD from the Readme.txt file and count the number of lines.
textFile = sc.textFile("spark-1.2.0-bin-hadoop2.4/README.md") 15/01/31 10:16:02 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=280248975 15/01/31 10:16:02 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB) 15/01/31 10:16:02 INFO MemoryStore: ensureFreeSpace(22692) called with curMem=163705, maxMem=280248975 15/01/31 10:16:02 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 267.1 MB) 15/01/31 10:16:02 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:40146 (size: 22.2 KB, free: 267.2 MB) 15/01/31 10:16:02 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0 15/01/31 10:16:02 INFO SparkContext: Created broadcast 0 from textFile at NativeMethodAccessorImpl.java:-2 textFile.count() 15/01/31 10:16:11 INFO FileInputFormat: Total input paths to process : 1 15/01/31 10:16:12 INFO SparkContext: Starting job: count at <stdin>:1 15/01/31 10:16:12 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 1 output partitions (allowLocal=false) 15/01/31 10:16:12 INFO DAGScheduler: Final stage: Stage 0(count at <stdin>:1) 15/01/31 10:16:12 INFO DAGScheduler: Parents of final stage: List() 15/01/31 10:16:12 INFO DAGScheduler: Missing parents: List() 15/01/31 10:16:12 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[2] at count at <stdin>:1), which has no missing parents 15/01/31 10:16:12 INFO MemoryStore: ensureFreeSpace(5464) called with curMem=186397, maxMem=280248975 15/01/31 10:16:12 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 5.3 KB, free 267.1 MB) 15/01/31 10:16:12 INFO MemoryStore: ensureFreeSpace(4003) called with curMem=191861, maxMem=280248975 15/01/31 10:16:12 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.9 KB, free 267.1 MB) 15/01/31 10:16:12 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:40146 (size: 3.9 KB, free: 267.2 MB) 15/01/31 10:16:12 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0 15/01/31 10:16:12 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:838 15/01/31 10:16:12 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 15/01/31 10:16:12 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1317 bytes) 15/01/31 10:16:12 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 15/01/31 10:16:12 INFO HadoopRDD: Input split: file:/home/ubuntu/spark-1.2.0-bin-hadoop2.4/README.md:0+3645 15/01/31 10:16:12 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id 15/01/31 10:16:12 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id 15/01/31 10:16:12 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap 15/01/31 10:16:12 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition 15/01/31 10:16:12 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id 15/01/31 10:16:12 INFO PythonRDD: Times: total = 700, boot = 603, init = 95, finish = 2 15/01/31 10:16:12 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1797 bytes result sent to driver 15/01/31 10:16:13 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 0.859 s 15/01/31 10:16:13 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 0.973213 s 15/01/31 10:16:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 843 ms on localhost (1/1) 15/01/31 10:16:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 98
The pyspark shell also starts up a local webserver on http://localhost:4040, where we can look at the current jobs submitted to the local spark engine. We can see the count() action call, the date it was submitted, its status, the number of tasks it took and other details. We will be exploring more of those options in future posts.
This was an intro on how to install and experiment with Apache Spark in 3 easy commands!