First steps with Spark and Clojure
27 July 2014
Spark being all the rage these days, I decided to set myself up and try out one of the canonical samples (distributed Pi computation) in Clojure. Note that discussion below focuses on the raw Spark APIs. There is already a nice looking wrapper library, clj-spark.
Basic setup
Let's start with getting the built-in Scala version of Pi calculation running. For convenience (and avoiding yet more dependencies) this will be a local deployment master+node setup. Unfortunately, there seems to be a one cliff around the local ip address - at least on my local mac. So let's throw in a couple of additional settings (SPARK_LOCAL_IP and for good measure change the mac "unknown-blah" host name). Note the spark context of the master node is printed in the master logs, in case where the hostname works out to something random.
> sudo scutil --set HostName "mylocalbox" > export SPARK_LOCAL_IP=127.0.0.1 > cd <spark install directory> > sbin/start-master > bin/spark-class org.apache.spark.deploy.worker.Worker spark://mylocalbox:7077
After this the Spark master should be up (web port 8080) along side one worker node (on 8081). So time to run the test
> bin/spark-submit --master spark://mylocalbox:7077 --class org.apache.spark.examples.SparkPi lib/spark-examples-1.0.1-hadoop2.2.0.jar ... 14/07/27 19:37:10 INFO SparkContext: Job finished: reduce at SparkPi.scala:35, took 3.908062 s Pi is roughly 3.13882
Writing the Clojure task
First, set up the leiningen project. The only tricky thing is to get the dev dependencies right, so that the uberjar is not bloated up with all the Spark dependencies.
(defproject spark "0.1.0-SNAPSHOT"
:aot [spark.clojure.pi]
;; avoid including spark in compilation but don't include in jar, credit to clj-spark project.clj
:profiles {
:dev {:dependencies [[org.apache.spark/spark-core_2.10 "1.0.1"]]}}
:dependencies [[org.clojure/clojure "1.6.0"]])
With the project in place, the driver class comes next. Translating the Java/Scala algorithm to Clojure is pretty straight forward. The algorithm calculates an approximation of Pi via a Monte Carlo process checking whether randomly chosen points in the unit square fall into the unit cicle (the percentage of points that do give the area of the unit cirle or 4 times Pi). We are using Spark here to map in parallel over all a collection of samples points and reduce the results.
(ns spark.clojure.pi
(:gen-class)
(:import [org.apache.spark SparkConf]
[org.apache.spark.api.java JavaSparkContext]
[org.apache.spark.api.java.function Function Function2]))
(defn -main [& args]
(let [conf (SparkConf.)
_ (.setAppName conf "ClojureSparkPi")
no-slices (if (seq args) (Integer/parseInt (first args)) 1)
n (* no-slices 100000)
count
(-> (JavaSparkContext. conf)
(.parallelize (vec (range n)) no-slices)
(.map (proxy [Function] []
(call [i]
(let [x (- (* (rand) 2) 1)
y (- (* (rand) 2) 1)]
(if (< (+ (* x x) (* y y)) 1)
1
0)))))
(.reduce (proxy [Function2] []
(call [i j] (+ i j)))))]
(println "pi is roughly" (/ (* 4.0 count) n))))
Overall this does not look too horrible. The function declarations are a bit awkward but could be conventiently wrapped via macros. To execute the code, first create an uberjar via leiningen and then run it against the Spark deployment similar to the Scala SparkPi example above.
> bin/spark-submit --master spark://mylocalbox:7077 --class spark.clojure.pi <path to Clojure project>/spark-0.1.0-SNAPSHOT-standalone.jar ... 14/07/27 19:50:57 INFO spark.SparkContext: Job finished: reduce at pi.clj:25, took 6.575753 s pi is roughly 3.14252comments powered by Disqus