In the previous post, a Spark cluster is set up using 2 VirtualBox Ubuntu guests. while this is a viable option for many, it is not always for others. For those who find setting-up such a cluster is not convenient, there’s still another option, which is relying on the local mode of Spark. In this post, a BitBucket repository is introduced, which is a R project that includes Spark 1.6.0 Pre-built for Hadoop 2.0 and later and hadoop-common 2.2.0 - the latter is necessary if it is tested on Windows. Then several initialization steps are discussed such as setting-up environment variables and library path as well as including the spark-csv package and a JDBC driver. Finally it shows some examples of reading JSON and CSV files in the cluster mode.
Spark 1.6.0 Pre-built for Hadoop 2.0 and later is downloaded and renamed as spark after decompressing. Also, hadoop-common 2.2.0 is downloaded from this GitHub repo and saved within the spark folder as hadoop. The SparkR package is in R/lib and the bin path includes files that execute spark applications interactively and in batch mode. The conf folder includes a number of configuration templates. At the moment, only one of the templates is modified - log4j.properties.template. This template is renanmed as log4j.properties and log4j.rootCategory is set to be WARN as shown below. Previously it was INFO and it may be distracting as a lot of messages are printed with this option.
The spark folder in the repository is shown below.
There are 2 data files. iris.json is the popular iris data set in JSON format. iris_up.csv is the same data set in CSV format with 3 extra columns - 1 integer, 1 date and 1 integer column with NA values - how to read them will be discussed shortly. postgresql-9.3-1103.jdbc3.jar is a JDBC driver to connect PostgreSQL-like database servers such as PostgreSQL server or Amazon Redshift - you may add another driver for your own DB server.
It sets two environment variables: SPARK_HOME and HADOOP_HOME. The latter is mandatory if you’re running this example on Windows (possibly in local mode). Otherwise the following error is thrown: java.lang.NullPointerException. Note that the Spark pre-built distribution doesn’t include Hadoop-common and it is downloaded from this repo and added within the spark folder - the foler is named as hadoop. If it is running on Linux, this part can be commented out as in the cluster mode example below.
Also the spark bin directory is added to the PATH environment variable - this path is where spark-submit (Spark batch excutor) and Saprk REPL launchers exist. Then the path of the db driver (postgresql-9.3-1103.jdbc3.jar) is specified - see postgresql_drv. As can be seen in sparkR.init(), the path is added to environment variable on the worker node by adding sparkEnvir and the driver is passed to the worker node by setting sparkJars.
Finally the search path is updated to add the SparkR package (sparkr_lib). For the local mode, the master can be set as local[*] if it is required to use all existing cores or a specific number can be specified - see spark_link. The last environment variable (SPARKR_SUBMIT_ARGS) is for controlling spark-submit. In this setting, the spark-csv package is included at launch.
At the end, a spark context (sc) and sql context (sqlContext) are defined. Note that, in this way, it is possible to run a Spark application interactively as well as in batch mode using Rscript, rather than using spark-submit.
In order to run the script in the cluster mode, the two data files (iris.json and iris_up.csv) are copied to ~/data in both the master and slave machines. (Files should exist in the same location if you’re not using HDFS, S3 …) I simply used WinSCP and you may find this post useful. Also I started a cluster by ~/spark/sbin/start-all.sh - see this post for further details.
The main difference is the Spark master, which is set to be spark://192.168.1.10:7077.
The JSON format is built-in so that it suffices to specify the source. By default, read.df() infers the schema (or data type) and it is found that all data types are identified correctly except for the last one where it includes some NA values.
The schema inference is worse on CSV as the date field is identified as string.
It is possible to specify individual data types by constructing a custom schema (customSchema). Note that the last column (null) is set as string in the custom schema and converted into integer using cast() after the data is loaded. The reason is NA is considered as string that the following error will be thrown if it is set as integer: java.lang.NumberFormatException: For input string: "NA".