Apache Spark for beginners- Chapter 1 : Get Started

Mahendran
3 min readJun 23, 2020

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Get started today by downloading the simple WordCount program with Maven on your IDE[IntelliJ IDEA], run it and learn more start your spark learning today.

Chapter 1: Local Setup and Run a Wordcount App

Objective

Get your hands dirty by running a simple workdcount program in < 5 mins. The project is created with maven scala-spark 3.0 with funspec as test suite

Prerequisites

java 11 | brew cask install java11

scala 2.12 | brew cask install java

Install or upgrade spark to latest version

brew install apache-spark

OR

Download Spark from https://spark.apache.org/downloads.html

tar -vxf spark-3.5.0-bin-hadoop3.tgz -C /Users/${$USER}/repos

echo SPARK_HOME="/Users/${USER}/repos/spark-3.5.0-bin-hadoop3/">> ~/.zshrc
echo PATH="\${SPARK_HOME}/bin:\$PATH" >> ~/.zshrc
source ~/.zshrc

Start spark Locally

cd $SPARK_HOME
./sbin/start-all.sh
mkdir /tmp/spark-events
./sbin/start-history-server.sh

Check the UI

http://localhost:8080/

http://localhost:18080/

Alternatively, click the image to Create a new Maven-Scala-Spark project from the scratch.

click here takes you to the setup

1. Clone the git project

spark-poc

Add the Scala framework support to the project to enable the compiler in IDE

2. Build the project

mvn clean package

3. Setup

mkdir -p /tmp/input /tmp/output

cp src/main/resources/data.txt /tmp/input/

4. Execute the jar

The master URL can be spark://localhost:7077. Refer

spark-submit --name wordcount_`date +%F_%T` \
--class com.mahendran.example.wordcount.WordCount \
--conf spark.yarn.submit.waitAppCompletion=false \
--master spark://localhost:7077 \
--queue testing \
target/spark-poc-1.0-SNAPSHOT.jar \
/tmp/input/data.txt /tmp/output

4.1 verify the results

Verify the output directory

Note: The input file data.txt is ~ 21MB which is less then the configured block size 128MB. The no. of partitions is 1 in this case.

5. Overridden the default split size.

In the code, WordCount.Scala, the default block size [for local it is 32MB] is overridden to by setting the mapred input split [default is 0] to 128MB

6. Exercise: Generate 1GB dataset and run the wordcount

mkdir -p /tmp/input /tmp/output

cd src/main/shell

./data-gen.sh

mv 1gb-data.txt /tmp/input/data.txt
Note the no. of partitions changed to 9 [1.1 GB / 128MB = 9]

6.1 Repartition the resulted dataset

Repartitioned to 3 files

7. Verify

ls -l /tmp/output
3 part files are generated
number of occurrences of each word

Now you have a working environment. Start exploring.

Notes:

  1. To create a 10KB dataset from dictionary
 cat /usr/share/dict/words | sort -R | head -1024 > data.txt

Previous Chapter

Index

  1. Setup Spark-Scala-Maven In Intellij IDEA
  2. Chapter 1 : Get Started [You are here]
  3. Chapter 1.1 : Configure Spark Web UI in Local

Comment the issues/feedback below. I went through lot of issues to set it up in way that i could run the spark-scala tests in local in debug mode.

--

--

Mahendran

A Software/Data Engineer, Photographer, Mentor, and Traveler