Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
Get started today by downloading the simple WordCount program with Maven on your IDE[IntelliJ IDEA], run it and learn more start your spark learning today.
Chapter 1: Local Setup and Run a Wordcount App
Objective
Get your hands dirty by running a simple workdcount program in < 5 mins. The project is created with maven scala-spark 3.0 with funspec as test suite
Prerequisites
java 11 | brew cask install java11
scala 2.12 | brew cask install java
Install or upgrade spark to latest version
brew install apache-spark
OR
Download Spark from https://spark.apache.org/downloads.html
tar -vxf spark-3.5.0-bin-hadoop3.tgz -C /Users/${$USER}/repos
echo SPARK_HOME="/Users/${USER}/repos/spark-3.5.0-bin-hadoop3/">> ~/.zshrc
echo PATH="\${SPARK_HOME}/bin:\$PATH" >> ~/.zshrc
source ~/.zshrc
Start spark Locally
cd $SPARK_HOME
./sbin/start-all.sh
mkdir /tmp/spark-events
./sbin/start-history-server.sh
Check the UI
Alternatively, click the image to Create a new Maven-Scala-Spark project from the scratch.
1. Clone the git project
Add the Scala framework support to the project to enable the compiler in IDE
2. Build the project
mvn clean package
3. Setup
mkdir -p /tmp/input /tmp/output
cp src/main/resources/data.txt /tmp/input/
4. Execute the jar
The master URL can be spark://localhost:7077. Refer
spark-submit --name wordcount_`date +%F_%T` \
--class com.mahendran.example.wordcount.WordCount \
--conf spark.yarn.submit.waitAppCompletion=false \
--master spark://localhost:7077 \
--queue testing \
target/spark-poc-1.0-SNAPSHOT.jar \
/tmp/input/data.txt /tmp/output
4.1 verify the results
Note: The input file data.txt is ~ 21MB which is less then the configured block size 128MB. The no. of partitions is 1 in this case.
5. Overridden the default split size.
In the code, WordCount.Scala, the default block size [for local it is 32MB] is overridden to by setting the mapred input split [default is 0] to 128MB
6. Exercise: Generate 1GB dataset and run the wordcount
mkdir -p /tmp/input /tmp/output
cd src/main/shell
./data-gen.sh
mv 1gb-data.txt /tmp/input/data.txt
6.1 Repartition the resulted dataset
7. Verify
ls -l /tmp/output
Now you have a working environment. Start exploring.
Notes:
- To create a 10KB dataset from dictionary
cat /usr/share/dict/words | sort -R | head -1024 > data.txt
Index
- Setup Spark-Scala-Maven In Intellij IDEA
- Chapter 1 : Get Started [You are here]
- Chapter 1.1 : Configure Spark Web UI in Local
Comment the issues/feedback below. I went through lot of issues to set it up in way that i could run the spark-scala tests in local in debug mode.