How to choose spark settings?

Spark comes with a plethora of settings. Good luck trying to make sense of them and setting their values. There is ton of documentation but not one good page describing in step by step form how to choose the values of important parameters like:

  • number of executors
  • number of cores per executor
  • executor memory
  • executor memory overhead
  • spark.default.parallelism
  • spark.sql.shuffle.partitions
  • spark.memory.storageFraction
  • spark.memory.fraction
  • spark dynamicAllocation

Spark processes a job in stages. Your goal as a developer is to minimize the # of stages since data needs to be shuffled between stages and that is an expensive operation; also the stages cannot be executed in parallel – one stage has to complete before the next one can be executed. Within a stage, spark processes data in tasks. Thus N tasks need to be completed to complete a stage. Tasks can be executed in parallel. A task is processed by a thread. And a task processes a partition of the data. Executor is a JVM instance (i.e., its a process) that lives for the lifetime of the spark job and executes tasks. Just as multiple threads live in a process and share the process memory, so multiple tasks can live in a executor and share the executor memory. Task has a 1:1 relationship with a thread (a task is executed in a thread) and thread has 1:1 relationship with CPU core (a thread consumes one CPU core when running)

In general, it is recommend to use 2-3 tasks per CPU core in the cluster [link]. we can start with the total number of virtual cores available in the cluster. Lets say that number is 500 and we want to use 80% of the cluster’s capacity. This means we want to use total 0.8*500 = 400 VCores. That settles one number. So total # of tasks = pick a number between 2 and 3 * 400 = 2.5 * 400 = 1000. That settles another number. Since a task has a 1:1 relationship with a partition, that means spark.default.parallelism and spark.sql.shuffle.partitions should be set to 1000.

Sandy Ryza notes that

  • I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.

  • Running tiny executors (with a single core and just enough memory needed to run a single task, for example) throws away the benefits that come from running multiple tasks in a single JVM. For example, broadcast variables need to be replicated once on each executor, so many small executors will result in many more copies of the data.

So lets set number of cores per executor = 5 based on above notes (a constant which does not need to be changed based on cluster properties etc.) Any number between 3-5 should be fine in general.

as we have calculated total # of cores and # of cores per executor, we can now calculate the # of executors as 400 / 5 = 80. 

the executor memory can be set equal to the max. executor memory provided by the administrator minus the overhead. the default non-heap overhead in spark is 384MB. One might want to bump this to 1GB if one encounters the “container killed by YARN for exceeding memory limits” error.

If you are using Spark SQL, you might sometimes have to forcefully repartition the dataframe into the # of partitions calculated above. Be careful because in our case spark would not repartition the data even when we asked it to by making a call to repartition. To force it to repartition the data, we followed the call to repartition with persist. When spark has to shuffle data between stages it will use the # of partitions set by spark.default.parallelism or spark.sql.shuffle.partitions, but during the first stage it does not automatically use the # of partitions set by spark.default.parallelism or spark.sql.shuffle.partitions. And if the # of partitions is only 20 whereas 400 VCores were allocated in spark-submit, only 20 tasks will be active at a time and 380 VCores will be idle. Don’t let this happen! The executors tab in the spark UI shows how many tasks are active at a given point of time. make sure the # of tasks here are equal to the # of vcores – otherwise there are CPUs sittle idle.

If data is not being cached or persisted, then the storage memory fraction can be set to 0. in our experience we found spark dynamic allocation is more of a hassle. our jobs would sometimes get stuck when using dynamic allocation. moreover, the time to process becomes more unpredictable. static allocation translates to a predictable time to process.

So here we are again:

  • number of cores per executor = 4 (average of 3 and 5)
  • number of executors = # of cores in cluster * load factor / number of cores per executor
  • executor memory = can set it to max. available – the overhead
  • executor memory overhead = default is 384MB. bump it to 1GB if getting “container killed by YARN for exceeding memory limits”
  • spark.default.parallelism = 2.5 * # of cores in cluster * load factor
  • spark.sql.shuffle.partitions = 2.5 * # of cores in cluster * load factor
  • spark.memory.storageFraction = 0 if not doing any caching or persisting
  • spark.memory.fraction
  • spark dynamicAllocation = false (gave us more trouble than the promised benefit)

More Notes:

Do use kryo serialization when running spark jobs

Further Reading:

Posted in Software | Leave a comment

10 reasons why outages occur

  1. Most often outages happen because of config changes in dependencies used by your application. To give an example, lets say you are running hadoop jobs on a cluster and relying on HADOOP_USER_NAME to set the privileges under which your job runs. This has been working for quite some time. But now the cluster admin enables kerberos on the cluster and your job suddenly fails and you have an outage.

When Kerberos is disabled, the identity of a user is picked up by Hadoop first from the environment variable HADOOP_USER_NAME, then from the OS-level username (e.g. the system property

2. Second most common cause of outage is outage in a dependency. Say you are relying on AWS S3 to fetch some static content. And this happened.

3. A third common reason is expired credentials. Your service has been working fine for months. But now all of a sudden some credentials expire and you get a pager alarm.


Posted in Software | Leave a comment

Making GQL queries in new appengine dashboard

select the icon at the left


then select datastore from the dropdown


you should now be able to run GQL queries like before

Posted in Software | Leave a comment

Steps to setup NES on Rasberry Pi

I have this model of Rasberry Pi

Raspberry Pi 2 Model B (1GB) Basic Starter Kit Includes Raspberry Pi 2 Model B–

To setup NES you need 7-zip and Win32DiskImager

  1. Download Retropie v4.1
  2. Extract using 7-zip. This step should dump out a 2GB file or so
  3. Insert SD card and open Win32DiskImager
  4. Write the file in step 2 to the SD Card
  5. Insert SD card into rasberry pi, plug in ethernet cable, and boot up
  6. Transfer ROMS to NES folder over Ethernet
  7. Might need to restart to see the games

That’s It! No need to format the SD card etc. in case of problems. Just overwrite the image using step 4.

Posted in Computers | Leave a comment

Failed to load class org.slf4j.impl.StaticLoggerBinder

Ran into this issue today while trying to run a program from IntelliJ and using SLF4J for logging. pom.xml looked like below:



I was not creating any fat jar. The problem was that IntelliJ was not able to import the

slf4j-simple dependency. To check this goto View -> Tool Windows -> Maven Projects.
Saw red swiggly lines as described in Could not find artifact descriptor.
Downloaded the file manually from and installed it manually by running:

mvn install:install-file -Dfile=/Users/siddjain/Downloads/slf4j-simple-1.6.1.jar -DgroupId=org.slf4j -DartifactId=slf4j-simple -Dversion=1.6.1 -Dpackaging=jar

Now did a reimport in IntelliJ and the error went away

Posted in Software | Leave a comment

IntelliJ IDEA: Cannot resolve symbol

Step 1. Check that your Settings (accessed from File -> Settings) match following screenshots or equivalent:





Step 2: Close IntelliJ

Step 3: Delete all .idea directory and .iml files in your repo:

rm $(find . -name *.iml)

rm -r $(find . -name .idea)

will delete all files recursively on mac. On windows use the simpler

dir *.iml /s

dir .idea /s

Step 4: Delete IDEA system directory. On Windows the IDEA system directory is C:\Users\username\.IntelliJIdea2016.2\system

On Mac do:

rm -r ~/Library/Logs/IntelliJIdea15/*

rm -r ~/Library/Caches/IntelliJIdea15/*


First of all you should try File | Invalidate Caches and if it doesn’t help, delete IDEA system directory. Then re-import the Maven project and see if it helps.



Posted in Software | Leave a comment

Scenic Places

Lower Antelope Canyon


Valley Of Fire State Park


Red Rock Canyon


Hoover Dam


Butchart Gardens


Taj Mahal


Taj Mahal Entrance


Posted in Travel | Leave a comment