Apache Spark - Which language to pick?

Karthik Kamalakannan
Karthik Kamalakannan, Founder and CEO
Apache Spark - Which language to pick?

Even after all these years, people still seek the community’s help to decide on a good language for their project using Apache Spark. If you browse through the forums today, you will find a slight bias towards Scala over Python for Spark. Let us jump in and see why there’s a slight bias towards Scala over Python,

  • Scala is faster than Python.
  • You can write neat code with Scala.
  • Maintain ability is much higher since the code is clean.
  • Writing code is easier with Play framework in place.
  • Writing asynchronous code is much easier with Scala.
  • Scala makes it easier for developers to make parallel processing possible.
  • Scala helps in real-time data streaming and processing with server push capabilities.

But again, this does not mean that Python is bad for Spark or for processing data at scale. It is important to remember that streaming data processing is not just about speed and performance, as there are a ton of other things to consider.

DataFrame API is probably one of the biggest things you will work with when you are processing data. It is here where Python takes over Scala. Working with Python’s built-in expressions and not allowing a lot of data transfer between the DataFrame and Resilient Distributed Datasets (RDD) will push your processing speeds further. Python is definitely a better choice when it comes to getting things started with Spark. The learning curve is less, widespread support from the community and probably their peers and the beauty of making calls easily is definitely a given with Python.

But again, the debate here goes back to the concept of Type Safety. Thinking about what would hold up when it comes to Type Safety, our vote definitely goes to Python especially when you are working with smaller ad hoc experiments. While this is the case with Python, running smaller projects and experiments on Spark seems not so production ready. We are currently facing this issue and seeing results live between Python and Scala and benchmarking the results for same operations in two different systems with the same configuration.

During our tests, Scala seems to have crushed Python for large-scale projects. This could be because Scala is a statically typed language, which makes it easier to write, run and maintain.

The conclusion here is pretty simple from our side. You use Python for Proof of Concepts or experiments, but go with Scala if you would like to run your project in production and work at scale.