What is Stream Processing?

Stream Processing vs. Batch Processing

Coming from various projects in the big data landscape of German companies I realize that many of the companies are facing similar problems. They have old legacy systems in place that are very good at what they where designed and built for at that time. Today, new technologies arise and new things become possible. People are talking about stream processing and real-time data processing. Everyone wants to adopt to new technologies to invest in the future. Even though I personally think this is a reasonable thought, I am also convinced that one has to understand these technologies first and what they were intended to be used for. As I was working with various clients I realized that it’s not too easy to define in a clear way what stream processing is and what use-cases are that we can leverage it for. Therefore, in this series of articles I will share some thoughts of mine and we will elaborate these two approaches to understand better what they are. In this first article I will try to give a clear distinction between the well-known batch processing and stream processing.

The relational database model is probably the most mature and adopted form of database. I guess that almost all companies have it in place as a central component to implement their operational business. Usually a copy of this central data store exists to be used for analytical use cases. The replication and the analytics queries, however, usually run as so called batch jobs. Even though most of us might have an extensive background in relational databases, I would like to start the article with a clear definition of the characteristics of a batch job.

Read More

Share Comments

The bigger picture: How SparkSQL relates to Spark core (RDD API)

Apache Spark’s high-level API SparkSQL offers a concise and very expressive API to execute structured queries on distributed data. Even though it builds on top of the Spark core API it’s often not clear how the two are related. In this post I will try to draw the bigger picture and illustrate how these things are related.

Read More

Share Comments

Choosing the right data model

One essential choice to make when building data-driven applications is the technology of the underlying data persistence layer. As the database landscape becomes more and more densely populated, it is essential to understand the fundamental concepts and their implications to choose the appropriate technology for a specific use-case. A very comprehensive overview on a conceptual level can be found in Martin Kleppmann’s big data bible “Designing Data-Intensive Applications”. Kleppmann writes: “… every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward.” He compares currently existing data models by considering various criteria. Here, I will use these criteria to provide an overview of existing concepts and to elaborate on their advantages and shortcomings.

Read More

Share Comments

Understanding Apache Spark Hash-Shuffle

As described in Understanding Spark Shuffle there are currently three shuffle implementations in Spark. Each of these implements the interface ShuffleWriter. The goal of a shuffle writer is to take an iterator of records and write them to a partitioned file on disk - the so called map output file. The map output file is to be partitioned so that subsequent stages can fetch data merely for a specific partition.

Several approaches exist to accomplish this - each with specific pros and cons and therefore beneficial in different situations. The ShuffleManager attempts to select the most appropriate one based on your configuration and your application. In this post we will deep dive into the so called BypassMergeSortShuffleWriter, which is also referred to as hash-shuffle. We will understand how it works and elaborate on its up- and downsides to understand in which situations it might be a good choice to use.

Read More

Share Comments

Understanding Apache Spark Shuffle

This article is dedicated to one of the most fundamental processes in Spark - the shuffle. To understand what a shuffle actually is and when it occurs, we will firstly look at the Spark execution model from a higher level. Next, we will go on a journey inside the Spark core and explore how the core components work together to execute shuffles. Finally, we will elaborate on why shuffles are crucial for the performance of a Spark application.

Currently, there are three different implementations of shuffles in Spark, each with its own advantages and drawbacks. Therefore, there will be three follow-up articles, one on each of the shuffles. These articles will deep-dive into each of the implementations to elaborate on their principles and what performance implications they inherit.

In order to be able to fully understand this article you should have a basic understanding of the Spark execution model and the basic concepts: MapReduce, logical plan, physical plan, RDD, partitions, narrow dependency, wide dependency, stages, tasks, ShuffleMapStage and ResultStage.

Read More

Share Comments