Lambda is one of the most discussed architecture pattern in data science space.
Lets look at what it really means ,
Lambda is a data processing architecture and framework designed to address robustness and scalability and fault tolerance of big data systems.
In this study we are planning to focus on the batch and speed layers to achieve data processing.
As proposed in the summary for both the batch and stream we will be using Spark.
What constitutes the lambda architecture for data processing?
Lambda is a architecture pattern .
Here the architecture we investigate has Lambda implemented primarily with Spark for batch and stream processing, Cassandra for No -SQL database storage, Kafka for accessing and sending
the stream data and Zeppelin for visualisations.
Let's investigate how a sample application workflow would look like.
As is clear from the architecture diagram a data processing architecture has two processing engines real time processing and batch processing layers.
The real time layer is handled by Apache Spark .
Apache Spark can be used to leverage both batch and real time processing engines.
The batch layer can be build using traditional data processing platforms like HDFS.
The data persistance layer is on HDFS to perform batch operations.
[HDFS stands for Hadoop distributed file system]
For the processed data loads the data has to be stored over in Cassandra, It would be highly recommended to use tools like Zeppelin to visualise the data.
To work against the data use Spark and Cassandra commands for any custom new queries.The application should also use a click stream producer to send data to Kafka in a few different formats .
The sample application we are going to create uses Spark to synchronise the data to HDFS and perform the stream and batch processing.
What is Apache Spark?
Spark is a general-purpose cluster computing platform designed with components for scheduling and executing against large datasets.
Spark is now in version 2.X where there is more focus on structured streaming.
How spark fits into lambda Architecture?
Spark is a general engine for large -scale data processing.
It also scales horizontally like Map Reduce the major difference comes in the speed.
Spark is one of those frameworks that was built to address some of the inefficiencies with MapReduce.
Spark performs specific optimizations by building its own directed acyclic graph or DAG based on your program and optimizes that DAG with a substantially less amount of data hitting disk and passed on through memory instead.
Spark also builds its own execution DAG as well and has its own optimizations and scheduling for executing that DAG.
The core strength of Spark’s performance when compared to other frameworks is that it can utilize memory and cache objects efficiently and that it also keeps a lineage graph of your operations, so it can re-compute on failures.
These are two of the fundamental things that the resilient distributed dataset implementation in Spark is all about.
What are the Spark components and how to perform scheduling?
Spark downloaded usually contains the Spark Core that includes high-level API and an optimised engine that supports general execution graphs, Spark SQL for SQL and structured data processing, and Spark Streaming that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark MLLIb contains the machine learning libraries.
GraphX is for graph computations.
Spark also supports a variety of languages like Java, Python, Scala, and R.
As can be seen Apache Spark forms a important constituent of achieving the Lambda Architecture .Lets look at some high level components that Apache Spark holds.
Different Aggregations are available in Apache Spark
Fundamental abstraction and building block. RDD represents a “Resilient Distributed Dataset”.
RDD which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel. RDDs are created by starting with a file in the Hadoop file system (or any other Hadoop-supported file system), or an existing Scala collection in the driver program, and transforming it. Users may also ask Spark to persist an RDD in memory, allowing it to be reused efficiently across parallel operations. Finally, RDDs automatically recover from node failures.
Like an RDD, a Data Frame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, Data Frame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.
Spark 1.6 brings us a Dataset API, which adds to the Data Frame API by adding type safety to the structured table representation of data that Data Frames bring.
Spark also supports pulling data sets into a cluster-wide in-memory cache. This is very useful when data is accessed repeatedly, such as when querying a small “hot” dataset or when running an iterative algorithm like PageRank.