Location Setup capability
These are capable of clear placement preference to compute partitions. Placement preference refers to the defining information about the location of it. Here the DAG comes into play and places the partitions in a way that the task is close to the data it needs. Hence, the Speed of computation is increased.
A product of Apache Software Foundation, which is in an open-source unified programming model and is used to define and execute data processing pipelines. Click to explore about our, Apache Beam Architecture and Processing
RDD vs Dataset vs DataFrame
The complete comparison described below:
RDD APIs
It is the actual fundamental data Structure of Apache Spark. These are immutable (Read-only) collections of objects of varying types, which computes on the different nodes of a given cluster. These provide the functionality to perform in-memory computations on large clusters in a fault-tolerant manner. Every DataSet in the Spark Resilient Distributed Dataset is well partitioned across many servers so that they can be efficiently computed on different nodes of the cluster.
DataSet APIs
In Apache Spark, the Dataset is a data structure in Spark SQL which is strongly typed, Object-oriented and is a map to a relational schema. It represents a structured query with encoders and is an extension to the Data-frame API. These are both serializable and Query-able, thus persisting in nature. It provides a single interface for both Scala and Java languages. It also reduces the burden of libraries.
DataFrame APIs
We can say that Data-Frames are Dataset organized into named columns. These are very similar to the table in a relational database. The ideology is to allow processing of a large amount of Structured Data. Data-Frame contains rows with a schema where the schema is the illustration of the structure of data. It provides memory management and optimized execution plans.
An open-source web framework that is used for building applications on AWS, Microsoft Azure, Kubernetes, etc. Click to explore about our, Serverless Solutions and Architecture for Big Data
Methods to Generate Apache Spark
Three ways to create an RDD in Apache Spark are listed below:Parallelizing collection (Parallelized)
We take an already existing collection in the program and pass it onto the SparkContext’s parallelize() method. This is an original method which creates RDDs quickly in Spark-shell and also performs operations on them. It is very rarely used, as this requires the entire Dataset on one machine.
Referencing External Dataset
In Spark, the Resilient Distributed Dataset can be formed from any data source supported by the Hadoop, including local file systems, HDFS, Hbase, Cassandra, etc. Here, data is loaded from an external dataset. We can use SparkContext’s textFile method to create text file. It would URL of the file and read it as a collection of line. URL can be a local path on the machine itself.
Creating RDD from an existing
Transformation mutates one into another, and change is the way to create a new from an existing. This creates a difference between Apache Spark and Hadoop MapReduce. Conversion works like one that intakes an RDD and produces one. The input Resilient Distributed Dataset does not change, and as RDDs are immutable, it generates varying by applying operations.
Operation on RDD
There are Two operations of Apache Spark RDDs Transformations and Actions. A Transformation is a function that produces a new Resilient Distributed Dataset from the existing. It takes it as input and generates one or more as output. Every time it creates new when we apply any transformation. Thus, all the input, cannot be changed since these are immutable. Some points are -
-
No Change to the cluster.
-
Produces a DAG which keeps track of which it was made when in the Life cycle.
-
Example : map(func), filter(func), Reduce(func), intersection(dataset), distinct(), groupByKey(), union(dataset), mapPartitions(fun), flatMap().
Presto queries data including Hive, Cassandra, relational databases, separating computation from storage performing independent scaling. Click to explore about our, Large Data Processing with Presto
Types of Transformations
-
Narrow Transformations: In this type, all the elements which are required to compute the records in a single partition live in that single partition. Here, we use a limited subset of partition to calculate the result. Narrow transformations are the result of map(), filter().
-
Wide Transformations: Here, all elements required to compute the records in that single partition may live in many of the partitions of the parent. These use groupbyKey() and reducebyKey().
Spark Actions
-
The Transformations in Apache Spark create it from each other, but to work on actual Dataset, and then we perform action operations. Here, new Resilient Distributed Dataset is not formed but gives non-RDD values as results that are stored on drivers or to the external storage system. It brings Laziness to the processing it.
-
Actions are a means of sending data from Executor to the Driver where the Executors are responsible for executing a task. At the same time, the Driver is a JVM process that manages workers and execution of the task.
-
Some Examples include : count(), collect(), take(n), top(), count value(), reduce(), fold(), aggregate(), foreach().
The flow of it in the Spark Architecture
-
Spark creates a graph when you enter code in the sparking console.
-
When an action is called on Spark, Spark submits graph to DAG scheduler.
-
Operators are divided into stages of Tasks in DAG scheduler.
-
The stages are passed on to the Task scheduler, which launches task through Cluster Manager.
Limitations of the Apache Spark
No automatic Optimization
In Apache Spark, it does not have an option for automatic input optimization. It is unable to make use of the Spark advance optimizers like the Catalyst optimizer and Tungsten execution engine, and thus we can only do manual Resilient Distributed Dataset optimization. This is Overcome in the Dataset and DataFrame concepts, where both make use of the Catalyst to generate optimized logical and physical query plan. It provides space and Speed efficiency.
No static and Runtime type safety
It does not provide static or Runtime type Safety and does not allow the user to check error at the runtime. But, Dataset provides compile-time type safety to build complex data workflows. This helps error detection at compile time and thus make code safer.
The Problem of Overflow
It degrades when there is not enough memory too available to store it in-memory or on disk. Here, the partitions that overflow from RAM may be stored on disk and will provide the same level of performance. We need to increase the RAM and disk size to overcome this problem.
Overhead of serialization and garbage collection (Performance limitation)
As it is an in-memory object, it involves the overhead of Garbage Collection and Java serialization, which becomes expensive with growth in data. To overcome this, we can use data structures with fewer objects to lower cost or can persist object in serialized form.
No Schema View of Data
Resilient Distributed Dataset has a problem with handling structured data. This is because it does not provide a schema view of data and has no provision in that context. Dataset and DataFrame provide Schema view and is distributed collection of data organized into named columns.
A technique that is used to convert the raw data into a clean data set before the execution of the Iterative Analysis. Click to explore about our, Data Processing in ML
Conclusion
The Hadoop MapReduce had a lot of shortcomings with it. To overcome these shortcomings, Spark RDD was introduced. It had in-memory processing, immutability and other functionalities mentioned above which gave users a better option. But its too had some limitations which restricted Spark from being more versatile. Thus, the concept of Data-Frame and Dataset evolved.
- Get insight on TDD for Big Data and Apache Spark
- Click to read about Apache Spark Optimization Techniques