Apache Spark and Scala Interview Questions: Spark Interview Questions for Professionals

Features	SPARK	Hadoop
Inspiration	Hadoop Map-Reduce and Scala programming language, developed by UC-Berkeley's AMPLab in 2009, use Generalized computation instead of MapReduce Real time processing capability	Google, papers in 2004 outlining MapReduce Batch Processing
Speed	100X in-memory and 10X on Disk	Heavy Disk read I/O intensive
Easy of Use	Easily write application using Java, Scala, Python, R (Functional programming style) Interactive Shell available with Scala and Python High level simple map-reduce Operations	Java – Imperative programming style No shell complex map-reduce operations
Iterative Workflow	Great at Iterative workloads(Machine learning ..etc)	Not ideal for iterative work
Tools	Well integrated tools (Spark SQL, Streaming, Mlib and GraphX) to develop complex analytical application	Loosely coupled large set of tools, but matured
Deployment	Hadoop YARN, Mesos, Amazon-EC2	Usually use Oozie and Azkaban to create workflow
Data Source	HDFS(Hadoop), HBase, Cassandra, MongoDB, Amazon-S3, RDBMS, file, socket, tweeter	RDBMS (using sqoop), streaming using FLUME
Applications	Spark ‘Application’ is higher level of Unit, runs multiple jobs in sequence or parallel Application process are called executors, runs on clusters(workers)	Hadoop ‘job’ is higher level unit process data with map reduce and writes data to storage
Executors	Executors can run multiple tasks in a single processor	Each mapReduce runs in its own processor
Shared Variable	Broadcast variables: Read-only(look-up) variable, ships only once to worker Accumulators: Workers add values and driver reads the data, and fault tolerant	Hadoop counter have additional (system ) metric counters like ‘Map input records’
Persisting/Caching RDD	Cached RDDs can be used & reused in across the operation, thus increase the processing speed	None
Lazy Evaluation	Transformation functions execution plan bundled together and executes only with RDD action function	None
Memory Management and Compression	Memory is conserved, because of the compact format. Speed is improved by custom code-generation.	Custom compression can be using AVRO, Kyro, no memory management
Optimizer and Query Planning	Optimizer is a Rule Executor for logical plans. It uses a collection of logical plan optimizations. Generate encoders via runtime code-generation. The generated code can operate directly on the Tungsten compact format. Query is optimized – logical and physical plan (inspired by RDBMS query planning and optimization)	None

What are the differences between functional and imperative languages, and why is functional programming important?

Following features of Scala makes it uniquely suitable for spark.

Immutability - Immutable means that you can't change your variables; you mark them as final in Java, or use the val keyword in Scala

Higher order functions - These are functions that take other functions as parameters, or whose result is a function. Here is a function apply which takes another function f and a value v and applies function f to v: example - def apply(f: Int => String, v: Int) = f(v)

Lazy loading - Lazy val is executed when it is accessed the first time else no execution.

Pattern matching - Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy

Currying - If we turn this into a function object that we can assign or pass around, the signature of that function looks like this: val sizeConstraintFn: IntPairPred => Int => Email => Boolean = sizeConstraint _ Such a chain of one-parameter functions is called a curried function

Partial application - When applying the function, you do not pass in arguments for all of the parameters defined by the function, but only for some of them, leaving the remaining ones blank. What you get back is a new function whose parameter list only contains those parameters from the original function that were left blank.

Monads - Most Scala collections are monadic, and operating on them using map and flatMap operations, or using for-comprehensions is referred to as monadic-style.

Programming approach difference:

Characteristic

Imperative approach

Functional approach

Programmer focus

How to perform tasks (algorithms) and how to track changes in state.

What information is desired and what transformations are required.

State changes

Important.

Non-existent.

Order of execution

Important.

Low importance.

Primary flow control

Loops, conditionals, and function (method) calls.

Function calls, including recursion.

Primary manipulation unit

Instances of structures or classes.

Functions as first-class objects and data collections.

What is a resilient distributed dataset (RDD), Explain showing diagrams?

Resilient distributed dataset (RDD) is a read-only and fault-tolerant collection of objects partitioned across cluster of computers that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, S3, Cassandra or RDBMS.

RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –

Immutable – RDDs cannot be altered.

Resilient – If a node holding the partition fails the other node takes the data.

Lazy evaluated

Cacheable

Type inferred

Ref

Explain transformations and actions (in the context of RDDs)

Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.

ReduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.

Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.

What are the Spark use cases?

Data integration and ETL

Interactive analytics or business intelligence

High performance batch computation

Machine learning and advanced analytics

Real-time stream processing

Lots of people are doing data integration and ETL on MapReduce, as well as batch computation, machine learning and batch analytics. But these things are going to be much faster on Spark. Interactive analytics and BI are possible on Spark, and the same goes for real-time stream processing.

Need all the answers See my kindle book.

2 comments:

UnknownAugust 23, 2017 at 3:34 AM
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SPARK , kindly contact us http://www.maxmunus.com/contact
MaxMunus Offer World Class Virtual Instructor led training On APACHE SPARK . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Saurabh Srivastava
MaxMunus
E-mail: saurabh@maxmunus.com
Skype id: saurabhmaxmunus
Ph:+91 8553576305 / 080 - 41103383
http://www.maxmunus.com/
AnonymousJanuary 9, 2018 at 2:18 AM
Really appreciated the information and please keep sharing, I would like to share some information regarding online training.Maxmunus Solutions is providing the best quality of this Apache Spark and Scala programming language. and the training will be online and very convenient for the learner.This course gives you the knowledge you need to achieve success.

For Joining online training batches please feel free to call or email us.
Email : minati@maxmunus.com
Contact No.-+91-9066638196/91-9738075708
website:-www.maxmunus.com

Characteristic	Imperative approach	Functional approach
Programmer focus	How to perform tasks (algorithms) and how to track changes in state.	What information is desired and what transformations are required.
State changes	Important.	Non-existent.
Order of execution	Important.	Low importance.
Primary flow control	Loops, conditionals, and function (method) calls.	Function calls, including recursion.
Primary manipulation unit	Instances of structures or classes.	Functions as first-class objects and data collections.

Wednesday, February 1, 2017

Spark Interview Questions for Professionals

What is the difference between spark and Hadoop?

What are the differences between functional and imperative languages, and why is functional programming important?

What is a resilient distributed dataset (RDD), Explain showing diagrams? ​ ​

Explain transformations and actions (in the context of RDDs)

What are the Spark use cases?

2 comments:

What is a resilient distributed dataset (RDD), Explain showing diagrams?