Contents
What is the difference between spark and Hadoop?
Features
|
SPARK
|
Hadoop
|
Inspiration
|
|
Google, papers in 2004 outlining MapReduce
Batch Processing
|
Speed
|
|
Heavy Disk read I/O intensive
|
Easy of Use
|
|
Java – Imperative programming style
No shell
complex map-reduce operations
|
Iterative Workflow
|
|
Not ideal for iterative work
|
Tools
|
|
Loosely coupled large set of tools, but matured
|
Deployment
|
|
Usually use Oozie and Azkaban to create workflow
|
Data Source
|
|
RDBMS (using sqoop), streaming using FLUME
|
Applications
|
|
Hadoop ‘job’ is higher level unit process data with map reduce and writes data to storage
|
Executors
|
|
Each mapReduce runs in its own processor
|
Shared Variable
|
|
Hadoop counter have additional (system ) metric counters like ‘Map input records’
|
Persisting/Caching RDD
|
|
None
|
Lazy Evaluation
|
|
None
|
Memory Management and Compression
|
|
Custom compression can be using AVRO, Kyro, no memory management
|
Optimizer and Query Planning
|
|
None
|
What are the differences between functional and imperative languages, and why is functional programming important?
Following features of Scala makes it uniquely suitable for spark.
Immutability - Immutable means that you can't change your variables; you mark them as final in Java, or use the val keyword in Scala
Higher order functions - These are functions that take other functions as parameters, or whose result is a function. Here is a function apply which takes another function f and a value v and applies function f to v: example - def apply(f: Int => String, v: Int) = f(v)
Lazy loading - Lazy val is executed when it is accessed the first time else no execution.
Pattern matching - Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy
Currying - If we turn this into a function object that we can assign or pass around, the signature of that function looks like this: val sizeConstraintFn: IntPairPred => Int => Email => Boolean = sizeConstraint _ Such a chain of one-parameter functions is called a curried function
Partial application - When applying the function, you do not pass in arguments for all of the parameters defined by the function, but only for some of them, leaving the remaining ones blank. What you get back is a new function whose parameter list only contains those parameters from the original function that were left blank.
Monads - Most Scala collections are monadic, and operating on them using map and flatMap operations, or using for-comprehensions is referred to as monadic-style.
Programming approach difference:
Characteristic
|
Imperative approach
|
Functional approach
|
Programmer focus
|
How to perform tasks (algorithms) and how to track changes in state.
|
What information is desired and what transformations are required.
|
State changes
|
Important.
|
Non-existent.
|
Order of execution
|
Important.
|
Low importance.
|
Primary flow control
|
Loops, conditionals, and function (method) calls.
|
Function calls, including recursion.
|
Primary manipulation unit
|
Instances of structures or classes.
|
Functions as first-class objects and data collections.
|
What is a resilient distributed dataset (RDD), Explain showing diagrams?
Resilient distributed dataset (RDD) is a read-only and fault-tolerant collection of objects partitioned across cluster of computers that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, S3, Cassandra or RDBMS.
RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
- Immutable – RDDs cannot be altered.
- Resilient – If a node holding the partition fails the other node takes the data.
- Lazy evaluated
- Cacheable
- Type inferred
Explain transformations and actions (in the context of RDDs)
Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
ReduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
What are the Spark use cases?
- Data integration and ETL
- Interactive analytics or business intelligence
- High performance batch computation
- Machine learning and advanced analytics
- Real-time stream processing
Need all the answers See my kindle book.
I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SPARK , kindly contact us http://www.maxmunus.com/contact
ReplyDeleteMaxMunus Offer World Class Virtual Instructor led training On APACHE SPARK . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
For Demo Contact us.
Saurabh Srivastava
MaxMunus
E-mail: saurabh@maxmunus.com
Skype id: saurabhmaxmunus
Ph:+91 8553576305 / 080 - 41103383
http://www.maxmunus.com/
Really appreciated the information and please keep sharing, I would like to share some information regarding online training.Maxmunus Solutions is providing the best quality of this Apache Spark and Scala programming language. and the training will be online and very convenient for the learner.This course gives you the knowledge you need to achieve success.
ReplyDeleteFor Joining online training batches please feel free to call or email us.
Email : minati@maxmunus.com
Contact No.-+91-9066638196/91-9738075708
website:-www.maxmunus.com