Wednesday, February 1, 2017

Spark Interview Questions for Professionals


Contents


  1. What is the difference between spark and Hadoop?

Features
SPARK
Hadoop
Inspiration
  • Hadoop Map-Reduce and Scala programming language, developed by UC-Berkeley's AMPLab in 2009, use Generalized computation instead of MapReduce
  • Real time processing capability


  • Google, papers in 2004 outlining MapReduce
  • Batch Processing
  • Speed
    • 100X in-memory and
    • 10X on Disk


  • Heavy Disk read I/O intensive
  • Easy of Use
    • Easily write application using Java, Scala, Python, R (Functional programming style)
    • Interactive Shell available with Scala and Python
    • High level simple map-reduce Operations


  • Java – Imperative programming style
  • No shell
  • complex map-reduce operations
  • Iterative Workflow
    • Great at Iterative workloads(Machine learning ..etc)


  • Not ideal for iterative work
  • Tools  
    • Well integrated tools (Spark SQL, Streaming,  Mlib and GraphX) to develop complex analytical application


  • Loosely coupled large set of tools, but matured
  • Deployment
    • Hadoop YARN, Mesos,  Amazon-EC2


  • Usually use Oozie and Azkaban to create workflow
  • Data Source
    • HDFS(Hadoop), HBase, Cassandra, MongoDB, Amazon-S3, RDBMS, file, socket, tweeter


  • RDBMS (using sqoop), streaming using FLUME
  • Applications
    • Spark ‘Application’ is higher level of Unit,  runs multiple jobs in sequence or parallel
    • Application process  are called executors, runs on clusters(workers)


  • Hadoop ‘job’ is higher level unit  process data with map reduce and writes data to  storage
  • Executors
    • Executors can run multiple tasks in a single processor


  • Each mapReduce runs in  its own processor
  • Shared Variable
    • Broadcast variables: Read-only(look-up) variable, ships only once to worker
    • Accumulators: Workers add values and driver reads the data, and fault tolerant


  • Hadoop counter have additional (system ) metric counters like  ‘Map input records’
  • Persisting/Caching RDD
    • Cached RDDs  can be used & reused in across the operation, thus increase the processing speed  


  • None
  • Lazy Evaluation
    • Transformation functions execution plan bundled together  and executes only with RDD action function


  • None
  • Memory Management and Compression
    • Memory is conserved, because of the compact format. Speed is improved by custom code-generation.


  • Custom compression can be using AVRO, Kyro, no memory management
  • Optimizer and Query Planning
    • Optimizer is a Rule Executor for logical plans. It uses a collection of logical plan optimizations. Generate encoders via runtime code-generation. The generated code can operate directly on the Tungsten compact format. Query is optimized – logical and physical plan (inspired by RDBMS query planning and optimization)


  • None


    1. What are the differences between functional and imperative languages, and why is functional programming important? 

    Following features of Scala makes it uniquely suitable for spark.
    Immutability - Immutable means that you can't change your variables; you mark them as final in Java, or use the val keyword in Scala
    Higher order functions - These are functions that take other functions as parameters, or whose result is a function. Here is a function apply which takes another function f and a value v and applies function f to v: example - def apply(f: Int => String, v: Int) = f(v)
    Lazy loading - Lazy val is executed when it is accessed the first time else no execution.
    Pattern matching - Scala has a built-in general pattern matching mechanism. It allows to match on any sort of data with a first-match policy
    Currying - If we turn this into a function object that we can assign or pass around, the signature of that function looks like this: val sizeConstraintFn: IntPairPred => Int => Email => Boolean = sizeConstraint _ Such a chain of one-parameter functions is called a curried function
    Partial application - When applying the function, you do not pass in arguments for all of the parameters defined by the function, but only for some of them, leaving the remaining ones blank. What you get back is a new function whose parameter list only contains those parameters from the original function that were left blank.
    Monads - Most Scala collections are monadic, and operating on them using map and flatMap operations, or using for-comprehensions is referred to as monadic-style.
    Programming approach difference:
    Characteristic
    Imperative approach
    Functional approach
    Programmer focus
    How to perform tasks (algorithms) and how to track changes in state.
    What information is desired and what transformations are required.
    State changes
    Important.
    Non-existent.
    Order of execution
    Important.
    Low importance.
    Primary flow control
    Loops, conditionals, and function (method) calls.
    Function calls, including recursion.
    Primary manipulation unit
    Instances of structures or classes.
    Functions as first-class objects and data collections.

    1. What is a resilient distributed dataset (RDD), Explain showing diagrams? ​ ​

    Resilient distributed dataset (RDD) is a read-only and fault-tolerant collection of objects partitioned across cluster of computers that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, S3, Cassandra or RDBMS.
    RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are –
    • Immutable – RDDs cannot be altered.
    • Resilient – If a node holding the partition fails the other node takes the data.
    • Lazy evaluated
    • Cacheable
    • Type inferred
    1. Explain transformations and actions (in the context of RDDs)

    Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
    ReduceByKey merges the values for each key using an associative and commutative reduce function. This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
    Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
    1. What are the Spark use cases?

    • Data integration and ETL
    • Interactive analytics or business intelligence
    • High performance batch computation
    • Machine learning and advanced analytics
    • Real-time stream processing
    Lots of people are doing data integration and ETL on MapReduce, as well as batch computation, machine learning and batch analytics. But these things are going to be much faster on Spark. Interactive analytics and BI are possible on Spark, and the same goes for real-time stream processing.


    Need all the answers See my kindle book.

    2 comments:

    1. I really appreciate information shared above. It’s of great help. If someone want to learn Online (Virtual) instructor lead live training in APACHE SPARK , kindly contact us http://www.maxmunus.com/contact
      MaxMunus Offer World Class Virtual Instructor led training On APACHE SPARK . We have industry expert trainer. We provide Training Material and Software Support. MaxMunus has successfully conducted 100000+ trainings in India, USA, UK, Australlia, Switzerland, Qatar, Saudi Arabia, Bangladesh, Bahrain and UAE etc.
      For Demo Contact us.
      Saurabh Srivastava
      MaxMunus
      E-mail: saurabh@maxmunus.com
      Skype id: saurabhmaxmunus
      Ph:+91 8553576305 / 080 - 41103383
      http://www.maxmunus.com/

      ReplyDelete
    2. Really appreciated the information and please keep sharing, I would like to share some information regarding online training.Maxmunus Solutions is providing the best quality of this Apache Spark and Scala programming language. and the training will be online and very convenient for the learner.This course gives you the knowledge you need to achieve success.

      For Joining online training batches please feel free to call or email us.
      Email : minati@maxmunus.com
      Contact No.-+91-9066638196/91-9738075708
      website:-www.maxmunus.com

      ReplyDelete