Troubled by confusing concepts such as Executors, Node, RDD, Task in spark? Invest just 2 minutes of your time to make some order in this mess! I'll clean up these apache spark concepts for you! Spark building blocks: executor,tasks,cache,sparkcontext,cluster manager Executor => Multiple Tasks: is a JVM process sitting on all nodes. Executors receive tasks (jars with your code) deserialize it, and run it as a task. Executors utilize cache so that the tasks can run faster. Node => Multiple Executors: Each node has multiple executors. RDD => Big DataStructure : Its main strength is that it represents data which cannot be stored on a single machine, so its data is distributed, partitioned, split across computers. Input => RDD: Every RDD is born out of some input like a text file, hadoop files etc. Output => RDD : The output of functions in spark can produce an RDD. So it's like one function after another each receives an input RDD and
Software Engineering Best Practices, System Design, High Scale, Algorithms, Math, Programming Languages, Statistics, Machine Learning, Databases, Front Ends, Frameworks, Low Level Machine Structure, Papers and Computing, Computer Science Book Reviews - Everything!