We launched Hadoop Developer In Real World course on Nov 2015 and we got excellent response from the Hadoop In Real World members right away. In few months we got several requests to create a Hadoop Administration course. We looked at other courses in the market to get a feel for what is already out there. Hadoop Administration is more than just installing and restarting services. To our surprise, we saw other courses in the market just covered installation, restarting services and couple of random topics. It was shocking !
We have created a video version of this post. Click here if you prefer watching the video.
So the very first thing we did was, we came up with a list; a list of things that every Hadoop administrator must know, which is essentially also the list of things that what employers look for in a Hadoop Administrator. So we decided to cover every single item in our list in our Hadoop Administration course. Our students also told us that they would like the course to help them clear certifications like Cloudera Certified Administrator for Apache Hadoop, CCAH for short. So we made the course CCAH ready, meaning we covered all the topics that are in scope for CCAH in our Hadoop Administrator In Real World course. In this post we will explain what is covered in Hadoop Administrator In Real World course.
In the very first chapter, we will explain the structure of the course and give you instructions on how to connect to our cluster. When you enroll in Hadoop Administrator In Real World course, you will get access to our 3 node Hadoop cluster hosted on Amazon Web Services. This cluster is shared by the students of Hadoop In Real World. With the cluster access you can execute MapReduce programs, access HDFS, execute Pig and Hive scripts etc. This cluster will help you get your feet wet right away.
We have designed the course keeping in mind that it should be easy for someone who is new to the Big Data and Hadoop world. So in the next 3 chapters will cover all the needed basics.
In the next chapter, we will introduce you to Big Data. At some point in your Hadoop Administration career you will be give a set of problems and you will be asked to design a solution using Hadoop. How can you be certain whether the problems in front of you can be solved by Hadoop? As an administrator, given a set of scenarios you should be able to ascertain whether the problem at hand is a big data problem or not. This chapter will help you with exactly that.
In the next HDFS chapter, you will learn what is Hadoop Distributed File System or HDFS for short and more importantly why you need HDFS when there are many file systems already available in the market. This chapter will teach you how to navigate and work with HDFS. You will also learn how read and write works behind the scenes in HDFS.
As a Hadoop administrator, you will not be asked to write MapReduce programs, that is the job of Hadoop developers but a good Hadoop Administrator must know how MapReduce works and have a good understanding of the different phases in MapReduce. Knowing how MapReduce works will help you troubleshoot performance issues and debug problems. MapReduce chapter will help you understand all the details behind MapReduce.
In the next Architecture chapter, we cover both MapReduce version 1 architecture and also MapReduce version 2 architecture or simply referred as YARN. We cover both versions because if you end up working for a client who is trying to migrate from MapReduce version 1 to MapReduce version 2, you will not be lost. Hadoop has Single Point Of Failures and in this chapter we will discuss what solutions Hadoop has in place to protect from those Single Point of Failures. We will talk about secondary namenode and Highly availability in this chapter.
Cluster Planning is the most critical element in Hadoop administration. Because the decisions that are made during cluster planning has a long term impact and has cost implications too. We are actually shocked to see other courses out there does not cover cluster planning at all. Any administrator who does not know how to plan a cluster can not call himself a good administrator. Due to the significance of cluster planning, CCAH exam covers a lot of questions on cluster planning. In the Cluster Planning chapter, we will talk about software & hardware requirements, preference between JBOD and RAID, sizing the cluster and network topology. In short, everything you need to plan a Hadoop cluster.
Next chapter is cluster setup. We want our students to learn how Hadoop is used in the real world and that is why any time we teach how to install and configure a hadoop service or setup a cluster we use Amazon Web Services. Because AWS is widely used in lot of production deployments. In this chapter we will install a Hadoop cluster on AWS on EC2 instances. We will also look at another widely used Amazon service named EMR in this chapter. As you go over the course, we want our students to practise what they learn, that is install, configure, uninstall services etc. You can practise in AWS but there is a charge associated with AWS. Instead of using AWS you can configure Virtual Machines to simulate a multi node setup. We will teach you how to set up VMs in this chapter as well.
In the next chapter, we have covered all the day to day essentials that are required for you to perform your day to day tasks as a Hadoop administrator. Let’s say today is your day one on your brand new job as a Hadoop administrator, how do you get to know the details about your Hadoop cluster? You can ask around but that is no fun. In this chapter you will learn to get to know your Hadoop cluster, adding nodes to your cluster, protecting against accidental data loss, quotas, configure network topology, exploring logs etc. You will learn everything you need to perform as efficient as possible as a Hadoop administrator in this chapter.
When Hadoop first came out, there is no strong concept of authentication in Hadoop and quite honestly that was one of the weak points of Hadoop at that time. Later, Kerberos authentication was introduced and it got wide acceptance right way. Protecting a Hadoop cluster against improper access is an important task of Hadoop administrator. Don’t you agree ? This chapter will teach you how to install, configure and enable kerberos authentication in your Hadoop cluster.
Namenode is the master node in a Hadoop cluster and it is also a Single Point Of Failure (SPOF). Hadoop was criticized widely to have a critical element like Namenode to be a single point of failure. Hadoop committers rose to the occasion and they introduced High Availability. High Availability offers a solution for namenode’s single point of failure problem. In this chapter, we will configure Hadoop with High Availability using Quorum Journal Managers.
Next chapter on resource management is very important. Hadoop cluster is a significant investment for any company so it is very common for multiple teams to share a single production Hadoop cluster. Which means as administrators you need to answer question – How can you share a Hadoop cluster between teams and between users? This chapter has the answer for you. In this chapter we will talk about 3 different type of schedulers – FIFO, Capacity and Fair schedulers. We will discuss the differences between the 3 schedulers, configure and experiment with all 3 schedulers.
So far when we install or configure a service in any of the lesson we do it manually, step by step and we cover every single configuration details. We don’t use any cluster management software like Cloudera Manager to install and configure services at the very beginning to install or configure service. Why? To be an efficient Hadoop administrator you need to know how things work behind the scenes in detail and tools like Cloudera Manager and Ambari hide all those details. This sounds like a good thing but when it time to troubleshoot an issue or tweak a property administrators who are not familiar with configuration details are usually lost. So that is why we cover the hard things like manual and step by step configuration first and easy things like Cloudera Manager last. In this chapter you will learn how to install services and work with cloudera manager. In addition to that you will also learn how to use cloudera manager for monitoring and troubleshooting.
In the next chapter we will introduce you to Apache Pig, Hive, Sqoop and Flume. As Hadoop administrator you will not be required to implement solutions using Pig, Hive, Sqoop or Flume but you are definitely expected to understand what they do and how to install and configure them. So that is what you will learn in this chapter.
Hadoop ecosystem is very energetic and it changes almost every day. Which means there are several popular tools in the ecosystem and new tools are coming in all the time. We plan to keep Hadoop Administrator In Real World as a living course, meaning we will keep the course live and add new tools as per the demand from our students. In the near future we have plans to include Spark and Kafka. We also have plans to include topics that are covered in Hortonworks Data Platform Certified Administrator, HDPCA for short, very soon.
We hope this post gave you a pretty good idea about the concepts and topics covered in the Hadoop Administrator In Real World course. Thank you so much for reading this post and thank you for being a part of Hadoop In Real World community.