In this post we will discuss what employers expect from Hadoop Administrators. We also have a video version of this post, click here if you prefer to watch the video.
Hadoop environments are distributed systems and you could potentially have hundreds or even thousands of nodes working together and also you have lot of tools in the Hadoop ecosystem which means there is a good probability for things to go wrong. In most companies Hadoop is at the center of everything and this means you can not afford to have issues in your Hadoop production environments. For this reason most employers look for Hadoop Administrators who have well rounded knowledge about Hadoop and tools in the Hadoop ecosystem and also understand how they work behind the scenes.
If you are an aspiring Hadoop Administrator it helps if you know what employers like to see in a candidate when they interview someone for a Hadoop administrator position and what is the expectation from you as a Hadoop Administrator when you land on the job. So here is what you need to know in terms of what is expected from Hadoop Administrators
We often hire Hadoop Administrators for our clients and we know precisely what to look for in a candidate. It does not take took long for us to know whether the candidate is good or bad and whether he/she will be able to handle our Hadoop environments or not.
Most Hadoop Administration interviews start with installation first. Even though most production clusters are managed by cluster management softwares like Cloudera Manager or Apache Ambari, administrators are expected to install and configure Hadoop clusters and other ecosystem components manually without the help of any tool.
We always ask the candidate to Explain the steps to configure a High Availability Hadoop cluster without the help of tools like Cloudera Manager or Ambari in an interview. The reason this question is important for hiring managers is because you can enable high availability in a cluster with Cloudera Manager in a matter of minutes, just with a few clicks but it hides all the behind the scenes details. When you know how to configure something like high availability manually and explain the steps, it tells that the candidate has strong experience and also know how things work behind the scenes which means when something breaks or something needs to be tweaked you know where to look and what to change.
To put it very simply, you won’t give your car to a car mechanic who does not understand how things work under the hood. Same applies for Hadoop administration, you can not expect an employer to offer someone a Hadoop Administrator job who does not understand how Hadoop works behind the scenes
When it comes to Hadoop Administration, most courses out there covers only few architecture concepts and install Hadoop with tools like Cloudera Manager and explain how to start and stop services. That’s insane. Hadoop Administration is much more than that and much more involved.
To give you few examples – as a Hadoop administrator you are expected to know how to install, configure and enable Kerberos security. If a cluster is not High Availability enabled you need to know how to configure and enable high availability. You are expected to know what is Fair and Capacity scheduler, what is the difference between the two and when to go with Fair scheduler and when to choose Capacity scheduler. You are also expected to know how to protect against accidental data loss.
All these concepts are very important for any Hadoop administrator to understand. To put it very simply, ask this question – “Will you give your car for diagnosis to a car mechanic who only knows to change tires?” No. Absolutely not. You expect the mechanic to know about the engine, transmission, brakes etc. Similarly don’t expect any employer to offer a Hadoop administration position if all the candidate knows is cluster installation, starting and stopping services.
Planning a Hadoop cluster is the job of a Hadoop administrator without a doubt. This task may sound simple but it is much involved. The difficulty with planning a cluster is that you won’t know for sure how the Hadoop cluster will be used in the future. We seen time and time again, the intended use of Hadoop cluster deviates from the initial planning phase with in just of few months of cluster go-live.
Since there is some level of uncertainty in how the cluster will be used in the future, as an administrator you should analyse the current needs and do your best in projecting the future needs and present a more balanced configuration with the future needs in mind. Your job in planning a cluster is to estimate the storage needs, computational needs, number of nodes in the cluster, picking the right configuration for individual nodes, design a good network topology, choosing between storage intensive nodes and compute intensive nodes etc. As said, it is more involved than it sounds.
Given the importance of this topic, we are surprised, actually shocked to see many courses out there doesn’t even mention this topic.
In any software environment it is safe to assume that things fail. This is certainly true for a distributed environment like Hadoop. With Hadoop, you have hundreds or even thousands of nodes running at the same time, you have many jobs running at the same time, many users accessing the cluster at the same time, you have data coming in from many sources at the same time, also you have many tools like Hive, Sqoop, Flume being used at the same time. You get the idea now, with so many things happening at the same time, there is a high potential for things to go wrong.
Hadoop administrators are paid top dollars to handle chaos when things are broken. Hadoop administrators will be called to troubleshoot and fix issues when things go wrong. This means any aspiring Hadoop administrators should be prepared for such scenarios. This also means you need to know how things work even when you don’t work on the topic directly. For eg. as a Hadoop Administrator you will not be asked to write a MapReduce program but you are expected to know how MapReduce works, what are the phases in MapReduce, what happens in Shuffle phase, why Shuffle phase is intensive etc. Knowing these details will help you troubleshoot and advise on potential solutions when there is a performance issues with a MapReduce program for instance. Similarly you should also have a decent understanding of other Hadoop ecosystem tools like Hive, Pig, Sqoop, Flume etc. and how they work.
This also goes back to knowing the behind the scenes details. When you know how things work under the hood, that is the configuration details behind the tools and it’s functionalities you will be in a better positions to fix issues efficiently.
Certifications like Cloudera Certified Administrator for Apache Hadoop in short (CCAH) certainly puts you and your resume on the spotlight and it gives you instant credibility. Keep in mind, being certified does not mean you are guaranteed a job nor it means that you will be effective in your job. Most employers definitely evaluate the candidate before offering a job. Having all things equal between 2 candidates, an employer will prefer the candidate with the certification to the candidate without any certification.
When students and members of the Hadoop In Real World community asked us to create a Hadoop Administration course we went ahead with our research. Soon we realized that the courses out there does not focus on any of the critical elements which we just looked at. As said before almost all courses just touched on architecture, installation, starting and stopping services and that is it. It was shocking. So we decided to design and create a course which will really help aspiring Hadoop Administrators to manage and administer real world Hadoop cluster in real production environments with confidence and stress free.
We cover all the admin essentials in the Hadoop Administrator in Real World course from getting to know your cluster, starting, stopping services, adding & removing nodes to and from the cluster, recovering from data losses etc. That is not the exciting part though, that is just the scratching the tip of the iceberg. We are more interested in teaching our students the critical functionalities like installing and configuring high availability, installing and configuring Kerberos, installing and configuring different schedulers like fair and capacity schedulers etc. and let’s be clear when we install any component we teach manual installation and manually configuring all elements, this way you understand exactly what is happening behind the scenes. Which means, our students can survive even when there is no cluster management tools like Cloudera Manager or Ambari. Also, we have covered so many tips, tricks and shortcuts in the course which will help you with day to day administration tasks.
Hadoop Administrator in Real World course is CCAH certification ready. Meaning we have covered all the necessary components needed to clear Cloudera Certified Administrator for Apache Hadoop in short (CCAH) exam.
So in short, we have designed the Hadoop Administrator in Real World course to equip students with all the necessary skills to administer and manage real world chaotic Hadoop production cluster with confidence and stress free. Click here to check out the full curriculum of the course. Shoot us an email @ firstname.lastname@example.org if you have any questions and we will be happy to answer.
Thank you for reading this post and thank you for being a part of Hadoop In Real World community.