Microsoft Azure HDInsight and why should I care
HDInsight is managed Apache Hadoop offering by Microsoft, can be on cloud or on-premise. We can help you understand, build and maintain enterprise HDInsight clusters.
 
                    Lets try to understand what Microsoft HDInsight is. But in order to do that we need to understand what Big Data and Hadoop is. So let's talk about Big Data. So Big Data is data that is too large or complex for analysis in traditional relational databases. Now, traditional relational databases can actually handle large volumes of data. So it is not just about the size of data, it is more about the complexity of the data as well. So when we talk about what Big Data is, we quite often talk about something called the three Vs. And the three Vs generally come to stand for Volume, Variety and Velocity.
Volume, Variety and Velocity
First of all that idea of a large volume of data that you need to process and an example of that might be large volumes of web server logs that you are going to use to analyze click streams as people work their way through your website. But it is not just about volume. It may also be variety. You may have a mixture of some structured and some unstructured data that consists of, you know, nice neat columns and rows of data but also some free formed text or some data that is not text maybe its images or videos or something like that. So maybe you want to analyze both that structured and unstructured data and that is not easy in a traditional database, it is a big data problem. And another example might be the velocity of the data, where the data is arriving very quickly, you have got a constant stream of data, perhaps from some sensors or some internet of things (IOT) type stream of data that is coming and you need to grab and analyze and visualize and process in real time. So those three Vs make up the kind of definition of Big Data and the types of problem that we are going to try and solve using Hadoop and HDInsight.
Now those types of problems that we are going to try and solve, there is really sort of three major problem areas as follow.
1. Batch Processing
Batch processing idea where we are going to filter and cleanse and shape the data for analysis. So we might have all of that data in all this; huge volume or variety of formats that are coming in, in real time, we are going to grab it and we need to filter it to get rid of the data we are not interested in, we need to cleanse it to make sure that it has got valid data that is in the right format and we need to shape it so that it is in the right kind of structure for us to analyze and visualize perhaps in tables in a database or something like that. So that batch processing things are a very common big data scenario. The batch processing scenario, is the kind of core Hadoop scenario, that allows us to explore all of the core technologies in Hadoop to understand how to get that data to filter it, to cleanse it and shape it.
2. Real time processing
Real time processing is where you are capturing data in real time, you are filtering all the events you are not interested in, and perhaps you are aggregating over temporal windows perhaps you have got data coming in and you want to find out how many tweets have you had in the past hour or how many people have visited your website in the past twenty minutes or whatever it might be. So you have that kind of temporal aspect of your data and you have little latency requirements because the data is coming in in real time, you want to be able to read and write into that data very very quickly so that you can get almost real time analytics out of the data as it comes in.
3. Predictive Analytics
Predictive analytics, where you are trying to apply statistical models on to the data that you have captured and based on historical data you are trying to predict future data. So maybe you are classifying data as it comes in, maybe you are applying regression techniques on to that data so that you can predict a particular value based on the other values that you already have or maybe you are clustering the data into similar entities or maybe you are just simply trying to predict, you are doing things like basket analysis in a website where based on what the user has in their basket just now, you are recommending other items. So there is that other predictive scenario that you might use your big data to get involved in and that’s increasingly becoming a popular area where people are applying machine learning techniques to data that comes in in huge volume and varieties and velocities.
APACHE Hadoop
So what is Hadoop? It is an open source distributed data processing cluster technology. it’s a technology that is based on using multiple servers clustered together to divide the workload across those servers. And what happens is you have these multiple servers, you have Name Nodes, typically you have at least one Name Node but usually you have more than one for redundancy. And that Name Node will accept the requests coming in from client applications to do some processing and it will then use some Data Nodes, and typically we have lots of Data Nodes that will share the processing work across between them. And the way they do that is they all have access to a shared file system that typically is referred to as the Hadoop Distributed File System or HDFS. So the important thing here is that the data is not processed by one big computer. We actually distribute the processing load across multiple computers, multiple nodes within a cluster. The analogy, that’s often used, when you need to plough a large field, you don't necessarily need the bigger oxen, you need more oxen. You team them together and you are able to make them work together and that's the approach that’s taken with Hadoop, where we have a set of cluster nodes that work together to process the data.
Now as I said they use this Hadoop Distributed File System, that's HDFS shared storage area, so that all of the nodes can access the data and they can all operate on the data. The data is what we call splittable in its format so they can read one file but they can split the contents of that file between the different Data Nodes to each process on a little bit of the data and that’s an important aspect of Hadoop. And to do that we are using obviously server resources cluster resources across the nodes in the cluster. And that Resource Management is performed by technology that is referred to as YARN and actually stands for Yet Another Resource Negotiator, it is what it stands for. So you'll quite often see YARN appearing when you are working with Hadoop, it's, if you can think of, it's being that kind of ring master or the coordinator of all the work that has been performed across the cluster as it reads the data and processes it from the HDFS environment.
Microsoft HDInsight
You might be thinking where is HDInsight, is this post about HDInsight or what. So let’s talk about HDInsight. HDInsight is actually Apache Hadoop running on Azure, that's the simplest definition for it. It is regular open source Hadoop, it is not any particular, you know, Microsoft special version of Hadoop. It's open source Hadoop, it is actually from Hortonworks they’re the vendor that Microsoft works with to create this. It is an instance of Hortonworks HDP Hadoop distribution running on virtual machines in Azure. And the benefit of that is we can spin up a cluster just by going to the Azure portal just telling it that we want a Hadoop cluster and it will manage the provisioning and running of all the virtual machines and the installation of Hadoop on all those machines. So we don't have to worry about managing the servers or anything like that. But really what we get are virtual machines configured as a cluster running Hortonworks implementation of Hadoop.
Now the HDFS that I mentioned earlier on that shared storage, we implement that slightly differently when we work with Azure. Rather than the storage being managed by the actual hard disks that are connected to the virtual machines we move the storage into the cloud; we move the storage into either the Azure storage account or into the Azure data lake which is a new service that has been created for managing large volumes of data. It behaves exactly like HDFS, when you are writing code it makes no difference and your Hadoop cluster understands it just as much as it does in need to for an HDFS implementation. All we have done is we have abstracted HDFS, moved the data into the Azure storage or into data lake. And the advantage of that is that the shared storage still behaves the same way for Hadoop so there is no changes to the way the Hadoop processes the data but of course because that is shared storage on the cloud, other applications can write data directly into there or read data directly from there, so we have separated the actual storage from the processing in the cluster and that enables us to manage the lifetime of the data separately from the lifetime of the cluster. So you could run the cluster only when you need it, you can take the cluster down when you don’t need it and you no longer want to pay for it but keep the storage and keep the data that you have been working on.
The other piece of Azure that is used by HDInsight, obviously the Hadoop cluster itself has some metadata about the data that it is working on about things like hive tables or Oozie workflows that we have been working on and that is stored in a database. If you were installing Hadoop on your own local server typically you would install some sort of database system, it might be MySQL or SQL server or whatever it might be. When we use HDInsight to host our Hadoop cluster, the metadata is stored in an instance of Azure SQL Database. Now you don’t need to worry about that, you don’t need to explicitly provision that or manage it or anything like that. It is all handled for you under the covers but there is an Azure SQL Database there that is hosting the metadata for your HDInsight cluster. Isn't it simple, time saving option rather than going directly to Hadoop ecosystem and installing, configuring everything manually. In next post we will go ahead and create Hello World program that is word count program on HDInsight cluster.
