Learnerslesson
   JAVA   
  SPRING  
  SPRINGBOOT  
 HIBERNATE 
  HADOOP  
   HIVE   
   ALGORITHMS   
   PYTHON   
   GO   
   KOTLIN   
   C#   
   RUBY   
   C++   




Big Data

What Is Big Data?

In todays date the craze of internet is at it's peak. The number of people accessing web is increasing from millions to billions. And so is the data. The data is so huge that it goes beyond the capacity of processing and storage. This type of data is called Big Data.

Where does this huge data come from?
Facebook : Have you ever wondered how much data does facebook process per day. It's close to 500 Tera Bytes. And facebook warehouse currently stores a total of 300 Peta Bytes.

Google : When you type something in Google search bar, you get your results within a fraction of second. Ever thought how much data it has to process per day. It's 100 Peta Bytes and it's total storage is around 15 Exabytes.


Now just think for a moment. Is it possible for a single computer to solve this purpose, no matter how much powerful the computer is? The answer is no. The main problem is not just storage but the processing as well. What if Google returned your search results after 15 minutes. What if Facebook had taken half an hour to upload a photo.

The solution to this problem is a distributed computing system.

Distributed Computing System

In simple words just think how a manager works in an office. When a new assignment comes, he distributes the work among his co-workers.

Similar case applies for Distributed Computing System. A single computer is not responsible for handling all the tasks. Rather the work is distributed among several computers connected to each other. Where each computer is called a node and the group of computers connected to each other is called a cluster.

The standalone computers are commodity hardwares(Low cost computers). If we increase the number of nodes(standalone computers), it's not just the storage capacity that increases but the processing capability increases as well.


Who manages the data distributed across the systems?

So, there should be someone who is responsible for :

  1. Distributing data across the systems.

  2. Distribute data for processing across multiple systems.

  3. Replicating data across multiple systems. Since the commodity hardwares can fail easily, data has to be duplicated across multiple systems.

To solve the above problems Hadoop came into picture.