• Twitter
  • Facebook
  • Google+
  • Instagram
  • Youtube

HADOOP - BRIEF DESCRIPTION

About Hadoop and Big data processing




Hadoop is an Open source Framework designed to work with large data sets in a distributed computing Environment.  It is a part of Apache Project, under license of Apache.
Hadoop provides faster data processing between the nodes, Provides high availability,
and fault tolerance to Hadoop clusters.
Hadoop is a powerful platform for processing, coordinating the movements of data across various architectural components.

Initially Google started GFS (Google file system) which mainly works on part files.
These part files are like small chunk size files or blocks. Hadoop designed after GFS.

In April 2008, Hadoop established as a power full system to sort  1 terabyte of
 data running on 910 node cluster in less than 4 minutes.
In April 2009 500 GB of data sorted in 59 seconds on 1406 Hadoop nodes.
And 1Tb of data sorted in 62 seconds on same clusters.


The hardware used during sorting is
2 quad core xeons at 2.0 GHz per node
4 sata disks per node, 8 gb ram per node, 1 Gb ethernet on each node, 40 nodes per rack, redhat linux server release 5.1, sun jdk 1.6.0.
(http://sort.benchmark.org/yahoohadoop.pdf").


Hadoop distribution comes with Hadoop kernel, Hdfs, MapReduce.  We can add more components or Hadoop sub projects like Hive, Hbase, Zookeeper, Sqoop etc..
Hadoop uses a file system known as HDFS (Hadoop distributed file system)     which is used maintain physical files in Hadoop.

Hadoop internally uses Mapreduce paradigm, which contains Mapper section, Reducer section, which divides data into small sized chunks, stored as part files in Hdfs. It performs distributed data processing, data access patterns.

Hadoop runs on commodity hardware.  It means Hadoop supports existing infrastructure, reusable machines, low or mid range systems. No need to go for high end machines to work with Hadoop.
 Generally we need a two quad-core processor with 2.25 GHz CPU's, 16-24 GB Ram, 1 Tb Sata hard disks.

When we are working with Hadoop, we need to configure namenode, data node, job tracker, task tracker. How to configure I will teach you in how to setup Hadoop cluster notes.

Hadoop is written in Java, runs on any environment where JVM is available.
Most of the times we use Hadoop to work in Ubuntu, centos. To work in windows environment we need a tool Cygwin.

Hadoop distribution, commercial support provided by Cloudera, MapR, Hortonworks, karmasphere. Refer(http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support) for Hadoop commercial support.

Hadoop comes with its sub projects, we can say Hadoop ecosystems or Hadoop components.
Data storage: Hdfs, Hbase
Data processing: MapReduce, Hive, Pig.
Data Coordination between the components: zookeeper
Data import and Exports: Sqoop
Data serializations: Avro
Data in log files: Flume

Apart from above Hadoop includes more components
Ambari, Hcatalog, Mahout, Oozie, Cassandra, Cascading, Vertica, common etc..

The data or Dataset which we work on Hadoop Generally Includes:                                                      Users Entire browsing History
 Weblogs
 User interaction logs
 User interaction history
 User transaction history
 User tweets list generated from twitter
 Climate sensor data

And we use Hadoop to

   How to Analyze Data.
   Identify customers who are most important
   Identifying the best time to perform maintenance based on usage patterns
   Analyzing brands reputations, analyzing social media.



For comments, queries mail to click here
Facebook id: hadoopframework

Contact

Get in touch with me


Adress

Hyderabad, Telangana India

Phone number

+(91) 9493455338 (E-Mail Prefred Communication)

whatsapp

9493455338