Hadoop is an Open source Framework designed to work with large data sets in a distributed computing Environment. It is a part of Apache Project, under license of Apache.
and fault tolerance to Hadoop clusters.
Hadoop is a powerful platform for processing,
coordinating the movements of data across various architectural components.
Initially Google started GFS (Google file system)
which mainly works on part files.
These part files are like small chunk size files or
blocks. Hadoop designed after GFS.
In April 2008, Hadoop established as a power full system to sort 1 terabyte of
data running on 910 node cluster in less than 4 minutes.
In April 2009 500 GB of data sorted in 59 seconds
on 1406 Hadoop nodes.
And 1Tb of data sorted in 62 seconds on same
clusters.
The hardware used during sorting is
2 quad core xeons at 2.0 GHz per node
4 sata disks per node, 8 gb ram per node, 1 Gb
ethernet on each node, 40 nodes per rack, redhat linux server release 5.1, sun
jdk 1.6.0.
(http://sort.benchmark.org/yahoohadoop.pdf").
Hadoop distribution comes with Hadoop kernel, Hdfs, MapReduce. We can add more components or Hadoop sub projects like Hive, Hbase, Zookeeper, Sqoop etc..
Hadoop uses a file system known as HDFS (Hadoop
distributed file system) which is
used maintain physical files in Hadoop.
Hadoop internally uses Mapreduce paradigm, which contains Mapper section, Reducer section, which divides data into small sized chunks, stored as part files in Hdfs. It performs distributed data processing, data access patterns.
Hadoop runs on commodity hardware. It means Hadoop supports existing infrastructure, reusable machines, low or mid range systems. No need to go for high end machines to work with Hadoop.
Generally we need a two quad-core processor with 2.25 GHz CPU's, 16-24 GB Ram, 1 Tb Sata hard disks.
When we are working with Hadoop, we need to configure namenode, data node, job tracker, task tracker. How to configure I will teach you in how to setup Hadoop cluster notes.
Hadoop is written in Java, runs on any environment where JVM is available.
Most of the times we use Hadoop to work in Ubuntu,
centos. To work in windows environment we need a tool Cygwin.
Hadoop distribution, commercial support provided by Cloudera, MapR, Hortonworks, karmasphere. Refer(http://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support) for Hadoop commercial support.
Hadoop comes with its sub projects, we can say Hadoop ecosystems or Hadoop components.
Data storage: Hdfs, Hbase
Data processing: MapReduce, Hive, Pig.
Data Coordination between the components: zookeeper
Data import and Exports: Sqoop
Data serializations: Avro
Data in log files: Flume
Apart from above Hadoop includes more components
Ambari, Hcatalog, Mahout, Oozie, Cassandra,
Cascading, Vertica, common etc..
The data or Dataset which we work on Hadoop Generally Includes: Users Entire browsing History
Weblogs
User
interaction logs
User
interaction history
User
transaction history
User tweets
list generated from twitter
Climate
sensor data
And we use Hadoop to
How to Analyze Data.
Identify
customers who are most important
Identifying
the best time to perform maintenance based on usage patterns
Analyzing
brands reputations, analyzing social media.
Facebook id: hadoopframework