Information on JAVA.: Hadoop: What it is, how it works, and what it can do

Anyone concerned with information technology needs to know about Hadoop.

The seeds of Hadoop were first planted in 2002.

What it is:
The Hadoop platform was designed to solve problems where you have a lot of data — perhaps a mixture of complex and structured data — and it doesn’t fit nicely into tables.

Example: Images,Audio,Video,PDF,Word,XML any kind of unstructured data.

Hadoop is a data storage and processing system. It is scalable,fault-tolerant and distributed. The software was originally developed by the world’s largest internet companies to capture and analyze the data that they generate. Unlike older platforms, Hadoop is able to store any kind of data in its native format and to perform a wide variety of analyses and transformations on that data. Hadoop stores terabytes, and even petabytes, of data inexpensively. It is robust and reliable and handles hardware and system failures automatically, without losing data or interrupting data analyses.

Hadoop is designed to store big data cheaply on a distributed file system across commodity servers. How you get that data there is your problem. And it’s a surprisingly critical issue because Hadoop isn’t a replacement for existing infrastructure, but rather a tool to augment data management and storage capabilities. Data, therefore, will be continually going in and out.

How it works:
Hadoop runs on clusters of commodity servers. Each of those servers has local CPU and storage. Each can store a few terabytes of data on its local disk.

The two critical components of the Hadoop software are:

The Hadoop Distributed File System, or HDFS

HDFS is the storage system for a Hadoop cluster. When data arrives at the cluster, the HDFS software breaks it into pieces and distributes those pieces among the different servers participating in the cluster. Each server stores just a small fragment of the complete data set, and each piece of data is replicated on more than one server.

A distributed data processing framework called MapReduce

Because Hadoop stores the entire dataset in small pieces across a collection of servers, analytical jobs can be distributed, in parallel, to each of the servers storing part of the data. Each server evaluates the question against its local fragment simultaneously and reports its results back for collation into a comprehensive answer.

MapReduce is the plumbing that distributes the work and collects the results.
Both HDFS and MapReduce are designed to continue to work in the face of system failures. The HDFS software continually monitors the data stored on the cluster. If a server becomes unavailable, a disk drive fails or data is damaged, whether due to hardware or software problems, HDFS automatically restores the data from one of the known good replicas stored elsewhere
on the cluster. When an analysis job is running, MapReduce monitors progress of each of the servers participating in the job.If one of them is slow in returning an answer or fails before completing its work, MapReduce automatically starts another instance of that task on another server that has a copy of the data. Because of the way that HDFS and MapReduce work,
Hadoop provides scalable, reliable and fault-tolerant services for data storage and analysis at very low cost.

Hadoop stores any type of data, structured or complex, from any number of sources, in its natural format. No conversion or translation is required on ingest. Data from many sources can be combined and processed in very powerful ways, so that Hadoop can do deeper analyses than older legacy systems. Hadoop integrates cleanly with other enterprise data management
systems. Moving data among existing data warehouses, newly available log or sensor feeds and Hadoop is easy. Hadoop is a powerful new tool that complements current infrastructure with new ways to store and manage data at scale.

What it can do:
Hadoop solves the hard scaling problems caused by large amounts of complex data.As the amount of data in a cluster grows, new servers can be added incrementally and inexpensively to store and analyze it. Because MapReduce takes advantage of the processing power of the servers in the cluster, a 100-node Hadoop instance can answer questions on 100 terabytes of data just as quickly as a ten-node instance can answer questions on ten terabytes.
Of course, many vendors promise scalable, high-performance data storage and analysis.Hadoop was invented to solve the problems that early internet companies like Yahoo!and Facebook faced in their own data storage and analysis. These companies and others actually use Hadoop today to store and analyze petabytes thousands of terabytes of data. Hadoop is not merely faster than legacy systems. In many instances, the legacy systems
simply could not do these analyses.

Information on JAVA.

Tuesday, 10 December 2013

Hadoop: What it is, how it works, and what it can do

3 comments: