Hadoop is an open-source framework based on Java that manages the storage and processing of large amounts of data for applications. It was originally designed for computer clusters built from commodity hardware, which is still the common use. Hadoop allows clustering multiple computers to analyze massive datasets in parallel more quickly, instead of using one large computer to store and process the data. The primary Hadoop framework consists of four modules that work collectively to form the Hadoop ecosystem:
-
Hadoop Distributed File System (HDFS): A distributed file system that runs on standard or low-end hardware. HDFS provides better data throughput than traditional file systems, in addition to high fault tolerance and native support of large datasets.
-
Hadoop Common: Provides common Java libraries that can be used across all modules. Hadoop makes it easier to use all the storage and processing capacity in cluster servers, and to execute distributed processes against huge amounts of data. Hadoop provides the building blocks on which other services and applications can be built.
-
Hadoop YARN: A framework for job scheduling and cluster resource management.
-
Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.
Hadoop is used by a wide variety of companies and organizations for both research and production.