A Solution of Big Data: Apache Hadoop
In my previous article, I talked about Big Data. There was a description about big data: Big data isn’t a concept. It’s a problem to solve. And now, we have a solution: Apache Hadoop!
What is Apache Hadoop
Apache Hadoop is a framework that allows us for the distributed processing of large data sets across clusters of computers using simple programming models. Instead of using one large computer to store and process the data, allows clustering multiple computers to analyze massive datasets in parallel more quickly.
Hadoop systems can handle various forms of structured and unstructured data, giving users more flexibility for collecting, processing, analyzing and managing data than relational databases and data warehouses provide.
Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Doug Cutting named it after his son’s toy elephant.
Ecosystem
Hadoop Ecosystem is a platform or a suite which provides various services to solve the big data problems. It includes Apache projects and various commercial tools and solutions.
Modules
Hadoop consists of four main modules:
HDFS (Hadoop Distributed File System): As the primary component of the Hadoop ecosystem, HDFS is a distributed file system that provides high-throughput access to application data with no need for schemas to be defined up front.
YARN (Yet Another Resource Negotiator): YARN is a resource-management platform responsible for managing compute resources in clusters and using them to schedule users’ applications. It performs scheduling and resource allocation across the Hadoop system.
MapReduce: MapReduce is a programming model for large-scale data processing. Using distributed and parallel computation algorithms, MapReduce makes it possible to carry over processing logic and helps to write applications that transform big datasets into one manageable set.
Hadoop Common: Hadoop Common includes the libraries and utilities used and shared by other Hadoop modules.
Applications
HBase: HBase is a non-relational distributed database running on the Big Data Hadoop cluster that stores large amounts of structured data.
Cassandra: Cassandra is a wide-column store NoSQL database management system.
Pig: Pig is comprised of a high-level language for expressing data analysis programs, and infrastructure for evaluating these programs. Pig was designed for performing a long series of data operations, making it ideal for ETL data pipelines, research on raw data, and iterative processing of data.
Hive: Hive allows for easy reading, writing, and managing files on HDFS. It has its own querying language for the purpose known as HQL (Hive Querying Language) which is very similar to SQL. This makes it very easy for programmers to write MapReduce functions using simple HQL queries.
Sqoop: Sqoop plays an important part in bringing data from Relational Databases into HDFS. The commands written in Sqoop internally converts into MapReduce tasks that are executed over HDFS. It works with almost all relational databases.
Flume: Flume aggregates, collects, and moves large amounts of log data.
Kafka: Kafka sits between the applications generating data (Producers) and the applications consuming data (Consumers). Kafka is distributed and has in-built partitioning, replication, and fault-tolerance. It can handle streaming data and also allows businesses to analyze data in real-time.
Oozie: Oozie is a workflow scheduler system that allows users to link jobs written on various platforms. Using Oozie you can schedule a job in advance and can create a pipeline of individual jobs to be executed sequentially or in parallel to achieve a bigger task.
Zookeeper: Zookeeper is an open-source, distributed, and centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services across the cluster.
Ambari: Ambari is a web-based interface for managing, configuring, and testing Big Data clusters to support its components.
Spark: Spark is an alternative framework to Hadoop built on Scala but supports varied applications written in Java, Python, etc. Spark is also responsible for Hadoop streaming and supporting SQL, machine learning and processing graphs.
Advantages of Hadoop
Fast: In HDFS the data distributed over the cluster and are mapped which helps in faster retrieval. Even the tools to process the data are often on the same servers, thus reducing the processing time. It is able to process terabytes of data in minutes and Petabytes in hours.
Scalable: Hadoop cluster can be extended by just adding nodes in the cluster.
Cost-Effective: Hadoop is open source and uses commodity hardware to store data so it really cost-effective as compared to the traditional relational database management systems.
Resilient to failure: HDFS has the property with which it can replicate data over the network, so if one node is down or some other network failure happens, then Hadoop takes the other copy of data and use it. Normally, data are replicated thrice but the replication factor is configurable.