Some Hadoop Concepts

Recently I am taking the Big Data Specialization courses by UCSD on Coursera. And I got a certificate on the first course, as you can see below:



Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware

Apache Framework Basic Modules

  • Hadoop Common (libraries and utilities)
  • Hadoop Distributed File System (HDFS)
  • Hadoop YARN (a resource-management platform, scheduling)
  • Hadoop MapReduce (a programming model for large scale data processing)


Hadoop Distributed File System: Distributed, scalable, and portable file- system written in Java for the Hadoop framework


YARN enhances the power of a Hadoop compute cluster

Apache Sqoop

Tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases


A scalable data warehouse with support for large

  • Column-oriented database management system
  • Key-value store
  • Based on Google Big Table
  • Can hold extremely large data
  • Dynamic data model
  • Not a Relational DBMS


A high-level data-flow language and execution framework for parallel computation

  • High level programming on top of Hadoop MapReduce
  • The language: Pig Latin.
  • Data analysis problems as data flows

Apache Hive

A data warehouse infrastructure that provides data
summarization and ad hoc querying

  • Datawarehousesoftwarefacilitates querying and managing large datasets residing in distributed storage
  • SQL-like language
  • Facilitates querying and managing large datasets in HDFS
  • Mechanism to project structure onto this data and query the data using a SQL- like language called HiveQL


  • Workflow scheduler system to manage Apache Hadoop jobs
  • Oozie Coordinator jobs
  • Supports
    MapReduce, Pig, Apache Hive, and Sqoop, etc


  • Provides operational services for a Hadoop cluster group services
  • Centralized service for: maintaining configuration information naming services
  • Providing distributed synchronization and providing group services


  • Distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data


Cloudera’sopensourcemassively parallel processing (MPP) SQL query engine Apache Hadoop


Apache Spark™ is a fast and general engine for large-scale data processing

  • Multi-stage in-memory primitives provides performance up to 100 times faster for certain applications
  • Allows user programs to load data into a cluster’s memory and query it repeatedly
  • Well-suited to machine learning