You are currently viewing HDFS Architecture

HDFS Architecture

Hadoop includes a fault tolerant storage system called the Hadoop Distributed File System, or HDFS. HDFS is able to store huge amounts of information, scale up incrementally and survive the failure of significant parts of the storage infrastructure without losing data. Hadoop creates clusters of machine and coordinates work among them. Clusters can be built with inexpensive computers.

MapReduce Architecture

The processing pillar in the Hadoop ecosystem is the MapReduce framework. The framework allows the specification of an operation to be applied to a huge data set, divide the problem and data, and run it in parallel. From an analyst’s point of view, this can occur on multiple dimensions. The outputs of these jobs can be written back to either HDFS or placed in a traditional data warehouse. There are two functions in MapReduce as follows:

  • map–this function takes key/value pairs as input and generates an intermediate set of key/value pairs
  • reduce–this function which merges all the intermediate values associated with the same intermediate key

Hadoop is an open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. There are mainly five building blocks inside this runtime environment (from bottom to top):

    • The cluster is the set of host machines (nodes). Nodes may be partitioned in racks. This is the hardware part of the infrastructure.
    • The YARN Infrastructure (Yet Another Resource Negotiator) is the framework responsible for providing the computational resources (e.g., CPUs, memory, etc.) needed for application executions. Two important elements are:The Resource Manager (one per cluster) is the master. It knows where the slaves are located (Rack Awareness) and how many resources they have. It runs several services, the most important is the Resource Scheduler which decides how to assign the resources.
      The Node Manager (many per cluster) is the slave of the infrastructure. When it starts, it announces himself to the Resource Manager. Periodically, it sends an heartbeat to the Resource Manager. EachNode Manager offers some resources to the cluster. Its resource capacity is the amount of memory and the number of vcores. At run-time, the Resource Scheduler will decide how to use this capacity: a Container is a fraction of the NM capacity and it is used by the client for running a program.
      the HDFS Federation is the framework responsible for providing permanent, reliable and distributed storage. This is typically used for storing inputs and output (but not intermediate ones).
    • other alternative storage solutions. For instance, Amazon uses the Simple Storage Service (S3).
    • the MapReduce Framework is the software layer implementing the Map Reduce Paradigm

The YARN infrastructure and the HDFS federation are completely decoupled and independent: the first one provides resources for running an application while the second one provides storage. The MapReduce framework is only one of many possible frameworks which runs on top of YARN (although currently is the only one implemented).

YARN: Application Startup

In YARN, there are at least three actors:

  • the Job Submitter (the client)
  • the Resource Manager (the master)
  • the Node Manager (the slave)

The application startup process is the following:

  1. a client submits an application to the Resource Manager
  2. the Resource Manager allocates a container
  3. the Resource Manager contacts the related Node Manager
  4. the Node Manager launches the container
  5. the Container executes the Application Master

The Application Master is responsible for the execution of a single application. It asks for containers to the Resource Scheduler (Resource Manager) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The Application Master knows the application logic and thus it is framework-specific. The MapReduce framework provides its own implementation of an Application Master.

The Resource Manager is a single point of failure in YARN. Using Application Masters, YARN is spreading over the cluster the metadata related to running applications. This reduces the load of the Resource Manager and makes it fast recoverable.

 

Uddeshya Rana

Uddeshya Rana is a small town guy born on 2nd December 1996, who lives a dreamlife of his own. Born and brought up in Chandigarh, pumped with ambition and confidence took him to pursue engineering about his passion in computers. He won various recognised competitions. He is already a entrepreneur in his mind and soon will make a reputed name in this society. He is a passionate workoholic guy and gives 200% to what he truly love. Uddeshya is a motivational and public speaker as well and hosted numerous stages in his entire life, he heals people with his experience showcasing a true leadership and entrepreneurial potential. Rest about him is like an another mystery about to unfold. He is a strong extrovert, wont charge you to have a word tough

Leave a Reply