The Hadoop Distributed FileSystemKonstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert ChanslerYahoo!Sunnyvale, California USA{Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.comPresenter: Alex Hu

HDFS IntroductionArchitectureFile I/O Operations and Replica ManagementPractice at YAHOO!Future WorkCritiques and Discussion

Introduction and Related Work What is Hadoop?–––Provide a distributed file system and a frameworkAnalysis and transformation of very large data setMapReduce

Introduction (cont.) What is Hadoop Distributed File System (HDFS) ?––––––File system component of HadoopStore metadata on a dedicated server NameNodeStore application data on other servers DataNodeTCP-based protocolsReplication for reliabilityMultiply data transfer bandwidth for durability

Architecture NameNodeDataNodesHDFS ClientImage JournalCheckpointNodeBackupNodeUpgrade, File System Snapshots

Architecture Overview

NameNode – one per cluster Maintain The HDFS namespace, a hierarchy offiles and directories represented by inodesMaintain the mapping of file blocks to DataNodes–– Read: ask NameNode for the locationWrite: ask NameNode to nominate DataNodesImage and JournalCheckpoint: native files store persistent record ofimages (no location)

DataNodes Two files to represent a block replica on DN–– Handshake when connect to the NameNode–– The data itself – length flexibleChecksums and generation stampVerify namespace ID and software versionNew DN can get one namespace ID when joinRegister with NameNode––Storage ID is assigned and never changesStorage ID is a unique internal identifier

DataNodes (cont.) - control Block report: identify block replicas–– Block ID, the generation stamp, and the lengthSend first when register and then send per hourHeartbeats: message to indicate availability–––Default interval is three secondsDN is considered “dead” if not received in 10 minsContains Information for space allocation and load balancing ––Storage capacityFraction of storage in useNumber of data transfers currently in progressNN replies with instructions to the DNKeep frequent. Scalability

HDFS ClientA code library exports HDFS interface Read a file –– Ask for a list of DN host replicas of the blocksContact a DN directly and request transferWrite a file–––Ask NN to choose DNs to host replicas of the first block of the fileOrganize a pipeline and send the dataIterationDelete a file and create/delete directory Various APIs ––Schedule tasks to where the data are locatedSet replication factor (number of replicas)

HDFS Client (cont.)

Image and Journal Image: metadata describe organization–– Journal: log for persistence changes– Flushed and synched before change is committedStore in multiple places to prevent missing– Persistent record is called checkpointCheckpoint is never changed, and can be replacedNN shut down if no place is availableBottleneck: threads wait for flush-and-sync–Solution: batch

CheckpointNode CheckpointNode is NameNodeRuns on different hostCreate new checkpoint–––– Download current checkpoint and journalMergeCreate new and return to NameNodeNameNode truncate the tail of the journalChallenge: large journal makes restart slow–Solution: create a daily checkpoint

BackupNode Recent featureSimilar to CheckpointNodeMaintain an in memory, up-to-date image– Create checkpoint without downloadingJournal storeRead-only NameNode––All metadata information except block locationsNo modification

Upgrades, File System and Snapshots Minimize damage to data during upgradeOnly one can existNameNode––– Merge current checkpoint and journal in memoryCreate new checkpoint and journal in a new placeInstruct DataNodes to create a local snapshotDataNode––Create a copy of storage directoryHard link existing block files

Upgrades, File System and Snapshots –Rollback NameNode recovers the checkpointDataNode resotres directory and delete replicas aftersnapshot is createdThe layout version stored on both NN and DN–– Identify the data representation formatsPrevent inconsistent formatSnapshot creation is all-cluster effort–Prevent data loss

File I/O Operations and ReplicaManagement File Read and WriteBlock Placement and Replication managementOther features

File Read and Write Checksum––––Read by the HDFS client to detect any corruptionDataNode store checksum in a separate placeShip to client when perform HDFS readClients verify checksumChoose the closet replica to read Read fail due to ––– Unavailable DataNodeA replica of the block is no longer hostedReplica is corruptedRead while writing: ask for the latest length

File Read and Write (cont.)New data can only be appended Single-writer, multiple-reader Lease –––– Who open a file for writing is granted a leaseRenewed by heartbeats and revoked when closedSoft limit and hard limitMany readers are allowed to readOptimized for sequential reads and writes–Can be improved Scribe: provide real-time data streamingHbase: provide random, real-time access to large tables

Add Block and The hflush Unique block ID Perform write operation new change is not guaranteedto be visible The hflushhflush

Block ReplacementNot practical to connect all nodes Spread across multiple racks ––– Communication has to go through multiple switchesInter-rack and intra-rackShorter distance, greater bandwidthNameNode decides the rack of a DataNode–Configure script

Replica Replacement Policy Improve data reliability, availability and networkbandwidth utilizationMinimize write costReduce inter-rack and inter-node writeRule1: No Datanode contains more than onereplica of any blockRule2: No rack contains more than two replicas ofthe same block, provided there are sufficient rackson the cluster

Replication management Detected by NameNode Under-replicated–– Priority queue (node with one replica has the highest)Similar to replication replacement policyOver-replicated––Remove the old replicaNot reduce the number of racks

Other features Balancer–– Block Scanner–– Verification of the replicaCorrupted replica is not deleted immediatelyDecommissioning––– Balance disk space usageBandwidth consuming controlInclude and exclude listsRe-evaluate listsRemove decommissioning DataNode only if all blocks on it arereplicatedInter-Cluster Data Copy–DistCp – MapReduce job

Practice At Yahoo! 3500 nodes and 9.8PB of storage availableDurability of Data–Uncorrelated node failures –Correlated node failures Chance of losing a block during one year: .5%Chance of node fail each month: .8%Failure of rack or switchLoss of electrical powerCaring for the commons––Permissions – modeled on UNIXTotal space available

BenchmarksDFSIO benchmark DFSIO Read: 66MB/s per nodeDFISO Write: 40MB/s per nodeProduction cluster Busy Cluster Read: 1.02MB/s per nodeBusy Cluster Write: 1.09MB/s per nodeSort benchmarkOperation Benchmark

Future Work Automated failover solution– ZookeeperScalability––Multiple namespaces to share physical storageAdvantage –Drawback –Isolate namespacesImprove overall availabilityGeneralizes the block storage abstractionCost of managementJob-centric namespaces rather than cluster centric

Critiques and Discussion Pros––– Architecture: NameNode, DataNode, and powerful features to provide kinds of operations,detect corrupted replica, balance disk space usage and provide consistency.HDFS is easy to use: users don’t have to worry about different servers. It can be used aslocal file system to provide various operationsBenchmarks are sufficient. They use real data with large number of nodes and storage toprovide kinds of experiments.Cons–––Fault—tolerance is not very sophisticated. All the recoveries introduced are based on theassumption that NameNode is alive. No proper solution currently in this paper handles thefailure of NameNodeScalability, especially the handling of replying heartbeats with instructions. If there are toomany messages come in, the performance of NameNode is not proper measured in thispaperThe test of correlated failure is not provided. We can’t get any information of theperformance of HDFS after correlated failure is encountered.

Thank you very much