Transcription

Bigtable: A Distributed Storage System for Structured DataFay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. WallachMike Burrows, Tushar Chandra, Andrew Fikes, Robert E. es,gruber}@google.comGoogle, Inc.AbstractBigtable is a distributed storage system for managingstructured data that is designed to scale to a very largesize: petabytes of data across thousands of commodityservers. Many projects at Google store data in Bigtable,including web indexing, Google Earth, and Google Finance. These applications place very different demandson Bigtable, both in terms of data size (from URLs toweb pages to satellite imagery) and latency requirements(from backend bulk processing to real-time data serving).Despite these varied demands, Bigtable has successfullyprovided a flexible, high-performance solution for all ofthese Google products. In this paper we describe the simple data model provided by Bigtable, which gives clientsdynamic control over data layout and format, and we describe the design and implementation of Bigtable.1 IntroductionOver the last two and a half years we have designed,implemented, and deployed a distributed storage systemfor managing structured data at Google called Bigtable.Bigtable is designed to reliably scale to petabytes ofdata and thousands of machines. Bigtable has achievedseveral goals: wide applicability, scalability, high performance, and high availability. Bigtable is used bymore than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth. These products use Bigtable for a variety of demanding workloads,which range from throughput-oriented batch-processingjobs to latency-sensitive serving of data to end users.The Bigtable clusters used by these products span a widerange of configurations, from a handful to thousands ofservers, and store up to several hundred terabytes of data.In many ways, Bigtable resembles a database: it sharesmany implementation strategies with databases. Parallel databases [14] and main-memory databases [13] haveUSENIX Associationachieved scalability and high performance, but Bigtableprovides a different interface than such systems. Bigtabledoes not support a full relational data model; instead, itprovides clients with a simple data model that supportsdynamic control over data layout and format, and allows clients to reason about the locality properties of thedata represented in the underlying storage. Data is indexed using row and column names that can be arbitrarystrings. Bigtable also treats data as uninterpreted strings,although clients often serialize various forms of structured and semi-structured data into these strings. Clientscan control the locality of their data through carefulchoices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to servedata out of memory or from disk.Section 2 describes the data model in more detail, andSection 3 provides an overview of the client API. Section 4 briefly describes the underlying Google infrastructure on which Bigtable depends. Section 5 describes thefundamentals of the Bigtable implementation, and Section 6 describes some of the refinements that we madeto improve Bigtable’s performance. Section 7 providesmeasurements of Bigtable’s performance. We describeseveral examples of how Bigtable is used at Googlein Section 8, and discuss some lessons we learned indesigning and supporting Bigtable in Section 9. Finally, Section 10 describes related work, and Section 11presents our conclusions.2 Data ModelA Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a rowkey, column key, and a timestamp; each value in the mapis an uninterpreted array of bytes.(row:string, column:string, time:int64) stringOSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation205

"contents:""com.cnn.www""anchor:cnnsi.com"" html ."t3" html ."t5" html ."t6"CNN""anchor:my.look.ca"t9"CNN.com"t8Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home pageis referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.comand anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t 3 , t5 , and t6 .We settled on this data model after examining a varietyof potential uses of a Bigtable-like system. As one concrete example that drove some of our design decisions,suppose we want to keep a copy of a large collection ofweb pages and related information that could be used bymany different projects; let us call this particular tablethe Webtable. In Webtable, we would use URLs as rowkeys, various aspects of web pages as column names, andstore the contents of the web pages in the contents: column under the timestamps when they were fetched, asillustrated in Figure 1.RowsThe row keys in a table are arbitrary strings (currently upto 64KB in size, although 10-100 bytes is a typical sizefor most of our users). Every read or write of data undera single row key is atomic (regardless of the number ofdifferent columns being read or written in the row), adesign decision that makes it easier for clients to reasonabout the system’s behavior in the presence of concurrentupdates to the same row.Bigtable maintains data in lexicographic order by rowkey. The row range for a table is dynamically partitioned.Each row range is called a tablet, which is the unit of distribution and load balancing. As a result, reads of shortrow ranges are efficient and typically require communication with only a small number of machines. Clientscan exploit this property by selecting their row keys sothat they get good locality for their data accesses. Forexample, in Webtable, pages in the same domain aregrouped together into contiguous rows by reversing thehostname components of the URLs. For example, westore data for maps.google.com/index.html under thekey com.google.maps/index.html. Storing pages fromthe same domain near each other makes some host anddomain analyses more efficient.206Column FamiliesColumn keys are grouped into sets called column families, which form the basic unit of access control. All datastored in a column family is usually of the same type (wecompress data in the same column family together). Acolumn family must be created before data can be storedunder any column key in that family; after a family hasbeen created, any column key within the family can beused. It is our intent that the number of distinct columnfamilies in a table be small (in the hundreds at most), andthat families rarely change during operation. In contrast,a table may have an unbounded number of columns.A column key is named using the following syntax:family:qualifier. Column family names must be printable, but qualifiers may be arbitrary strings. An example column family for the Webtable is language, whichstores the language in which a web page was written. Weuse only one column key in the language family, and itstores each web page’s language ID. Another useful column family for this table is anchor; each column key inthis family represents a single anchor, as shown in Figure 1. The qualifier is the name of the referring site; thecell contents is the link text.Access control and both disk and memory accounting are performed at the column-family level. In ourWebtable example, these controls allow us to manageseveral different types of applications: some that add newbase data, some that read the base data and create derivedcolumn families, and some that are only allowed to viewexisting data (and possibly not even to view all of theexisting families for privacy reasons).TimestampsEach cell in a Bigtable can contain multiple versions ofthe same data; these versions are indexed by timestamp.Bigtable timestamps are 64-bit integers. They can be assigned by Bigtable, in which case they represent “realtime” in microseconds, or be explicitly assigned by clientOSDI ’06: 7th USENIX Symposium on Operating Systems Design and ImplementationUSENIX Association

// Open the tableTable *T OpenOrDie("/bigtable/web/webtable");// Write a new anchor and delete an old anchorRowMutation r1(T, "com.cnn.www");r1.Set("anchor:www.c-span.org", "CNN");r1.Delete("anchor:www.abc.com");Operation op;Apply(&op, &r1);Figure 2: Writing to Bigtable.applications. Applications that need to avoid collisionsmust generate unique timestamps themselves. Differentversions of a cell are stored in decreasing timestamp order, so that the most recent versions can be read first.To make the management of versioned data less onerous, we support two per-column-family settings that tellBigtable to garbage-collect cell versions automatically.The client can specify either that only the last n versionsof a cell be kept, or that only new-enough versions bekept (e.g., only keep values that were written in the lastseven days).In our Webtable example, we set the timestamps ofthe crawled pages stored in the contents: column tothe times at which these page versions were actuallycrawled. The garbage-collection mechanism describedabove lets us keep only the most recent three versions ofevery page.3 APIThe Bigtable API provides functions for creating anddeleting tables and column families. It also providesfunctions for changing cluster, table, and column familymetadata, such as access control rights.Client applications can write or delete values inBigtable, look up values from individual rows, or iterate over a subset of the data in a table. Figure 2 showsC code that uses a RowMutation abstraction to perform a series of updates. (Irrelevant details were elidedto keep the example short.) The call to Apply performsan atomic mutation to the Webtable: it adds one anchorto www.cnn.com and deletes a different anchor.Figure 3 shows C code that uses a Scanner abstraction to iterate over all anchors in a particular row.Clients can iterate over multiple column families, andthere are several mechanisms for limiting the rows,columns, and timestamps produced by a scan. For example, we could restrict the scan above to only produceanchors whose columns match the regular expressionanchor:*.cnn.com, or to only produce anchors whosetimestamps fall within ten days of the current time.USENIX AssociationScanner scanner(T);ScanStream *stream;stream scanner.FetchColumnFamily("anchor");stream- ");for (; !stream- Done(); stream- Next()) {printf("%s %s %lld %s\n",scanner.RowName(),stream- ColumnName(),stream- MicroTimestamp(),stream- Value());}Figure 3: Reading from Bigtable.Bigtable supports several other features that allow theuser to manipulate data in more complex ways. First,Bigtable supports single-row transactions, which can beused to perform atomic read-modify-write sequences ondata stored under a single row key. Bigtable does not currently support general transactions across row keys, although it provides an interface for batching writes acrossrow keys at the clients. Second, Bigtable allows cellsto be used as integer counters. Finally, Bigtable supports the execution of client-supplied scripts in the address spaces of the servers. The scripts are written in alanguage developed at Google for processing data calledSawzall [28]. At the moment, our Sawzall-based APIdoes not allow client scripts to write back into Bigtable,but it does allow various forms of data transformation,filtering based on arbitrary expressions, and summarization via a variety of operators.Bigtable can be used with MapReduce [12], a framework for running large-scale parallel computations developed at Google. We have written a set of wrappersthat allow a Bigtable to be used both as an input sourceand as an output target for MapReduce jobs.4 Building BlocksBigtable is built on several other pieces of Google infrastructure. Bigtable uses the distributed Google FileSystem (GFS) [17] to store log and data files. A Bigtablecluster typically operates in a shared pool of machinesthat run a wide variety of other distributed applications,and Bigtable processes often share the same machineswith processes from other applications. Bigtable depends on a cluster management system for schedulingjobs, managing resources on shared machines, dealingwith machine failures, and monitoring machine status.The Google SSTable file format is used internally tostore Bigtable data. An SSTable provides a persistent,ordered immutable map from keys to values, where bothkeys and values are arbitrary byte strings. Operations areprovided to look up the value associated with a specifiedOSDI ’06: 7th USENIX Symposium on Operating Systems Design and Implementation207

key, and to iterate over all key/value pairs in a specifiedkey range. Internally, each SSTable contains a sequenceof blocks (typically each block is 64KB in size, but thisis configurable). A block index (stored at the end of theSSTable) is used to locate blocks; the index is loadedinto memory when the SSTable is opened. A lookupcan be performed with a single disk seek: we first findthe appropriate block by performing a binary search inthe in-memory index, and then reading the appropriateblock from disk. Optionally, an SSTable can be completely mapped into memory, which allows us to performlookups and scans without touching disk.Bigtable relies on a highly-available and persistentdistributed lock service called Chubby [8]. A Chubbyservice consists of five active replicas, one of which iselected to be the master and actively serve requests. Theservice is live when a majority of the replicas are runningand can communicate with each other. Chubby uses thePaxos algorithm [9, 23] to keep its replicas consisten