Transcription

Design Implications for Enterprise Storage Systems viaMulti-Dimensional Trace AnalysisYanpei Chen, Kiran Srinivasan , Garth Goodson , Randy KatzUniversity of California, Berkeley, NetApp Inc.{ychen2, randy}@eecs.berkeley.edu, {skiran, goodson}@netapp.comABSTRACTEnterprise storage systems are facing enormous challengesdue to increasing growth and heterogeneity of the data stored.Designing future storage systems requires comprehensive insights that existing trace analysis methods are ill-equippedto supply. In this paper, we seek to provide such insightsby using a new methodology that leverages an objective,multi-dimensional statistical technique to extract data access patterns from network storage system traces. We applyour method on two large-scale real-world production network storage system traces to obtain comprehensive accesspatterns and design insights at user, application, file, anddirectory levels. We derive simple, easily implementable,threshold-based design optimizations that enable efficientdata placement and capacity optimization strategies for servers,consolidation policies for clients, and improved caching performance for both.Categories and Subject DescriptorsC.4 [Performance of Systems]: Measurement techniques;D.4.3 [Operating Systems]: File Systems Management—Distributed file systems1.INTRODUCTIONEnterprise storage systems are designed around a set of dataaccess patterns. The storage system can be specialized bydesigning to a specific data access pattern; e.g., a storagesystem for streaming video supports different access patternsthan a document repository. The better the access patternis understood, the better the storage system design. Insights into access patterns have been derived from the analysis of existing file system workloads, typically through traceanalysis studies [1, 3, 17, 19, 24]. While this is the correctgeneral strategy for improving storage system design, pastapproaches have critical shortcomings, especially given recent changes in technology trends. In this paper, we presenta new design methodology to overcome these shortcomings.The data stored on enterprise network-attached storage systems is undergoing changes due to a fundamental shift inthe underlying technology trends. We have observed threesuch trends, including: Scale: Data size grows at an alarming rate [12], dueto new types of social, business and scientific applications [20], and the desire to “never delete” data. Heterogeneity: The mix of data types stored on thesestorage systems is becoming increasingly complex, eachhaving its own requirements and access patterns [22]. Consolidation: Virtualization has enabled the consolidation of multiple applications and their data onto fewerstorage servers [6, 23]. These virtual machines (VMs)also present aggregate data access patterns more complex than those from individual clients.Better design of future storage systems requires insights intothe changing access patterns due to these trends. Whiletrace studies have been used to derive data access patterns,we believe that they have the following shortcomings: Unidimensional: Although existing methods analyze manyaccess characteristics, they do so one at a time, withoutrevealing cross-characteristic dependencies. Expertise bias: Past analyses were performed by storagesystem designers looking for specific patterns based onprior workload expectations. This introduces a bias thatneeds to be revisited based on the new technology trends. Storage server centric: Past file system studies focusedprimarily on storage servers. This creates a critical knowledge gap regarding client behavior.To overcome these shortcomings, we propose a new designmethodology backed by the analysis of storage system traces.We present a method that simultaneously analyzes multiple characteristics and their cross dependencies. We use amulti-dimensional, statistical correlation technique, calledk-means [2], that is completely agnostic to the characteristics of each access pattern and their dependencies. TheK-means algorithm can analyze hundreds of dimensions simultaneously, providing added objectivity to our analysis.To further reduce expertise bias, we involve as many relevant characteristics as possible for each access pattern. Inaddition, we analyze patterns at different granularities (e.g.,at the user session, application, file level) on the storageserver as well as the client, thus addressing the need for understanding client patterns. The resulting design insightsenable policies for building new storage systems.

Client side observations and design implications1. Client sessions with IO sizes 128KB are read only orwrite only. Clients can consolidate sessions based ononly the read-write ratio.2. Client sessions with duration 8 hours do 10MB of IO. Client caches can already fit an entire day’s IO.3. Number of client sessions drops off linearly by 20% fromMonday to Friday. Servers can get an extra “day” forbackground tasks by running at appropriate times duringweek days.4. Applications with 4KB of IO per file open and manyopens of a few files do only random IO. Clients shouldalways cache the first few KB of IO per file per application.5. Applications with 50% sequential read or write accessentire files at a time. Clients can request file prefetch(read) or delegation (write) based on only the IO sequentiality.6. Engineering applications with 50% sequential read andsequential write are doing code compile tasks, based on fileextensions. Servers can identify compile tasks; serverhas to cache the output of these tasks.Server side observations and design implications7. Files with 70% sequential read or write have no repeatedreads or overwrites. Servers should delegate sequentially accessed files to clients to improve IO performance.8. Engineering files with repeated reads have random accesses. Servers should delegate repeatedly read filesto clients; clients need to store them in flash or memory.9. All files are active (have opens, IO, and metadata access)for only 1-2 hours in a few months. Servers can use fileidle time to compress or deduplicate to increase storagecapacity.10. All files have either all random access or 70% sequentialaccess. (Seen in past studies too) Servers can selectthe best storage medium for each file based on only accesssequentiality.11. Directories with sequentially accessed files almost alwayscontain randomly accessed files as well. Servers canchange from per-directory placement policy (default) toper-file policy upon seeing any sequential IO to any filesin a directory.12. Some directories aggregate only files with repeated readsand overwrites. Servers can delegate these directoriesentirely to clients, tradeoffs permitting.Table 1: Summary of design insights, separated into insights derived from client access patterns and server access patterns.We analyze two recent, network-attached storage file system traces from a production enterprise datacenter. Table1 summarizes our key observations and design implications,they will be detailed later in the paper. Our methodologyleads to observations that would be difficult to extract usingpast methods. We illustrate two such access patterns, oneshowing the value of multi-granular analysis (Observation 1in Table 1) and another showing the value of multi-featureanalysis (Observation 8).First, we observe (Observation 1) that sessions with morethan 128KB of data reads or writes are either read-only orwrite-only. This observation affects shared caching and consolidation policies across sessions. Specifically, client OSscan detect and co-locate cache sensitive sessions (read-only)with cache insensitive sessions (write-only) using just one parameter (read-write ratio). This improves cache utilizationand consolidation (increased density of sessions per server).Similarly, we observe (Observation 8) that files with 70%sequential read or sequential write have no repeated reads oroverwrites.This access pattern involves four characteristics: read sequentiality, write sequentiality, repeated readbehavior, and overwrite behavior. The observation leads toa useful policy: sequentially accessed files do not need to becached at the server (no repeated reads), which leads to anefficient buffer cache.These observations illustrate that our methodology can derive unique design implications that leverage the correlationbetween different characteristics. To summarize, our contributions are: Identify storage system access patterns using a multidimensional, statistical analysis technique. Build a framework for analyzing traces at different granularity levels at both server and client. Analyze our specific traces and present the access patterns identified. Derive design implications for various storage systemcomponents from the access patterns.In the rest of the paper, we motivate and describe our analysis methodology (Sections 2 and 3), present the access patterns we found and the design insights (Section 4), providethe implications on storage system architecture (Section 5),and suggest future work (Section 6).2.MOTIVATION AND BACKGROUNDPast trace-based studies have examined a range of storagesystem protocols and use cases, delivering valuable insightsfor designing storage servers. Table 2 summarizes the contributions of past studies. Many studies predate currenttechnology trends. Analysis of real-world, corporate workloads or traces have been sparse, with only three studiesamong the ones listed [13, 15, 18]. A number of studieshave focused on NFS trace analysis only [8, 10]. This focus somewhat neglects systems using the Common InternetFile System (CIFS) protocol [5], with only a single CIFSstudy [15]. CIFS systems are important since CIFS is thenetwork storage protocol for Windows, the dominant OSon commodity platforms. Our work uses the same tracesas [15], but we perform analysis using a methodology thatextracts multi-dimensional insights at different layers. Thismethodology is sufficiently different from prior work as tomake the analysis findings not comparable. The followingdiscusses the need for this methodology.2.1Need for Insights at Different LayersWe divide our view of the storage system into behavior atclients and servers. Storage clients interface directly withusers, who create and view content via applications. Separately, servers store the content in a durable and efficientfashion over the network. Past network storage system tracestudies focus mostly on storage servers (Table 2). Storageclient behavior is underrepresented primarily due to the reliance on stateless NFS traces. This leaves a knowledge gapabout access patterns at storage clients. Specifically, thesequestions are unanswered: Do applications exhibit clear access patterns? What are the user-level access patterns? Any correlation between users and applications? Do all applications interact with files the same way?

StudyFileSystemBSDN/wFSOusterhout, et al. [17]Date ofTraces1985Ramakrishnan, et al. [18]1988-89VAX/VMSX1991SpriteXGribble, et al. [10]1991-97XDouceur, et al. [7]1998Vogels [24]1998Zhou et al. aker, et al. [3]Roselli, et al. [19]Ellard, et al. [8]Agrawal, et al. [1]1997-0020012000-04VxFS,NTFSNFSLeung, et al. [15]2007FAT,FAT32,NTFSCIFSKavalanekar, et al. [13]2007NTFSThis PC,CorpEngLiveLiveEng,Live,Backup SnapXXXDataSetEngXXInsights/ContributionsSeminal patterns analysis: Large, sequential read access; limited read-write;bursty I/O; short file lifetimes, etc.Relationship between files and processes- on usage patterns, sharing, etc.Analysis of distributed file system; comparison to [17], caching effects.Workload self-similarityEngSnapAnalysis of file and directory attributes:size, age, lifetime, directory iveCorp,EngLiveWeb,CorpLiveCorp,EngLiveSupported past observations and trendsin NTFSAnalysis of personal computer workloadsIncreased block lifetimes,cachingstrategiesNFS peculiarities, pathnames can aidfile layoutDistribution of file size and type innamespace, change in file contents overtimeFile re-open, sharing, activity characteristics; changes compared to previousstudiesStudy of web (live maps, web content,etc.) workloads in servers via eventstracing.Section 4LiveSnapTable 2: Past studies of storage system traces. “Corp” stands for corporate use cases. “Eng” stands for engineering use cases.“Live” implies live requests or events in traces were studied, “Snap” implies snapshots of file systems were studied.Insights on these access patterns lead to better design ofboth clients and servers. They enable server capabilities suchas per session quality of service (QoS), or per applicationservice level objectives (SLOs). They also inform variousconsolidation, caching, and prefetching decisions at clients.Each of these access patterns is visible only at a particularsemantic layer within the client: users or applications. Wedefine each such layer as an access unit, with the observedbehaviors at each access unit being an access pattern. Theanalysis of client side access units represents an improvementon prior work.On the server side, we extend the previous focus on files.We need to also understand how files are grouped withina directory, as well as cross-file dependencies and directoryorganization. Thus, we perform multi-layer and cross-layerdependency analysis on the server also. This is another improvement on past work.2.2Need for Multi-Dimensional InsightsEach access unit has certain inherent characteristics. Characteristics that can be quantified are features of that accessunit. For example, for an application, the read size in bytesis a feature; the number of unique files accessed is another.Each feature represents an independent mathematical dimension that describes an access unit. We use the termsdimension, feature, and characteristic interchangeably. Theglobal