Computer Architecture Lab at Carnegie Mellon Technical Report CALCM-TR-2008-001Submitted to ISCA 2009R-NUCA: Data Placement in Distributed Shared CachesNikos Hardavellas1, Michael Ferdman1,2, Babak Falsafi1,2 and Anastasia Ailamaki3,112Computer Architecture Lab (CALCM), Carnegie Mellon University, Pittsburgh, PA, USAParallel Systems Architecture Lab (PARSA), École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland3École Polytechnique Fédérale de Lausanne, Lausanne, SwitzerlandAbstractIncreases in on-chip communication delay and the large workingsets of commercial and scientific workloads complicate the designof the on-chip last-level cache for multicore processors. The largeworking sets favor a shared cache design that maximizes theaggregate cache capacity and minimizes off-chip memory requests.At the same time, the growing on-chip communication delay favorscore-private caches that replicate data to minimize delays onglobal wires. Recent hybrid proposals offer lower average latencythan conventional designs. However, they either address theplacement requirements of only a subset of the data accessed bythe application, require complicated lookup and coherencemechanisms that increase latency, or fail to scale to high corecounts.In this work, we observe that the cache access patterns of a rangeof server and scientific workloads can be classified into distinctcategories, where each class is amenable to different dataplacement policies. Based on this observation, we proposeReactive NUCA (R-NUCA), a distributed shared cache designwhich reacts to the class of each cache access and places blocks atthe appropriate location in the cache. Our design cooperates withthe operating system to support intelligent placement, migration,and replication without the overhead of an explicit coherencemechanism for the on-chip last-level cache. We evaluate R-NUCAon a range of server, scientific and multi-programmed workloadsand find that its performance matches the best alternative design,providing a speedup of 17% on average against the competingalternative, and up to 26% at best.1 IntroductionIn recent years, processor manufacturers have shifted towards producing multicore processors to remain within the power and cooling constraints of modern chips while maintaining the expectedperformance advances with each new processor generation.Increasing device density enables exponentially more cores on asingle die. Major manufacturers already ship 8-core chip multiprocessors [25] with plans to scale to 100s of cores [1, 32], while specialized vendors already push the envelope further (e.g., Cisco’sCRS-1 with 192 Tensilica network-processing cores, Azul’s Vega 3with 54 out-of-order cores). The exponential increase in the number of cores results in the commensurate increase in the on-chipcache size required to supply these cores with data. Physical andmanufacturing considerations suggest that future processors will betiled, where groups of processor cores and banks of on-chip cachewill be physically distributed throughout the chip area [1,43]. Tiledarchitectures give rise to varying access latencies between thecores and the cache slices spread around the die, naturally leadingto a Non-Uniform Cache Access (NUCA) organization of the onchip last-level cache (LLC), where the latency of a cache hitdepends on the physical distance between the requesting core andthe location of the cached data.However, growing cache capacity comes at the cost of accesslatency. As a result, modern workloads already spend most of theirexecution time on on-chip cache accesses. Recent research showsthat server workloads lose as much as half of the potential performance due to the high latency of on-chip cache hits [20]. Althoughincreasing device switching speeds result in faster cache-bankaccess times, communication delay remains constant across technologies [8], and access latency of far away cache slices becomesdominated by wire delays and on-chip communication [24]. Thus,from an access-latency perspective, an LLC organization whereeach core treats a nearby LLC slice as a private cache is desirable.While a private distributed LLC organization results in fast localhits, it requires area-intensive, slow and complex mechanisms toguarantee coherence. In turn, coherence mechanisms reduce theavailable cache area and penalize data sharing, which is prevalentin many multicore workloads [3,20]. At the same time, growingapplication working sets render private caching schemes impractical due inefficient use of cache capacity, as cache blocks are replicated between private cache slices and waste space. At the otherextreme, a shared distributed LLC organization where blocks arestatically address-interleaved in the aggregate cache offers maximum capacity by ensuring that no two cache frames are used tostore the same block. Because static interleaving defines a single,fixed location for each block, a shared distributed LLC does notrequire a coherence mechanism, enabling a simple LLC design andallowing for larger aggregate cache capacity. However, static interleaving results in a random distribution of cache blocks across theL2 slices, leading to frequent accesses to distant cache slices andhigh access latency.An ideal LLC organization enables the fast access times of the private LLC and the design simplicity and large capacity of the sharedLLC. Recent research advocates hybrid and adaptive designs tobridge the gap between private and shared organizations. However,prior proposals require complex, area-intensive, and high-latencylookup and coherence mechanisms [4, 10, 7, 43], waste cachecapacity [4, 43], do not scale to high core counts [7, 19], or optimize only for a subset of cache accesses [4, 7, 11].In this paper we propose Reactive NUCA (R-NUCA), a scalable,low-overhead and low-complexity cache architecture that optimizes data placement for all cache accesses, while at the same time

Computer Architecture Lab at Carnegie Mellon Technical Report CALCM-TR-2008-001Submitted to ISCA 2009attaining the fast local access of the private organization and thelarge aggregate capacity of the shared scheme.R-NUCA cooperates with the operating system to classify accessesat the page granularity, achieving negligible hardware overheadand avoiding complex heuristics that are prone to error, oscillation,or slow convergence [4, 10, 7]. The placement decisions in RNUCA guarantee that each modifiable block is mapped to a singlelocation in the aggregate cache, thereby obviating the need forcomplex, area- and power-intensive coherence mechanisms thatare commonplace in other proposals [7, 4, 10, 43]. At the sametime, R-NUCA allows read-only blocks to be shared by neighboring cores and replicated at distant ones, ensuring low accesslatency for surrounding cores while balancing capacity constraints.In the process of doing so, R-NUCA utilizes rotational interleaving, a novel lookup mechanism that matches the fast speed ofaddress-interleaved lookup without pinning the block to a singlelocation in the cache, thereby allowing the block’s replicationwhile avoiding expensive lookup operations [43, 10].More specifically, in this paper we make the following contributions: Through execution trace analysis, we show that cacheaccesses for instructions, private data, and shared data exhibitdistinct characteristics leading to different replication, migration, and placement policies. We leverage the characteristics of each access class to designR-NUCA, a novel, low-overhead, low-latency mechanism fordata placement in the NUCA cache of large-scale multicorechips. We propose rotational interleaving, a novel mechanism toallow fast lookup of nearest-neighbor caches, eliminatingmultiple cache probes and allowing replication withoutwasted space and without coherence overheads. We use full-system cycle-accurate simulation of multicoresystems to evaluate R-NUCA and find that its performancealways exceeds the best of private or shared design for eachworkload, attaining a speedup of 20% on average against thecompeting alternative when running server workloads (17%on average when including scientific and multi-programmedworkloads) and up to 26% speedup at best.The rest of this paper is organized as follows. Section 2 presentsbackground on distributed caches and tiled architectures, and provides insight into the data-access patterns of modern workloads.Section 3 presents our classification and offers a detailed empiricalanalysis of the cache-access classes of commercial, scientific, andmulti-programmed workloads. We detail the R-NUCA design inSection 4 and evaluate it in Section 5 using cycle-accurate full-system simulation. We summarize prior work in this area in Section 6and conclude in Section 7.While our techniques are applicable to any last-level cache, weassume a 2-level cache hierarchy in our evaluation. Thus, in theinterest of clarity, we refer to our last-level cache as L2 in theremainder of this work.P0P1P2P3P4P5P6P7P8P9P10P11P6 TileCOREI D L2 sliceP12P13P14P15FIGURE 1. Typical tiled architecture. Each tile containsa core, L1 instruction and data caches, and a shared-L2cache slice, interconnected into a 2-D folded torus.required for multicore processors renders caches with uniformaccess impractical, as increases in capacity simultaneouslyincrease access latency [20]. To mitigate this problem, recentresearch [24] proposes to decompose the cache into multiple slices.Each slice may consist of multiple cache banks to optimize for lowaccess latency [5], with all slices physically distributed across theentire chip. Thus, cores realize fast accesses to the cache slices intheir physical proximity and slower accesses to physically remoteslices.Just as cache slices are distributed across the entire die, processorcores are similarly distributed. Thus, it is natural to couple coresand cache slices together and allow each core to have a “local”slice that affords fast access. Furthermore, economic, manufacturing, and physical design considerations [1, 43] suggest “tiled”architectures that co-locate distributed cores with distributed cacheslices in tiles communicating via an on-chip interconnect.2.2 Tiled architecturesFigure 1 presents a typical tiled architecture. Multiple tiles, eachcomprising a processor core, caches, and network router/switch,are replicated to fill the die area. Each tile includes private L1 dataand instruction caches and an L2 cache slice. L1 cache missesprobe the on-chip L2 caches via an on-chip network that interconnects the tiles (typically a 2D mesh). Depending on the L2 organization, the L2 slice can be either a private L2 cache or a portion ofa larger distributed shared L2 cache. Also depending on the cachearchitecture, the tile may include structures to support cache coherence such as L1 duplicate tags [2] or sections of the L2-cache distributed directory.Tiled architectures scale well to large processor counts, with anumber of commercial implementations already in existence (e.g.,Tilera’s Tile64, Intel’s Teraflops Research Chip). Tiled architectures are attractive from a design and manufacturing perspective,enabling developers to concentrate on the design of a single tileand then replicate it across the die [1]. Moreover, tiled architecturesare beneficial from an economic standpoint, as they can easily support families of products with varying number of tiles and power/cooling requirements.Private L2 organization. Each tile’s L2 slice serves as a privatesecond-level cache for the tile’s core. On an L1 cache miss, onlythe L2 slice located in the same tile is probed. On a read miss in thelocal L2 slice, the coherence mechanism (a network broadcast or2 Background2.1 Non-Uniform Cache ArchitecturesGrowing wire delays have necessitated a departure from conventional cache architectures that present each core with a uniformcache access latency. The exponential increase in cache sizes2

Computer Architecture Lab at Carnegie Mellon Technical Report CALCM-TR-2008-001Submitted to ISCA 2009access to a statistically address-interleaved distributed directory)confirms that a copy of the block is not present on chip. On a writemiss, the coherence mechanism invalidates all other on-chip copies. With a directory-based coherence mechanism, a typical coherence request is performed in three network hops. With tokencoherence [27], a broadcast must be performed and a responsemust be received from the farthest tile.Enforcing coherence requires large storage or complexity overheads. For example, a full-map directory for a 16-tile multicoreprocessor with 1MB per L2 slice and 64-byte blocks requires 256Kentries. Assuming a 42-bit physical address space, the directorysize per tile is 1.3MB, exceeding the L2 capacity of the slice. Thus,full-map directories are impractical for the private L2 organization.Limited-directory mechanisms use smaller directories, but requirecomplex, slow, and non-scalable fall-back mechanisms such asfull-chip broadcast. In the rest of this work, we optimisticallyassume a private L2 organization where each tile has a full-mapdirectory with zero area overhead.Shared L2 organization. Each L2 slice is statically dedicated tocaching a part of the address space, servicing requesting from anytile through the interconnect. On an L1 cache miss, the missaddress dictates the tile responsible for caching the block, and arequest is sent directly to that tile. The target tile stores both thecache block and its coherence state. Because each block has aunique location in the aggregate L2 cache, the coherence state mustcover only the L1 cache tags; following the example for the privateL2 organization and assuming 64KB split I/D L1 caches per core,the directory size is 168KB per cache block placement can improve performance by bringingdata close to the requesting cores, allowing fast access.We identify three key requirements to enable high performanceoperation of distributed NUCA caches through intelligent blockplacement