An Accurate Prefetch Technique for Dynamic Paging Behaviour ...
An Accurate Prefetch Technique for Dynamic Paging Behaviour for Software Distributed Shared Memory Jie Cai and Peter Strazdins Research School of Computer Science The Australian National University ICPP 2012 Pittsburgh, PA, USA Outline Introduction Background Related Work on Existing Prefetch Techniques Stride-augmented Run-length Encoding Method (sRLE) Dynamic Region-based Prefetch Technique Evaluation Results Conclusion ICPP 2012 @ Pittsburgh, PA
Introduction Software Distributed Shared Memory (sDSM) systems provide programming environments that enable the use of shared programming model such as OpenMP on clusters. sDSM systems inherit the good programmability of shared memory programming models. Removing explicit control of data exchange from programmer However, sDSM suffers from significant system overheads. Prefetch techniques, fitting well with lazy release consistency (LRC), can be used to improve performance. Prefetch techniques for sDSM face two major challenges: Applications dynamic memory access patterns
Page misses caused by non-global synchronization operations ICPP 2012 @ Pittsburgh, PA Introduction (Cont.) In this talk, we address the challenges of prefetch techniques for sDSM systems Reconstruct page miss record using strided-augmented run-length encoding (sRLE) method Designed a dynamic region-based prefetch (DReP) technique based on sRLEd records to predict and issue prefetches. Implemented into the only commercialized sDSM system, Intel Cluster OpenMP (CLOMP) DReP and sRLE with CLOMP are evaluated using NPBOMP benchmark suite, LINPACK, and a memory consistency cost micro-benchmark (MCBENCH) ICPP 2012 @ Pittsburgh, PA
Background (1) Fork-join type shared memory programming models: Single Thread Start parallel region, implicit barrier Thread0 Thread1 Thread2 Regions are separated Thread3 Parallel Region #1 Explicit barrier
using global synchronizations, e.g. implicit and explicit barriers; Sequential Region #1 Thread0 Thread1 Thread2 Thread3 Parallel Region #2 End parallel region, implicit barrier Region-executions are multiple executions of the same region when this region is enclosed in
a loop. Single Thread Sequential Region #2 Start parallel region, implicit barrier Thread0 Thread1 Thread2 Thread3 Parallel Region #3 End parallel region, implicit barrier Single Thread ICPP 2012 @ Pittsburgh, PA Sequential Region #3 Global memory is managed in
blocks, pages Background (2) sDSM memory consistency model Shared Memory Programming Model (OpenMP) Each process has a local view of the shared pages The shared pages are kept consistent via mprotect (please refer to the page state machine for details). Virtual Shared Memory MPI
Local Memory ICPP 2012 @ Pittsburgh, PA Local Memory Background (3) sDSM memory consistency costs The major sDSM system overhead is the memory
consistency cost. MCBENCH is a in-house developed microbenchmark that measures this cost for different OpenMP implementations, including cluster enabled OpenMPs. ICPP 2012 @ Pittsburgh, PA Related Work Dynamic Aggregation (C. Amza et al. 1997) Simple assumption of temporal paging behavior before and after a barrier.
B+ and Adaptive++ (R. Bianchini et al. 1996 & 1998) B+: simple assumption of temporal paging behavior before and after a barrier. Adaptive++: assuming page misses occurred before a barrier or even before the previous barrier will occur again after the barrier. Third order differential finite context method (TODFCM) (E. Speight et al. 2002) Generic technique prefetch a page when three previous consecutive misses had happened before. ICPP 2012 @ Pittsburgh, PA Related Work (Cont.) Temporal region-based pretech (TReP) technique (J. Cai
et al. 2010) Deployed idea of region and region-executions Assume page misses in the previous region-execution will occur in the current region-execution Considered temporal paging behaviour between consecutive region-executions Hybrid region-based prefetch (HReP) technique (J. Cai et al. 2010) Deployed idea of region and region-executions Combined TReP and Adaptive++ Addressed temporal paging behaviour between consecutive region-executions and spatial paging behaviour within a region-execution. ICPP 2012 @ Pittsburgh, PA sRLE Method -- Observation
LINPACK dynamic page access pattern with 4 processes Corresponding dynamic page miss pattern ICPP 2012 @ Pittsburgh, PA sRLE Method Step (a) group sub-list with common stride; Step (b) encode the sublists into first level format:
Step (c) group consecutive encoded sub-lists with common stride into second level encoding format: (start page, stride, run length) (first level encoded record, stride, run length) Ordinary page fault list can be converted to 2D fault regions with sRLE. ICPP 2012 @ Pittsburgh, PA DReP Technique Designs
All page fault records (per region) has been encoded twice with sRLE method. Each record contains a list of second level encoded entries. ICPP 2012 @ Pittsburgh, PA DReP Technique Designs (cont.) Beginning of a region-execution At the beginning of each region-execution, DReP predict and prefetch pages.
Yes Previously executed twice? Compare every entries between two records. Issue prefetches ONLY for the following three cases. Case 1: Prefetch the entry if it is common to both lists No No Prefetch issued Case 2: Case 3: When strides and run lengthes are
common to both lists, predict a start page, and prefetch with the common strides and run length When strides are common and run lengthes are highly similar to both lists, predict a start page and a run lengthes, then prefetch with the common strides. pred.l1_en_col.start_page = p_list.l1_en_col.start_page + (p_list.l1_en_col.start_page bp_list.l1_en_col.start_page) pred.l1_en_col.run len = p_list.l1_en_col.run_len + (p_list.l1_en_col.run_len bp_list.l1_en_col.run_len) pred.run_len = p_list.run_len + (p_list.run_len bp_list.run_len) ICPP 2012 @ Pittsburgh, PA DReP Implementation
DReP has been implemented into Intel Cluster OpenMP runtime. New region notification user interactive interface: KMP_USER_NOTIFY_NEW_REGION(1) : 1 indicates this is a parallel region KMP_USER_NOTIFY_NEW_REGION(0): 0 indicates this is a sequential region Flush filtering solved the problem of single page can be missed multiple times within one region-execution by removing duplicated records. Enlarged message header of the communication layer which can accommodate 128 page IDs to leverage network bandwidth. Each process first communicate to its right neighbor that avoid network congestion. ICPP 2012 @ Pittsburgh, PA DReP Implementation (Cont.)
DReP has been implemented into Intel Cluster OpenMP runtime. Page state machine has been updated with two new introduced page states Prefetched_diff Prefetched_page ICPP 2012 @ Pittsburgh, PA Evaluation Experimental setup Software and benchmarks NPB-OMP suite LINPAK OpenMP implementation (n=8196, nb=64)
MCBENCH (a = 4MB, c = 4B and 4KB) Hardware platform 8-node Intel cluster Each node consists of 2 Intel E5472 3.0Ghz CPUs 16GB memory Gigabit Ethernet DDR Infiniband ICPP 2012 @ Pittsburgh, PA Efficiency and Coverage Nf: total number of page faults Np: number of prefetches
Nu: number of useful prefetches, Nu = Nf*C C = Nu/Nf, coverage E = Nu/Np, efficiency Bold font represents best results ICPP 2012 @ Pittsburgh, PA Efficiency and Coverage (Cont.) MCBENCH: DReP vs TReP and HReP c = 4B: extreme false sharing c= = 4KB: no false sharing Bold font represents best results ICPP 2012 @ Pittsburgh, PA Memory Consistency Cost
Measured using MCBENCH, a = 4MB, c = 4B and 4KB c = 4B: extreme false sharing (reduced ~86% cost) c = 4KB: no false sharing ICPP 2012 @ Pittsburgh, PA Memory Consistency Cost (Cont.) LINPACK OpenMP implementation with n=8196 and nb=64 DReP is represented as a reduction rate to that of original CLOMP implementation, e.g. (Orig-DReP)/Orig. ICPP 2012 @ Pittsburgh, PA Memory Consistency Cost (Cont.) NPB-OMP Rates are represented as an average of each class from A to C.
ICPP 2012 @ Pittsburgh, PA Overhead Analysis of DReP NPB-OMP IS.C Tsegv: total memory consistency cost in seconds for original CLOMP and DReP enabled CLOMP. TMK Comm (% to Tsegv): communication time spent in the DSM layer of CLOMP (TMK) TMK local (% to Tsegv): the local software overhead of TMK layer DReP Comm (% to Tsegv): communication cost of data prefetching DReP local (% to Tsegv): the local software cost introduced by DReP Communication costs are further broke down to cost for transferring diffs and pages.
ICPP 2012 @ Pittsburgh, PA Conclusions With assistance of sRLE, DReP accurately analyses the paging behaviour exhibiting both static and dynamic memory access patterns, such as NPB-OMP and LINPACK. On average of NPB and LINPACK, DReP improves 34% efficiency and 47% coverage based on existing prefetch techniques, in details: 55% and 5% better efficiency compared to Adaptive++ and TODFCM; 55% and 44% better coverage compared to Adaptive++ and TODFCM 47% and 30% better efficiency compared to TReP and HReP; and 56% and 34% better coverage compared to TReP and HReP. DReP dramatically reduces 86% memory consistency cost for the false sharing scenario; and ~45% and ~38% for LINPACK and NPB on GigE and IB respectively. A detailed breakdown analysis showed a ~2% introduced overhead for DReP.
ICPP 2012 @ Pittsburgh, PA Acknowledgement Australian Research Council Grant LP0669726 ANU CECS Faculty Research Grant Intel Corp. Sun Microsystems NCI National Facility / ANU Supercomputer Facility ICPP 2012 @ Pittsburgh, PA
While squaring a number means to multiply a number by itself, cubing a number means to multiply a number by itself three times. So a perfect cube is the result of cubing an integer. This means 8 is a perfect...
001 - BIBLE TRIVIA QUIZ . This quiz consists of 10 multiple choice type questions of varying difficulty. To begin Click on Start below. The quiz is pretty self-explanatory. If you have suggestions how to improve it please use contact...
SDH. SDH. SDH. Summary Amplified networks. Amplifiers are used in DWDM networks and increases the reach of the optical signals up to 3000 km. Amplifiers are a better choice than electrical repeaters and are usually distributed each 80-100 km.
"De-Revolutionising" the peasants. High price of land led to high mortgage (redemption) payments. One of the reasons that peasant joined the 1905 revolution is the fear that government would reposes land. In 1905 government cleared this debt to "de-revolutionise" the...
Strengthen and Expand Programming through Innovation and Responsiveness to Community and Student Needs: Flexible scheduling, coherent pathways, and utilization of different delivery modalities. Develop the Faculty: Establish a center for teaching and learning, focusing on new faculty development and ongoing...
L'élève devra indiquer le numéro de compte pour le règlement des factures au moment de soumettre son paiement. Le nom à utiliser aux fins de paiement est le suivant : « Centre de demande d'admission aux universités de l'Ontario »...
Virtue ethics Ethics isn't just about acting, but about living An action is right if and only if it is what a virtuous agent would characteristically (i.e. acting in character) do in the circumstances Knowing how to act takes practical...
Ready to download the document? Go ahead and hit continue!