The Parallel I/O Software Crisis

The Parallel I/O Software Crisis

Achievements and Challenges for I/O in Computational Science Rob Ross Mathematics and Computer Science Division Argonne National Laboratory SciDAC 2005 I/O in Computational Science I/O is an increasingly important part of computational science Lots of different I/O needs from applications Initialization Checkpointing (defensive I/O) Lots of data written all at once Visualization Input datasets vary widely in size and format

Subset of checkpoint data More frequent writes during runtime than with checkpoints Probably read many times Data movement Application Astrophysics Wide-area data access All values are for a single run; units are TBytes. Data primarily from workshop on Requirements for Ultrascale Computing in Washington, DC, June 2003. Reading and Post-processing, Analysis Generation Checkpointing 20-200 20-200 20 Supernova 20 2 2 Climate Modeling

2 2 1 Cosmology 5 1 1 Fusion 1,000 1 0.5 SciDAC 2005 2 Parallel I/O Parallel I/O is simply using many I/O resources in a coordinated way to solve a single problem more quickly Example: storing a checkpoint into a single file Same thing we do in parallel processing

Parallel I/O is becoming mandatory for applications Its not working like it used to? A single BG/L compute node has no more than 60 Mbyte/sec of I/O bandwidth But the whole machine might have 30 Gbyte/sec of I/O bandwidth (e.g. LLNL)! I/O software determines how well we can make use of the available I/O hardware Especially at scale SciDAC 2005 3 What Drives I/O in HPC? Not just providing performance with parallel I/O Three metrics on which we measure success: Usability How well I/O interfaces map to application data models and access patterns

Solutions are unique to HPC Performance and scalability How well our I/O systems are tuned for common application patterns (e.g. concurrent access, noncontiguous access) and metadata access Reliability and management How much maintenance our parallel I/O systems require, how well they handle failures This talk covers all three areas, pointing out both successes and challenges SciDAC 2005 4 Usability SciDAC 2005 Application View of I/O It doesnt matter how fast the I/O system is if apps cant use it well Applications internally use complex data structures to organize data Ideally data would be stored in a similar format

Canonical representation Typed data Multidimensional, unstructured datasets Attributes of data, of the run Graphic from A. Siegel, ANL More domain or data model specificity leads to more convenience for applications But we cant afford to rewrite everything for each application Graphic from J. Tannahill, LLNL SciDAC 2005 6 Organization of I/O Software I/O components layered to provide needed functionality (I/O stacks) Common APIs allow combination of components

I/O I/O Middleware Middleware (MPI-IO) (MPI-IO) Parallel Parallel File File System System (POSIX) (POSIX) I/O I/O Hardware Hardware Parallel file system organizes hardware into single, fast storage space I/O middleware matches to programming model, provides optimizations Example: Application Application High-level High-level I/O I/O Library Library collective I/O operations in MPI-IO High-level I/O libraries (HLLs) provide usability SciDAC 2005

7 High-level I/O Libraries Provide structured data storage Metadata is placed in the file itself, simplifying data movement, archiving Two good examples HDF5 first to use MPI-IO, widely used PnetCDF parallel API for netCDF data Compelling alternative to POSIX, MPI-IO Multidimensional, typed datasets Attributes of data, provenance Both of these are too low-level Important step, but still somewhat general

SciDAC 2005 8 Challenge: Bridging the Usability Gap Applications still struggle to use this infrastructure Build new layers on top of existing I/O software stack Maximize code reuse Benefit from optimizations Application Application Model-Specific Model-SpecificI/O I/OAPI API High-level High-levelI/O I/OLibrary Library I/O I/OMiddleware Middleware(MPI-IO) (MPI-IO) Parallel ParallelFile

FileSystem System I/O I/OHardware Hardware Match I/O interfaces to data models or domains Must be a collaborative effort! Application people know the models I/O system people know the optimizations SciDAC 2005 9 Challenge: Standard APIs to WideArea Data Access Recent trend: Accessing data between sites Tools for moving data across the wide area GridFTP Storage Resource Managers Logistical Networking Storage Resource Brokers Groups are developing MPI-IO interfaces to various wide-area data transfer tools SRM, GridFTP, SRB, Logistical Networks

HDF5, PnetCDF between sites 10 Aggregate Write (MB/sec) Writing a Subarray to LN with MPI-IO Performance can vary even more widely than local file systems! SciDAC 2005 8 6 4 2 0 Nave Indep. Write Optimized Indep. Write Collective Write No Sync Sync 10

Performance and Scalability SciDAC 2005 Performance and Scalability Goal: Minimize the time applications spend performing I/O-related operations Maximize time applications spend computing End-to-end I/O performance includes Concurrent For real application access patterns Metadata operations Creating files, traversing directories, etc. Overhead access to files of all I/O software layers

Features arent free SciDAC 2005 12 Parallel File Systems Clients (1000s-10,000s) ... Three popular parallel file system solutions GPFS Lustre PVFS/PVFS2 ... 120 All three being actively developed and deployed Competition in this space is good No one size fits all solution at this time All three already in use on BG/L systems! All capable of 10GByte/sec+ I/O rates,

given adequate storage hardware and easy access patterns SciDAC 2005 Average Aggregate Read Rate (MB/s) Storage or System Network I/O devices or servers (10s-1000s) 100 80 60 40 20 0 0 5 10 15 20 25 Num ber of Concurrent Clients NFS

PVFS2 (A) PVFS2-1Gbit (B) Lustre Updated results from Shared Parallel Filesystems in Heterogeneous Linux Multi-Cluster Environments by Cope, Oberg, Tufo, and Woitaszek of Univ. of Colorado at Boulder, using caggreIO benchmark. 13 Complication: I/O Access Patterns Application I/O is often complex, not just big blocks I/O interfaces determine ability to extract performance Ignoring ghost cells, extracting subarrays Additional data stored by high-level I/O libraries These result in noncontiguous I/O Define the knowledge that the I/O system has to work with

Standard (POSIX) file system interface does not allow for efficient noncontiguous access SciDAC 2005 14 Supporting Noncontiguous I/O Three approaches for noncontiguous I/O Use POSIX and suffer Perform optimizations at the MPI-IO layer as work-around Augment the parallel file system Augmenting the parallel file system API is most effective MPI-IO Optimizations 40 35 POSIX I/O Bandwidth (MB/ s) PFS Enhancements 30 25 20

15 10 5 0 POSIX I/ O Data sieving I/ O Two-phase I/ O List I/ O Datatype I/ O Results from Datatype I/O prototype in PVFS1 with tile example SciDAC 2005 15 Creating Files Even creating files can take significant time on very large machines! Why? Its complicated but it mostly has to do with the interface we have to work with and implications on synchronization What happens if we change this interface?

SciDAC 2005 16 Creating Files Efficiently Improving the file system interface improves performance for computational science Leverage communication in MPI-IO layer 700 600 500 GPFS Lustre PVFS2 400 300 200 12 8 75 25 0 8

... POSIX file model forces all processes to open a file, causing a storm of system calls. ... ... A handle-based model uses a single FS lookup followed by a broadcast of the handle (implemented in PVFS2). 100 1 Avg. Create Time (ms) Time to Create Files Through MPI-IO ... Number of Processes SciDAC 2005 17 High-Level I/O Library Performance High-level I/O libraries cost performance

Second-generation high-level I/O libraries are showing promise Better leveraging features of MPI-IO Using simpler file models that allow for greater concurrency Still, performance is only a fraction of peak! Applications must make tough decisions in some cases between functionality/usability and performance SciDAC 2005 Mbytes/sec FLASH I / O Benchmark 120 100 80 60 40 20 0 16 32

64 Processors HDF5 128 256 PnetCDF The FLASH I/O benchmark, shows PnetCDF performance to be competitive with and in some cases significantly higher than HDF5 performance. This is due to the light-weight, low-overhead nature of PnetCDF and its tight coupling to MPI-IO (results from ASCI Frost machine at LLNL, rates in MB/sec). This work performed in collaboration with Alok Choudhary and Jianwei Li of NWU. 18 Challenge: Minimizing I/O Costs Need other parallel file systems to adopt API enhancements Currently available in PVFS2 file system Standardizing extensions to POSIX I/O for HPC

High-level I/O libraries need more work Caching components integrated into HLLs Or maybe I/O middleware? New file formats, tuned for performance SciDAC 2005 19 Reliability and Management SciDAC 2005 I/O System Complexity Sheer number of devices is an issue Administration (configuration and tuning) Reliability 16 dual P4 servers - 7.3 TB each (116TB total) - multi-home ... IB

GigE IB Switch ... 112 dual P4 nodes ... IB GigE Switch FastE 250 IA64 nodes SciDAC 2005 GigE ... 144 dual P4 nodes 21 File System Administration It is the role of the parallel file system to organize and manage the I/O resources PFSes are themselves difficult

to manage! Failure tolerance Tuning Installation and configuration Similar technologies (e.g. RDBs, networking) now need experts to manage them New software solutions can alleviate many of these problems for I/O systems SciDAC 2005 22 Autonomic Storage Self-healing, self-maintaining, self-tuning Not a reality for parallel I/O, yet. New PFS designs integrate communication between servers

Adapts to device failures transparently Automatically integrates new storage devices Balances data to preserve performance Exchange information about health, load, allocated space Prototyping in PVFS2 parallel file system Next step will be to integrate policies, enforce them Moving data in response to policy decisions is the easy part! SciDAC 2005 23 Impact of Hardware Failures More components usually means more failures Disk failures may be tolerated with RAID-like concepts Server failures may be tolerated with high availability approaches Client failures can be a real problem, especially at scale

Clients will not all be online 99.99% up indicates ~6 nodes down at any time on a 64K node system 99.9% up indicates ~65 down at any time on same MTTFs of 6-8 hours on large DOE machines (e.g. ASCI Q) Need approaches that minimize impact of client failures SciDAC 2005 24 NFS Did Get This Right NFS (v3) doesnt store important data on clients Known as stateless clients Client failures dont impact servers or other clients Parallel file systems may be built similarly PVFS2 takes this approach But we lose traditional performance enhancements

Such as client-side caching No room for cache on BG/L nodes anyway SciDAC 2005 25 Challenge: Reliability, Manageability, and Performance Autonomic storage concepts are not yet reality for parallel file systems Maintaining predictable I/O performance in autonomic storage will be tricky! Getting both reliability and performance is a challenge Start with simple, stateless clients Analog to smaller OSes being used on clients Very difficult if we want to minimize cost! SciDAC 2005

26 Conclusions SciDAC 2005 Summary Many recent successes in I/O for computational science Usability, performance, management, and reliability of existing parallel I/O systems can all be improved Multiple file system options Multiple high-level interfaces available for applications Remote data access capabilities Application interfaces arent convenient to use Observed performance rarely reaches peak performance Parallel file systems are difficult to manage, require too much expertise, and are reliability challenged Development and adoption of solutions to these issues are critical to the future success of HPC systems SciDAC 2005

28 Its (Almost) All About Interfaces APIs play a fundamental role in I/O system software development and use Organization of components into I/O stacks using common APIs Development of new, domain- or model-specific I/O libraries for better usability Extensions to traditional parallel file system interfaces to increase performance Common interfaces for wide-area data access More database-like interfaces for finding data in file systems Changing interfaces is never easy! SciDAC 2005 29 Looking Forward

Efforts are underway to revitalize I/O system software to tackle problems for current and future HPC systems Deployment and adoption of these solutions will enable new and more data-oriented applications It has to be a team effort Scientific Data Management SciDAC is actively pursuing these collaborations If you cant get enough I/O, attend our Parallel I/O in Practice tutorial at SC2005. SciDAC 2005 30 Acknowledgements The Scientific Data Management Center Colleagues at ANL W. Gropp, R. Thakur, S. Lang, R. Latham, J. Lee Members of the I/O and data management community and their respective teams

A. Choudhary, Northwestern University W. Ligon, Clemson University P. Wyckoff, Ohio Supercomputer Center A. Shoshani, Lawrence Berkeley National Laboratory N. Samatova, Oak Ridge National Laboratory G. Grider, Los Alamos National Laboratory L. Ward, Sandia National Laboratories T. Critchlow and W. Loewe, Lawrence Livermore National Laboratory D.K. Panda, Ohio State University G. Gibson, Panasas R. Haskin, IBM This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under Contract W-31-109-ENG-38. SciDAC 2005 31

Recently Viewed Presentations

  • Introduction - welcome to english at magnus

    Introduction - welcome to english at magnus

    "The Adventure of the Speckled Band" is a Sherlock Holmes short story by Sir Arthur Conan Doyle; its plot centers around Holmes and his assistant, Dr. John Watson, as they solve a mystery for client Helen Stoner, whose twin sister...
  • Department for Education - Bradford

    Department for Education - Bradford

    I have inserted a comma after a fronted adverbial. I have written a sentence in the passive voice to create suspense, eg. They were trapped! Success criteria should encourage pupils to focus on the criteria against which the work will...
  • Student Loans: How to Lose Your Debt Without Losing Your Mind

    Student Loans: How to Lose Your Debt Without Losing Your Mind

    Received by loan servicer no later than 15 days after the scheduled payment due date. Full Payments. Amount that equals or exceeds the amount you are required to pay each month under your repayment schedule-pay the amount on your bill....
  • CYBERBULLYING - The Sports Museum

    CYBERBULLYING - The Sports Museum

    Review the three characteristics and ask students to think about how each applies to cyberbullying. "It happens over and over again" = once something is posted on the internet, it is on there forever (even if you deleted it). if...
  • The Bill of Rights

    The Bill of Rights

    The Incorporation Doctrine. The Bill of Rights was designed to limit the powers of the . national. government. In 1868, the Fourteenth Amendment was added to the Constitution and its language suggested that the protections of the Bill of Rights...
  • Literary Analysis Strategy -

    Literary Analysis Strategy -

    Example Thesis Statements. Use a Debatable Thesis: Pride . and Prejudice is about Elizabeth . Bennet's. effort to overcome her own . proud . behavior and discrimination towards Mr. Darcy, as well as how her family is affected by the...
  • Java Programming on the Raspberry Pi with Pi4J

    Java Programming on the Raspberry Pi with Pi4J

    Export & unexport GPIO pins. Configure GPIO pin direction. Configure GPIO pin edge detection - detect falling, rising edge for debounce. Control/write GPIO pin states. Pulse GPIO pin state - PWM. Read GPIO pin states. Listen for GPIO pin state...
  • Tortoise and Freshwater Turtle Specialist Group [] Chair:

    Tortoise and Freshwater Turtle Specialist Group [] Chair:

    SSC Chairs' Meeting, Abu Dhabi, February 2008 Chair: Anders G.J. Rhodin I am affiliated with Chelonian Research Foundation (CRF) TFTSG Steering Committee and Red List Authority: Anders G.J. Rhodin, Chair Peter Paul van Dijk, Deputy Chair Russell A. Mittermeier, EC...