LA-UR-o . OO bApproved for public release;distribution is unlimited.Title: IAuthor(s):A HYBRID PARALLEL FRAMEWORK FOR THECELLULAR POTTS MODEL SIMULATIONSI KEJING HEYI JIANGSHOUBIN DONGIntended for: IPROCEEDINGS/MEETINGCC GRID 2009IEEE INTERNATIONAL SYMPOSIUM ON CLUSTERCOMPUTING AND GRID--QAlamosNATIONAL LABORATORY- --EST . 1943 - - Los Alamos National Laboratory, an affirmative action/equal opportunity employer, is operated by the Los Alamos National Security, LLCfor the National Nuclear Security Administration of the U.S. Department of Energy under contract DE-AC52-06NA25396. By acceptanceof this article, the publisher recognizes that the U.S. Government retains a nonexclusive, royalty-free license to publish or reproduce thepublished form of this contribution, or to allow others to do so, for U.S. Government purposes. Los Alamos National Laboratory requeststhat the publ isher identify this article as work performed under the auspices of the U.S. Department of Energy. Los Alamos NationalLaboratory strongly supports academic freedom and a researcher's right to publish; as an institution, however, the Laboratory does notendorse the viewpoint of a publ ication or guarantee its technical correctness.Form 836 (7/06)

A Hybrid Parallel Framework for the Cellular PottsModel SimulationsKejing He*, Member, IEEE, Yi Jiang t and Shoubin Dong**School of Computer Science and Engineering, South China University of Technology, Guangzhou, China 510641tTheoretical Division, Los Alamos National Laboratory, Los Alamos, New Mexico 87544Email: nAbstract-The Cellular Potts Model (CPM) has been widelyused for biological simulations. However, most current imple mentations are either sequential or approximated, which can'tbe used for large scale complex 30 simulation. In this paperwe present a hybrid parallel framework for CPM simulations.The time-consuming POE solving, cell division, and cell reactionoperation are distributed to clusters using the Message PassingInterface (MPI). The Monte Carlo lattice update is parallelizedon shared-memory SMP system using OpenMP. Because theMonte Carlo lattice update is much faster than the POE solvingand SMP systems are more and more common, this hybridapproach achieves good performance and high accuracy at thesame time. Based on the parallel Cellular Potts · Model, westudies the avascular tumor growth using a multiscale model.The application and performance analysis show that the hybridparallel framework is quite efficient. The hybrid parallel CPMcan be used for the large scale simulation ("" 108 sites) of complexcollective behavior of numerous cells ("" 10 6 ).I. INTRODUCTIONThe Cellular Potts Model (CPM) [1] is a simple and flexibleframework for cell-oriented models of development. It hasbeen used for modeling many morphogenesis processes, in cluding cell sorting [2], tumor growth [3], vascularization [4],angiogenesis [5], limb growth [6] and slime mold development[7]. The underlying of the CPM is a three-dimensional rect angular lattice. Every site of the lattice has a value (formallyID), and the sites with the same ID form a cell, which thushas a volume and shape. In this approach, cells are deformableand can interact with each other through the cell boundaries.The cells can have Type and State as part of their properties.The cell-cell interactions are described through an effectivetotal energy (formally Hamiltonian), which typically includea cell-type dependent surface adhesion energy and a volumeenergy:1t2:J T (O'i ) ,T(O'j )[-8(O'i, O'j)](i ,j) Avolume 2:[V(O') - V'target(O')]2(1)0' Asur!ace 2:[5(0') - 5target(0')j20'where JT(O';) ,T(O' j) is the adhesive energy between ID O'/scell type and ID O'/s cell type. Avolume and Asur!ace arethe volume elasticity and membrane elasticity respectively.In general, the second and third terms make the cells growto their target size gradually. The classic CPM employs amodified Metropolis Monte Carlo algorithm [8] to evolve thesystem. In the Metropolis algorithm, a random lattice site ischosen, and the ID of this site is updated to be one of the otherIDs. Let t.1t be the change of energy due to this update, theMetropolis-Boltzmann probability for accepting such updateis:t.H ::; 0(2)p { !xp( -t.H/T) t.H 0Since the underlying structure of the CPM is a lattice, andthe typical discretization scale is several microns per latticesite, a realistic biological tissue level simulation means a largelattice requiring large memory and computing resources. In thecase of tumor growth modeling (Ill-A), the simulation domainconsists of up to 400 3 lattice sites, and we need to simulatethe tumor growth for more than 30 days. For such simulations,parallelization becomes a necessity.Domain decomposition is widely used in parallelizationfor lattice based models. The existing domain-decompositionmethods usually decompose a large domain into a number ofsmaller subdomains, each is designated to a separate physicalprocessor. In addition to the local subdomain information, eachprocessor also has to maintain some data (ghost zones) from itsneighbouring subdomains. After one calculation step, the datain the ghost zone are updated, so each processors have currentinformation for its neighboring subdomains. For most latticebased models (such as Cellular Autonoma), domain decom position method is efficient in parallelization. But because theCPM has a two layer structure (section II-A and Fig. 1), andthe Monte Carlo operation (section II-D) in the CPM needsthe consistency of the information on cell properties, e.g. cellvolume, the parallelization of Monte Carlo lattice update indistributed-memory computing system is difficult in order toachieve accuracy and performance at the same time.The basis of the CPM, the Potts model, is effectively ageneralized Ising model, which has been the example systemof parallelization implementation. Barkema et al. [9] foundthat to reproduce the results obtained from the sequentialalgorithm, the number of interprocess synchronizations duringone Monte Carlo step (MCS) should be statistically equal toN s , where Ns is the number of sites located on the border ofadjacent subdomains. They also presented a "smart" algorithmthat reduces the synchronization number to J2Ns /1r in two dimensions (2D). According to this estimate, for a 4003 lattice,the processes have to synchronize tens of thousands of times

each MCS, which would severely limit the parallel perfor mance in the distributed memory environment. To reducethe number of synchronizations, a few methods have beenproposed to exchange some accuracy for parallel efficiency.Wright et al. [10] used a "checkerboard" strategy to parallelizethe Potts Model. In their approach, each subdomain is dividedinto four subgrids (in 2D, and eight in 3D) colored withdifferent colors using the same schema. During the "subgrid"step, the processors work on subgrids of one color, andother subgrids are left intact. After each "subgrid" step, theprocessors are synchronized. Next, the processors work onsubgrids of another color. The "checkerboard" approach goesthrough lattice sites differently than the sequential algorithm.The parallel algorithm does not pick sites randomly, evenwithin a subdomain within a processor [10]. Chen et al. [11]followed the "checkerboard" approach to parallel the CellularPotts Model, and they analyzed the relationship between thesubgrid switching frequency, the accuracy, and the efficiency.Cercato et al. [12] developed a parallel algorithm for the CPMbased on a Random Walker (RW) approach [13]. All of theseapproximation algorithm sacrifice some degree of accuracy forefficiency. The full analysis of Monte Carlo lattice update ispresented in section II-D.The development of domain decomposition methods isclosely related to the progress of high performance computing.Because the symmetric mUltiprocessing (SMP) clusters arebecoming increasingly common in the top computer systems,it has become desirable for the domain decomposition methodsto take advantage of the memory architecture of moderndistributed computing systems. An important property of agood parallel domain decomposition method is that it canintegrate with both the scientific problem and the underlyingcomputing system architecture. The Message Passing Interface(MPI) [14] is a de facto computer communications standardfor communication among the nodes running a parallel pro gram on a distributed memory system. Since MPI has beenimplemented for almost every distributed memory architectureand each implementation is optimized for the hardware onwhich it runs, MPI is portable and efficient. OpenMP [IS] isanother well-known application programming interface (API)which supports multi-platform, shared-memory, multiprocess ing programming in C/C and Fortran on many architectures.It consists of a set of compiler directives, library routines,and environment variables that influence run-time behavior.To achieve both high efficiency and high accuracy, we adopta hybrid approach, where the time-consuming PDE solving,cell division, and cell death operation are distributed to theSMP cluster, the inter-nodes communication is fast and trivial.Since the parallelization of Monte Carlo lattice update , theMonte Carlo lattice update is parallelized on shared-memorysystem, through OpenMP. Because the Monte Carlo latticeupdate is relatively fast than PDE solving, the OpenMP basedparallelization is a good balance between performance andaccuracy. Section II-A briefly describes the CPM. Section IIdiscusses the parallelization of various CPM operations, in cluding the Monte Carlo lattice update, the advection-diffusionpartial differential equation, cell division, and cell death. Whythe hybrid approach is essential and advantageous is alsoanalyzed there. The accuracy, performance and productivityof the parallel CPM is demonstrated in section III through atumor growth problem.II .HYBRID PARALLEL FRAMEWORKA. Cellular Potts ModelThe underlying of the CPM is a three-dimensional lattice.A "cell" (7 is a simply-connected domain of lattice sites withthe same ID number (7, which could represent a biological cellor a generalized domain entity, such as extracellular matrix.Cell attributes include the cell type (tumor vs. endothelial),cell state (proliferating vs. dead), center of mass, volume,target volume, etc . The hierarchical structure of the CPM isillustrated in Fig. 1.Underlying Lattice LayerFig. I.The hierarchical structure of the Cellular Potts ModelB. Domain Decomposition Method (DDM)Domain decomposition method generally refers to the split ting of problem into coupled problems on smaller subdomains,which are partitions of the original domain. From sectionII-A, we see a classic CPM is composed of two layers.The underlying is a regular rectangular lattice, each site inthe lattice contains a number identifying which cell this sitebelonged to. In the computer implementation of the model,the sites in the lattice would be represented by one array,whose indexes correspond to a structure of locations in thelattice. The top layer is not a lattice, but a collection ofdifferent cells. If every site interacts with every other site,it is generally hard to divide the computing tasks into a set ofparallel procssors. Fortunately, all the operations in the CPMhave their local effective region. This means that to apply onespecific operation 0 to the system (at the site S) from thetime t to the time t 6 t, we need only the information ofthe values at time t in the limited surroundings of S. In thispaper, this set of information is defined as the knowledge zoneof site S regarding operation o. In PCPM, different operationshave different kinds of knowledge zones, they are detailed

in section II-D. A natural way to address the parallelizationin the CPM is the domain decomposition method: we dividethe physical domain into several subdomains (one for eachprocessor); each processor can then model the subsystem,storing only the data associated with the subdomain, andexchanging knowledge zone with the processors responsiblefor the neighboring subdomains. In PCPM, the data need tobe exchanged may include the underlying sites' ID numbersand the cell information associated with these IDs.enough such that every cell, regardless of its shape, can becontained into a SCW 3 cube. In the parallel CPM, the widthof ghost zone is set to be SCW. A tical value of SCWmay be twice the target radius (\ 3x ( targ: volume ) . If some#1##2Fig. 2. The different communication patterns in ID- decomposition and 2D decomposition. In I D- decomposition , every ghost zone is continuous. In 2D decomposition, either ghost zone #1 and #3 or ghost zone #2 and #4 arenon-continuousFig. 3.The local subdomain and ghost zone of domain decompositionmethod. The width of the ghost zone is set to be Safe Cell Width (SCW).subdomain ')" s local area (dark area in Fig. 3) contains a sitewhose ill number is (7, the process responsible for subdomain')' is guaranteed to have all the information of cell (7. Thatis, there is enough information for the parallel operationspresented in section II-D.D. Parallel Cellular Potts ModelWe choose a ID-decomposition scheme in PCPM, basedon two considerations. First, the communication overhead willinfluence the performance of PCPM. In a 2D-decomposition in2D domain, as illustrated in figure 2, each subdomain has fourghost zones, numbered #1 to #4 respectively. Among thosefour ghost zones, there are two are memory-discontinuous.In MPI, though discrete scatter and gather is possible, theperformance of continuous send and receive is much betterthan discrete scatter and gather. So the memory recompositionis necessary before every neighbour communication. If weuse ID-decomposition, the ghost memory is a continuousbulk, we do not need to recomposite memory before everycommunication. Second, the CPM is composed of two layers(section II-A). In the 2D- or 3D- decomposition