The Multiscale Systems Immunology project: software for cell-based immunological simulation
© Mitha et al; licensee BioMed Central Ltd. 2008
Received: 04 December 2007
Accepted: 28 April 2008
Published: 28 April 2008
Skip to main content
© Mitha et al; licensee BioMed Central Ltd. 2008
Received: 04 December 2007
Accepted: 28 April 2008
Published: 28 April 2008
Computer simulations are of increasing importance in modeling biological phenomena. Their purpose is to predict behavior and guide future experiments. The aim of this project is to model the early immune response to vaccination by an agent based immune response simulation that incorporates realistic biophysics and intracellular dynamics, and which is sufficiently flexible to accurately model the multi-scale nature and complexity of the immune system, while maintaining the high performance critical to scientific computing.
The Multiscale Systems Immunology (MSI) simulation framework is an object-oriented, modular simulation framework written in C++ and Python. The software implements a modular design that allows for flexible configuration of components and initialization of parameters, thus allowing simulations to be run that model processes occurring over different temporal and spatial scales.
MSI addresses the need for a flexible and high-performing agent based model of the immune system.
Computer simulations are becoming increasingly important in biological research, complementing both laboratory experiments and the venerable models of mathematical biology. While the line between mathematical modeling and simulation is somewhat indistinct, simulation typically incorporates a greater level of biological detail than mathematical models, facilitating the mapping between the biological and formal representations at the cost of increased complexity and reduced analytical tractability. This more direct correspondence allows subject-matter specialists with little mathematical experience to participate in the design and interpretation of computational experiments more easily; a consideration of great importance for the evolution and improvement of the model.
We have undertaken the development of an individual cell-based simulation system comprising components of the immune system, based on the relevant biophysics and intracellular dynamics, to help elucidate the early immune response to vaccination and natural infection. We believe that such a framework for immunological simulations can be used to explore the details of molecular and cellular dynamics, in much the same way as a new optical technology such as two-photon microscopy has radically advanced our understanding of leukocyte dynamics by permitting visualization of in vivo cell motility and interactions. The construction of realistic complex spatiotemporal simulations, however, is relatively new to mathematical and theoretical immunology, and poses significant software engineering challenges with which few researchers in the field will be familiar.
The immune response to an antigenic challenge such as vaccination consists of processes occurring over several temporal and spatial scales, from intracellular signaling (seconds to minutes), to the complex spatial reorganization that occurs in the draining lymph nodes some distance away (hours to days). The outcome depends on interactions between several cell types communicating by direct cell-cell contact and over short distances by diffusible small messenger molecules known as cytokines, and on the interactions of the motile immune cells, known as leukocytes, with the resident parenchymal cells. Adding to the complexity, all these interactions may be modulated by the local extracellular matrix, which guide the motions of leukocytes and provide a substrate upon which bound cytokines interact with these cells. While many of these individual interactions have been studied in detail experimentally, there is yet little real understanding of how these components add up to produce a coherent immune response. The reconstructive approach embodied by computational simulation holds promise for effecting the necessary synthesis.
We therefore describe the challenges and trade-offs inherent in building such a simulation, and our specific choices, in the hope that other researchers will gain a better understanding of the issues involved and consequently make more informed software engineering choices. Finally, we compare our package with two closely related simulation software packages with similar goals, namely Compucell [1–3] and Simmune [4, 5], that illustrate complementary approaches to engineering complex immune simulation software.
The first decision was that of programming language. We wanted a design driven by the immunological domain we are simulating, while maintaining the high performance necessary for such a complex simulation, which would have to deal with reaction-diffusion equations as well as the intracellular dynamics and interactions of thousands to millions of leukocytes and parenchymal cells. Our specific needs for a language included support for robust pseudorandom number libraries, multi-dimensional arrays and parallel computation. While languages such as Fortran and C fit these requirements well, they failed at our other requirement of a flexible high-level interface for description of the model domain. C++ includes C as a subset while providing high-level object-oriented facilities, and also has the advantage of being well-known and familiar to the scientific community. For better or worse, C++ has strong static typing, and this basic feature to a large extent dictates usage of the language. Strongly typed languages are comparatively inflexible and are poorly suited to rapid development, particularly in exploratory modes of programming. As a compromise, we opted to use a mixed language environment, where an interpreted dynamically typed language extends a statically typed compiled language. This hybrid system to some extent inherits the advantages of both languages.
For pragmatic reasons, we decided on a Python/C++ hybrid. Python is a popular object-oriented interpreted language in widespread use as a scripting language, but it also has good and ever-improving scientific support. It is also well-suited as a glue language, facilitating the connection of the base simulation code with databases and visualization systems. An important technical tool that made such a hybrid system feasible was the existence of the Boost Python library , which offers a sophisticated C++ API to convert functions and data types automatically to and from the two languages. One approach to exploiting such a hybrid system is to code most of the simulation in Python, using C++ only for the time-intensive routines. We were constrained, however, by the need for high performance and the consequent need to develop for implementation on Beowulf cluster architectures; the Python MPI wrappers are, for this purpose, unacceptably slow. We chose instead to create two functionally identical systems, one in Python and one in C++. Synchrony between the two was enforced by writing a suite of Python unit tests, which depending on a switch, tests either the Python directly or C++ system via Boost Python wrappers. This seems to enhance productivity: it is simpler to code and test prototypes in Python and, at the appropriate point in their development, port them to C++ than to develop and test prototypes from scratch in C++. We also take advantage of flexible and efficient arrays in Fortran by calling individual Fortran routines from C++ for low level matrix and vector computations.
Just as fundamental were the choices we made for our software development environment. We chose to use the distributed version control system Mercurial , because a distributed version control system offered superior support for multiple developers in terms of branching, merging and disconnected operations. For software builds, we chose Scons . Among other benefits, Scons automatically manages file dependencies, and allows us to customize our build scripts using the full power of Python. We elected to use Trac for bug and issue management, as it allows us to manage version control, bug tracking and documentation within the same web-based package. Not coincidentally, all the tools we chose are Python based, which makes it possible for us to extend these tools if necessary.
One of our design constraints was that the primary domain for our simulation was molecular and cellular immunology, not mathematics, physics or computer science. At the least, this meant that we had to be able to specify and describe a simulation run using the vocabulary of immunology. Another constraint was that we were constantly researching new prototypes, adding more functionality to the system and investigating more efficient data structures and algorithms for performance bottlenecks, so the system had to be extensible; ideally, we would be able to swap in new functionality for old with minimal additional effort. To meet these constraints, we decided on a hierarchical modular system. The implementation of such a system was naturally object-oriented, a paradigm both C++ and Python support. At the top level, we split the project into Catalog, Core and Graphics/Analysis modules. Within each of these were separate sub-modules and classes that actually implemented the functionality of the system. This organization allowed us to develop individual modules independently, so long as there was a standard API that allowed communication between modules.
The Catalog module is a repository of biological and physical information. Its entries include, for example, parameters governing the locomotion and behavior of specific leukocyte types and the diffusion coefficients and reaction rates of cytokines. The information in the Catalog is stored in a PostgreSQL relational database, and simulation classes are mapped to database tables using the strategy of mapping each inheritance tree to a table. To facilitate data entry into the Catalog, the Python library
SQLAlchemy citesqlalchemy was used to provide an object-relational mapping, making it possible for developers to create new class specifications without knowing SQL. Our plans call for the development of a web interface to the Catalog, so that experimentalists and other non-developers can contribute items to the database. After describing the Experiment module, we will describe how the Catalog is used to construct class entities in a simulation at run-time.
The Simulation class manages administrative tasks, with the ability to run or step through a simulation and log simulation details. It will also provide hooks to interface with run-time logging, computational steering and visualization systems when these are developed.
The Environment class represents a physical spatial volume, which may be simple, such as when modeling an in vitro environment, or complex, such as when modeling physiological tissue. Since our task allocation is done in a parallel environment using a spatial decomposition, each processor in a parallel run will contain its own Environment instance. Most of the work done by the Environment class is delegated to three private classes which handle the reaction-diffusion equations (Diffusion), collision detection (Occupation) and cellular immigration and emigration (Trafic). Since these classes are an implementation of the well-known Strategy design pattern, it is simple to upgrade the functionality provided by simply pointing to a new class. For example, we recently replaced the original forward Euler scheme with a Multigrid reaction-diffusion algorithm , resulting in an order of magnitude speedup. This was accomplished by simply replacing the Diffusion class in Environment. The Environment also serves as a container for various cell types (Cell), blood and lymphatic vessels (Vessel), and various cytokines and chemokines (Soluble_factor), and manages the interactions among these components.
Similar to Environment, the Cell base class is also a composite class, with distinct behaviors delegated to their own classes, and hence also extensible. A cell's functional state is given by, among other things, the local concentrations of various proteins in each of their post-translationally modified states (e.g., phosphorylated on specific residues). These characteristics of a protein are themselves dynamically regulated, and typically modeled using nonlinear ordinary differential equations. Importantly, these proteins determine the functional behavior of the cell, and it is sufficient to couple cell behavior modules to these protein concentrations to model cell stimulus-response behavior. The protein readouts are encapsulated in a State class, which tracks and exposes the fluctuating protein levels. In our current implementation, there is a mechanistic model that governs the protein levels, in which the rates of protein synthesis and/or degradation are governed by an explicit cascade involving model classes for Receptors, Inducers, Promoters, Genes and regulatory Proteins. It would be straightforward, however, to replace this mechanistic model with a probabilistic model, or even a much more detailed mechanistic model depending on the goals of the simulation. By having a flexible granularity, we can study an immune response at different levels of spatiotemporal resolution. When studying the behavior of individual cells, for example, we can increase the resolving power, while for studying large population behaviors, a coarser approximation of internal cell dynamics might be sufficient. Other behaviors like Motility and Secretion can then base the quality/quantity of their response at any time step on the internal environment represented by the State class and/or the external environment (e.g. concentration or gradient of various cytokines or cell-cell contacts). This design allows an extremely flexible stimulus-response coupling between the internal and external cellular environments and cellular behavior. Because of the loose coupling and the ability to change cellular behavior by "plug-in", an inheritance hierarchy to represent the different cell types is not necessary. Instead, we just "plug-in" different behavior modules for the various cell types to implement their observed behaviors. In practice, this means that specific behaviors for a particular cell type are implemented as instances of their respective classes that are given to the cell class constructor as parameters.
The Vessel class is used for the representation of the blood vessels that permeate host tissue. Leukocytes typically enter tissue by migrating across the endothelial barrier at capillaries, and egress at post-capillary venules to enter the lymphatic system. The rate of entry and egress of various cell types is regulated by pro-inflammatory cytokines, which regulate the adhesiveness and permeability of the blood vessels, as well as by chemokines, which influence leukocyte migratory properties. Our implementation of blood vessels is simple; the entry and egress vessels are stored in matrices representing positions in space, and the probability of cell trafficking at a point depends on the permeability and adhesiveness of the corresponding matrix entry, which in turn is sensitive to the local concentration of the relevant soluble factors. We currently assume that the blood vessels are homogeneously distributed; it is trivial to relax this assumption to reflect tissue histology more accurately by masking certain matrix entries.
The Soluble_factor class is essentially a container for a array of concentrations, with a few accessor and mutator methods for convenience and some basic attributes including its diffusion coefficient. An instantiation of the Soluble_factor class is made for every cytokine, chemokine and other soluble factor of concern in the simulation. The array is updated by the Diffusion class which takes into account secretion from cellular sources, adsorption by cells and extracellular matrix, binding reactions between soluble factors, and diffusion. There are several classes derived from Diffusion that represent different methods for solving the diffusion problem, including a forward Euler method and a backward Euler method which incorporates multigrid.
While the bundling of distinct behaviors into separate classes certainly simplifies the dynamical configuration of classes, their instantiation can be clumsy, accounting for a significant proportion of the code required to run a simulation. Biological cells, however once adequately characterized, can be reused in many different simulations, just as an experimentalist would use the same cultured cell lines (possibly ordered from a catalog!) for many different experiments. This was the idea driving the creation of the Catalog module, and here we will describe how it is used for instantiation of Experiment classes.
We started by writing generic code that could construct a class instance without advance specification of the instance type or even its constructor parameters. This design provided the flexibility to easily incorporate new classes in the database, or new subclasses of currently existing types. To do so, we adopted the object factories scheme, which allows flexible creation of object instances using C++ templates .
For visualization and analysis, we export simulation variables using the standard HDF5 format , which can then be visualized using our own wxPython and OpenGL based visualization application or imported into a statistical package such as R or Matlab for further analysis.
where g μ is a smoothly cut-off Gaussian with support over the volume of the cell. Here D i denotes the diffusion coefficient for soluble factor i, r ij is the rate at which soluble factor i is removed by interaction with soluble factor j, and λ i is the rate of removal of soluble factor i by other processes. These positive constants are stored in the Soluble_factor class. The secretion of soluble factor i through the surface of cell μ is approximated by a source term centered at the cell position x μ with secretion rate J iμ (x, t). Although the model in [13, 14] contains point sources, the existence of a solution to (1) requires a smooth source term as discussed in . In the simulation, the rate J iμ (x, t) is determined by the Secretion member of each cell. The source term for each soluble factor is a sum over the M individual cells indexed by μ.
The positive constants γ, c0, h0, σ0, q and χi are stored in the Motility class, which is a member of Cell.
The motion of the cells (2).
Solve the diffusion, (5), for Δt using the initial data u i (x, t), and x0 and v0.
Solve the reaction, (6), for Δt using the solution from the previous step as initial data.
Move the cells according to (2) for one time step Δt using the soluble factors from the previous step and the initial cell positions and velocities, x0 and v0, as initial data.
The error due to this splitting is (Δt). This result is well-known for ordinary and partial differential equations; we have extended it to the reaction-diffusion-stochastic system . Furthermore, if the numerical schemes for (5), (6) and (2) are at least (Δt), then the resulting error is (Δt). Here we see that operator splitting allows us to solve three tractable problems rather than one complex system. Splitting the reaction and the diffusion reduces a system of nonlinear partial differential equations to a system of linear partial differential equations for the diffusion, and a system of ordinary differential equations for the reaction. The partial differential equations can be solved using a backward Euler scheme with multigrid for the linear solve as described in  and . The multigrid scheme is implemented as a member function of the Multigrid_diffusion class which is derived from Diffusion. This class contains a necessary set of parameters, as well as functions to compute the relevant sinks and sources from the list of cells. The reaction equations are solved numerically using a first order semi-implicit Euler scheme. A separate scheme for the reaction is also a member function of the Diffusion class; it updates the entire list of soluble factors for one time step. Finally, the Langevin process for the cell motion can be simulated exactly as described in . The Langevin motility class stores a set of parameters that are relevant to the equations and numerical scheme. It also contains member functions that use the concentrations of the soluble factors to update the velocity and return the displacement of the cell for one time step. Although the soluble factors and the cells depend on each other, operator splitting allows us to update them sequentially instead of simultaneously and further modularizes the simulation.
We run unit tests using the Python unittest module, with the only novelty being that since all our C++ classes are compiled to a shared library readable in Python, we can drive both Python and C++ unit tests with the same test code. At a slightly higher level, the user may compare simulated behavior with analytical predictions where closed form solutions (either deterministic or statistical) are available. Examples include testing cell motility modeled as Langevin processes or simple diffusion from a single point-source secreting at a constant rate. There are also informal eyeball tests, in which simulations with simple configurations are run and visually checked to ensure that the resulting behavior is qualitatively consistent with experimental results. We have not incorporated code coverage analysis  or continuous build systems  yet, but these would certainly be sensible additions.
Due to the highly modular nature of our code, it is generally simple to replace an inefficient module with a more efficient one. Identification of bottlenecks is done using the callgrind tool widely available on Unix platforms, and the kcachegrind GUI to visualize the resulting profile. Examples of such high level optimizations include the replacement of a forward Euler diffusion scheme with a Multigrid scheme, and replacing collision detection by looping over all cells with a grid-based method implemented as a hash table. At a lower level, we have removed unnecessary copies by using references where possible for large data structures, and rearranged code to minimize unnecessary looping. So far we have avoided doing micro-optimizations (e.g. loop unrolling) believing that such tasks are best left to the compiler. One simple optimization we did incorporate was to specify the appropriate compiler flags for each machine architecture, allowing the compiler to make optimal use of pipelining and vectorization.
Running a complex simulation inevitably requires a fairly sophisticated supporting computer software infrastructure. The primary purpose of the MSI software is to provide such an infrastructure for running simulations of the immune response, making it possible for users inexperienced in simulation methods to conduct, visualize and analyze complex computational experiments. The functionality provided by the software falls into three main categories – providing plug-in components for an immune simulation that can be assembled to form complex simulations, data and parameter management via a relational database, and a graphical user interface for visualization and control.
The MSI software provides components that model both the physical and biological aspects of an immune simulation as a hierarchy of loosely coupled classes. We provide physics-based classes that model chemical reaction and diffusion, contact forces and collisions, and biology-based classes that model the environment, cells and soluble factors. Some of these may be composite classes and have sub-components that can be plugged in to provide new functionality. Cells can, for example, be fitted with different model classes that provide specific motility, secretion, sensory abilities etc.
The development of useful biological models requires extensive parameterization, typically done by mining the available literature or conducting experimental measurements. Within the MSI system, these data are managed by storing the parameters that characterize specific biological entities (e.g. cell types such as macrophages, neutrophils, fibroblasts, etc.) in a relational database. The database is closely integrated with the model classes in the code, so that biological entities can be instantiated by the appropriate database name lookup. The integration of the database catalog of biological entities and model classes makes it possible to specify a computational experiment using the appropriate domain by simply specifying the appropriate environment, cell types and soluble factors that must be present in the simulation. Clearly, this framework also makes re-use simple; the same cell types used in one simulation can be used to populate another one with, say, a different environment.
Another aspect of data management is the collection of the vast quantities of numerical and descriptive data that can be generated in simulation runs. To support this functionality, we use the HDF5 hierarchical data format developed by NCSA, which is highly efficient, scalable and provides a utility library for data access and manipulation. The HDF5 files are stored for off-line analysis and visualization. Paraview  is an example of a tool that can read and parse the stored HDF5 files to generate an openGL-based rendering of the simulation.
▹ Number of leukocytes.
DBConnStr ← (user, dbname, host, port, password)
▹ Initialize database.
DBMaker ← DBMaker::Instance(DBConnStr)
Environment ← DBMaker.CreateEnvironment("testenv")
Source1 ← DBMaker.CreateCell("source")
▹ Creating and adding point sources to Environment.
Source1.SetPosition(x1, y1, z1)
Source2 ← DBMaker.CreateCell("source")
Source2.SetPosition(x2, y2, z2)
Source3 ← DBMaker.CreateCell("source")
Source3.SetPosition(x3, y3, z3)
TNF ← DBMaker.CreateSolfac("tnf")
▹ Creating and adding soluble factors to environment.
STNFR ← DBMaker.CreateSolfac("stnfr")
MCP1 ← DBMaker.CreateSolfac("mcp1")
Vasculature ← DBMaker.CreateVasculature("testvessel")
for i ← 1, n0 do
▹ Creating and adding leukocytes to environment.
Cell ← DBMaker.CreateCell("macrophage")
procedure Main(dt, numsteps)
Simulation ← new Simulation(Environment)
for i ← 1, numsteps do
▹ Log of simulation results.
We have also recently completed a series of computational experiments which illustrate how the software can be used to gain insight into a biologically relevant process – the mechanisms regulating inflammation following a tissue insult. The insult triggers the release of chemotactic soluble factors (chemokines) by parenchymal cells (fibroblasts), which in turn recruit immune cells (leukocytes). Using multiple simulation runs to systematically perturb the system variables, we studied the feedback regulation of tumor necrosis factor alpha (TNF-α) by its shed receptor soluble TNF receptor (sTNFR) and its effect on the recruitment and spatial distribution of the leukocytes over time. The results have been published  and serve to illustrate the kinds of simulations enabled by the MSI software package.
The two most closely related software packages that simulate multiple cells interacting in a dynamic environment for immune simulation are Compucell [1–3] and Simmune [4, 5]. Compucell uses the cellular Potts model to model cell dynamics by partitioning space into pixels, each of which is regarded as a cellular automata. The overall behavior of individual cells, modeled as a collection of pixels, depends on minimizing the total effective energy of the system subject to constraints that channel the dynamics. In contrast to our direct mapping of cell behavior modules to the underlying biology, the specification of cell behaviors in Compucell is indirect, via constraint terms in the energy function. Such an abstract specification limits its appeal to experimental immunologists, since it is difficult to translate the simulation to detailed biological mechanisms. Simmune is also an agent-based model with a sophisticated user interface design that allows a graphical specification of the simulation biochemistry and inter-cellular interactions by biologists. In particular, Simmune uses a GUI which allows cellular behavior to be copied across cell types. Also, it simulates explicit reaction-diffusion across all scales, and uses an adaptive integrator to separate the different timescales with a user-defined precision. We agree with Simmune's philosophy that immune simulations must have models that map closely to the underlying biology to be useful to experimentalists. Our approach is slightly different, in that we provide 'default' collections of cells and cellular behaviors in the Catalog, that can be used as a template and easily modified to simulate experimental perturbations, making it simpler to set up typical simulated experiments of cellular immune responses. Currently, the software emphasis is also different – Simmune focuses more on biochemistry, while we have paid more attention to the biophysics of cell motility and diffusion from a motile source. Due to the complexity and scope of immune systems, we strongly believe that multiple complementary approaches to large-scale quantitative immune simulations will continue to develop and cross-fertilize each other's development.
The code is currently functional and has been tested on Debian, Ubuntu, OpenSuSE and Mac OS X platforms. We are hereby releasing it as an Open Source project and invite interested developers to participate in the project. Source code, documentation and a community wiki can be found at the Multiscale Systems Immunology website . In the near future, our priorities are to populate the Catalog with additional cell types, soluble factors and environments, so that more realistic and complex immune simulations can be conducted. We also plan to fully parallelize the code using MPI so that larger-scale simulations can be run in a reasonable time on a moderate-size Beowulf cluster. We are also prototyping more capable visualization frontends based on the Visualization Toolkit (VTK) , as well as the possibility of real-time visualization and control with computational steering techniques. We intend to provide the means to simulate immune processes in heterogeneous environments with virtual tissue architectures based on histological analysis of biological tissues. The primary challenge for this task will be the development of efficient methods for the integration of reaction-diffusion equations in irregularly bounded spaces. Our group is actively engaged in research in this area. Finally, while we use the software exclusively for simulating immune responses, the software is sufficiently generic to run simulations of any cell population phenomena with minimal adaptation.
There are many software engineering techniques that can enhance productivity and make more pleasant (or at least less painful) the development of large, complex scientific simulations. Since most biological modelers today come from a more theoretical background and their coding experience is often limited to smaller single-developer efforts, these important techniques may not be familiar to researchers who could most profit from them. We have here described some of the tools and methods that we have found useful in the construction of an cell-based immune simulation, including setting up sensible version control and build systems, and testing suites; the use and integration of a relational database, modular decomposition of the project, object-oriented design incorporating ideas from domain-driven design and the patterns community, and the use of standard formats to facilitate communication between subsystems (RDBMS, HDF5). Perhaps more unusually, we have opted to use a mixed language approach to software development – this choice has worked well for us, and we are comfortable recommending it to others. We hope that this paper will help other groups develop better, more sophisticated simulators for molecular and cellular biology, and allow a more integrated understanding of organisms.
Project name: The Multiscale Systems Immunology (MSI) Project.
Project home page: .
An apt repository for the MSI code is available for readers using the Debian or Ubuntu operating systems. This repository currently contains the source code and binary packages for Debian etch (i386 and amd64) and Ubuntu feisty (i386 and amd64). The source code can also be obtained here as a tar.gz file [Additional file 1]. For further information about the apt repository, see the README at  (Documentation → Installation from the main MSI project page).
Operating system(s): Linux (tested on Debian 4.0 (etch) and Ubuntu 7.04 (Feisty Fawn)).
Programming language: Python, C++ and Fortran.
Other requirements: See README in source for code dependencies.
Application Programming Interface
Hierarchical Data Format 5
Integrated Development Environment
Message Passing Interface
National center for Supercomputing Applications
Partial Differential Equation
Relational Database Management System
Structured Query Language
We wish to thank Bill Rankin and John Pormann for their contributions in the area of High Performance Computing, and Rachael Brady for her contributions in the area of Scientific Visualization. We also wish to thank Bill Allard for important contributions to the mathematical methodology. This work was supported by the NIH through grant R21 AI058227-01 and research contract HHSN268200500019C.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.