Presented at Python Community Conference (PyCon) Washington, DC 28 March 2003 P. 1/20
A Population Simulation System Written in Python
Tom Olivier
Green Creek Paradigms, LLC
Schuyler, VA
Introduction
This paper describes a population simulation system being developed in Python. This system is intended as a tool for biological researchers and conservationists who seek to model a range of demographic, genetic and other processes in animal populations of several hundred to a few thousand individuals. A goal of this system is modeling of populations subdivided by social groupings, space or landscape features into subpopulations. These subpopulations may be unstructured internally or may consist of social groups with further internal subdivisions. The system includes detailed portrayal of life histories of individuals and their kinship relationships.
To be useful to biologists working with a range of real populations, a system of this type must be capable of modeling variations of important population processes. For example, it should treat animal survival processes that are population density dependent and independent. The system should be configurable for different forms of gene transmission (for example, diploid inheritance common in mammals and haplodiploid inheritance seen in some social insects).
The intent of the author is to provide a modeling system of Python classes and modules that can be customized with modest programming effort to meet many of the preceding biological modeling challenges. My goal in this paper is to describe the design, implementation and experience with operation of the system to date.
Motivation
Mathematical models have played an important role in many areas of population biology, such as genetics and demography, throughout the twentieth century. With the development of digital computers, computer simulations appeared as an alternative to mathematical analysis. However, for decades after they first appeared, many scientists viewed computer simulations as a tool to be used only when mathematical treatments were not possible.
This attitude began to change around 1980, driven by several factors. First, empirical researchers in biology were finding that real populations often exhibited complex structures and processes that could not be represented adequately in rigorous mathematical models. For example, classical mathematical population genetic models often assumed for analytical convenience that regional populations were subdivided into permanent local subpopulations of fixed size, with essentially no significant internal structure (Crow and Kimura, 1970). However, field biologists often found populations of animals, such as Old World monkeys, divided into local groups that were both highly complex and dynamic. In contrast to mathematical models, data structures native to computers conveniently represent many of the complexities observed in real populations. For example, syntactically organized character strings, can represent complex, dynamic social group structures found in baboons, rhesus monkeys and other primate populations (Olivier, 1985). Finally, cheap, fast, microcomputers have become available, making simulation more accessible as a routine modeling tool for biologists.
For many biological modeling problems, computer simulations now seem a natural choice. This trend perhaps has come to a head with Stephen Wolfram’s text, A New Kind of Science, in which he argues that digital simulations are broadly preferable to analytical mathematical models (Wolfram, 2002).
Predecessors of the system described in this paper were written during the 1980’s in BASIC, Forth and C. Object oriented methods available today in Python and many other languages offer broadly different options for representing simulated populations than did the procedural languages of the past. Scripting languages offer data type flexibility not easily available in compiled languages. However, flexibilities like this often come with new prices and pitfalls.
The system described here now consists of over 20 Python modules and 5,000 lines of code. During development, it has been used to construct models of two different kinds of populations – a simply subdivided hypothetical mammalian population and a monkey population with complex local group organization. More testing, some refactoring and extensions remain to be done, but the system is at a point that differences between building a modeling system of this type in Python and many other languages are plainly evident.
Design and Implementation
In this system, keyed character strings (also referred to as key strings), equivalent graphs, or indented text structures provide data structures for representing subpopulations. Keyed strings also facilitate syntactic analysis of correctness of subpopulation structures generated in simulations. Alternately, structures portrayed in keyed strings can be represented in graphs implemented in dictionaries in Python. These usually are easier to read by humans and are used as principal storage medium for keyed structures in the current system. Indented text can be used to represent population structure as well. This form of keyed structure information seems most easily read by humans and routinely is used for input and output of subpopulation structure information. In this system, each element of a keyed structure is given a unique tuple identifier. The first element of the tuple is a text string that identifies the element type. The second is a decimal integer that identifies the specific instance of that type. See Olivier (1985) and Appendix A for further discussion and examples of keyed character strings, graphs and indented text structures.
The system targets creation of models that run as unattended batch applications. Text file input and output are used to the extent practical. Standard files supported by the system include input of initial simulation state, output of terminal simulation state, console messages regarding run progress, a temporal simulation event log and an error log that captures and reports on unhandled exceptions that arise in the course of a simulation. A design guideline stipulates that system classes provide input and output methods sufficient for their use. Text tags identify and format data input and output, improving human readability and facilitating identification of ill-formed input.
The system includes some utility modules. One module, texio, supports tagged input of data in various formats including lists and dictionaries. Another, randy, wraps the rv random number generator library (Frohne, 1998). This allows simple generation of random numbers fitting many widely used probability distributions within simulations.
A single execution of a simulation application steps at least one population through a series of discrete time periods, with births, deaths, migrations and other biological events potentially taking place in each time period. Each population contains subpopulations, representing occupants of a different spatial neighborhoods or social groups. Attributes of simulated individuals are read from the input file initially or generated in the simulation when new individuals are born. Individual attributes are deleted at simulated death. A kinshipserver class instance in each population stores parentage records of individuals in a dictionary. The system retains kinship links after death of simulated individuals for analyses of familial relationships. Such relationships can affect the course of processes in the model and are of direct interest to geneticists. In one of the examples considered in this paper, the rsb model, family relationships traced by female descent are particularly important. This model uses an instance of class uniline, derived from class kinshipserver, to facilitate identifying and utilizing information on female familial relationships.
A hierarchy of class objects embedded within other objects is used to provide a basic framework in the modeling system. A simulation class provides an envelope in which other objects exist. The simulation class opens and closes input files, manages the procession of simulated time periods, provides random number generation and captures and logs exceptions not dealt with at lower levels. A simulation normally contains a population class instance and might contain more than one population.
Embedded within each population instance are objects that manage many kinds of information including individual attributes, familial relationships, fertility and mortality rates. A template server instance in each population reads the keyed string template for allowable subpopulation structures from the simulation initialization file. The template server automatically creates a parser that allows it to check the structures of strings submitted to it at run time for consistency with the template. Methods present in the system allow simple translation of subpopulation keyed graphs to keyed strings. These keyed strings then can be forwarded to the template server for syntactic checking after modeling of biological events that could disrupt subpopulation structures. Discrepancies between template and keyed string structures raise exceptions that include detailed information regarding the discrepancy. Such exceptions can be used to trigger corrective restructurings of the subpopulation organizations or may be allowed to go unhandled and trigger stoppage of simulations by the simulation objects. A dictionary embedded in each population instance stores subpopulation objects. Each subpopulation object contains a keyed graph object that stores its composition and organization in a keyed structure graph. An identification number generator, an instance of class idnserver, is created and initialized within each population instance at run time. It generates unique id numbers for new subpopulations, group elements and individuals that do not clash with numbers identifying elements of same types entered into the simulation from the initialization file.
At present, modules providing classes that model diploid genetic inheritance and simple infectious disease transmission exist and can be included in a population should modeling of these kinds of events be desired. The system is designed to support simultaneous modeling of populations of different, potentially interacting species populations within a single simulation. Multi-species population models allow studies of ecological competition, predator-prey interactions and parasite-host relationships.
Class inheritance is used to specialize objects to the requirements of models of populations of particular species. Classes base_popframe and base_subpopframe provide some basic facilities for input, output, information management and key structure verification. Base classes provide no modeling of biological processes. Classes mammal_popframe (derived from base_popframe) and mammal_subpopframe (derived from base_subpopframe) add functionality to assist modeling of age-specific survival and reproduction widely seen in mammals, as well as emigration and immigration processes approximately like those seen in some mammalian populations. Module mammal_demes includes population and subpopulation classes, derived from their mammalian framework counterparts, that complete the functionality required for modeling of a simply subdivided hypothetical mammalian population. Module rsb also provides population and subpopulation classes derived from the mammalian framework classes. However, the rsb classes provide the more elaborate capabilities needed to model rhesus monkey or savanna baboon populations. One can imagine different sequences of derivations from base population and subpopulation frameworks to model populations of other kinds of organisms. For example, a bees and wasp framework could be developed to provide features matching observations on these biologically interesting and important animals.
One aim of the system is to provide means of representing spatial effects in population processes. At present, the system deals with subdivision but provides no support for spatial distance effects. A modest treatment of distance effects could be obtained by adding geographic coordinate attributes to each subpopulation, permitting calculation of distances between subpopulations and modeling of the influence of distance on processes such as migration of individuals among subpopulations. Simple network structures attached to each subpopulation also could be used to reflect connectedness between subpopulations as affected by location.
Ideally, one would like to represent subpopulation locations in space along with local landscape variations in altitude, vegetation, water resources and other environmental features known to influence organisms. One would like to incorporate into simulations the interactions between the landscape states and population processes and states. Modeling such landscape and population interactions would seem particularly valuable to conservation-oriented land use planning and management.
Geographic information systems (GIS) provide many of the desired landscape portrayal and analysis capabilities. As a means of exploiting the ability of GIS in this simulation system, I have constructed a Python interface to the Idrisi GIS (Clark Labs, 2001). This package provides extensive facilities for visualization and analysis of many forms of landscape data. Idrisi supports the Microsoft Component Object Model (COM) interface (Rogerson, 1997). I constructed a class with a Python interface to Idrisi using the win32com package (Stein and Hammond, 2001). This interface now permits calling of IDRISI functions from Python programs. However, a derived interface containing methods particularly adapted to the needs of population simulation remains to be developed.
To date, the system has been developed on a Microsoft Windows 2000 platform. One goal of development is to produce a system that for the most part also will run under the Linux operating system. While most modules are expected to port in straightforward fashion, some, such as the interface to Idrisi (a Windows program), are likely to remain platform specific.
Examples
Appendices present code samples plus initial and terminal output from simulations of two different kinds of populations. Appendix B discusses a simply subdivided hypothetical mammalian population, mammal_demes. The mammal_demes listing shows modeling in each time period of birth, death, ageing and migration between two subpopulations. Sections from the input and output files illustrate use of tags to format files, use of indented text to represent subpopulation compositions, entries for values of parameters that control population process, attributes of simulated individuals and their familial links. A close comparison of the input and output contents shows that in the single simulated time period, some individuals have appeared through births, some have disappeared through death and some have moved between groups.
Samples from a second model,rsb, are presented in Appendix C. The rsb model simulates populations of rhesus monkeys or savanna baboons. Much like the mammal_demes model, births, deaths, migration and ageing occur in the rsb model. However, unlike the simpler mammal_demes model, rsb subpopulations are social groups that are further subdivided internally into a segment of immigrant adult males that relate in a dominance hierarchy and a natal segment composed of matrilineally related females born in the group and their immature offspring. Groups may fission when large (largely along matrilineal lines) or fuse together if they become small. Matrilines also may fission if they become large. A comparison of rsb inputs and outputs illustrates the more complex group structures, more process parameters and changes in groups present as well as individuals present. Appendix A provides details on rsb subpopulation features and their representation in keyed structures.
Python as a Language Choice
So far, Python appears well suited for implementation of a system of this type. Object oriented programming, with its objects, attributes and methods, seems better suited than strictly procedural languages to modeling biological entities that contain information, are related in complex ways and engage in a variety of behaviors. Lists and dictionaries, with their automatic garbage collecting and flexible typing, make quick work of tasks that take far longer to accomplish in low-level languages. The high logical level of Python language features means that fewer steps are required to map the concepts in a biologist’s mind into an implemented simulation. Additionally, the code for such a simulation is more likely to be intelligible to scientists who are occasional programmers. Use of an interpreted language must come with a price regarding execution speed. However, so far execution speed appears sufficient.
Next Steps
Additional testing, extensions and population modeling applications are planned in the coming months. Development of a geographic information system interface class with methods specifically useful for population simulations is a high priority. I intend to release the system with documentation under an open source license by the end of 2003.
References
Clark Labs. Idrisi32. Vers. I32.21. Worcester, 2001.
Crow, J. and M. Kimura. An Introduction to Population Genetics Theory. Harper and Row: New York, 1970.
Frohne, I. rv. Vers. 1.1 Nov. 1998. <
Olivier, T.J. “Use of Keyed Character String Data Structures and Operators in Models of Primate Groups,” J. Theor. Biol. 115, 539-549, 1985.
Rogerson, D. Inside COM. Microsoft Press: Redmond, 1997.
Stein, G. and M. Hammond. Win32com. Ver 145. 2001. <
Wolfram, S. A New Kind of Science. Wolfram Media: Champaign, 2002.
Appendix A.
Use of Templates, Keyed Strings, Graphs and Indented Text
To Represent Composition and Structure in Subpopulations
Keyed strings flexibly represent simple subpopulation structures and complex subpopulation structures. Keyed strings begin with a text label of letters and perhaps digits, followed by an open parenthesis, then some content and then a terminating closed parenthesis. The text label specifies the type of population entity represented by the keyed string (a subpopulation or group or a biologically recognizable subset of a subpopulation or an individual animal). Keyed strings may contain other keyed strings in their content. Thus, nested, inverted tree structures can be represented in the linear sequence of characters in a keyed string. In the keyed strings used in this simulation system, an identifying number unique in the system for elements of that label type occurs immediately after the opening parenthesis of each keyed string. Thus, in this system each keyed string is uniquely identified by a tuple composed of its text type label and an identification number.
Subpopulation structures vary greatly between different animal species. Subpopulations in some species may be simple while others may exist in social groups with different kinds of recognizable subgroups. These complex social structures are not static but are conserved in the face of many changes, such as births, deaths and movements of individuals. One of the challenges in modeling complex, dynamic populations is developing models of population events that capture both the dynamism of the social groups and conservation of their organization. To this end, the system provides means of comparing subpopulation structures against templates that specify subpopulation structures anticipated in simulations. These templates are similar to, but not identical to, keyed strings. They play a role similar to the patterns used in text search processes. Templates are provided by modelers and read at the start of simulations for recurrent testing of subpopulation structures as simulations proceed.