Probably a Paragraph About PILPS and the Need for and Difficulty in the Evaluation of Land

LAND SURFACE SCHEME INTERCOMPARISONS – A BRIEF HISTORY

The land surface is the locus of most human activities: it is where we live, grow our food and harvest fresh water. As a consequence, realistic simulation of continental near-surface weather and climate is of importance to people. A land-surface parameterisation scheme, or simply a land-surface scheme (LSS), is an algorithm for determining the exchanges of energy, mass and momentum between the atmosphere and the land surface. These exchanges are complex functions of a number of processes (physical, chemical and biological) that have a range of temporal and spatial scales. It is impossible, and probably unnecessary, to incorporate all the details of these processes into a numerical scheme and hence, land-surface schemes have been developed based on various simplifications. Depending on these simplifications, land-surface schemes in today's atmospheric models exhibit a wide range of complexity from classic "bucket" models (e.g., Manabe 1969) to detailed soil-vegetation-atmosphere transfer schemes (SVATs) (e.g., Dickinson et al., 1993, Sellers et al., 1996). A subset of SVATs takes into account the sub-grid scale variations of surface characteristics and/or atmospheric conditions (e.g., Avissar, 1992; Famiglietti and Wood, 1995; Entekhabi and Eagleson, 1989) to deal with the usually nonlinear relationships among the surface processes/variables. Third generation of LSSs combines the physical processes with the biophysical exchanges needed to represent photosynthesis, respiration and, in some schemes, decay (e.g. Xiao et al., 1998; Tian et al., 1999).

The perceived need to systematically analyse these diverse schemes motivated the World Climate Research Programme (WCRP) to launch the Project for the Intercomparison of Land-surface Parameterization Schemes (PILPS) in 1992 (Henderson-Sellers and Brown, 1992; Henderson-Sellers et al., 1993). PILPS’ goal is to enhance understanding of the parameterisation of fluxes of heat, moisture and momentum between the atmosphere and the continental surface in climate and weather forecast models. PILPS diagnoses the behaviours of participating land-surface schemes (hereafter, "PILPS schemes") in controlled experiments implemented in four phases. The first two phases include studies of scheme behaviour when driven in "off-line" (one-way feedback) mode by atmospheric forcings prescribed from GCM output (Phase 1) (Pitman et al., 1993) or from varied observational data sets (Phase 2) (Shao and Henderson-Sellers 1995; Chen et al., 1997; Wood et al., 1998). PILPS Phase 3 entails the diagnosis of land-surface schemes coupled to their "home" atmospheric host models, while Phase 4 concerns the analysis of results from coupling different land-surface schemes to a common host (Henderson-Sellers et al., 1996).

In practice, Phase 3 has involved the analysis of land-surface schemes in the Atmospheric Model Intercomparison Project (AMIP) models, organised as Diagnostic Subproject No. 12 (DSP12). AMIP, another initiative of the WCRP, is a widely implemented protocol for testing the performance of atmospheric general circulation models (AGCMs) under common specifications of radiative forcings and observed ocean boundary conditions (Gates 1992; Gates et al., 1999). The first phase of AMIP (designated as AMIP I) had been conducted between 1992 to 1996. The second phase of AMIP (AMIP II) is currently ongoing. From the perspective of land-surface specialists, AMIP affords a unique opportunity to study the interactions of a wide range of LSSs with their atmospheric host models. DSP12 studies of this type aims to address an overarching question: “To what extent does AGCM performance in simulating continental climates depend on the parameterisations of the coupled LSS?”

The importance of answering this question is readily appreciated, since the impacts of climate variability on human populations are most keenly felt at or near the land surface; nevertheless, a definitive resolution remains elusive for several reasons. First, there is the dilemma of how to reliably validate simulation performance, given the present dearth of multi-annual global land-surface data sets. Even if observational data were more plentiful, some inherent ambiguities would remain. Specifically, validation of AGCM continental climate does not verify the workings of the LSS per se since, in addition to the intrinsic properties of the LSS, the continental simulation also is affected by atmospheric forcings and by mediating land-surface characteristics. Despite these complications, there are preliminary indications that characteristic “signatures” of different LSSs can be detected in coupled climate and weather simulations, provided that suitable diagnostics are chosen (e.g. Gedney et al., 1999)..

Atmospheric models, in general, and land-surface parameterisation schemes, in particular, are based on many empirical functions. The empirical nature of the schemes implies that they may not perform equally well when generalised (applied) to different environmental conditions. DSP12 aims to test this hypothesis by evaluating the performance and intercomparing simulated land surface climates of the AMIP II GCMs. To perform this test we propose stratifying the global data into different climate zones and evaluate the model simulations over each climate.

A Off-line Experiments

The Project for Intercomparison of Land-surface Parameterization Schemes (PILPS)

The perceived need to systematically analyse these diverse schemes motivated the World Climate Research Programme to launch the Project for the Intercomparison of Land-surface Parameterization Schemes (PILPS) in 1992. The first two phases of PILPS consist of analysis of land-surface schemes in the off-line mode, hence they unrealistically exclude the impacts of feedback of the land surface processes on the atmospheric conditions. The third phase of PILPS, which involves analysis of land-surface simulations in the coupled mode, is organised as Diagnostic Subproject No. 12 (DSP12) of the Atmospheric Model Intercomparison Project (AMIP) models. The first phase of AMIP failed to deliver much information about the land-surface because consistent initialisation of soil moisture differed very wide and because of poor quality control of results related to the continental climate. The analysis was also limited due to the lack of global validation data sets. The second phase of AMIP (AMIP II) offers a new opportunity to evaluate surface simulations of different global climate models globally in well-controlled conditions. A number of reanalysis data sets that estimate the land surface variables for the corresponding AMIP II period have been considered for validation of model simulations.

PILPS Phase 1 and Phase 2 experiments were conducted off-line; i.e. there was no feedback to the atmosphere. In PILPS Phase 1, participating LSSs were integrated for many years using synthetic meteorological forcing from a global climate model. The same descriptions of surface vegetation and soil were used by all land-surface schemes. A single year’s meteorology was used for as many annual cycles as was required for a particular LSS to come into equilibrium with the prescribed atmospheric conditions, i.e. until mean changes in the surface heat and moisture storage were negligible. Therefore, it was expected that there would not be a considerable difference among the net radiant energy at the surface among the schemes and the differences should be manifested in the partitioning of this energy between sensible and latent heat fluxes and that of precipitation between evapotranspiration and runoff.

The results of the first set of PILPS Phase 1 experiments (1(a)) showed very large differences among the simulations. In an attempt to understand this large scatter, changes in the experimental design were made to assure that models were physically self-consistent. This included checking the convergence of the results to steady state and water and energy balance in the annual means. The results also showed that most land-surface schemes required many years to come to thermal and hydrologic equilibrium with the forcing meteorology and that the needed time and the final equilibrium state could differ depending on the scheme and the initialisation of moisture stores. Thus short period simulations of land-surface disturbances will produce a range of results depending upon initialisation and surface scheme. After all refinements in the experimental design (Phase 1(c)), the range in latent heat flux for the PILPS Phase 1 was about 100 W m-2 in tropical forest and about 50 W m-2 for the grassland with commensurate ranges in the sensible heat flux. Monthly and diurnal ranges were larger (e.g. Pitman et al., 1999) and were found in all diagnosed characteristics of the surface climate. The simpler (bucket) schemes tend to evaporate more than LSSs which include a representation of canopy processes or at least canopy resistance.

Phase 2 experiments, where observations from real sites were used, represented major advances over the first phase of PILPS in that comparison were made not only among the schemes themselves, but also with observations. In Phase 2(a) one year (1987) point meteorological data from Cabauw, the Netherlands (51° 58’ N, 4° 56’ E) was used to force the land surface schemes. This site has saturated deep soil year-round, so it overcomes the problem of soil moisture initialisation (Chen et al., 1997). The range amongst the predictions as compared with observed turbulent fluxes and the problem of non-zero total energy budget continued to be a concern as did arbitrary specification of soil depths and hence soil moisture stores (Slater et al., 2001).

In PILPS Phase 2(b) (Shao and Henderson-Sellers, 1995) a subset of PILPS schemes were run using measured meteorological forcing from the Caumont site of the HAPEX-MOBILHY experiment. The results were compared with observed surface fluxes for a 35-day intensive observation period and with soil moisture measurements taken over the year. Also, observed runoff from two adjacent catchments was used for the evaluation of the schemes’ results. The results showed that simulation of soil moisture by land surface schemes is sensitive to the choice of soil hydraulic properties/parameters and that there were large differences among the schemes in predicting soil moisture. Results show the predicted seasonal cycles of total soil moisture for the 1.6 m soil layer compared with the HAPEX-MOBILHY measurements, with ±10% error margins. All schemes correctly describe the annual cycle of soil moisture in a qualitative sense. However, there is a range of between 70 mm to 100 mm among the LSSs’ simulations depending on the season. Most schemes showed greater seasonal variations compared to observations, with great underestimation during the growing season. The bucket model performed differently from most of the SVATs with much drier soils during the growing season. Shao and Henderson-Sellers (1995) show the partitioning of precipitation between annual evapotranspiration and runoff by different schemes. Their results reveal that there is surface water imbalance in some of the simulations (deviation of results from a line with a slope of –1 and an intercept of equal to annual precipitation) and that the differences in water partitioning among the schemes was quite large. Most schemes underestimate runoff and hence overestimate evapotranspiration compared to observations, probably due to neglecting the subgrid scale variations of soil moisture/precipitation.

Phase 2(c) experiment (Wood et al., 1998), performed over the Red-ArkansasRiver basin, aimed to evaluate the ability of land-surface schemes to reproduce measured energy and water fluxes over multiple seasonal cycles across a climatically diverse, continental-scale basin. It was also aimed to test the ability of schemes to calibrate their parameters using data from small catchments and to transfer this information from the calibration basins to other small catchments, and to computational grid boxes. The results showed that there were significant differences among the schemes with respect to partitioning water and energy on an average annual basis. For instance, the mean seven-year Bowen ratio averaged over 61 1°1° grids varied from 0.072 to 1.726 and that of runoff ratio varied between 0.020 to 0.409. In general the LSSs that calibrated their parameters based on streamflow of some calibration catchments performed better than uncalibrated schemes in validation runs. The results strongly suggested that there was a value in using catchment data to calibrate the parameters of land surface schemes. One possible implication for global implementation is the desirability of establishing a global set of calibration catchments that could be used by land surface schemes for parameter estimation.

In PILPS Phase 2(d) (Slater et al., 2001) the simulation of snow by land surface schemes was investigated using data from Valdai, Russia. The results showed that the LSSs as a group can capture the general patterns of accumulation and ablation on an inter-annual basis, but weaknesses such as mid-season ablation exist. Between the schemes there is a considerable scatter (e.g. 40 days difference in the timing of the complete ablation of snow), but much of the scatter is systematic. It was found that the early part of the snow season, especially during the ablation events, produces considerable scatter because in such low ‘snow water equivalent’ conditions the amount of energy incident to the portion of the grid assigned as snow varies widely across the schemes. The divergence amongst LSSs’ simulations, once established, tends to persist during the winter due to internal feedback processes or LSS-independent snowfall events.

All the results from the first and second phases of PILPS showed that the simple ‘bucket’ model behaves differently from more complex land-surface schemes. One of the main objectives of PILPS and DSP12 is to investigate the impact of land-surface parameterisation complexity on the simulation of land-surface climates by atmospheric models. To do this, we have employed a multi-mode land-surface scheme, CHAmeleon Surface Model (CHASM) (Desborough 1999), to investigate the simulation of land-surface climate as a function of a range of LSS’ complexity. Desborough (1999) compares the partitioning of precipitation between evaporation and runoff as simulated by different modes of CHASM with observations from Valdai, Russia, superimposed on the range of corresponding values by different land-surface schemes, derived from PILPS 2d. The results clearly shows that a considerable portion of the ranges of scaled runoff and scaled evapotranspiration found among the PILPS land-surface schemes for Valdai is due to variation in the complexity of the schemes and that, generally, the more complex schemes tend to simulate values more consistent with observations.

To investigate the impact of parameter calibration on scheme simulations, a multi-criteria calibration method (Gupta et al., 1999) has been applied. The results of different modes converged towards the observations, with the greatest improvement occurring for the simplest mode. Although tuning of parameters makes the simple schemes able to perform more consistently with the observations, it is preferable to choose complex schemes for GCMs as long as global data for parameter calibration is not available for tuning or tuning at every point in a global model is deemed too costly in time or resources. Furthermore, calibrated parameters are a function of time-scale. While calibrating parameters may improve simulations for a given time-scale (e.g. annual mean), it does not necessarily improve the scheme performance in other time-scales (e.g. diurnal and seasonal variations) (Irannejad, 1999). Experiments with CHASM using data from the ThorneRiver Basin, used for PILPS Phase 2(e), also show that the simulation of evapotranspiration (latent heat flux) depends on the complexity of the land-surface scheme. Furthermore, it was found that calibrating the parameters (here based on streamflow measurements) improves the performance of the schemes, but it does not fully eliminate the residual errors in the simulations.

B Coupled Intercomparisons

The first phase of AMIP (designated as AMIP I, circa 1990-1996) saw the participation of about 30 AGCMs that simulated the global climate for the period 1979-1988. From 1990 to 1996, diagnostic subproject 12 contributed analyses of AMIP I results of relevance to the characterisation of the continental climate. In the AMIP I global models, the land-surface schemes were solely boundaries for energy and water exchanges between the air and the land. Most land-surface schemes were well established, dating back to Manabe (1969) and were generally believed to work well. The AMIP I experiment was designed to include evaluation of land-surface. Despite this, there were a number of difficulties associated with the analysis. These included:

the failure to initialise soil moisture consistently among AMIP I models;
a rather restricted set of “standard output” variables;
the limited variety of LSSs in the AGCMs of that time;
poor quality control of archived results; and
relatively poor documentation of the LSSs which were represented in AMIP I.

Despite these obstacles and inhibiting factors, it was possible to undertake evaluations of the simulated land-surface climates. Although it was originally intended to concentrate on “validation”, the outcomes of diagnostic subproject 12 of AMIP I are perhaps best described as “learning on the job”. The AMIP I/12 results can be summarised into three main findings:

No “best” land-surface simulation could be identified: every surface climate was an outlier in some respect (Love and Henderson-Sellers, 1994);
results revealed serious errors of execution such as non-conservation of continental moisture and/or energy and pronounced trends in moisture stores which were traced back to coding/coupling mistakes and incorrect/inadequate initialisation (Love et al., 1995);
at a regional scale, the inter-model scatter in energy and moisture partitions among the coupled simulations were substantially greater than in comparable off-line experiments, suggesting that the hypothesis that two-way feedbacks between land and atmosphere dampen land-surface climate differences is incorrect or at least unproven (Irannejad et al., 1995; Qu and Henderson-Sellers, 1998).

The twofold conclusions of AMIP I/12 was: (i) land-surface codes were not working especially well and certainly not as well as the AGCM owners and users believed and (ii) the neglect of stomatal resistance parameterisation in incorporated LSSs rendered the representation of many continental climates rather poor. Results also suggested that at least two years’ simulation, and often much longer, is required to overcome the difficulties of soil moisture initialisation. Energy and moisture differences over multi-year periods were caused in AMIP I by philosophies underpinning the incorporation of deep soil processes, as a result of poor experimental design, and by coding errors.