SPECIFICATIONS OF THE PEDOTRANSFER RULES PROJECT.

______

Joel DAROUSSIN

Unite de Science du Sol

Institut National de la Recherche Agronomique

INRA - Avenue de la Pomme de Pin

BP 20619 - Ardon

F-45166 OLIVET CEDEX

France

Tel: +33 (0)2 38 41 78 42

Fax: +33 (0)2 38 41 78 69

E-mail:

Last update of this file: 19/12/94

Last update of this file: 05/02/95: minor changes and fixes.

Last update of this file: 25/03/96: version 2.0 of the project.

Last update of this file: 18/07/96: version 3.2 of the SGDE.

Last update of this file: 23/07/96: new computation of confidence levels.

SUMMARY:

------

1 - Objective of the work:

2 - General specifications:

2.1 - Dataset, objects, attributes, values, NODATA:

2.2 - Rules, occurrences, input attributes, output attributes, facts, inferences:

2.3 - Wild cards:

2.4 - Confidence levels:

3 - Technical specifications:

3.1 - Expert type rules:

3.2 - Class type rules:

3.3 - Other rule descriptors:

3.4 - Dataset:

3.5 - Naming conventions:

3.6 - Class rule structure:

3.7 - Expert rule structure:

3.8 - Item coding and constraints:

3.9 - Rules data:

4 - Project organization:

1 - Objective of the work:

------

The purpose of this work is to implement the capability to provide information necessary to a particular field of interest (in the present case: environmental studies) using that information which is available in another field of interest (in the present case: description of soils). Transcription of information from one field to the other is done by applying transfer rules (in the present case: pedotransfer rules).

Implementation of the system will take place within the Arc/Info Geographical Information System (GIS) software package, using it's macro programming language (AML: Arc Macro Language). The reasons for this choice are 1) the database of available information (soils description) is stored and managed within Arc/Info, 2) the resulting information (environmental parameters) has to be stored and managed within Arc/Info for map display and database query purposes, and 3) this implementation has to be made within time and means limits which do not allow for acquisition of - and staff training to - a specialised software.

Although the implementation is tailored for general utilization (within the context stated above), it is firstly meant to provide the European Environmental Agency with spatialized environmental indicators that can possibly be derived from the Soils Geographical Database of Europe at Scale 1:1,000,000.

2 - General specifications:

------

2.1 - Dataset, objects, attributes, values, NODATA:

All the available information in the "from" field of interest is stored in a so-called "dataset" (e.g. the soils dataset). The dataset is physically stored as a dataset Info file.

The dataset holds information about a number of "objects" (e.g. a number of soils such as Luvisols, Cambisols, etc). Each object is physically stored as a line or record in the dataset Info data file.

The objects in the dataset have a number of characteristics called "attributes" (e.g. soils have a soil name, a texture, etc). Each attribute is physically stored as a column or "item" in the dataset Info file.

Each object in the dataset has a particular "value" for each one of its attributes (e.g. Rankers have a Medium texture). Each value is physically stored at the intersection of the object's record and the corresponding item in the dataset Info file.

Values generally follow a coding schema before being physically stored in the dataset (e.g. soil name Ranker is encoded and stored as U, Medium texture is stored as 2, etc).

Some objects might not be fully described when some of their attributes are unknown (e.g. for such soil, texture is not known). An unknown value for an attribute is called a "NODATA" value. As there is no pre-defined way of coding and physically store NODATA values in Info files, each attribute coding schema will have to make provision of a NODATA value code (e.g. 0 will mean unknown

texture).

2.2 - Rules, occurrences, input attributes, output attributes, facts, inferences:

Soil Science experts of the project working group provide the system with so-called "pedotransfer rules". A "rule" is the mean by which new needed information describing an object of the dataset can be derived - i.e. "inferred" -, using expert knowledge in the field, from existing available information - i.e. factual information or "fact" - describing the object (e.g. the depth to rock of such particular soil can be inferred from its known soil name, parent material and phase).

A set of rules holds all the usable knowledge to derive all the new needed information from an available dataset. It is physically stored as a rules Info database.

A rule holds all the usable knowledge to derive one single new information from a fact (available information about an object). A rule is physically stored as a rule Info file.

A rule can be seen as a statement of the form:

IF <available information is ...> THEN <new information is ...>

ELSE IF <available information is ...> THEN <new information is ...>

...

ELSE IF <available information is ...> THEN <new information is ...>

Each line in this statement is called an "occurrence" of the rule. An occurrence is physically stored as a line or record in the rule Info file.

An occurrence can be seen as a statement of the form:

IF (or ELSE IF)

<factual value for attribute i is w

and factual value for attribute j is x

...

and factual value for attribute n is y>

THEN

<inform the object with value z for a new attribute m>

where attributes i to n are providing the factual information (values w to y of an object), and attribute m is providing the new - inferred - information (with value z).

Attributes providing the factual information are called the "input attributes" to the rule. The attribute providing the new - inferred - information is called the "output attribute" from the rule. The input attributes are physically stored as columns or "input items" in the rule Info file. The output attribute is physically stored as a column or "output item" in the rule Info file.

Example:

IF <soil name is Luvisol and parent material is "any" and phase is "any">

THEN <depth to rock is deep>

ELSE IF <soil name is Orthic Luvisol and parent material is Marl

and phase is "any">

THEN <soil depth is very deep>

ELSE IF <soil name is "any" and parent material is "any" and phase is Lithic>

THEN <soil depth is shallow>

As with the dataset, "values" are physically stored at each intersection of each occurrence’s record and each input and output items in the rule Info file.

Input items in a rule must have the same definition (name, type, size...) and coding schema as their corresponding item in the dataset.

An "inference" is the action of producing a new derived information to an object according a) to the available information it provides, and b) to the rule which is activated. It proceeds in 5 steps:

1. The input attributes are identified in the rule.

2. The values for these attributes are retrieved from the object in the

dataset and constitute a fact.

3. The occurrence of the rule which matches the fact is searched for.

4. The output attribute definition and value are retrieved from the

matching occurrence

5. and are added to the object in the dataset.

When a rule is activated on a dataset, one inference will take place for each object of the dataset, one after the other. The result will be a new attribute in the dataset, one for the whole dataset, to hold the new inferred values, one for each object.

An attribute of the dataset that has been previously inferred using a rule is further considered as storing available information. It can thus be used as an input attribute to other rules.

2.3 - Wild cards:

It is difficult, if not impossible, for an expert to foresee all the cases that can possibly occur in a set of available information. Furthermore, in some cases, several, nay, many different values of a fact will lead to the same conclusion (e.g. IF <texture is sandy or loamy or ...> THEN ...).

Therefore a "wild card" mechanism allows the expert to define occurrences of rules which will match several different facts.

The "any" terms in the expressions of the last example above show such situations.

The "any" wild card will be, by convention, denoted as a star character (*) in a rule.

A fact for which an exact matching occurrence can be found will receive this occurrence’s output attribute value.

A fact for which an exact matching occurrence cannot be found will receive the output attribute value of the last occurrence of the rule that matches, if it can be found using the wild card convention. This is assuming that an expert rather builds a rule by refining its occurrences, considering the most general cases before the most particular cases.

When no matching occurrence at all can be found for a fact, no value is provided to the output attribute, thus leaving it "blank" (or "0" (zero) depending on the output item's type). This can lead to confusion if blank (or 0) are possible normal output values. Therefore having a fully "wild carded" occurrence as a header of a rule will "pick up" all facts for which no information can be provided and force the output value to, say the NODATA value.

Using these specifications, the above example would become:

Example:

IF <soil name is "any" and parent material is "any" and phase is "any">

THEN <depth to rock is unknown>

ELSE IF <soil name is Luvisol and parent material is "any" and phase is "any">

THEN <depth to rock is deep>

ELSE IF <soil name is Orthic Luvisol and parent material is Marl

and phase is "any">

THEN <soil depth is very deep>

ELSE IF <soil name is "any" and parent material is "any" and phase is Lithic>

THEN <soil depth is shallow>

The wild card convention simulates the logical OR operator.

2.4 - Confidence levels:

Expert knowledge is fuzzy and subject to evolution. Furthermore, the available information on the one hand, and the inferences that can be made using that information and the expert knowledge on the other hand, both have a certain reliability. Therefore it is necessary to have a mechanism that will allow each available information (or factual value) held in the dataset, and each infered information (or output value) held in the rules database, to be complemented with its reliability.

The reliability of an information is called its "confidence level".

Confidence levels are held by confidence level attributes, one for each attribute of the dataset, and one for the output attribute of each rule.

Therefore each object in the dataset has a confidence level value for each one of its attributes. And each occurrence of each rule has a confidence level value for its output attribute.

The coding schema for confidence levels is the following:

v Very low or no information

l Low

m Moderate

h High

When an inference takes place, the following 4 steps complement those listed above:

6. The output confidence level attribute definition is retrieved from the

matching occurrence,

7. and is added to the object in the dataset.

8. The confidence level of each input attribute of the object is determined

from:

. its own associated confidence level item, named <in_item>.CL in the

dataset,

. or, if the former is not found, the global confidence level item of

the object, named CFL in the dataset,

. or, if neither are found, as an assumed high (H) confidence level.

9. The minimum (worst) confidence level value is retrieved from the

confidence levels of all the attributes implied in the inference

process (input confidence levels of the object, and output confidence

level of the occurrence).

10. The resulting confidence level value is added to the output confidence

level attribute in the object.

We have seen that an attribute of the dataset that has been previously inferred using a rule can be used as an input attribute to other rules. Its confidence level will be used in the same way as for any other input attribute.

3 - Technical specifications:

------

3.1 - Expert type rules:

When a rule is applied to a dataset, it is processed in the following manner:

1. Input items to the rule are located and checked in the dataset

2. and output and confidence level items are added (empty) to the dataset.

3. Then for each record in the dataset:

4. the combination of actual values for the input items are matched to their corresponding combination in the rule Info file,

5. the corresponding value for the output item is retrieved from the rule Info file,

6. the corresponding value for the output confidence level item is computed from all the available input confidence levels in the dataset and the output confidence level in the rule,

7. and finally these values are updated in the current record of the dataset.

Input and output items of a rule have a limited number of possible Info data types. These are character (C), clear integer (I) and clear numeric (N). Any other Info data type (date (D), binary integer (B) and binary floating point numeric (F)) is not to be used in rule data files.

3.2 - Class type rules:

The rules described above are called "expert type rules" as opposed to "class type rules". Class type rules are simple reclassification or recoding rules. They are used in any of the following cases:

1) convert the Info data type of an input item in the dataset from an unauthorized to an authorized type (e.g. B to I, or F to N),

2) reduce the number of different values for an input item (e.g. reclass detailed texture classes into less detailed texture classes),

3) recode the values of an input item (e.g. change codes to a more "speaking" coding schema),

4) a combination of the above cases.

Class type rules accept only one input item and produce one output item. The input item has no limitation as to it's Info data type. The output item follows the same limitations as those applicable to expert type rules.

Class type rules do not follow the wild card convention. Wild cards may not be used there.

Class type rules do not hold an output item associated confidence level for their occurrences. But if the input item has an associated confidence level in the dataset, the class type rule copies it in the dataset to a confidence level item associated to the output item.

Thus class type rules may or may not produce a confidence level item together with the output item. Whereas expert type rules always produce a confidence level item.

3.3 - Other rule descriptors:

Each occurrence of a rule is furthermore informed with the following:

- an author identification number,

- a last update date,

- and a pointer to a text file to hold free explanatory notes to give any more details about the occurrence (not implemented at this time).

The rules database also holds a rules information file (DICTIONARY) and an authors information file (AUTHORS).

3.4 - Dataset:

Each time a rule is activated or "fired", the input items to the rule are checked against the dataset. All input items of a rule must exist within the dataset. They must have the same definition (name, type, size) in both the dataset and the rule.

Each item in the dataset may or may not have an associated confidence level item. When an expert rule is fired, if an input item does not have an associated confidence level in the dataset, it is assumed to have the best confidence level.

3.5 - Naming conventions:

A rule is an Info file stored in the $PTRHOME/xxx_rules Arc/Info workspace, where xxx refers to the domain to which the rules apply (e.g. eur32_rules refers to rules applicable to the Soils Geographical Database of Europe at Scale 1:1,000,000 version 3.2).

It is named RULE<rule_number> in which <rule_number> identifies uniquely the rule in the rule database.

Each record of the rule file is called an occurrence of the rule.

Input and output items in a rule follow the Arc/Info naming conventions with one restriction: an item name must not exceed 13 characters. (The reason for this is Info's 16 characters limit reached with the next naming convention for associated confidence level items.)

An associated confidence level item has the same name as the item to which it is associated but is suffixed by ".CL" (e.g. if item name is ITEM then associated confidence level item name must be ITEM.CL).

3.6 - Class rule structure:

A class rule is one that classifies or recodes one and only one input attribute into an output attribute.

COLUMN ITEM NAME WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME INDEXED?

1 NUM_AUTHOR 2 2 I - -

Identification number of author of the occurrence of the rule.

3 LAST_UPD 8 8 D - -

Last update date of the occurrence of the rule.

11 NOTE 4 5 B - -

Pathname to an ASCII explanatory note file of the occurrence of the rule.

(Not used at this time.)

15 <out_item> ? ? ? - -

Output attribute from the rule.

? <in_item> ? ? ? - -

Input attribute to the rule. In the case of a class rule, there is only

one input attribute.

** REDEFINED ITEMS **

1 CLASS_RULE 2 2 I - -

The name of this redefined attribute is only used to differentiate

class type from expert type rules.

Example of a class rule to recode an attribute named TYPE to a new attribute

named CLASSTYPE:

COLUMN ITEM NAME WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME INDEXED?

1 NUM_AUTHOR 2 2 I - -

3 LAST_UPD 8 8 D - -

11 NOTE 4 5 B - -

15 CLASSTYPE 1 1 I - -

16 TYPE 2 2 C - -

** REDEFINED ITEMS **

1 CLASS_RULE 2 2 I - -

3.7 - Expert rule structure:

An expert rule is one that uses one or several input attributes to infer the

values of an output attribute.

COLUMN ITEM NAME WIDTH OUTPUT TYPE N.DEC ALTERNATE NAME INDEXED?

1 NUM_AUTHOR 2 2 I - -

Identification number of author of the occurrence of the rule.

3 LAST_UPD 8 8 D - -

Last update date of the occurrence of the rule.

11 NOTE 4 5 B - -

Pathname to an ASCII explanatory note file of the occurrence of the rule.

(Not used at this time.)

15 <out_item> ? ? ? - -

Output attribute from the rule.

? <out_item>.CL 1 1 C - -

Output confidence level attribute from the rule.

? <in_item 1> ? ? ? - -

1st input attribute to the rule.

{ ? <in_item 2> ? ? ? - -

2nd input attribute to the rule.

...

? <in_item N> ? ? ? - -

Nth input attribute to the rule. }