Elaboration, explanation and specification in graphical models

Svend Kreiner

Dept. of Biostatistics, University of Copenhagen

Part I: The elaboration paradigm and graphical models

Part II: A bootstrap evaluation of the confidence of graphical models provided by model search procedures and the estimates of associations between variables based on these models

Association between intelligence measured in 1968 and income reported 25 years later

g = 0.20 p <0.0005

The category collapsed table

The graphical model underlying category collapsibility

Income ╨ Intelligence | Collapsed income

Income ╨ Intelligence | Collapsed intelligence


Elaboration, explanation and specification

A paradigm for quantitative sociological research

Lazarsfeld, PF (1946) : Interpretation of Statistical Relation as a Research Operation

Lazarsfeld, PF & Kendall, PL (1950): Problems of Survey Research

Lazarsfeld, PF & Rosenberg, M (eds) (1955): The language of Social Research

Rosenberg, M (1962): Test factor Standardization as a Method of Interpretation

Davis, JA (1967): A partial coefficient for Goodman and Kruskall‘s Gamma

Rosenberg, M (1968): The Logic of Survey Analysis

Davis, JA (1975): Analyzing contingency tables with linear flow graphs

Davis, JA (1980): Contingency table analysis: proportions and flow graphs

Davis, JA (1984): Extending Rosenberg’s Technique for Standardizing Percentage

Tables


Elaboration, explanation and specification

Elaboration - Analysis of conditional association

Explanation - Z explains the association between X and Y if X ╧ Y | Z

Specification - Description of conditional associations that cannot be

explained by other relevant variables. Is the strength

of the X-Y association constant across levels of Z?


Graphical models and elaboration

Graphical models are models defined by explanations obtained during attempts to elaborate associations in a multivariate set of variables.

Graphical models provide the solution to the problem that killed the elaboration paradigm:

The problem - which variables are needed in order to explain or specify

the association between two variables?

The solution - is given by the global Markov properties and

decompositions of the Markov graph


Describing associations in graphical models

Both collapse onto for analysis of the AD association

The graphical model

Specification of the intelligence – Income association

Nothing more can be said about the association between intelligence and Income within the inference frame defined by graphical models?


Loglinear modelling on top of the graphical model

The graphical model is loglinear, but higher order interactions are always included.

The graphical model therefore assumes that the Income-Intelligence association is modified by Education, School and Sex

If we, during specification of associations, conclude/assume that the association is constant across levels defined by other variables, then the model has to be replaced by a loglinear model.

Collapsibility properties of graphical models guarantee that all parameters relating to the Intelligence-Income association are included in the marginal table including Education, School and Sex.

Specification of the Intelligence-Income association therefore only requires analysis of the 5-way table with these variables.


The marginal loglinear model

The marginal model is saturated.

The Income-Educ-School and Educ-School-Intel-Sex interactions are fixed in the model. Attempts to examine these interactions in the 5-way table will tell us nothing about the associations between these variables in the full model.

The results of analyses of the other interaction parameter in the 5-way table also apply for the full model.

At the end of the day:

No evidence against Income-Intelligence,

Income-Educ-Sex, Income-School-Sex,

Income-Educ-School,

Educ-School-Intel-Sex

The Income-Intelligence association is constant over all levels of the other variables of the model. (Observed partial g = 0.079 – fitted partial g = 0.050 under the loglinear model).


The estimation problem

The properties of estimates are well-known under the model.

The model itself, being a result of a model search procedure, is however in itself an estimate.

Very little is known about the properties of both

1)  the estimates of the model

2)  the estimates of unknown parameters based on the model estimates

Non-parametric bootstrapping is one way to examine these properties

The model search procedure in this example

(much better strategies are available – but not discussed here)

1)  Initial screening (Kreiner, 1986) of 2- and 3-way tables defining a starting point for a proper model search procedure

2)  Stepwise naïve p-value driven search (backwards – and forwards). P-values are Monte Carlo estimates (Kreiner, 1987) based on 400 random tables for each hypothesis. Significance is evaluated at a 1 % critical level

The screening will – apart from statistical errors – identify the parts of the models defined as strings and trees defined by cliques sharing only one edge.

Naïve p-value driven model search procedures are not consistent.

Type II errors may disappear, but type I errors (spurious edges) will continue to turn up even though n increases

How reliable is the estimate of the association between intelligence and income?

G = the “true” graphical model

The partial g is estimated relative to G

If G is known then g(G) has nice asymptotic properties.

If intelligence & Income is connected in G then

If intelligence & Income is disconnected in G then

G is not known. g must therefore be estimated relative to the estimate of the model, .

Let be the estimate under .

The distribution of is not known and (probably) not nice

The properties of and can be examined by naïve nonparametric bootstrapping.


How stable is the model estimate

Mean edge entropi = 0.317 (0.904/0.096 distribution)

Mean number of departures from data model = 6.5 (18.0 %)


Estimating partial g coefficients

The model found in the original data collapses on a table with

Income, Intelligence, Education, School, Sex

for estimation of the partial g coefficient

21.8 % of the bootstrapped models collapse on the same table even though none of the bootstrapped models are equal to the data model


Bootstrap distribution of collapsibility properties

Frequency / Percent / Valid Percent / Cumulative Percent
FGM / 110 / 22,0 / 22,0 / 22,0
BCFGLM / 86 / 17,2 / 17,2 / 39,1
FGLM / 65 / 13,0 / 13,0 / 52,1
BCFGM / 52 / 10,4 / 10,4 / 62,5
BCFGKM / 28 / 5,6 / 5,6 / 68,1
BCFGKLM / 24 / 4,8 / 4,8 / 72,9
CFGLM / 22 / 4,4 / 4,4 / 77,2
BFGLM / 19 / 3,8 / 3,8 / 81,0
BFGM / 17 / 3,4 / 3,4 / 84,4
FLM / 17 / 3,4 / 3,4 / 87,8
BCFLM / 11 / 2,2 / 2,2 / 90,0
FGKM / 9 / 1,8 / 1,8 / 91,8
FKM / 9 / 1,8 / 1,8 / 93,6
BCFKLM / 6 / 1,2 / 1,2 / 94,8
CFGM / 4 / ,8 / ,8 / 95,6
CFKLM / 4 / ,8 / ,8 / 96,4
BFLM / 3 / ,6 / ,6 / 97,0
CFGKM / 3 / ,6 / ,6 / 97,6
BFGKM / 2 / ,4 / ,4 / 98,0
CFLM / 2 / ,4 / ,4 / 98,4
FGKLM / 2 / ,4 / ,4 / 98,8
FKLM / 2 / ,4 / ,4 / 99,2
FM / 2 / ,4 / ,4 / 99,6
BFGKLM / 1 / ,2 / ,2 / 99,8
CFGKLM / 1 / ,2 / ,2 / 100,0
Total / 501 / 100,0 / 100,0


Partial g coefficient estimated in tables defined by decomposition


Estimated partial g coefficients under different collapsibility assumptions


g may sometimes be estimated in collapsed tables defined both by separation and decompositions.

Collapsed tables defined by separation are smaller, loglinear parameters of interest are the same, but separation does not guarantee collapsibility of partial g coefficients.

These may therefore be systematically different from those estimated in collapsed tables defined by decomposition

Are there any apparent differences when we compare the bootstrap estimates?


The association between g coefficients estimated under different collapsibility conditions

Uncorrelated estimates. Estimates from tables defined by separation are unbiased


No significant difference between estimates in tables defined by separation and decomposition

Table defined by / Mean / Std. Deviation
separation / ,074336 / ,0372415
decomposition / ,071078 / ,0403192
Total / ,072522 / ,0389971


The distribution of the gamma coefficients when edges are present in the model. Mean = 0.10, s.d. = 0.0216


The distribution of gamma coefficients including zeros implied by missing edges. Mean = 0.049, s.d. = 0.053

5