1. INTRODUCTION

Graphical methods have played a central role in the development of statistical theory and practice. From Sir Edmund Halley's graphical analysis of barometric pressure as a function of altitude, published in 1686, to the latest advertisements for computer graphics technology, the pages of scientific joumals have recorded the importance of statistical graphics to the scientific enterprise.

There has been a crescendo of interest and research in graphical methods that has built to a fortissimo since the 1060's. The force behind the rise has been the computer graphics revolution, high-quality hardware and software for graphing, data are now available at low cost, and systems with a substantial increase in power are about to penetrate the analyst's workplace. But the revolution has provided only a medium for development. Statisticians are actively exploiting this medium to produce important new tools for the data analyst.

There are three areas of statistical graphics. Which are methodology, graphical perception, and computing. Before observing these three areas we introduce a brief history of graphics in statistics. This longitudinal observation enables us to know ever used graphical methods and changes of graphical methods. Next, we introduce three areas of statistical graphics, Method, Perception, and Computing.

Research in graphical methods can involve studying quantification methodology - that is, what quantitative information should be shown on a graph to explore data or to provide a diagnostic for a model fitted to data Research in methods can involve studying visualization - that is, what visual vehicle should be used to solve a certain set of quantitative information. This paper shows historical development of methodology of statistical graphics. One particularly exciting topic of research that involves a tight weaving of methods and computing is Dynamic Graphics.

Graphical perception is recent work that has focused on the nature of the processes that operate when people decode the information represented in graphs. In section 2.3, researches of this area are introduced.

Recently a large number of statistical software packages are available. The combinations of computing hardware, general-purpose software and statistical software that will, together, help statisticians work productively and insightfully. In computing, one important research topic is the building of an interface between the data analyst and the computations that produce visual displays on computer screen. Another important research topic is the how one structures the computations. Many graphical methods require fast algorithms. In section 2.4 the features of present statistical computing environments are introduced.

It is important to construct practical realization. SGI (Statistical Graphics Interface) is a practical analytic tool for statistical graphics analysis. In section 3, explanations for use of SGI are introduced. The user can interact and experiment with the display, quickly compose a series of statistical graphics routines, and thus build up knowledge of the characteristics of data set. Practical approach is discussed in more detail in section 4 with example of clustering.

2. STATISTICAL GRAPHICS

2.1 Graphics in Statistics: History

Quantitative graphics have been central to the development of science, and statistical graphics date from the earliest attempts to analyze data Many familiar forms, including bivariate plots, statistical maps, bar charts, and coordinate paper, were used in the 18th century. Statistical graphics developed through attention to four problems: spatial organization (17th and 18th centuries), discrete comparison (18th and early 19th centuries), continuous distribution (19th century), and multivariate distribution, and correlation(late 19th and early 20th centuries). Today, statistical graphics appear to be reemerging as an important analytic tool, with recent innovations exploiting computer graphics and related technologies. Quantitative graphics did not originate with science; they can be traced back to prehistory. The earliest known map, extant on a clay tablet dates at 3800 B.C., depicts all of Northern Mesopotamia with conventions and symbols still familiar today. From about 3200 B.C., Egyptian surveyors abstracted their lands in terms of coordinates not unlike the Cartesian system still in use. By the 10th century AD., medieval astronomers depicted planetary movements as cyclic lines on spatial-temporal grids, diagrams strikingly similar to modem line graphs. Musical notation, standardized by the Vedic hymnists of the 7th century B.C., had become a true time series-following the Franconian reforms-by the 13th century AD.

Statistical graphics, beginning with simple tables and plots, date from the earliest attempts to analyze empirical data; many of the most familiar forms and techniques were well-established at least 200 years ago. At the turn of the 19th century, to use a convenient benchmark, a statistical analyst might have resorted to the following graphical tools: bivariate plots of data points (used since the mid-17th century), line graphs of time series data (since 1724), curve-fitting and interpolation (1760), the notion of measurement error as deviation from a regular graphed line (1765), graphical analysis of periodic variation (1779), statistical mapping (1782), bar charts (1786), and printed coordinate paper (1794). Today quantitative graphics are reemerging as an important statistical tool, as evidenced by developments as diverse as the growing use by statisticians of computer graphics, the proliferation of descriptive graphics in statistical publication of the Federal government, the progressive elaboration of graphics for exploratory data analysis, and the recent formation of an Ad Hoc Committee on Statistical Graphics in the journal of American Statistical Association.

2.1.1 Spatial organization for Data Analysis

The early problem of spatial organization grew with the amount of data to be analyzed. Multiple measurements proliferated with the Industrial Revolution in Europe, which brought a spate of new measuring devices: the air and water thermometer (1590), micrometer (1636), barometer (1643), pendulum clock (1656), water clock (1660), mercury thermometer (1714), etc. Spatial organization of multiple measurements was achieved in two competing forms, coordinate systems and tables, which dominated quantitative graphics in the 17th and 18th centuries.

2.1.2 Discrete Quantitative Comparison.

A second graphical problem, that of discrete quantitative comparison, arose in the state Statistics or Statenkunde of the early 18th century. Both tables of statistics in general, and comparative political data in particular, suggested the need for graphical comparison, especially in the growing volume of atlases and chartbooks intended for popular consumption.

2.1.3 Continuous Distribution

By the 1820's, an increasing number of the scientific journals in Europe began to publish graphs and charts that described and compared measurements of a wide range of natural and social phenomena. Graphical analysis of data finally emerged in the period 1830-1835 as a regular feature of scientific publication, particularly in England. At the same time, a relatively new field, vital statistics, generated a third graphical problem, that of representing continuous distributions. Two solutions, the ogive and the histogram, proved essential to the further development of vital statistics in the 19th century.

2.1.4 Multivariate Distribution and Correlation

By mid-19th century, quantitative graphics had become an accepted part of statistics. The Third International Statistical Congress, meeting in Vienna in 1857, organized an exhibition display of graphs and cartograms and debated the merits of various graphical methods. In 1872, the U. S. Congress appropriated the first money for a graphical treatment of statistical data the cartograms of 9th Census results which appear in Siatiatics of Wealth and Public Indebatenesa During the same period, vital statistics increasingly involved interrelationships among at least three variables-population, age time - which led to the graphical problem of representing multivariate distributions and correlations. Two general solutions, contour plots and Stereograms(i.e., orthographic and axonometric projection) occupied statisticians into the 20th century.

2.1.5 The 20th Century

At the turn of this century, statistical graphics had begun to diffuse-through textbooks, collage curses, and the mass media-into the popular domain. A magor vehicle for this diffusion was the pictogram, a comparative form based on similar drawings of different sizes developed by M. G. Mulhall for his popular Dictionary of Statistics published in 1884. W. C. Brinton corrected the pictogram's most serious flaw-the ambiguity of whether comparisons are to be made in one, two, or three dimensions-with his suggestion to string out unit drawings, thus making the form analogous to the bar chart. Otto Neurath was probably mose influent in promoting pictorial statistics, first in the 밮ienna Method?of the Social and Economic Museum, which popularized statistics of the city during the period 1924-1934, and later in Neurath's Isotype System, which became a profitable commercial ent erprise.

With the advent of World War II, interest in graphical methods appeared to wane among academic statisticians, as attention turned to more mathematical concerns. This trend did not begin to reverse again until the mid-1960's when developments in computer technology made possible the manipulation and analysis of large multivariate data sets. As W. H. Kruskall noted in 1975:

-6-


The role of statistical graphics within statistics generally - - · has had tremendous ups and downs: at one time, graphical methods were near the core of statistics-Karl Pearson devoted considerable attention to graphics and he was following the emphasis of his hero, Francis Galton. Later on, Statistical graphics become neglected and even scorned in comparison with the blossoming of the mathematical side of statistics. In recent years, however, there has been a renaissance of concern with graphics and some of our best statistical minds have suggested new graphical approaches of great interest.

New graphical approaches published during the past 30 years include, in 1957, Edgan Anderson's circular glyphs to represent multivariate data; in 1965, the first of J. W. Tukey's innovations for exploratory data anlysis (EDA); in 1968, the 밽raphic rational pattems?of R. Bachi; in 1972, D. F. Andrews?Fourier series to generatic multivariate plots; in 1973, H. Chemoff's cartoons of a human face to represent multiple variables; in 1974, a color-coded matrix to represent two variables in a single map, developed by the U. S. Bureau of the Census; in 1975, S. E. Fienberg's 밊lonting Four-Fold circular Display?(FCD) to represent a 2 x 2 tables; in 1978, Pi. McGILL, J. W. Tukey, and W. A Lasen's 밮ariations of Box Plots? in 1981, 밨epresenting Points in Many Dimensions by Trees and Castles?by B. Kleiner and J. A Hartigan; in 1884, J. J. Mezrich, S. Frysnger, and R. Slivjanovski's 밆ynamic Representation of Multivariate Time Series Data? ia 1986 D. W. Turner, J. W. Searman's research 밬sing Polyhedra to Graphically display k-dimensional data? in 1987, R. A Becker, W. S. Cleveland and A R. Wilks's 밆ynamic Graphics for Data Analysis? in 1988, 밠acSpin: Dynamic Graphics on a Desktop Computer

-7-


by A W. Donoho, D. L. Donoho and M. Gasko; in 1989, G. A Mead's 밫he Sorted Binary Plot: A New Technique for Exploratory Data Analysis?and R. Dunn's 볾 Dynamic Approach to two-Variable Color Mapping? All of these innovations exploit modem computer technology. Recently, innovations in statistical graphics are followed by the developments in computer graphics hardware and software, and include solutions to problems generated or made tractable by the computer and associated technologies.

2.2 Graphical Methods in Statistics

S. E. Fienberg's paper 밎raphical Methods in Statistics?(1979) shows the decline in the use of statistical graphs during this century. But this indicates the relative increase in statistical theory and nongraphical methodoloy. Despite what mny appear to be a prolonged decline in a relative use of graphics in statistical joumals, the past 30 years has secn an almost astonishing increase in innovation graphical ideas for data display and,analysis. The statistical groups at Princeton University and ATkT Bell Laboratories have provided much of the leadership for the development of what might be called the 뱊ew statistical graphics? Now, I would like to review quickly some of these innovations.

2.2.1 Graphs for Displaying Multidimensional Data

Anderson (1957) developed his method of using glypha and meiroglyphs, which are circles of fixed radius with rays of various lengths representing the values of different variables. There are many variants of the glyph 4echnique, involving the plotting of triangles (Pickett and White 1966), k-sided polygons (Siegel, Gildwyn,

?8-


and Friedmaa 1971), and weathervanes (Cleveland and Kleiner 1974), as well as much more elaborate devices such as constellations (Wakimoto and Taguri 1978). Bertin (1967) developed Profiles, which represent each point by k vertical bars for k-dimensional data each bar having height proportional to the values of the corresponding variable. The profile refers to the tops of the bars; sometimes the profile is drawn as a polygonal iine. This is undoubtedly the most common method of representing multivariate data Goldwyn (1971) developed Sfars, also called polygons, represent each variable as a value along k equally spaced rays issuing from the center of a circle. This technique is similar to profiles, except that values are arranged in polar coordinates rather than rectnngular ones. The points on the rays are usually connected in a polygon. Welsch (1976) describes the standard TROLL graph capabilities, as well as a series of experimental graphic devices, including STARS. Ashton, Healy, and Lipton (1957) used graphical techniques to compare measurements on the teeth of fossils and different 뱑aces?of humans and apes. Hartigan (1975) developed Bozea which represent each variable as a length in one of the three dimensions of a box; if there are more than three variables, there will be several segments within each dimension. The final appearance is of a box wrapped by strings, which break the edges if the box into segments corresponding to the variables. Andrews (1972) suggested representing a k tuple, z = (zy, zg, · · · , xg), by the finite Fourier series

f-(t) = ?+xysint+zycost+z4sin2t+zycos2t+ · · · -9-


Plots of the Fourier representations of the multivariate observations will be curves which can be visually grouped. In his article, Andrews goes on to develop significance tests and confidence intervals to make comparisons on the plots. Chemoff (1973) proposed representing a point in 18-dimensional space by drawing a face whose 18 characteristics (such as length of nose, shape of face, curvature of mouth, size of eyes, etc.) are determined by the coordinates or position of the point. Each k-dimensional observation is represented by mapping the value of each variable onto a component of the graphical display. Gnanadesikan (1977) discusses ways of using probability plotting techniques of multivariate point clouds, and gives a number of references. Wacher (1975) has developed a form of probabilty plotting as an adjunction to principal component analysis. The correlational structure among the variables is ignored, and the assignment of variables to components of the graph is arbitrary (Wainer 1981). A solution to this apparent shortcoming was given by E<leiner and Hartigen (1981). They suggested that a hierarchical clustering of the variables be used to generate a well-defined algorithm for assigning variables to components of the display. This approach used in conjunction with any graphical representation of multidimensional data by Jacob (1981). Due to advances in computing, recent research focused on interactive and dynamic methoda A dynamic graphical method is on in which a data analyst interacts in real time with a data display on a computer graphics terminal using a screen input device such as a mouse, the analyst can specify, in a visual way, points or regions on the display and cause aspects of the display to change nearly instantaneously. A prototype computer software system of which adapt this sort, PRIM-9, was developed by Tukey,