DataGrove: Exploring Network Trace Data with Hierarchical
Multi-dimensional Meta-data
Diane Tang, Robert Bosch, Chris Stolte, Mary Baker, and Pat Hanrahan
Computer Science Department
Stanford University
Abstract
With the increasing frequency and sophistication of data collection, hierarchical structured data described using star and snowflake schemas is becoming more important. In these schemas, additional semantic knowledge is connected to a main relational data table. While the database community recognizes the importance of structuring data in this fashion for analyzing large data sets, most visualization packages do not handle hierarchical multi-dimensional data.
Our goal is to foster intuitive user exploration of data described by both snowflake and star schemas. We present a conceptual model and user interface for exploring this type of data. While we demonstrate the benefits of navigating this structure using a single data set, specifically a wireless network trace, we believe these techniques can be generally applied to other large data sets.
1 Introduction
The database community has long recognized the importance of hierarchical multi-dimensional data, in which structured semantic knowledge is connected to a single main relational data table. This additional semantic knowledge is especially useful when analyzing large data sets.
For example, a store may keep transaction records in a simple relational table: each transaction record includes attributes such as product, customer, store location, and transaction date. Supplemental tables contain additional data relating to each attribute, such as product type, supplier, warehouse, etc. These separate data sets combine with the raw transaction data to form a multi-dimensional data hierarchy.
This combination of simple data augmented with hierarchically organized semantic knowledge, or meta-data, occurs in a wide range of problem domains. The meta-data hierarchy enables analysts to aggregate and organize the data set logically into different levels of detail, presenting the data in a more manageable, but still meaningful, form.
Unfortunately, most visualization packages do not take advantage of data structured in this fashion. Instead, they only support data in a single flat table, stored either internally or in a relational database. With a flat table, additional semantic knowledge must be explicitly included in the table, which can significantly increase its memory footprint. While a relational database can store the data more efficiently, in either case the hierarchical structure is typically unknown to the visualization and therefore not exposed to the user. Consequently, the user must have explicit knowledge of the hierarchy and manually incorporate this knowledge when using the visualization.
Our goal in this work is to facilitate intuitive user exploration of hierarchical structured data. We present both a conceptual model used to support hierarchical multi-dimensional data within a visualization environment and an interface for exploring this type of data.
We demonstrate the importance of leveraging this meta-data information for a particular data set, specifically a trace of a wireless network. However, we believe these techniques are generally applicable to other data sets and extensible to a wide variety of visualizations.
Section 2 provides background information about hierarchical multi-dimensional structures in general and the particular data set we explore in this paper. Related work is presented in Section 3, and Section 4 presents the virtual fact table conceptual model we use throughout the rest of the paper. Section 5 presents our current user interface, and Section 6 describes how virtual fact tables are generated. We then show how this structure is useful for exploring the network trace data, and we conclude with several directions for future work.
2 Background
The database community describes hierarchical multi-dimensional structures through the use of star and snowflake schemas [4][14]. Both types of schemas have a root fact table, consisting of the basic data that the user wants to explore. Each field in the root fact table is either a dimension or a measure: dimensions are independent fields, either nominal or quantitative, whereas measures are dependent quantitative fields.
Additional semantic knowledge describing the dimensions is stored in associated existence tables. Each existence table represents semantic information for the dimension at a particular level of detail.
An example of a snowflake schema is shown in Figure1. In this schema the dimensions are Time, User, Access Point, Application, Direction, and Remote Host, and the measures are Packets and Bytes. This schema is a snowflake schema because the User set of existence tables branches into both Research Group and Wing/Floor. Star schemas are a subset of snowflake schemas with only single, non-branching lines of existence tables.
The schema presented in Figure 1 describes a trace of the wireless network installed in the Gates Computer Science Building at Stanford University over a 12-week period in Fall, 1999. For every packet transmitted through the gateway for this wireless network, we recorded:
· the timestamp,
· the wireless network user that sent or received the packet,
· the access point (location) of the user,
· the application that generated the packet,
· the direction (incoming or outgoing) of the packet,
· and the remote host with which the user was communicating.
We collected this information for 78,739,933 packets over the 12 week period. Because we wanted to find interesting trends in this data, we pre-processed the data to compute the number of packets and bytes sent every 15 seconds for a unique set of user, access point, application, direction, and remote host, resulting in 2,890,497 unique records.
While we tried exploring this data using existing visualization packages, we found the inability to navigate the underlying semantic hierarchy frustrating. For example, we wanted to display the data grouped by the user's research group, or the users and the access points grouped by floor. However, to perform this grouping, we either had to pre-process the data manually into these groups, or add this supplemental data into the main table. Neither option was particularly palatable.
This data set is the basis for studying the techniques presented in the rest of this paper. A detailed analysis of this data is presented in Tang [12].
3 Related Work
There has been a great deal of work on visualizing trees, including techniques from tree maps [5] to cone trees [11] to hyperbolic browsers [9]. However, these techniques handle data that is itself a hierarchy, such as filesystems; our work is focused on visualizing data sets that are organized using hierarchical meta-data, such as the schema shown in Figure 1.
The Pad++ project [1] has explored interfaces for navigating large data spaces using semantic zooming and task-based filtering. While Pad++ has explored multiscale displays, they do not address data management schemes for hierarchical multi-dimensional data.
The Sage group at CMU has worked with hierarchical data, combining their Aggregate Manipulator with dynamic queries [3]. However, they restrict their discussion of data manipulation to desired operations such as on-the-fly aggregation and filtering.
A popular approach for handling hierarchical data in the database community is the datacube [4]. Users can query the datacube for slices using interfaces such as the Pivot Table [6]. Datacubes handle hierarchical data by essentially building a datacube of datacubes, which users can then query. While our conceptual model is similar, we normalize the resulting datacubes into tables and provide a focus-plus-context interface for user exploration.
4 Conceptual Model
Using hierarchical meta-data to display a data set at multiple levels of detail is similar to using techniques such as mipmaps, clipmaps, and r-sets to render images at different resolutions with a constant amount of work.
A mipmap [16] is an image pyramid, where each level of the pyramid represents a different level of detail (i.e. the image at a different resolution). The display resolution available to the image determines which level of the pyramid is used. Clipmaps [13] extend mipmaps to handle arbitrarily large textures. A clipmap filters, or clips, the more detailed levels of the pyramid to just the area being displayed and keeps an "invalid" border to allow for quick update when panning. In both cases, zooming into and out of the image corresponds to moving up and down the pyramid, changing the level of detail. Another technique similar to mipmaps is r-sets [9], which vary the level of detail of the horizontal and vertical dimensions independently.
We can apply these ideas from the graphics community to hierarchical multi-dimensional data. However, rather than varying the level of detail over a single dimension (for mipmaps, the image resolution), we must handle multiple levels over multiple dimensions. Changing the level for a dimension is equivalent to transforming how the root fact table is aggregated over that dimension. For example, changing the User dimension to the ResearchGroup level means that all users in a particular research group are aggregated together. We then define a virtual fact table to be the root fact table transformed so that each dimension is at a specific level of detail. Examples of virtual fact tables given the schema presented in Figure 1 are shown in Figure 2.
Exploring a hierarchical multi-dimensional data set is then equivalent to exploring the space of virtual fact tables. We can think of navigating a graph of virtual fact tables, similar to the mipmap image pyramid, where a user can go from one virtual table to another by changing the level of detail of a single dimension. A portion of this graph for our data schema is shown in Figure 2.
We can also apply the clipmap extension to our virtual fact table model. Filtering the virtual fact table to show only a subset of the domain (e.g., showing only the graphics research group in our data) is analogous to clipping mipmaps. Changing this filter corresponds to panning the displayed image data.
We can divide the exploration of this type of data into two parts: the user interface to explore and specify which virtual fact table we want to display, presented in Section 5, and the generation of virtual fact tables, presented in Section6.
Note that there is an additional stage in the exploration process: determining how to display the generated virtual fact table. However, for this paper we focus on a single visual representation, specifically a collection of strip charts displaying data varying over time.
5 User Interface
To enable easy exploration of hierarchical multi-dimensional data, we had several goals when designing our user interface:
- Enable the user to navigate the dimensional hierarchies easily and intuitively by zooming into and out of the data.
- Allow the user to choose how to traverse dimensional hierarchies. This ability is especially important for exploring snowflake schemas, which can branch.
- Provide a focus-plus-context view of the hierarchy.
- Provide direct manipulation of elements in the hierarchy, enabling the user to scroll the data display and change the data filters.
Figure 3 shows our user interface for both a nominal and a quantitative dimension. The interface is multi-tiered, with each tier corresponding to a different level of the dimensional hierarchy. The tiers are presented from lower to higher levels of detail, corresponding to navigation from the leaves of the snowflake schema towards the dimension in the root fact table. In addition to the levels explicitly specified in the schema, we provide an “All” node, corresponding to the lowest possible level of detail. Conceptually, the “All” level is an implicit child of all leaf nodes in the meta-data hierarchy. Figure 4 shows how one dimension of the schema is mapped to the hierarchy traversed by the user interface.
Each entry in a tier is color-coded: purple if the entry can be further expanded and green otherwise. Each entry also has faint lines indicating how many elements there would be if the user expanded that entry. The user zooms into and out of the hierarchy by clicking on a purple entry to expand or collapse it. Trapezoids connect an expanded entry with its children in the next tier. An optional final tier enables users to select a subset of the entries from the current last level in the hierarchy; this subset is drawn in red. In order to prevent disconcerting shifts in the interface during exploration, levels of the hierarchy maintain a constant size; the amount of space allocated to the interface grows and shrinks as levels are expanded and collapsed.
The user can change the data filter by selectively expanding entries or scrolling the selected areas in the hierarchy. For example, if the user begins with the Research Group tier and expands the graphics entry, only users in the graphics group will be displayed. If the user then collapses the graphics entry and expands the robotics entry, only robotics students will be displayed. The user can expand multiple entries: clicking on the database entry will show both robotics and database students. The user can also scroll a selected area. In a quantitative dimension, for example, we can scroll the display from week to week. This functionality is also useful in a nominal dimension with too many entries (e.g., the application level has over 100 entries, and a strip chart per entry would be illegible, so scrolling becomes necessary at that point).
If there is more than one choice on how to expand an entry, as is the case in a snowflake schema, a pulldown menu enables the user to choose the expansion path. For example, the user can choose to expand the User dimension by Floor or Research Group.
The interfaces to nominal and quantitative dimensions are very similar: both allow the user to specify the desired level of detail and the data filter. For a quantitative dimension with a continuous domain, the level of detail determines the discretization granularity of the data. For example, specifying the Time dimension at the Hour level of granularity means that all tuples within an hour are aggregated together.
Because each interface displays both the level of detail and a filter for a single dimension, several interfaces together uniquely specify the desired virtual fact table.
6 Virtual Fact Table Generation
Given a schema specification (described in the Appendix) and a virtual fact table specification from the interfaces described in the previous section, we now discuss the series of data transformations that must be performed on the root fact table to generate the virtual fact table.