Example File Space Management Scenarios

HDF5 File Space Management

Introduced with

HDF5 Release 1.10.0

in

Date

Page 26 of 31

Example File Space Management Scenarios

Copyright Notice and License Terms for HDF5 (Hierarchical Data Format 5) Software Library and Utilities

HDF5 (Hierarchical Data Format 5) Software Library and Utilities

Copyright 2006-2012 by The HDF Group.

NCSA HDF5 (Hierarchical Data Format 5) Software Library and Utilities

Copyright 1998-2006 by the Board of Trustees of the University of Illinois.

All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted for any purpose (including commercial purposes) provided that the following conditions are met:

  1. Redistributions of source code must retain the above copyright notice, this list of conditions, and the following disclaimer.
  2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions, and the following disclaimer in the documentation and/or materials provided with the distribution.
  3. In addition, redistributions of modified forms of the source or binary code must carry prominent notices stating that the original code was changed and the date of the change.
  4. All publications or advertising materials mentioning features or use of this software are asked, but not required, to acknowledge that it was developed by The HDF Group and by the National Center for Supercomputing Applications at the University of Illinois at Urbana-Champaign and credit the contributors.
  5. Neither the name of The HDF Group, the name of the University, nor the name of any Contributor may be used to endorse or promote products derived from this software without specific prior written permission from The HDF Group, the University, or the Contributor, respectively.

DISCLAIMER: THIS SOFTWARE IS PROVIDED BY THE HDF GROUP AND THE CONTRIBUTORS "AS IS" WITH NO WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED. In no event shall The HDF Group or the Contributors be liable for any damages suffered by the users arising out of the use of this software, even if advised of the possibility of such damage.

Contributors: National Center for Supercomputing Applications (NCSA) at the University of Illinois, Fortner Software, Unidata Program Center (netCDF), The Independent JPEG Group (JPEG), Jean-loup Gailly and Mark Adler (gzip), and Digital Equipment Corporation (DEC).

Portions of HDF5 were developed with support from the Lawrence Berkeley National Laboratory (LBNL) and the United States Department of Energy under Prime Contract No. DE-AC02-05CH11231.

Portions of HDF5 were developed with support from the University of California, Lawrence Livermore National Laboratory (UC LLNL). The following statement applies to those portions of the product and must be retained in any redistribution of source code, binaries, documentation, and/or accompanying materials:

This work was partially produced at the University of California, Lawrence Livermore National Laboratory (UC LLNL) under contract no. W-7405-ENG-48 (Contract 48) between the U.S. Department of Energy (DOE) and The Regents of the University of California (University) for the operation of UC LLNL.

DISCLAIMER: This work was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately- owned rights. Reference herein to any specific commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or the University of California, and shall not be used for advertising or product endorsement purposes.

Page 26 of 31

Example File Space Management Scenarios

Contents

1. Introduction 5

1.1. Definitions and Concepts 5

2. File Space Allocation Mechanisms 7

2.1. Free-Space Manager 7

2.2. Aggregators 7

2.3. Virtual File Driver 7

3. File Space Management Strategies 8

3.1. The All Persist Strategy 9

3.2. The All Strategy 9

3.3. The Aggregator VFD Strategy 10

3.4. The VFD Strategy 10

4. Setting or Changing a File Space Management Strategy 11

4.1. Specifying a Strategy at File Creation with H5Pset_file_space 11

4.2. Changing the Strategy with h5repack 12

4.3. Summary of Strategies and Implementation 12

5. Example File Space Management Scenarios 13

5.1. Empty Files 13

5.1.1. Create an Empty File with the All Strategy 13

5.1.2. Create an Empty File with the All Persist Strategy 14

5.1.3. Create an Empty File with the Aggregator VFD Strategy 14

5.1.4. Create an Empty File with the VFD Strategy 15

5.2. Scenario A: The All Strategy, Multiple Sessions 16

5.2.1. Session 1: Create a File and Add Datasets 16

5.2.2. Session 2: Delete One Dataset 17

5.2.3. Session 3: Add One Dataset 18

5.2.4. Session 4: Add another Dataset 19

5.3. Scenario B: The All Persist Strategy, Multiple Sessions 20

5.3.1. Session 1: Create a File and Add Datasets 20

5.3.2. Session 2: Delete One Dataset 21

5.3.3. Session 3: Add One Dataset 21

5.3.4. Session 4: Add another Dataset 22

5.4. Scenarios C – F, Single Sessions 23

5.4.1. Scenario C: Create File, Manipulate Objects, All Strategy 23

5.4.2. Scenario D: Create File, Manipulate Objects, All Persist Strategy 23

5.4.3. Scenario E: Create File, Manipulate Objects, Aggregator VFD Strategy 24

5.4.4. Scenario F: Create File, Manipulate Objects, VFD Strategy 25

5.4.5. Comparing Scenarios A - F 25

5.5. Scenarios G – J, Single Sessions 26

5.5.1. Scenario G: Create File, Manipulate Objects, All Strategy 26

5.5.2. Scenario H: Create File, Manipulate Objects, All Persist Strategy 26

5.5.3. Scenario I: Create File, Manipulate Objects, Aggregator VFD Strategy 27

5.5.4. Scenario J: Create File, Manipulate Objects, VFD Strategy 27

5.5.5. Comparing Scenarios A, B, G - J 28

5.6. Scenarios K – N, no Objects Deleted 29

5.6.1. Scenario K: Create File, Add Objects, All Strategy 29

5.6.2. Scenario L: Create File, Add Objects, All Persist Strategy 29

5.6.3. Scenario M: Create File, Add Objects, Aggregator VFD Strategy 29

5.6.4. Scenario N: Create File, Add Objects, VFD Strategy 30

5.6.5. Comparing Scenarios K to N 31

Page 26 of 31

Example File Space Management Scenarios

1. Introduction

The space within an HDF5 file is called its file space. When a user first creates an HDF5 file, the HDF5 Library (also referred to in this document as the library) immediately allocates space to store file metadata. File metadata is information the library uses to describe the HDF5 file and to identify its associated objects. When a user subsequently creates HDF5 objects, the library allocates file space to store data values and the necessary additional file metadata. When a user removes HDF5 objects from an HDF5 file, the space associated with those objects becomes free space. The library manages this free space. The library’s file space management activities encompass both the allocation of space and the management of free space.

The library has a variety of mechanisms that allow it to implement several different file space management strategies. Users can select a strategy when they create an HDF5 file. Depending on the file’s usage patterns, one strategy may be better than the others. Users of HDF5 files that have large datasets added and removed on a regular basis might prefer one strategy while users of HDF5 files that are fairly static might prefer a different strategy.

This document describes the available file space allocation mechanisms and strategies, the tools - API and command line - that are available to set or change a strategy, and how the file space management strategies affect the file size and access time for various HDF5 file usage patterns.

1.1. Definitions and Concepts

The following are some terms and concepts used in this document.

Session

A session is composed of the actions on a file from when it is opened until it is closed. The next time a file is opened would be considered another session. Depending on the file space management strategy that was chosen for the file, file space information might be tracked during the current session, tracked over multiple sessions, or not tracked during any session.

Tracked Free Space and Unaccounted Space

Space within an HDF5 file is released when an object is removed from the file. The freed space is monitored while the file is open and is reported as tracked free space. Depending on the file space management strategy that was chosen for the file, the tracked free space may be stored after the file is closed (after the end of the current session). If the tracked free space is not stored after the file is closed, then this free space will be considered unaccounted space the next time the file is open.

Tracked free space is sometimes referred to as tracked space.

Tracked free space and unaccounted space are reported in the output of the command line h5stat -S.

Tracked free space and unaccounted space can be reclaimed with the h5repack tool.

VFD

VFD is short for Virtual File Driver.

Raw Data

Raw data are the data values in HDF5 dataset objects. For example, in a file that holds weather data, the raw data might include temperatures at different locations and at a variety of times.

File Metadata

File metadata is information the library uses to describe the HDF5 file and to identify its associated objects. One example is the file space management strategy used by a file. The strategy is stored in file metadata. For more information, see the “HDF5 Metadata” paper.

2. File Space Allocation Mechanisms

The HDF5 Library has three different mechanisms for allocating space to store file metadata and raw data. These are described in the sections below.

2.1. Free-Space Manager

The HDF5 Library’s free-space manager tracks sections in the HDF5 file that are not being used to store file metadata or raw data. These sections will be of various sizes. When the library needs to allocate space, the free-space manager searches the tracked free space for a section of the appropriate size to fulfill the request. If a suitable section is found, the allocation can be made from the file’s existing free space. If the free-space manager cannot fulfill the request, the request falls through to the aggregator level.

2.2. Aggregators

The HDF5 Library has two aggregators. Each aggregator manages a block of contiguous bytes in the file that has not been allocated previously. One aggregator allocates space for file metadata from the block it manages; the other aggregator handles allocations for raw data. The maximum number of bytes in each aggregator’s block is tunable.

If the library’s allocation request exceeds the maximum number of bytes an aggregator’s block can contain, the aggregator cannot fulfill the request, and the request falls through to the virtual file driver level.

After space has been allocated from an aggregator’s block, that space is no longer managed by the aggregator. If at some point in the future that space is freed, then the free-space manager would be in charge of the space and not the aggregator. In other words, the freed space would not revert back to the aggregator. Unallocated bytes in the block continue to be managed by the aggregator.

If an aggregator does not have enough space in its block to fulfill a request, it will then request a new block of contiguous bytes from the virtual file driver. Any unallocated space from the old block will become free space.

2.3. Virtual File Driver

The HDF5 Library’s virtual file driver (VFD) interface dispatches requests for additional space to the allocation routine of the file driver associated with an HDF5 file. For example, if the POSIX file driver, H5FD_SEC2, is being used, its allocation routine will increase the size of the single file on disk that stores the HDF5 file contents to accommodate the additional space that was requested. For more information on VFDs, see the “Alternate File Storage Layouts and Low-level File Drivers” section in “The HDF5 File” chapter in the HDF5 User’s Guide.

3. File Space Management Strategies

The file space allocation mechanisms described above can be used to implement a variety of file space management strategies. The strategies differ in two main ways: when the library will track free space and how many of the mechanisms the library will use to allocate space for file metadata and raw data. The strategies are listed in the table below and are described in more detail in the sections following the table.

Table 1. File space management strategies /
Strategy Name / The strategy might be useful under these conditions: / Implementation Comments /
All Persist / Use with files where raw data and metadata are added and removed frequently and where the files are opened and closed frequently. Maximizes the use of space in a file over any number of sessions. / ·  Uses all of the file space allocation mechanisms
·  Tracks file free space across sessions
All / Use with files where raw data and metadata are added and removed frequently. Maximizes the use of space in a file during a single session. / ·  Uses all of the file space allocation mechanisms
·  Tracks file free space only in the current session
Aggregator VFD / Use with files where small datasets might be added and where few if any datasets are removed. Adding small datasets means the library can take advantage of the aggregators. Maximizes rate at which small datasets are written to the file. / ·  Uses the aggregator and VFD mechanisms
·  Never tracks free space
VFD Only / Use with files where large amounts of raw data are added and where few if any datasets are removed. Maximizes rate at which data is written to the file. / ·  Uses only the VFD mechanism
·  Never tracks free space

For more information on implementing one of these strategies, see the “Setting or Changing a Strategy” section on page 10.

3.1. The All Persist Strategy

The aim of the All Persist strategy is to maximize the use of space within an HDF5 file over a number of sessions.