Loading the Database

ESTAT/Unit B5March 2012 (version 1.1)1/14

LOADING AND UPDATING

DATA AND STRUCTURAL METADATA

INTO EUROBASE

Version 1.1

TABLE OF CONTENTS

1. Initiation

2. Minimum requirements for the uploading of data

3. Data files formats

4. Processing of data

5. The staging environment

6. Updating procedure and dissemination of data

7. Workflow diagram

8. Addresses and contacts

This document is available in theCybernews pages of Unit B5

This document aims to informdomain managers (= data producers)how to load and update data and structural metadata (code lists, table titles, navigation tree, flags and special value)into the Reference database "Eurobase". Domain managers that receive this document should be responsible for the dissemination of the information to new colleagues and contractors.

Two types oftables are available to users via the Reference database: main tables and datasets.

Main tables:

They are used to provide easy access to the main statistical indicators. They are based in general on datasets and are derived from them. They are pre-defined, non-modifiable and presented as twoor three dimensional tables. They are accessible through the TGM (Tables, Graphs and Maps).

Datasets:

They are multi-dimensional tables, more appropriate for use by statistical and other experts via special applications. They are accessible through the Data Explorer and the bulkdownload.

Data and associated structural metadata are loaded, stored and updated in Eurobase. The loading of data and structural metadata are two interconnected processes and are coordinated by the Reference DBA team in unit B5.

1.Initiation

The process starts when the production unit requests a modification of the Reference database Eurobase. At this stage, there should be a distinction between standard updates of existing tables where the Reference DBA team has only a role of monitoring, from a real change of the structure of a domain or the creation of new tables that is done by the Reference DBA team.

The Reference DBA team may occasionally initiate the process, either for harmonisation, correction or improvements purposes.

A meeting, or at least an exchange of mails, should take place between domain managers and the Reference DBA team once an action has been requested. The information that should be provided concerns:

The structure and format of the data
The use of standard code lists or existing dictionaries (code lists)
The structure of the navigation tree
The appropriate time schedule
General administrative information.

The production unit should respect the list of minimum technical requirements (see below) for loading data in order to facilitate and accelerate this procedure.

A clear discussion between the two parties is very important at this stage in order to clarify the demands and constraints of each side.

2. Minimum requirements for the uploading of data

2.1 Structure

Data must be organised in order to find a reasonable balance between logical structure, size, density and number of tables in the domain, as well as the length of the structure path to reach the table in the navigation tree (this should not consist of more than six levels including theme, sub-theme and domain).

However, there should be more than one or two tables per folder.

2.2 Incoming files

Use a DFT or a TXT format (blank or tab delimited flat file).
Use "." as the decimal point (values cannot contain commas), and send data with the number of decimals you want to see in the Reference database.
Create at least one data file for each table; data files can be merged but not divided.
The size of the tables and the number of dimensionsin each table should be kept below certain levels.

-The size of the table (in compressed DFT format) should not exceed 8 MB and have a maximum of 6variable dimensions. The maximum of 6 variable (= not fixed) dimensions is recommended for the readability of the tables.

-The maximum file size acceptedfor updates is 10 MB, regardless the format.

Apply the standards for flags and special value (see2.5 below).

2.3 Dictionaries (code lists)

Adictionary (code list) contains all the positions and labels describing a dimension used in a table.The updating file should contain dimensions/codes that exist beforehand in the Reference database.

Use preferably the Standard Code Lists available in Ramon. Alldictionaries (code lists)used in the Reference databaseare available in 3 languages at the following address: Code lists used in the Reference database.

Dictionaries should be supplied in TXTorEXCEL format with codes in capitals and labels in English, French and German languages.Special attention should be paid to clear and coherent wording of labels and correct translations.

Dimension codes must have a maximum of 8 alphanumeric characters (only the underscore and the dashare allowed). Concerning the positions, there are no software limitations but less than 8-10 characters are recommended.Codes for every level of the navigation tree (folders, domain, tables) have to be unique and must be as complete as possible to identify clearly each dataset. Table codes are prefixed with the acronym of the domain to which they belong.

In general, for every change in the dictionaries:

Base the modification on the most recent version of the dictionary in the Reference database.
Please indicate exactly what has changed; it is highly recommended to use the MS Word option "track changes".
Ensure that the files are transmitted to B5 well in advance for loading or updating.
Check the coherence between the data file and the dictionaries before sending them to the Reference DBA team.

2.4 Other information

Domain manager(s) with name, address, phone number and e-mail address. The domain managers are responsible for the quality of the information they send to the Reference environment; they are the contact point for any information about the Reference database, unit B5 should thus be informed in case of change.

User-id of the person(s) responsible for the update of the data and/or to be informed in case of update.

For every new domain, the related reference metadata (ESMS) should be created. All information about reference metadata can be found in the chapter "Reference Metadata" in the Cybernews page for unit B5.

2.5 Flags and special value

Flags give supplementary information about the statistical values; they are represented by a code (usually a letter) which is stored in a separate field and shown next to the actual value. Special values are codes (usually a special character) which replace the statistical value.Only one non-numeric special value is used in the Reference database.

The flags and the special value are standardised in the Reference database and it is highly recommended to the domain managers to adopt them inthe production databases.

Flags are codes added to the data and defining a specific characteristic:

b=break in series

c= confidential

e=estimated

f=forecast

i=see metadata

The i-flag should be used for footnote-type information. The relevant explanations should be available in the ESMS (reference metadata), usually in "21. Comment", either as notes (21.1) or as detailed footnotes files (21.3). An example for detailed footnotes files is the ESMS for the domain EDUC().

n=not significant

p=provisional

r=revised

s=Eurostat estimate

u=unreliable

z=not applicable (incl. real zero)

The flags 'c', 'n' and 'z' are meaningful only when combined with the special value ':' = not available (': c', ': n', ': z').

A special value is a code that replaces the real data. Only one special value is used:

:=not available

The previously used special value '-' = not applicable (incl. real zero) is no longer used in the online database; it has been replaced by the flag 'z'.

Other conventions

0=less than half the final digit shown and greater than real zero

The symbol used to split the value and its flag(s) shown in the same cell is ~.

Logical combinations of several flags can be attached to the same value (example: 14.5~ei).

3. Data files formats

3.1 Blank delimited flat files

In standard blank delimited flat files, there must be only one record per line and each record must contain a fixed number of blank or tab delimited fields. The last field of each record always shows the value, which cannot contain commas ("." has to be used for decimal point). The other fields consist of codes that refer to Eurobase dictionaries (code lists); each field refers to a specific dictionary/dimension.

It should be avoided to use in the records of the flat file codes that may differ from Eurobase; in this case appropriate transcodification(mappings) routines have to be put in place. Production units inform the Reference DBA team about their codifications. Please remember to communicate any change in codifications.

Neitherthe header nor the codes may include any empty space.If a separation is required the underscore ("_") should be used.

Example:

FLAT_FILE=STANDARD

ID_KEYS=SAS,LFSQ,EGDN2

FIELDS=GEO,AGE,NACE_R2,SEX,DURATION,TIME

UPDATE_MODE=RECORDS(optional)

ATY15_24CFGE_3M2009Q0126.8

ATY15_24CFLT_3M2009Q01:

ATY15_24CMGE_3M2009Q0168.0

ATY15_24CMLT_3M2009Q01:

[...]

END_OF_FLAT_FILE

FLAT_FILE=STANDARD means blank or tab delimited fields, with one value at the end of each record.

ID_KEYS = the input data set identifier used by the Eurobasetable definitionfor identifying the target table. The input data set identifiers should be defined in advance and can be modified by the domain manager only in agreement with the Reference DBA team. The input data set identifiers must be different for each table and include normally software, domain and table code.

FIELDS = the codes of the columns’ dimensions as they are defined in the Eurobasetable definition. The number of fields must correspond to the number of columns of codes (excluding the value column).

There are 3 alternative update modes:

A. UPDATE_MODE=RECORDS means that sparse records are being sent to Eurobase. In this case, only real values are used to update the target table; records with "not available" are ignored and are not used for updating the target table.RECORDS is the updating mode used by default.

Example: If the producer sends the following two records to the Reference environment:

Dimension GEO / Dimension NACE / Values
BE / A / V1
DK / B / V2

The values that are not mentioned (BE-B and DK-A) are not modified; thus the update mode RECORDS will create the following table:

Dimension GEO / Dimension NACE
A / B
BE / V1 / old value
DK / old value / V2

B. UPDATE_MODE=CUBEIn this case, the updating routine takes into account real values, “not available” and all the combinations of codes even not explicitly included in the flat file. In the resulting cube, missing records will be filled with "not available". In this way it is possible to add data, to replace data and to erase data. This mode must be used with great cautionsince it is possible to make mistakes and accidentally erase values.

Example: If the producer sends the same two records as above, but uses the update mode CUBE, the following table will be created:

Dimension GEO / Dimension NACE
A / B
BE / V1 / :
DK / : / V2

Here, the system supposes that non-mentioned values do not exist or have to be deleted and replaces them with “:”.

C. UPDATE_MODE=REPLACE As in the CUBE mode, the updating routine takes into account real values, “not available” and all the combinations of codes even not explicitly included in the flat file for replacing the corresponding table. This mode should be used with cautionsince it is possible to accidentally erase the original table.

Here too, the system supposes that non-mentioned values do not exist and replaces them with “:”.

FLAG_DELIMITER = the character used in the flat file for separating values and flags (this line is optional). Each flag must have a length of one character, but more than one flag can be attached to each value; in this case there shall be no space between the different flags.

END_OF_FLAT_FILE = the standard record that ends a flat file. This record must end with a carriage return to be readable by the system. No trailing characters are allowed.

3.2 DFT and RLC formats

DFT files are intended to efficiently store data organised as multi-dimensional tables. It is a compact way of storing figures and it uses codes for the dimensions and the positions in each dimension.

The DFT files can be divided into two parts:

1.A header with information about the table and a description of the dimensions used.

2.A data set with values that can be displayed in the data presentation area of the browsing and editing programs

Description of DFT's sections / DFT file
General information
Mode replace or merge (by default) / INFO
Created: MON 21 JUN 2010 09:03:39 UPDATE_MODE=MERGE
Last update / LASTUP
WED 30 JUN 2010 17:12:07
Type of DFT file / TYPE
V
Characters used as delimiters / DELIMS
( ),@~
List of
dimensions' codes[1] / DIMLST
(soft,domain,table,sex,age,duration,nace_r2,geo,time)
List of
dimensions' category / DIMUSE
(R,N,N,V,V,V,V,V,V)
Lists of positions' codes
(one per dimension)[2] / POSLST
(sas)
(lfsq)
(egdn2)
(T,M,F)
(Y15_MAX,Y20_64,…………,15_74,Y50_74,Y75_MAX)
(LT_3M,GE_3M,UNK,NRESP)
(TOTAL,A,B,C,D,E,F,G,………………,R,S,T,U,NRESP)
(EU,EU27,EU25,EU15,………………,SE,UK,HR,IS,MK,TR,NO,CH)
(2010Q1,2009Q4,2009Q3,…….,2008Q3,2008Q2,2008Q1)
type of DFT format
(FORMAT0 or FORMATR) / FORMAT
FORMATR
Character used
for missing values / NOTAV
:
List of values / VALLST
(0)
(:,8981.0,9061.5,8200.0,7406.4,9954.6,10659.4,9725.9,9576.4,:,8981.0,
[...]
,:@135,5.9,4.2,4.1,6.2,:@793,28.0,39.8,29.4,12.8,:@9382,9.9~u,:@8851)
(-1)

UPDATE_MODE must be included in the INFO field on one single line.

The syntax must be precisely the following; otherwise the update mode will be ignored:

UPDATE_MODE=MERGEorUPDATE_MODE=REPLACE (no blanks or tabs allowed).

The two alternative update modes are:

MERGE: the incoming table is used for updating the target table, both real values and “not available” (code “:”). In this way it is possible to add data, to replace data and to erase data. MERGE is the updating mode used by default.

REPLACE: works exactly the same way, but the target table is replaced by the incoming one, and not added as in the merge mode.

TYPE. TheDFT files are always of type “V”, which stands for values. From this section onwards, it is not allowed to use space or tabulators in the text.

DELIMS, i.e. delimiters, tell you the beginning of a list (1st symbol), the end of a list (2nd symbol) and the separator between two common elements of a list (3rd symbol); the 4th symbol used in the RLC format is a multiplier (@) and the 5th is the delimiter for flags. It is not permitted to use any one of these characters in other ways and they must not be present in other elements (position, dimension, value, etc.) in order to avoid ambiguity.

DIMLST is a list of all the dimensions used. This list contains service information, identification of the dataset and “true” dimensions. The DIMUSE block specifies their use as following:

1.Reserved dimension (= R): service information including source database if applicable.

2.Name dimension (= N): definition of the data in the source database.

3.Variable dimension (= V): the true dimensions, definitions of the cells in the DFT.

The DIMUSE block refers to the block DIMLST.

• The first item of the list DIMLST is used as specified by the first item of the list DIMUSE,

• The second item of the list DIMLST is used as specified by the second item of the list DIMUSE list,

• ....

• The last item of the list DIMLST is used as specified by the last item of the list DIMUSE.

The POSLST paragraph shows for each dimension the list of positions belonging to this dimension (for R or N dimensions, this list must contain exactly one element). Empty lists are not allowed.

FORMAT is used to specify the coding of the values in the DFT file. Different formats use different conventions to define the cells. The old original FORMAT0 used to store cells one by one, i.e. all the values were mentioned. The new format FORMATR adopted is a RLC format, which uses a reduced VALLST: each sequence of identical values is written as: "one value followed by a multiplier".

Example:

the sequence: / :,:,:,:,:,:,:
is written as: / :@7
to be read as: / 7 times ":"

NOTAV specifies string used in a cell to represent an empty cell ("not available value"). The value declared in this field must correspond to the value used in the list of values VALLST.

4. Processing of data

In Eurobase, the upload process requires for each table the definition of three fields:

Dimensions:

It contains the list of dimensions and the corresponding code listsfor each table; this information is used to monitor if incoming files fit the agreed structure. If not, the files are rejected and the producers have to contact unit B5 in order to solve the problem.

Load dimension / Mappings:

Mappings are created in exceptional cases for the codes that have to be transcoded before the update of the Reference database. In parallel unit B5 will try to harmonise these codes in cooperation with the production unit.

Load parameters:

Itcontainsthe information about the 'input dataset identifiers' (=identification key to identify the target table), the list of the authorised persons and the corresponding update modes allowed.

It provides also the list of persons/functional mailboxes who receive the informationabout the updates (successful or not).

This table definition is required for the first upload of the files and used for every successive update of the data: updating is afterwards carried out automatically, since all the instruments are already in place.

5. The staging environment

Anindependent staging environment exits for testing purposes when a domain manager wants to load a new domain or new tables.

Both the Reference DBA team and the production unit have the opportunity to control the structure of the database and the domain manager can check the content of the tables.

The general updating procedure is explained in detail in chapter 6,but for testing purposes, the addresswhere to send update files is the following: "".