DLI/IDD 1997 Workshop

DLI Ordering and Processing

3.1 Identifying files for acquisition

The two driving forces behind data acquisition are expressed need on the part of a user, or collection development policy.

Be aware that it is the explicit policy of the DLI that data files should only be acquired in response to express user need, and not generally acquired from DLI just because a file is available.

The presumption here is that a reference interview has taken place. Therefore, ascertain the following:

either the specific data file the user requires, or

the variables the user needs, including

what coding of these variables is needed

what time period the data should cover

what geographic area the user needs to describe

what the user needs to describe? Individuals, or groups of individuals, or a geographic area (i.e. what unit of observation is needed)

what the user intends to do with the numbers

what product the user wants

what software the user will be using

what platform the user will be doing his/her work on.

3.1.1 Selecting files

The characteristics of a data file can be categorized into those which describe the intellectual content of the data file, and those which describe the physical form of the data file.

Characteristics of the intellectual content:

Variables

substantive (dependant variables) versus demographic (independent variables)

level of coding of variables (e.g. age or income as a categorical or continuous variable)

Time period of data

 date of data collection

 time period covered by data (which is not necessarily the date of collection; e.g., the income variables in the Survey of Consumer Finances and the Census normally refer to the previous year.)

Geography

coverage of geographic area

availability of variables to identify required level of geography. For example, the Survey of Consumer Finances microdata files contain coding for region, but not province or census metropolitan area.

Level of observation

 microdata  aggregate data  time series data

 can aggregate from microdata to aggregate data, but cannot dissaggregate already aggregated data to smaller units or to microdata.

Desired Output Data Type  / Microdata / Aggregate data / Time-series data
Microdata / x / x
Aggregate data / x / x
Time-series data / x

For spatial data, and georeferenced data (standard Statistics Canada geography), it is possible to merge smaller units into larger ones, but not to disaggregate larger units into smaller ones. The following table displays, for 1991 census geographic products, the level of geography that can be generated (output) from spatial or georeferenced products at the commonly used input levels:

Output
Input / ea / ct / cma/ca / csd / cd / fed / rp
ea / x / x / x / x / x / x / x
ct* / x / x / (x) / (x) / (x)
cma/ca / x
csd / x / x / x
cd / x / x
fed / x / x

*Note: with 1971, 1976, and 1981 census tract level data (which included census tracts in census metropolitan areas/census agglomerates as well as provincial census tracts) it was possible to also generate census-subdivision, census division and region/province level data. With 1986 and 1991 census-tract level data, which includes only census tracts in census metropolitan areas/census agglomerates, this is no longer possible.

Edition/version of the data

version of data: different software dependant versions

edition (1st, 2nd, 3rd, etc.)

Characteristics of format

Access to the data

direct access by the user or access through an intermediary

ease of use (familiarity of software)

remote delivery versus hardware/software infrastructure to use/access

Input requirements of task

‘manual’ input

output from another task

Output requirements of task (what output does user need?)

subset (file) for further analysis

generic format

software dependant format

report

table

map

3.2 Ordering/acquisition process

The acquisition process consists of two parts: first establishing that the data are available from DLI (and if not, where else the data might be available), and secondly, actually acquiring a copy of the data.

3.2.1 Establishing availability

3.2.1.1 If you know the title of the data file you require, you can determine its availability through DLI using information from the DLI web site or the DLI mail lists:

the DLI WWW-site:

lists data files that will become available through DLI

lists data files that are currently available for ftp from the DLI ftp site, including the subdirectory in which each is to be found

The ‘dlilist’ listserv:

New data files are announced via . If you have not already done so, you should be subscribed to this listserv.

To subscribe, send an e-mail message to ,

with message text: subscribe dlilist [your first name] [your last name]

It is a good idea to save the messages from this listserv, perhaps in alphabetical order by title, for future reference, so that when you receive an inquiry, you will have the information to hand.

Dlilist runs using the listproc software. It is important to make a distinction between messages which manage your subscription to the list (such messages should be sent only to ) versus those messages which are to everyone else on the list, which messages should be sent to

Table of listproc commands:

N.B. These commands should be sent to:

help [topic] / get information re listproc commands
set [listname] [option] [argument] / with [option] [argument] change option to new value
subscribe [listname] [your name] / subscribe to specified list
unsubscribe [listname] / remove yourself from specified list
signoff [listname] / same as ‘unsubscribe’
recipients [listname] / receive a listing of non-concealed people subscribed to specified list
review [listname] / same as ‘recipients’
information [listname] / receive general information file about specified list
index [listname] / get a list of files in [list] archive
get [archive] [filename] / receive a copy of specified file(s) from specified archive
search [archive] [pattern] / receive a list of archive files that contain the character string [pattern]
which / receive a list of the local lists to which you are subscribed

3.2.1.2 If you do not know the title of the data file:

A separate handout will include a list of other reference tools which are available to determine the exact title and version/edition of the data file you require.

3.2.1.3 If the data are not currently available via DLI, but you think they should or might be, send an inquiry to . Responses are now received within a day or two.

A WWW-based order status system has been set up on the DLI WWW-page, at:

To access this order status page requires two things:

only the official DLI-contact person at your institution can access the page

access is only possible from a platform that has had its IP-address registered with Mr. Jackie Godfrey (e-mail to to register your IP-address for this purpose).

3.2.1.4 Knowing what data are not available from DLI, nor will be, is as important as knowing what is available.

Data files that are not available through DLI:

data that are collected by Statistics Canada but for which no standardcomputer-readable product is produced (such as the CANSIM cross-classified database, etc.)

data collected by other federal government departments (other than Statistics Canada)

data collected by provincial and municipal government departments

attitudinal or poll data

A separate handout will offer suggestions as to where to search for data files that are not available via DLI.

3.2.2 Acquiring the data

Data files from DLI are disseminated in two major ways:

those that are cd-rom products are mailed in one copy only, with documentation, to the official DLI-contact.

all other products are made available via ftp to the official DLI-contact from the Statistics Canada ftp site at ftp://ftp.statcan.ca

The official DLI-contact receive occasional e-mail messages containing the current password at the ftp site. There are plans to provide a WWW-interface so that files can also be ftp’d via WWW.

To acquire a copy of the data,

if the data file is a cd-rom product, the official DLI-contact person should

send an e-mail message to

if the data file is available via ftp, the official DLI-contact person should:

ftp the data from from ftp://ftp.statcan.ca

and

send and e-mail message to requesting a copy of the documentation.

3.2.3 Ftping files

The File Transfer Protocol allows you to copy files from one computer to another. It is possible to transfer files to or from a remote computer, regardless of format, allowing you to retrieve software programs, graphic images, sound files, etc., in addition to ASCII text files.

To access ftp-able resources you must either use a superclient (such as Mosaic, Netscape, lynx, etc.), a gopher client, or an ftp client (such as (Unix) ftp, or (Windows) WinSockFTP or Rapid Filer, or (DOS) kermit).

[Note from Laine: Be aware that Rapid Filer is the only ftp-client I am aware of that will allow you to ftp the content of an entire subdirectory with one command; however, it does so by assuming the mode of the ftp of each file on the basis of the file extension, and thus is only useful when all files in the subdirectory have standard Mime-compliant extensions. Other clients that require you to specify the mode of the ftp transfer on a file-by-file basis, such as WS_ftp, allow you the control over the mode of transfer of each file.]

SYNTAX

(Unix client):ftp [site] [port]

(from within a superclient): ftp://[ site]:[port]

COMMANDS (selected)

FTP accepts only a limited set of commands. Not all FTP servers accepts all commands listed below; which commands are acceptable will depend on both the FTP server software installed on the remote host and the FTP client software you are using. Use ‘help’ or ‘help ftp’ to display commands available on the version of FTP software you are using. Selected FTP commands are:

CommandAction

Navigation on the REMOTE host (host you are ftping to):

cd [dirname] / change remote working directory
cdup / change directory ('upwards') to root directory
cd .. / change directory ('upwards') one subdirectory
close / close connection to remote host
del [fn] / delete a file
dir / list files in current directory
get [fn] |more / display specified file without copying it to local system - use ‘q’ to exit
help / display help information
ls al / list content of current directory
mdel * / delete all files in a directory
mkdir[dirname] / make a new directory
open[site] / connect to new system [site]
pwd / print path information to current directory
rmdir[dirname] / remove a directory

Navigation on the LOCAL host (host you are ftping from):

!dir / display a list of files in current directory
!dir |more / display list of files one page at a time
lcd [dirname] / change 'local' directory
lcd .. / move up/back one directory
lcdup / move up/back one directory
!ls -al / display a list of files in current directory
!ls -al |more / display list of files one page at a time
!mkdir [dirname] / create a new directory with name [dirname]
!pwd / display path information to current local directory
!rmdir [dirname] / delete directory

Copying files between hosts

ascii / change transfer mode to transfer ASCII text files
binary / change transfer mode to transfer binary files
prompt / turn prompting for transfer of each individual file off/on
get [fn] / copy specified file from remote host to local host
mget * / copy all files from remote host to local host
mput * / copy all files from local host to remote host
put [fn] / copy specified file from local host to remote host
<ctrl<c> / cancel file transfer
<ctrl<z> / interrupt file transfer
bg %l / restart file in background

Notes (for the Unix ftp client):

You must be logged on to both the remote and local computer simultaneously,

and have an account and password on both computers.

Read the Readme.first files. On a Unix server (such as ftp.statcan.ca) it is possible to read files in remote ftp directories without actually 'getting' them first.

Typeget [filename] |more

and use <ctrl<c> or <q> to exit reading mode.

Actually, it is a good idea to both read the Readme.first files, as well as ‘get’ them, for later reference.

Two files that are especially useful on the ftp.statcan.ca site are in the ftp root directory (the very first directory you see when you login):

Readme.firstcontains a list of data file titles with corresponding subdirectory name

Dirlist.txtcontains a directory listing of the entire ftp site, and is very useful for discovering how the site is organized and where all possible variants of a file are, especially in very complex subdirectories, such as the ‘geography’ subdirectory.

Unix systems are casesensitive. ALWAYS give the directory or filename(s)

exactly as shown, including punctuation, and upper and lower case characters.

Distinguish between directories and files. Only files can be transfered.

Enter 'cd [subdirectory name]' to move from one directory to the next lower subdirectory.

Enter 'cd ..' or 'cdup' to move back up in the subdirectory hierarchy, one directory level at a time. Use cd ../../.. to move up three subdirectories at a time, etc.

To 'get' or 'put' a file, you must know how much disk space you have free, and how big the file is; use 'ls al' or 'dir' to display file sizes. On a Unix system, use ‘df’ to display remaining disk space before you run ftp.

When 'getting' a file, you may supply a filename for the incoming file, if you wish to change it, e.g. get Readme.first readme.gss10.

Use 'get' to get one file, or 'mget' to get several files with the same filename characteristics. The 'wild card' is '*', but can be used only with ‘mget’.

E.g. mget *.txt

If you make a mistake in typing a filename, try to backspace using:

the backspace key, or

<ctrl<h>, or

<ctrl<backspace>.

To abort a file transfer, the terminal interrupt key sequence is usually <ctrl<c>. To merely suspend file transfer use <ctrl<z> followed by ‘bg %l’ to restart the transfer in the background. Be very scrupulous in checking file sizes after ftping in the background.

It is considered good netiquette to avoid using ftp sites during their working hours, and to not linger at a site any longer than is necessary to retrieve the files you need.

Transferring files between different environments

When ftping files, i.e. transferring them from one computing environment to another, two things are very important:

whether the file contains ‘binary’ codes, especially when being copied between an ASCII environment and an EBCDIC environment;

end-of-line conventions in the environments between which the file is being transferred.

When files are uploaded in binary mode, they are copied from one system to another exactly. This is absolutely essential for files in a software-dependant format, especially files which contain binary codes (i.e. ASCII upper-128 codes). If the file being copied contains binary characters and is uploaded in ASCII mode, some of the binary characters may be interpreted as ftp control characters, and either terminate the ftp session, or merely result in the corruption of the transferred file. When transferring a file containing binary codes between an ASCII environment and an EBCDIC environment, translation problems may also occur with ASCII upper-128 codes if the file is not transferred in binary mode.

When files are uploaded in ASCII mode, however, although the bulk of the file is uploaded byte by byte as is, some things do change, especially when you are ftp-ing between different operating systems. Uploading from a DOS/Windows environment, to a Unix environment (or vice versa), in ASCII mode, the end-of-line (EOL) character at the end of each physical record is changed from the two characters DOS uses (CR-LF) to the single character Unix uses (LF). CMS, on the other hand, normally stores files in a fixed-length format, and the length of each record is constant.

NEWLINE CHARACTERS IN PLAIN TEXT DATA FILES[1]

Data files / typically have added / bytes per line
‘mainframe’ tapes / neither CR nor LF / Zero
written for IBM mainframes / neither CR nor LF / Zero
written for Unix / LF / One
written for DOS / both CR and LF / Two
written for Macintosh / CR / One

In short, if you need to preserve the existing EOLs, move the file between different operating systems in ASCII mode; if EOL codes are irrelevant (e.g. in system files) or the file contains binary fields, move the file between operating systems in binary mode.

Enter the ‘binary’ command before ftping a binary file. Use the ‘ascii’ command prior to transferring text files. The following table lists some common file name extensions, as well as some commonly occurring data-related file types, and whether the files are should be transferred in binary or ascii mode.

File type/extension: / Ftp mode: / Operating system:
.arc file / binary
.cat file / binary / DOS/Mac
.com file / binary / DOS
.doc file / binary / DOS/Mac
.exe file / binary / DOS/Unix
.gz file / binary
.tar file / binary
.wp[n] file / binary / DOS/Mac
.z file / binary
.zip file / binary
.Z file / binary
ArcInfo export files / ascii / DOS/Mac
ASCII text file / ascii
dBase file (.dbf) / binary / DOS/Mac
DDMS dictionary file / ascii / DOS
Lotus 1-2-3 file / binary / DOS/Mac
MapInfo files / binary / DOS/Mac
PDF file (.pdf) / binary
PostScript file (.ps) / ascii
raw data / ascii
SPSS system file / binary
SAS system file / binary
SPSS export file / ascii
SAS export file / binary
SPSS command file / ascii
SAS command file / ascii

If in doubt, copy the file twice, once in binary mode and once in ascii mode, to different filenames of course, and see which one gives the most satisfactory result.