Spring Cleaning Weblogs

Ronald A. Martinez, Fairbank, Maslin, Maullin, & Associates, Santa Monica, CA

Introduction:

Parsing web logs aides in attaining information about your site’s visitors. From where they have been to where they come from, you can start extracting metrics and learn more about your users’ characteristics to better serve them. In this paper I will briefly discuss some tools that will aide you through this process.

Goal:

1. Use the colon modifier to read in a data file with variable length fields

2. Demonstrate use of the SCAN function in parsing out a variable

First, we would like to gain some insight regarding the lay out of the data file. Depending on your systems resources and the size of the file, delays can occur, which may cause a hindrance. By using SAS as a borescope you can overcome this obstacle and obtain a rough layout of the data. In the code below we use the PUT statement and the _infile_ (input record text buffer) to write the file to the log. Notice that we are using the OPTION statement to limit the amount of observations that we read in. Also, remember that this OPTION is a global statement and will remain in effect until you change the settings.

LIBNAME datafold "H:\WUSS 2008\Access Logs";

FILENAME datafile "H:\WUSS 2008\Access Logs\Copy of Access1.csv";

OPTION obs=30;

DATA _null_;

INFILE datafile DLM="," MISSOVER LRECL=800;

INPUT;

PUT _infile_;

RUN;

/*resetting OBS option*/

OPTION obs=max;

By viewing some of the data file in the OUTPUT window, it is evident that within a variable some observations differ in length and will cause incorrect values to be read. To address this issue and create a usable SAS data set, we will need to use Modified List Input. In particular we will use the “ : ” - colon modifier.

The colon modifier will read a value until it encounters:

a) a blank (or other specified delimiter)

b) the max length of the informat or

c) end of file marker.

Taking advantage of this I have specified lengths that are longer than necessary to ensure that everything is captured. The product is the WEBLOGS SAS data set with variables than can now be parsed and analyzed for information.


DATA datafold.weblogs;

%let _EFIERR_ = 0; /* set the ERROR detection macro variable */

INFILE datafile DLM= ',' MISSOVER DSD LRECL=30000 FIRSTOBS=2 ;

INPUT

date $

time $

s_sitename :$50.

s_ip :$20.

cs_method :$30.

cs_uri_stem :$50.

cs_uri_query :$50.

s_port :$40.

cs_username :$40.

c_ip :$20.

cs_User_Agent_ :$60.

sc_status :$30.

sc_substatus

sc_win32_status

;

IF _ERROR_ THEN CALL symput('_EFIERR_',1);

/* set ERROR detection macro variable */

RUN;

PROC PRINT DATA=datafold.weblogs;

RUN;

A tip when considering to write out a data set; you could, in the same manner that we use the ":" and an informat to read non-standard numeric data, use the ":" and a format to write out non-standard data.

Before demonstrating the use of the SCAN function in parsing out a variable, let us review some information regarding this function.

The SCAN function “separates a character string to return a word based on its position. It defines words by counting delimiters, which are characters that are used as word separators. The name of the function is followed, in parentheses, by the name of the character variable, the number of delimiters to count, and the specified delimiters enclosed in quotation marks.” (SAS Institute Inc. 2004:441)

The General form of the scan function is: SCAN(argument,n,delimiters)

By default the SCAN function assigns a length of 200 to each target variable and the default list of delimiters are:

blank . < ( + | & ! $ * ) ; ^ - / , %

In application to the WEBLOGS data set, we will parse out the CS_URI_STEM variable. To reduce some of our dependence on system resources we first specify a length and then by using a DO loop we can also diminish the amount of code to write.

LENGTH uristem1-uristem6 $10;

ARRAY uristem [6] uristem1-uristem6;

DO j=1 TO 6;

uristem[j]=

SCAN(cs_uri_stem,j,"/");

END;

By using some default characteristics of the DO loop and DIM function we can minimize the code even more.

ARRAY uristem[6] $10;

DO j=1 TO DIM(uristem);

uristem[j]=

SCAN(cs_uri_stem,j,"/");

END;

This produces 6 new variable (URISTEM1,URISTEM2,…,URISTEM6) created from CS_URI_STEM, each one having been a string separated by a forward slash delimiter. These variables illustrate the path that a user browsed through to log in and view the website. By applying these and other techniques you can begin to parse other relevant variables and gather information regarding the traffic of your website. Finding out what content is popular and what content is ignored is vital, so that you can begin tailoring your content to satisfy your visitors.

Conclusion:

This brief application of SAS tools is a beginning on how to enter and manipulate web logs.

Using these tools to gather information can aide in generating traffic, encouraging visitors to view information about your services, and help with the timing of maintenance tasks. [1]

References:

1. SAS Institute Inc. 2004. SAS Certification Prep Guide: Base Programming, Cary, NC: SAS Institute Inc.

2. Cody, Ron 1999. Cody’s Data Cleaning Techniques Using SAS Software, Cary, NC: SAS Institute Inc.

3. Cody, Ron . Having a Ball with String: SAS Character Functions, Piscataway, NJ: Robert Wood Johnson Medical School

4. Cody, Ron 1998. The Input Statement: Where it’s @, Piscataway, NJ: Robert Wood Johnson Medical School

Ronald A. Martinez

Statistical Programmer/Data Analyst

Fairbank, Maslin, Maullin, & Assoc

2425 Colorado Ave.

Santa Monica, CA 90404

2

[1] Upon further research, I encountered a similar paper written by: Wang, Wei . Parsing Web Logs Using Base SAS, Pittsburgh, PA: Highmark Blue Cross Blue Shield