- Using if-then statements
Grammer:
1. if condition then (single) action;
2. if condition then do;
action;
action;
end;
3. if condition then action;
else if condition then action;
else if condition then action;
else action;
Comparison operators:
=(eq), ^=(~=, ne), >(gt), <(lt), >=(ge), <=(le), &(and), |(or)
data homeimprovements;
input Owner $ Description & $30. cost;
if cost=.then costgroup='missing'; *or you can use if(missing(cost)) then costgroup='missing';
elseif cost < 2000thendo;
costgroup='low';
cost1=cost-1000;
end;
elseif cost < 10000thendo;
costgroup='medium';
cost1=cost-2000;
end;
elsedo;
costgroup='high';
cost1=cost-5000;
end;
datalines;
Bob Kitchen cabinet face-lift 1253.00
Shirley Bathroom addition 11350.70
Silvia Paint exterior .
Shhirley Bathroom addition 11350.70
Al backyard gazebo 3098.63
;
procprintdata=homeimprovements;
run;
An example of multiple OR statements and the IN operator;
data homeimprovements_new;
set homeimprovements;
if costgroup="low" or costgroup="high"or costgroup=”medium” then print="yes";
/*equivalently use if costgroup in ("low", "high", “medium”) then print="yes";*/
else print=”no";
procprint;run;
Subsetting if statement;
data traffic;
input type $ name $ 9-38 AMtraffic PMtraffic;
Mtraffic=mean(of AMtraffic PMtraffic);
*if type="freeway" then delete;
*if type="surface" then output;
*if type="surface";
*---+----1----+----2----+----3----+----4;
datalines;
freeway 408 3684 3459
surface Martin Luther King Jr. Blvd. 1590 1234
surface Broadway 1259 1290
surface Rodeo Dr. 1890 2067
freeway 608 4583 3860
freeway 808 2386 2518
surface Lake Shore Dr. 1590 1234
surface Pennsylvania Ave. 1259 1290
;
procprint;run;
data traffic_surface;
set traffic;
where type="surface";
procprintdata=traffic;
where type="surface";
run;
where statements:
Similar to subsetting if but in addition, where statements can be used
1. with procedures
2. with existing tables only. However, the IF can be used when reading with INPUT statement.
3. with more operators, for example,
IS MISSING, where gender is missing
IS NULL, where gender is null
BETWEEN AND, where age between 20 and 40
CONTAINS, where name contains 'Mac'
LIKE, where name like 'R_n%'
data home_new;
set homeimprovements;
where cost between 2000 and 4000;
*where owner like 'S_i%' or owner like 'Si%';
procprint;run;
- The output statement
/*SAS secret: there is a hidden output statement at the end of each data step*/
data tmp;
input x;
output;
datalines;
11
2
3
10
;
procprintdata=tmp;
run;
/*SAS secret: the hidden output statement is suppressed when there are explicit output statements*/
data generate;
Do x=1to6;
y=x ** 2;
output;
END;
procprint;
run;
data tmp;
input x;
y=x ** 2;
if x > 5;
z = x+1;
datalines;
1
2
3
10
;
procprintdata=tmp;
run;
- Writing multiple data sets using the output statement
1). output data-set-name
data freeway surface;
set traffic;
if type='freeway'thenoutput freeway;
if type='surface'thenoutput surface;
run;
procprintdata=freeway; run;
procprintdata=surface; run;
2) Making multiple observations from one with the output statement
data a;
infile'C:\Documents and Settings\anna\Desktop\MyDesktop\597\speed.dat';
input id sex vo35 vo4 vo45 vo5 vo55 vo6;
speed=3.5;vo=vo35;output;
speed=4.0;vo=vo4;output;
speed=4.5;vo=vo45;output;
speed=5;vo=vo5;output;
speed=5.5;vo=vo55;output;
speed=6.0;vo=vo6;output;
keep id sex speed vo; /* without this every record corresponding to a subject would continue to also have vo35 to vo6 */
run;
title'first method';
procprintnoobs;
run;
- Do loops syntax
Syntax:
1) Do var = varlist;
End;
2) Do var = start TO stop By increment;
End;
data b;
infile'C:\Documents and Settings\anna\Desktop\597\speed.dat';
input id sex @; /* the @ holds the pointer at the current line */
do speed = 3.5to6by.5;
input vo @;
output;
end;
title'second method';
procprint;
run;
/*Example:
Normal Placebo group: 50 45 55 52
Normal Treatment group: 76 60 58 65
Hyperactive Placebo group: 70 72 68 75
Hyperactive Treatment group 51 57 48 55*/
data trtment;
length group drug $ 20;
Do group = "Normal", "Hyperactive";
Do drug = "placebo", "Treatment";
input activity1 -activity4 @;
output;
/*Do subject = 1 TO 4 by 1;
input activity@;
output;
End;
*/
End;
End;
Datalines;
50 45 55 52 76 60 58 65 70 72 68 75 51 57 48 55
;
procprint;
run;
- Mathematical function
log, log10, sin, cos, tan, arsin, arcos, artan, int, sqrt, round, mean, sum, max, n, nmiss
data example;
input x1-x10 2.;
if N(OF x1-x10) > 7then ave=mean(OF x1-x10);
*numNonMissing=N(OF x1-x3, OF x6-x10);
datalines;
12 14 18 10 9 1 . 5 3 19
. 4 . . . 8 . 9 10 13
;
procprint;run;
Random number generators: ranuni(seed), rannor(seed), rantbl(seed, p1, p2, ..., pk), rand('dist', parm1, ..., parmk). A seed is used to generate reproducible sequences.
Nonpositive seeds are ignored and random numbers are generated based on the system clock
data example;
*seed=123;
do i=1to10;
x=rannor(123); *or equivalently;
*call rannor(seed, x);
output;
end;
procprint;run;
Probability distribution functions: CDF, PDF, QUANTILE, PROBNORM, PROBT
data example;
do x=-3to3by0.1;
y1=PDF('normal',x);
y2=CDF('normal',x);
xx=QUANTILE('normal', y2);
output;
end;
procprint;run;
procgplotdata=example;
plot y1*x y2*x;
run;
The lag and dif function: TheLAGnfunctionsimply looks back inthefilennumber of records and allows you to obtain a previous value for a variable and store it inthecurrent observation.
optionsps=60ls=80nodate;
data mouse;
infile'C:\Documents and Settings\anna\Desktop\597\newmice.dat';
input year stand logmice mice;
/* mice is the mouse density, logmice is log10(mice +1) */
procprint;run;
data b1;
set mouse;
lagmice=lag(mice); /*lag1(mice) would give the same answer*/
procprint;
run;
The following used lag8 to calculate the yearly difference within a stand. Use of lag8 is right only when there are 8 observations per mice.Many timestheonly thing you want to do with a previous value of a variable is to compare it withthecurrent value to computethedifference.TheDIFnfunctionworksthesame way asLAGn, but rather than simply assigning a value, it assignsthedifference betweenthecurrent value and a previous value of a variable.Thestatement a=difn(x) tellsSASthat 'a should equalthecurrent value of x minusthevalue x had n number of records back inthefile'.
data b8;
set mouse;
lagmice=lag8(mice);
change = mice - lagmice;
*change=dif8(mice);
procprint;
run;
Calculate the difference between the last and the first observation within a stand and create a dataset with the last-first differences only
data b72;
set mouse;
lagmice=lag72(mice);
change = mice - lagmice;
if lagmice ^= .;
drop mice logmice;
procprint;
run;
Cautions against the LAG function:
Lag function returns the value of its argument at the last time Lag was excecuted
This means:
1. Lag function returns previous value of its argument when it's excecuted everytime
2. (Almost) Never use the lag function conditionally
data lagged;
input x;
if x > 5then lag_x=lag(x);
datalines;
7
9
1
8
;
procprint;
run;
- Retain statement:
Retain statement preserve a variable's value from the previous iteration of the data step
Retain statement can appear anywhere in the data step
General gramma for retain statement:
retain variable-list;
retain variable-list initial-value;
data b1;
retain mice;
lagmice=mice;
set mouse;
procprint;
run;
Will the following work?
data b1;
retain mice;
set mouse;
lagmice=mice;
procprint;
run;
The following generated the same b8 data as before
procsortdata=mouse;
by stand year;
data b8;
retain mice;
lagmice=mice;
set mouse;
by stand;
if first.stand then lagmice=.;
change=mice-lagmice;
procprint;
run;
The following generated the same last-first difference data as before when we used lag72
data diff;
retain lagmice;
set mouse;
by stand;
if first.stand then lagmice=mice;
if last.stand;
change=mice-lagmice;
procprint;
run;
- Procedure transpose: very much like matrix transpose -- turning observations into variables or vice versa
Grammer:
Proc transpose data=old-data-set out=new-data-set
By variable-list; *To which group the transposition should apply
ID variable; *values of which are used to create new variable names
Var variable-list; *The variables that are actually transposed
procsortdata=mouse;
by stand;
run;
proctransposedata=mouse out=long_mouse prefix=year;
by stand;
ID year;
var logmice mice;
run;
data long_mouse;
set long_mouse;
drop _NAME_;
run;
proctransposedata=long_mouse out=mouse2 prefix=mice;
by stand;
var year86 year87;
run;
procprintdata=mouse2;run;
- Time and Date functions
Today, MDY, YRDIF, DAY, MONTH, YEAR, WEEKDAY, HOUR, MINUTE, SECOND*/
data example;
TodayDate=Today();
DOB="15May2005"D;
time1='10:20't;
hour1=hour(time1);
Day1=DAY(DOB);
Month1=Month(DOB);
Age=YRDIF(DOB, TodayDate, 'ACTUAL');
procprint;
format TodayDate DOB mmddyy9.;
format time1 hhmm6.;
run;
- Character functions:length, compress, substr, input, put, translate, Trim, upcase
The following example uses the substr function to extract and change part of the value of a character variable: substr(char variable, start position, length);
The input function converts a character variable to a numeric variable; Usage: input(char_var, format); The put function converts a numeric variable to a character variable; usage: put(numeric_var, format).
data example;
input ID $10. ;
state=substr(ID, 1, 2);
numchar=substr(ID, 7,3);
num=input(numchar, 3.);
substr(ID, 3,4)=' ';
datalines;
NYAAAA123
NJ1234567
;
procprint;run;
data example;
input sbp dbp @@;
length sbp_chk $ 4 dbp_chk $ 4;
sbp_chk=put(sbp, 3.);
dbp_chk=put(dbp, 3.);
if sbp > 160thensubstr(sbp_chk, 4, 1)='*';
if dbp > 90thensubstr(dbp_chk, 4, 1)='*';
datalines;
120 80 180 92 200 110
;
procprint; run;
- Array: Array is a facility that can reduce the amount of coding in a SAS DATA STEP.
Array temporarily groups variables, making it convenient for loop processing. Array exists only for the duration of current data step. It is NOT a variable.
Syntax for ARRAY definition:
ARRAY array-name[subscript] varlist(val1, val2, ..., valn); This is for numeric arrays
ARRAY array-name[subscript] $ varlist(val1, val2, ..., valn);This is for character arrays
Note:
- array-name is used to identify arrays. It follows the naming convention for sas variable names;
- subscript can be
1)An integer: specifying the length of the array
2)*: SAS will determine the length according to the varlist
3)lower:upper: the lower and upper bounds of the subscript.
- All variables in the varlist must be the same type (numeric or character).
- Brackets can be replaced by braces {} and parentheses ()
- Without the varlist, SAS treat the array-name1, array-name2, ..., array-namen as the varlist
Syntax for referencing an array: array-name[subscript]
The subscript can be
- A variable or expression that evaluate to a valid subscript value in the definition of the array
- *: array-name[*] can be used in input and put statements and with some sas functions, forexample, input array-name[*], mean(of array-name[*]).
data a;
infile'C:\Documents and Settings\anna\Desktop\MyDesktop\597\597C\speed.dat';
input id sex vo35 vo4 vo45 vo5 vo55 vo6;
if vo35 = .then newvo1=.;
elseif vo35 > 25then newvo1=1;
else newvo1=0;
if vo4 = .then newvo2=.;
elseif vo4 > 25then newvo2=1;
else newvo2=0;
if vo45 = .then newvo3=.;
elseif vo45 > 25then newvo3=1;
else newvo3=0;
if vo5 = .then newvo4=.;
elseif vo5 > 25then newvo4=1;
else newvo4=0;
if vo55 = .then newvo5=.;
elseif vo55 > 25then newvo5=1;
else newvo5=0;
if vo6 = .then newvo6=.;
elseif vo6 > 25then newvo6=1;
else newvo6=0;
data a;
infile'C:\Documents and Settings\anna\Desktop\MyDesktop\597\597C\speed.dat';
input id sex vo35 vo4 vo45 vo5 vo55 vo6;
array vo[6] vo35 vo4 vo45 vo5 vo55 vo6;
array newvo[6] newvo1-newvo6;
do i=1to6;
if vo[i] = .then newvo[i]=.;
elseif vo[i] > 25then newvo[i]=1;
else newvo[i]=0;
end;
drop i;
procprintdata=a;
run;
data a;
infile'C:\Documents and Settings\anna\Desktop\MyDesktop\597\597C\speed.dat';
input id sex vo35 vo4 vo45 vo5 vo55 vo6;
array vo[6] vo35 vo4 vo45 vo5 vo55 vo6;
/*This varlist has to be present. it can be vo35--vo6*/
array newvo[6] newvo1-newvo6;
/*This varlist can be ignored*/
do i=1to6;
if vo[i] = .then newvo[i]=.;
elseif vo[i] > 25then newvo[i]=1;
else newvo[i]=0;
end;
drop i;
procprintdata=a;
run;
vo35--vo6: the double hyphens specify all the variables between vo35 and vo6. The order is determined by the order of appearance of the variables in the DATA step
data a;
infile'C:\Documents and Settings\anna\Desktop\MyDesktop\597\597C\speed.dat';
input id sex vo35 vo4 vo45 vo5 vo55 vo6;
array vo[*] vo35--vo6;
array newvo[*] newvo1-newvo6;
do i=1to dim(vo);
if vo[i] = .then newvo[i]=.;
elseif vo[i] > 25then newvo[i]=1;
else newvo[i]=0;
end;
newvo[dim(newvo)]=1000;
drop i;
* drop vo; /* This does not work as vo is not a variable, it cannot be used in drop/keep/rename
statements. use drop vo35--vo6 instead*/
procprintdata=a;
run;
Special SAS name lists: _ALL_, _CHARACTER_, _NUMERIC_
data a;
infile'C:\Documents and Settings\anna\Desktop\MyDesktop\597\597C\speed.dat';
input id $ sex $ vo35 vo4 vo45 vo5 vo55 vo6;
array vo[*] _NUMERIC_;
array char[*] _CHARACTER_;
array newvo[*] newvo1-newvo6;
if char[2]='2';
do i=1to dim(vo);
if vo[i] = .then newvo[i]=.;
elseif vo[i] > 25then newvo[i]=1;
else newvo[i]=0;
end;
drop i;
procprintdata=a;
run;
In the following example, _N_ is a SAS automatic variable, which counts how many data step has beenrun (including the current one).Values of temporary array are automatically retained. Note in the above example,the temporary array called "key" is only assigned value when _N_= 1, but it is used for scoring every person
data score;
array key[10] $1_temporary_;
array ans[10] $;
if _N_=1then
do i=1to10;
input key[i] @;
end;
input id $ @5 (ans[*]) ($1.);
score = 0;
do i=1to10;
scorei = (ans[i]= key[i]);
score=score+scorei;
end;
drop i scorei;
datalines;
A B C D E E D C B A
001 ABCDEABCDE
002 AAAAABBBBB
;
procprintdata=score;run;