- The print procedure
PROC PRINT data=data-set NOOBS LABEL;
By variable-list; Group output by the by-variables. Data must be presorted.
ID variable-list; The observation numbers are replaced by the ID variables.
SUM variable-list;Print sums for the variables in the list.
VAR variable-list;Specify variables to print and order
LABEL variable=’label’; Use label for the specified variable.
The NOOBS option requests the observation numbers to be suppressed. The LABEL option requests labels instead of variable names to be printed, if variable labels have been defined in a DATA step with a LABEL statement. Note that when a LABEL statement is used in a DATA step, the labels become part of the data set; but when used in a PROC, the labels stay in effect only for the duration of that step.
data shoes;
input style $ 1-15 ExcerciseType $ :10. Sales price;
datalines;
Max Flight running 1930 142.99
Zip fit leather walking 2250 83.99
zoom airborne running 4150 112.99
Light step walking 1130 73.99
Max step woven walking 2230 75.99
zip sneak c-train 1190 92.99
Air basketball 1000 150
;
procsortdata=shoes;
by ExerciseType;
procprintdata=shoes label;
by ExerciseType;
sum sales;
varstyle sales price;
label Sales="sales in 2009";
run;
- The sort procedure
PROC SORT DATA=messy OUT=neat NODUPKEY;
BY state DESCENDING city;
The NODUPKEY option tells SAS to eliminate any duplicate observations that have the same values for the BY variables. The DESCENDING optionbefore the city variable requests SAS to sort by descending order of the city. By default, SAS sorts by ascending order.
procsortdata=shoes out=shoes_sorted NODUPKEY;
by ExerciseType;
run;
procprint; run;
- The format procedure: mainly used to recode variable values through user-defined formats.
PROC FORMATlibrary=libref.catalogname;
VALUE numfmt value1='formatted-value-1' value2='formatted-value-2'
...... valuen='formatted-value-n' ;
VALUE $charfmt 'value1'='formatted-value-1' 'value2'='formatted-value-2'
...... 'valuen'='formatted-value-n' ;
RUN;
PROC FORMAT Statement: Without theLIBRARY=option, formats are stored in a catalog calledFORMATSin the temporaryWORK libraryand exist only for the duration of the SAS session. If theLIBRARY=option specifies only alibref, formats are permanently stored in that library in a catalog calledFORMATS.
data temp;
infile cards dlm=',';
input id $ sex $ emp_stat yr_edu jobcat $ ;
cards;
A, m, 2, 18, 42
B, F, 0, 16, 00
C, f, 2, 16, 32
D, M, 1, 12, 52
E, f, 1, 18, 01
;
run;
procformat;
value $job '01'='Teacher'
'31'-'33'='Computing Consultant'
'41'-'49','51'-'59'='Medical Professional'
other='N/A' ;
value empst 0='NotEmployed'
1='Part-time Employed'
2='Full-time Employed' ;
value edu_cat 1-11='Less than High School'
12='High School'
12<-high='More than High School' ;
value $gen 'M','m'='Males'
'F','f'='Females' ;
run;
data rec;
set temp;
FORMAT emp_stat empst. jobcat $job.;
run;
procprintdata=rec;
run;
/* Using formats temporarily in a PROC step */
procfreqdata=saslib.rec;
tables sex yr_edu ;
format sex $gen. yr_edu edu_cat. ;
run;
data new;
set rec;
/* Detaching formats from these variables */
FORMAT emp_stat jobcat ;
run;
procprintdata=new;
title"Data with NO user-written formats" ;
run;
Specifying range of values:
- Ranges can be constant values or values separatedby commas:
- ·‘a’, ‘b’, ‘c’
- ·1,22,43
- Ranges can include intervals such as:
<lower> – <higher> means that the interval includesboth endpoints.
<lower> <- <higher> means that the intervalincludes higher endpoint, but not the lower.
<lower> - < <higher> means that the intervalincludes lower endpoint, but not the higher.
<lower> <- < <higher> means that the intervaldoes not include either endpoint.
- The numeric “ . “ and character ‘ “ “ ‘ missingvalues can be individually assigned values.
- Ranges can be specified with special keywords:
- LOW: From the least (most negative) possiblenumber.
- HIGH: To the largest (positive) possible number.
- OTHER: All other numbers not otherwisespecified.
- The LOW keyword does not format missing values.
- The OTHER keyword does include missing
- The means procedure
proc means options;
statements;
Commonly used options:
n, nmiss, mean, median, std, stderr, clm, lclm, uclm, min, max, sum, var, q1, q3,
qrange, cv, skewness, kurtosis, t, prt (p-value for the t-test), maxdec.
Commonly used statements:
class variable-list; request summary analysis done for each group. Data need not to be ordered first
by variable-list; request summary analysis done for each group. Data need to be ordered first
var variable-list;
output out=data set name statKeywords=names;
data htwt;
input subject $ gender $ height weight score $;
datalines;
1 M 68.5 155 L
2 F 61.2 99 H
3 F 63.0 115 M
4 M 70.0 205 .
5 M 68.6 170 M
6 F 65.1 125 H
7 M 72.4 220 L
8 M . 188 H
;
procmeansdata=htwt maxdec=2Nmeanstdstderrclm;
run;
- The univariate procedure
proc univariate options;
statements/statment options;
Some useful options:
normal: test for normality
plot: produce three text plots: stem-and-leaf, box plot and normal probability plot (QQplot)
Statements:
Varvariable-list;
Byvariable-list;
histogram variable-list/normal; This will generate a histogram with normal density curve superimposed.
QQplot: quantile-quantile plot
Probplot: quantile-probability plot
Inset: add a box that displays selected stats
procunivariatedata=htwt plotnormal;
var weight;
histogram weight/normal;
Insetmean='Mean' (5.2)
std='standard deviation' (6.3)/Font='Arial'POS=NW HEIGHT=3;
QQplot weight/normal(mu=160sigma=44color=red);
Probplot weight/normal(mu=160sigma=44color=red);
run;
See for a detailed annotation on the output
- Two sample comparisons
T-test: testing the differences between two independent group means
Assumptions:
1. Two groups are independent and samples within each group are independent
2. The means of the two groups are normally distributed
3. The variances of the two groups are approximately equal
data grouptime;
Do group="C", "T";
Do sub=1to5;
input time @;
output;
end;
end;
drop sub;
datalines;
80 93 83 89 98 100 103 104 99 102
;
procttestdata=grouptime;
class group;
var time;
run;
- Wilcoxon rank-sum test
Appropriate for nonnormal distributions and small sample size, and ordinal data. The null hypothesis is that the distributions of X in both groups are the same.
Group A: 3.1 2.2 1.7 2.7 2.5
Group B: 0.0 0.0 1.0 2.3
Order data: 0.0 0.0 1.0 1.7 2.2 2.3 2.5 2.7 3.1
groups: B B B A A B A A A
rank: 1.5 1.5 3 4 5 6 7 8 9
Rank-sum A: 4+5+7+8+9=33
Rank-sum B: 1.5+1.5+3+6=12
test statistic: min(Rank-sum A, Rank-sum B)
data tumor;
infile datalines missover;
input group $ mass1-mass5;
datalines;
A 3.1 2.2 1.7 2.7 2.5
B 0.0 0.0 1.0 2.3
;
proctransposedata=tumor out=tumor1 prefix=mass;
by group;
var mass1-mass5;
run;
procnpar1waydata=tumor1 wilcoxon;
class group;
var mass1;
exactwilcoxon;
run;
- Paired t-test
The same subject is measured under the two different treatment conditions
Assumptions: The mean of the within-pair differences is normally distributed
data grouptime1;
set grouptime;
ctime=lag5(time);
if ctime ^=.;
rename time=ttime;
drop group;
procttestdata=grouptime1;
paired ctime*ttime;
run;
Proc univariate can be used for paired t-test. It's better than the ttest proc since it alsodoes nonparametric tests
data grouptime2;
set grouptime1;
change=ctime-ttime;
keep change;
procunivariatedata=grouptime2;
run;
- One way analysis of variance (one way ANOVA): Comparison of one continuous variable among multiple groups
Assumptions:
1. Groups are independent and samples within each group are independent
2. data are normally distributed
3. The variances of the groups are approximately equal
F-test:
- Total Sum of Squres: TSS
- Sum of Squres due to treatment: SST
- Sum of Squres due to error: SSE
TSS = SST + SSE
SST/(k-1)
F = ------, N = total sample size, k = number of groups
SSE/(N-k)
data reading;
input group $ words @@;
datalines;
X 700 X 850 X 820 X 640 X 920
Y 480 Y 460 Y 500 Y 570 Y 580
Z 500 Z 550 Z 480 Z 600 Z 610
;
procanovadata=reading;
class group;
model words=group;
means group;
run;
- Categorical data
One binomial or multinomial variable
proc freq data=data;
tables variable-list/statment options;
Some proc statement options:
missing: includes missing values in frequency statistics
nocum: no cumulative frequency
nopercent: no percentage
List: print cross-tabulations in list format rather than grid
nocol: suppresses printing of column percentages in cross-tabulations
norow: suppresses printing of row percentages in cross-tabulation
Table statement options
AGREE: requests tests and measures of classification agreement including McNemar's test, kappa statistics, etc
BIN: requests binomial proportion, confidence limits and test for one-way tables
CHISQ: requests chi-square tests of homogeneity and measures of association
CL: requests confidence limits for measures of association
EXACT: requests Fisher's exact test
MEASURES: requests measures of association including Pearson and spearman correlation coefficients, etc
RELRISK: requests relative risk measures for 2x2 tables
RISKDIFF: requests risk difference and confidence limits for 2x2 tables
data htwt;
input subject gender $ height weight score $;
datalines;
1 M 68.5 155 L
2 F 61.2 99 H
3 F 63.0 115 M
4 M 70.0 205 .
5 M 68.6 170 M
6 F 65.1 125 H
7 M 72.4 220 L
8 M . 188 H
;
procfreqdata=htwt;
tables score/bin(p0=.6 level="L");
*tables score;
*exact bin;
run;
data htwt1;
set htwt;
score1=(score="H");
procprint;run;
procfreqdata=htwt1;
tables gender*score1/riskdiff;
run;
2-way contingency table (cross tabulations)
Useful for
1.Comparing two proportions with independent samples
2.Testing independence between two categorical variables for one sample
Commonly used tests:
1.Chi squre test (chisq): the expected number of count in each cell > 5
2.Fisher exact test (fisher): for small sample sizes
data fisher;
input gender $ vote $ count;
datalines;
M Y 5
M N 0
F Y 1
F N 4
;
procfreqdata=fisher;
tables gender*vote/fisher;
weight count;
run;
procfreqdata=htwt;
tables gender*score/chisqfisher;
exactchisq;
run;
- proc gplot
proc gplot data=data_name;
plot y1*x=symbol y2*x=symbol/overlay haxis vaxis;
run;
The example below is from data collected on a series of plots in Maryland to examine the relationshipsbetween gypsy moth egg mass densities and subsequent defoliation. The plots are 60 ha in size and arandom sample of .01ha subplots were obtained in each plot. Egg masses were counted and defolation (as a percent) was measured.
optionls=70;
title'create defol means';
data dd;
infile'C:\Documents and Settings\anna\Desktop\597\87md.dat';
input plot $ subplot egg def;
run;
procprintdata=dd;run;
procmeansdata =dd nway;
class plot;
var egg def;
outputout=result2 mean=meanegg meandef stderr=seegg sedef;
run;
procprintdata=result2;run;
The nway option in the means procedure statement: Limit the output statistics to the observations with the highest _TYPE_ value.
data c;
set result2;
up=meanegg+(1.96*seegg);
low = meanegg-(1.96*seegg);
run;
procprintdata=c;run;
title1'Mean egg mass and two standard errors';
title2' Maryland 1987';
title2;
axis2label=("Egg mass");
symbol1value=u color=red;
symbol2value=lcolor=red;
symbol3value=m color=black;
procgplotdata=c;
plot up*plot=1 low*plot=2 meanegg*plot=3/overlayvaxis=axis2;
run;
Title statement:
1. Global statement
2. TITLE1 is twice the height of all other titles and uses the SWISS font.
3. All other TITLE statements are one unit high and use the default hardware font.
4. The following quoted paragraph is from SAS online document
“Using TITLE and FOOTNOTE Statements
You can define TITLE and FOOTNOTE statements anywhere in your SAS program. They are global and remain in effect until you cancel them or until you end your SAS session. All currently defined FOOTNOTE and TITLE statements are automatically displayed.
You can define up to ten TITLE statements and ten FOOTNOTE statements in your SAS session. A TITLE or FOOTNOTE statement without a number is treated as a TITLE1 or FOOTNOTE1 statement. You do not have to start with TITLE1 and you do not have to use sequential statement numbers. Skipping a number in the sequence leaves a blank line.
You can use as many text strings and options as you want, but place the options before the text strings they modify.
The most recently specified TITLE or FOOTNOTE statement of any number completely replaces any other TITLE or FOOTNOTE statement of that number. In addition, it cancels all TITLE or FOOTNOTE statements of a higher number. For example, if you define TITLE1, TITLE2, and TITLE3, resubmitting the TITLE2 statement cancels TITLE3.To cancel individual TITLE or FOOTNOTE statements, define a TITLE or FOOTNOTE statement of the same number without options (a null statement):
title4;But remember that this will cancel all other existing statements of a higher number.
To cancel all current TITLE or FOOTNOTE statements, use the RESET= graphics option in a GOPTIONS statement:
goptions reset=footnote;
Specifying RESET=GLOBAL or RESET=ALL also cancels all current TITLE and FOOTNOTE statements as well as other settings.”
Symbol statement:
1. Global statement
2. Syntax
Symbol<1...255>
keyword=value;
keywords include:
color
line
value
interpol
3. A new symbol definition of any number replaces the old symbol definition of the same number with the same keywords
Axis statement:
1. Global statement
2. Syntax:
axis<1...99> label=("value");
procregdata=htwt;
model weight=height;
plot weight * height P.*height/overlay;
run;
symbol1value=plus color=black;
symbol2I=RLCLM95 line=1color=red;
symbol3I=RLCLI95 line=4color=blue;
procgplotdata=htwt;
plot weight*height weight*height=2 weight*height=3/overlay;
run;