Dear Professor,
I have some questions regarding Tutorial4,Chapter 5 and Chapter 6 of lecture notes:
1) Chapter 5, Slide 15
Descriptive statistics: R
> ex5.1=read.table (“F:=ST2137=lecdata=ex5 1ar.txt”,header=T)
> ex5.1ar=ex5.1[,1] What is the function of this statement? What does [,1] represent? Is this statement necessary? What will happen if it is not included?
%% a[,1] denotes the first column of the data frame “a”. Similarly, a[2,] denotes the second row of “a” and a[1,3] denotes the element of the first row and third column.
> summary(ex5.1ar)
Min 1st Qu Median Mean 3rd Qu. Max
-2.70 14.17 22.98 27.00 32.62 91.15
> mean(ex5.1ar)
[1] 27.00079
> median(ex5.1ar)
[1]22.98
> min(ex5.1ar)
[1] -2.7
> max(ex5.1ar)
[1] 91.15
2) Chapter 5, Slide 27
Histogram: SAS
=*Histogram and qqplot *=
options reset=all
ftext=‘Arial/bo’
%% this means that the font of the text using ‘Arial/bo’
cback=white
colors=(black)
gunit=pct
%% this means that graph unit is “pct”
htext=2
%% height of text is 2 (units)
hpos=15;
%% height of position is 15 (units)
What are those in blue mean?
3) Chapter 5, Slide 31
Histogram: R
> hist(return,include.lowest=TRUE,freq=TRUE, What is the function of return in this statement? Is it a variable?
%% it is a variable named “return”
main=paste(“Histogram of return”),
xlab=“return”, ylab=“frequency”, axes=TRUE)
> # Normal curve imposed on the histogram
xpt=seq(-10,100,0.1) What does 0.1 stand for?
%% 0.1 is the lag of the sequence generated using “seq” function; seq(-10,100,0.1) means the sequence -10, -10+0.1, -10+0.2,…
ypt=dnorm(seq(-10,100,0.1),mean(return),sd(return))
ypt=ypt*length(return)*10
lines(xpt,ypt)
4) Chapter 5, Slide 47
Descriptive Statistics by groups: SAS
proc format;
value $risk ‘1’=‘Average Risk’
‘2’=‘High Risk’;
data ex5 2;
infile “F:nST2137nlecdatanex5 2.txt”;
input return risk$;
label return=‘Return Percentage’;
format risk $risk.;
%% the format statement gives the format “$risk” defined earlier using the format procedure “proc format…” to the object “risk”
Why is '$' placed just before risk? I thought '$' should be placed behind a string variable? What does the first and second risk stand for?
%% This is the format for format statement. In addition, the dot following “$risk” is also necessary.
5) Chapter 5, Slide 67
Plot of bivariate data: R
plot(height[gender==“M”],weight[gender==“M”],
+main=“Use Gender to geneerate the plotting symbol”,
+ylab=“Weight”,xlab=“Height”,
+xlim=c(150,190), ylim=c(40,80))
par(new=T)
plot(height[gender==“F”],weight[gender==”F”],
+main=“”,ylab=“”,xlab=“”,xlim=c(150,190), ylim=c(40,80),
+axes=F,pch=0,col=2)
Is '+' necessary? This sign does not appear in tutorial 4's answer key. What does 'axes= F' mean?
This means the previous line is not completed and is connected to the next line using “+”.
6) Tutorial 4, Q1(d) R procedures for drawing histogram for the processing time for each of the two plants.
##Histogram
wip1a=wip[plant==1,c("plant","time")] What does c("plant", "time") mean?
%% This means the columns having the names “plant” and “time” respectively. “c” here is the “c” function
wip2a=wip[plant==2,c("plant","time")]
par(mfrow=c(2,1))
hist(wip1a$time,include.lowest=T,freq=T,main=paste("Histogram of time for first production
plant"),xlab="time",ylab="frequency",axes=T)
hist(wip2a$time,include.lowest=T,freq=T,main=paste("Histogram of time for second production
plant"),xlab="time",ylab="frequency",axes=T)
%% “wip1a$time “ extracts the variable “time” from the data frame wip1a.
## Another solution
wip1=wip[plant==1,1] What do the numbers 1 and 2 mean? Why are there three number 1?
wip2=wip[plant==2,1]
%% “plant==1” gives all those rows with the variable “plant” taking value 1.
%% “plant==2” gives all those rows with the variable “plant” taking value 2.
%% wip[plant==2,1] denotes the the first column values of those rows associated with “plant==2”
par(mfrow=c(2,1))
hist(wip1,include.lowest=T,freq=T,main=paste("Histogram of time for first production
plant"),xlab="time",ylab="frequency",axes=T)
hist(wip2,include.lowest=T,freq=T,main=paste("Histogram of time for second production
plant"),xlab="time",ylab="frequency",axes=T)
What do the parts inblue mean?
7) Tutorial 4, Q2(a) SAS procedure for drawing the scatterplot for the two test scores for all the trainees with a different symbol for different gender.
data testscores;
infile "F:/ST2137/tutorialdata/testscores.txt" firstobs=2; When do we start from the 2nd observation?
%% This is used when the data set “testscores.txt” has the headers, that is, it has the names of the variables at the first row of the data set.
input A B gender$; Is the order of the variables important?
%% this specifies the variable order of the SAS data set yielded.
run;
proc gplot data=testscores;
title "Scatter plot for two tests";
plot A*B=gender;
symbol1 value=circle color=red;
symbol2 value=square color=black;
run;