Dear Professor,

I have some questions regarding Tutorial4,Chapter 5 and Chapter 6 of lecture notes:

**1) Chapter 5, Slide 15**

Descriptive statistics: R

*> ex5.1=read.table (“F:=ST2137=lecdata=*ex5 1ar.txt”,header=T)

**> ex5.1ar=ex5.1[,1] What is the function of this statement? What does [,1] represent? Is this statement necessary? What will happen if it is not included?**

**%% a[,1] denotes the first column of the data frame “a”. Similarly, a[2,] denotes the second row of “a” and a[1,3] denotes the element of the first row and third column.**

> summary(ex5.1ar)

Min 1st Qu Median Mean 3rd Qu. Max

-2.70 14.17 22.98 27.00 32.62 91.15

> mean(ex5.1ar)

[1] 27.00079

> median(ex5.1ar)

[1]22.98

> min(ex5.1ar)

[1] -2.7

> max(ex5.1ar)

[1] 91.15

**2) Chapter 5, Slide 27**

Histogram: SAS

=*Histogram and qqplot *=

options reset=all

ftext=‘Arial/bo’

**%% this means that the font of the text using ‘Arial/bo’**

cback=white

colors=(black)

gunit=pct

**%% this means that graph unit is “pct”**

htext=2

**%% height of text is 2 (units)**

hpos=15;

**%% height of position is 15 (units)**

**What are those in blue mean?**

**3) Chapter 5, Slide 31**

Histogram: R

> hist(return,include.lowest=TRUE,freq=TRUE, **What is the function of return in this statement? Is it a variable?**

**%% it is a variable named “return”**

main=paste(“Histogram of return”),

xlab=“return”, ylab=“frequency”, axes=TRUE)

> # Normal curve imposed on the histogram

xpt=seq(-10,100,**0.1) What does 0.1 stand for?**

**%% 0.1 is the lag of the sequence generated using “seq” function; seq(-10,100,0.1) means the sequence -10, -10+0.1, -10+0.2,…**

ypt=dnorm(seq(-10,100,0.1),mean(return),sd(return))

ypt=ypt*length(return)*10

lines(xpt,ypt)

**4) Chapter 5, Slide 47**

Descriptive Statistics by groups: SAS

proc format;

value $risk ‘1’=‘Average Risk’

‘2’=‘High Risk’;

data ex5 2;

infile “F:nST2137nlecdatanex5 2.txt”;

input return risk$;

label return=‘Return Percentage’;

**format risk $risk.; **

%% the format statement gives the format “$risk” defined earlier using the format procedure “proc format…” to the object “risk”

**Why is '$' placed just before risk? I thought '$' should be placed behind a string variable? What does the first and second risk stand for?**

%% This is the format for format statement. In addition, the dot following “$risk” is also necessary.

5) Chapter 5, Slide 67

Plot of bivariate data: R

plot(height[gender==“M”],weight[gender==“M”],

+main=“Use Gender to geneerate the plotting symbol”,

+ylab=“Weight”,xlab=“Height”,

+xlim=c(150,190), ylim=c(40,80))

par(new=T)

plot(height[gender==“F”],weight[gender==”F”],

+main=“”,ylab=“”,xlab=“”,xlim=c(150,190), ylim=c(40,80),

+axes=F,pch=0,col=2)

Is '+' necessary? This sign does not appear in tutorial 4's answer key. What does 'axes= F' mean?

This means the previous line is not completed and is connected to the next line using “+”.

6) Tutorial 4, Q1(d) R procedures for drawing histogram for the processing time for each of the two plants.

##Histogram

wip1a=wip[plant==1,c("plant","time")] What does c("plant", "time") mean?

%% This means the columns having the names “plant” and “time” respectively. “c” here is the “c” function

wip2a=wip[plant==2,c("plant","time")]

par(mfrow=c(2,1))

hist(wip1a$time,include.lowest=T,freq=T,main=paste("Histogram of time for first production

plant"),xlab="time",ylab="frequency",axes=T)

hist(wip2a$time,include.lowest=T,freq=T,main=paste("Histogram of time for second production

plant"),xlab="time",ylab="frequency",axes=T)

%% “wip1a$time “ extracts the variable “time” from the data frame wip1a.

## Another solution

wip1=wip[plant==1,1] What do the numbers 1 and 2 mean? Why are there three number 1?

wip2=wip[plant==2,1]

%% “plant==1” gives all those rows with the variable “plant” taking value 1.

%% “plant==2” gives all those rows with the variable “plant” taking value 2.

%% wip[plant==2,1] denotes the the first column values of those rows associated with “plant==2”

par(mfrow=c(2,1))

hist(wip1,include.lowest=T,freq=T,main=paste("Histogram of time for first production

plant"),xlab="time",ylab="frequency",axes=T)

hist(wip2,include.lowest=T,freq=T,main=paste("Histogram of time for second production

plant"),xlab="time",ylab="frequency",axes=T)

What do the parts inblue mean?

7) Tutorial 4, Q2(a) SAS procedure for drawing the scatterplot for the two test scores for all the trainees with a different symbol for different gender.

data testscores;

infile "F:/ST2137/tutorialdata/testscores.txt" firstobs=2; When do we start from the 2nd observation?

%% This is used when the data set “testscores.txt” has the headers, that is, it has the names of the variables at the first row of the data set.

input A B gender$; Is the order of the variables important?

%% this specifies the variable order of the SAS data set yielded.

run;

proc gplot data=testscores;

title "Scatter plot for two tests";

plot A*B=gender;

symbol1 value=circle color=red;

symbol2 value=square color=black;

run;