CSSS 508: Intro R

1/25/06

Lab 3: Being Loopy and Creating Variables

In Venables and Ripley’s MASS library, there are several datasets that are available for exploring and analyzing.

> help.start()

Click on Packages, then MASS. Scroll through; look at short descriptions of datasets.

>library(MASS) (loads the package in your R session)

Today our sample dataset is the Veteran’s Administration Lung Cancer Trial data.

>help(VA)

> dim(VA)

[1] 137 8

> n<-nrow(VA)

> p<-ncol(VA)

> n

[1] 137

> p

[1] 8

There are 137 subjects, 8 variables. The data is a data frame; each column is named and can be accessed by VA$(the column name). You can reassign the columns if you want.

VA$stime: survival or followup time in days

VA$status: dead (1) or censored (0)

VA$treat: standard (1) or test (2)

VA$age: patient’s age in years

VA$Karn: patient’s Karnofsky score; on scale 0-100;

high values are for relatively well patients

VA$diag.time: times since diagnosis in months at entry to trial

VA$cell: one of four cell types

VA$prior: did the patient have prior therapy (10-Yes, 0-No)

Do summary() on each variable (or on VA) to get an idea of what kind of data you have.

Any missing data?


Creating a quartile age category variable using a for loop and if statements:

> summary(VA$age)

Min. 1st Qu. Median Mean 3rd Qu. Max.

34.00 51.00 62.00 58.31 66.00 81.00

> age.cat<-rep(0,n)

> for(i in 1:n){

+ if(34<=VA$age[i] & VA$age[i]<51) age.cat[i]<-1

+ if(51<=VA$age[i] & VA$age[i]<62) age.cat[i]<-2

+ if(62<=VA$age[i] & VA$age[i]<66) age.cat[i]<-3

+ if(66<=VA$age[i] & VA$age[i]<=81) age.cat[i]<-4

+ }

> table(age.cat)

age.cat

1 2 3 4

34 30 36 37

Note: we had an if statement for every possible age value, so we didn’t need an else statement. Could also do this with a string of if/else statements.

> age.cat<-rep(0,n)

> for(i in 1:n){

+ if(34<=VA$age[i] & VA$age[i]<51) age.cat[i]<-1

+ else if(51<=VA$age[i] & VA$age[i]<62) age.cat[i]<-2

+ else if(62<=VA$age[i] & VA$age[i]<66) age.cat[i]<-3

+ else age.cat[i]<-4

+ }

> table(age.cat)

age.cat

1 2 3 4

34 30 36 37

Or with a series of conditional assignments:

> age.cat<-rep(0,n)

> age.cat[34<=VA$age & VA$age<51]<-1

> age.cat[51<=VA$age & VA$age<62]<-2

> age.cat[62<=VA$age & VA$age<66]<-3

> age.cat[age.cat==0]<-4

> table(age.cat)

age.cat

1 2 3 4

34 30 36 37

Make a matrix copy of your VA data.

> VA2<-as.matrix(VA)

Change 10 random elements to missing.

> VA2[sample(seq(1,n*p),10)]<-NA


Double for loops:

If we want to loop over all elements in a matrix, we can index over two for loops.

The below code will loop over a matrix and return a list of the locations of the missing variables. The rows are indexed by i; the columns by j.

missing<-NULL

for(i in 1:n){

for(j in 1:p){

if(is.na(VA2 [i,j])) missing<-rbind(missing, c(i,j))

}

}

> missing (these are the random ones I had; yours will be different)

[,1] [,2]

[1,] 2 4

[2,] 21 3

[3,] 26 2

[4,] 41 2

[5,] 42 1

[6,] 56 5

[7,] 66 5

[8,] 90 8

[9,] 112 3

[10,] 125 7

Loops can take a long time; try to do things vector by vector if you can.

missing<-NULL

for(i in 1:n){

missing.loc<-which(is.na(VA2[i,]))

how.many<-length(missing.loc)

if(how.many!=0)

missing<-rbind(missing,cbind(rep(i,how.many),missing.loc))

}

> missing

missing.loc

age 2 4

treat 21 3

status 26 2

status 41 2

stime 42 1

Karn 56 5

Karn 66 5

prior 90 8

treat 112 3

cell 125 7


Logistic Regression: Often we want to create variables that represent a subgroup

(1 if in the subgroup; 0 if not)

Create subgroup variables for the 4 categories:

(standard treatment / Karn < 50, standard / Karn 50, test / Karn 50, test / Karn 50)

> stand.lowKarn<-stand.highKarn<-test.lowKarn<-test.highKarn<-rep(0,n)

> stand.lowKarn[VA$treat==1&VA$Karn<50]<-1

> stand.highKarn[VA$treat==1&VA$Karn>=50]<-1

> test.lowKarn[VA$treat==2&VA$Karn<50]<-1

> test.highKarn[VA$treat==2&VA$Karn>=50]<-1

> table(stand.lowKarn)

stand.lowKarn

0 1

119 18

> table(stand.highKarn)

stand.highKarn

0 1

86 51

> table(test.lowKarn)

test.lowKarn

0 1

117 20

> table(test.highKarn)

test.highKarn

0 1

89 48

Note that the number of ones adds up to 137. Each person is in one of the subgroups.

How would you do this with a for loop?

General Practice:

What is the mean survival time for each of the four cell types?

> mean(VA$stime[VA$cell==1])

[1] 200.2

> mean(VA$stime[VA$cell==2])

[1] 71.66667

> mean(VA$stime[VA$cell==3])

[1] 64.11111

> mean(VA$stime[VA$cell==4])

[1] 166.1111

Which patients were in the test treatment group and had prior therapy?

> which(VA$treat==2&VA$prior==10)

[1] 70 73 75 77 84 87 89 91 95 96 106 109 113 122 127 128 131 132 135

Looking at More Examples of While loops:

Mostly while loops are used when you’re testing a condition and you want to know when you’ve reached convergence or some point of interest.

How many data points do we need to simulate to get within .1 of the specified mean?

data<-NULL

check<-0

while(check==0){

data<-c(data,rpois(1,lambda=2))

new.mean<-mean(data)

if(1.9<=new.mean & new.mean<=2.1) check<-1

}

> data

[1] 2

> data

[1] 4 4 0 2 1 2 2 1

To get within .05?

data<-NULL

check<-0

while(check==0){

data<-c(data,rpois(1,lambda=2))

new.mean<-mean(data)

if(1.95<=new.mean & new.mean<=2.05) check<-1

}

> data

[1] 0 2 2 1 1 2 3 0 0 1 2 2 0 2 3 2 2 2 5 1 1 1 0 0 0 0 3 2 1 3 3 3 7 0 1 0 1 1 5 3 4 4 1 1 0 1 0 4 1 3 4 4 3 3 3 4 3 4

To get within .01?

data<-NULL

check<-0

while(check==0){

data<-c(data,rpois(1,lambda=2))

new.mean<-mean(data)

if(1.99<=new.mean & new.mean<=2.01) check<-1

}

> data

[1] 0 0 1 2 1 2 4 2 3 0 2 1 2 2 2 2 3 1 3 1 0 1 1 0 1 1 0 1 4 2 3 3 3 1 1 1 3

[38] 3 0 1 0 1 3 2 1 3 5 2 4 1 0 2 2 2 2 0 4 1 2 3 4 2 1 3 3 2 0 2 3 1 0 1 1 3

[75] 6 3 1 4 2 1 2 1 5 1 4 2 4 2 1 3 2 1 3 4 0 4 2 4 2 3 1 5

Rebecca Nugent, Department of Statistics, U. of Washington - 5 -