Numerical Methods for function Optimization in Statistics

Abstract

The purpose of this paper is to present some numerical methods for function Optimization in Statistics. I will mainly introduce two numerical methods, which are Newton-Raphson method, and Steep-Decent method. The basic mathematical theory behind these two methods and some applications will be illustrated. Such as approximate MLE for binominal model, approximate gradients and Hessians via finite-difference schemes, and searching for parameters for the Logistic and Probit Regression model and so on. Also the implementations of these two methods by using the R programming language are providing.

Newton-Raphson Method for Solving the MLE[1] of Statistical Models

The Newton Raphson algorithm is an iterative procedure that can be used to calculate MLEs.

The basic idea behind the algorithm is the following. First, construct a quadratic approximation to the function of interest around some initial parameter value (hopefully close to the MLE). Next, adjust the parameter value to which maximizes the quadratic approximation. This procedure is iterated until the parameter values stabilize.

Algorithm: Newton-Raphson 1-dimension (tolerance)

Comment: Find the value of x that maximizes f(x)

while |> tolerance

do

return ()

Example: Calculating the MLE of a Binomial Sampling Model by Newton-Raphson Method

To see how the Newton-Raphson algorithm works in practice lets look at a simple example with an analytical solution- a simple model of binomial sampling.

Suppose we have a log-likelihood function is:

Where n is the sample size, y is the number of successes, and is the probability of a success. The first derivative of the log-likelihood function is

and the second derivative of the log-likelihood function is

Analytically, if we let n=10, y=5, we get the MLE is =0.5.

Let’s see how the Newton Raphson algorithm works here. We begin by setting a tolerance level. In this case, we set TOL=0.01. Then we make an initial guess (denoted ) as to the MLE. Suppose =0.55. |Y) which is larger in absolute value than our tolerance of 0.01. Thus we set

Now we calculate which is still larger in absolute value than our tolerance of 0.01. Thus we set

is approximately equal to 0.0012 which is smaller in absolute value than our tolerance of 0.01 so we can stop. The Newton Raphson algorithm here returns a value of

=0.5335892 which is reasonably close to the analytical value of 0.5.

Note if we want to make the Newton Raphson procedure more accurate (within machine precision), we can set the tolerance closer to 0.

The R implementation of NR for solving MLE for Binomial Sampling Modeling

#program for MLE

#program to use Newton-Raphson method to solve MLE

of a Binomial Sampling Sampling Model

#Define LLF log-likelihood function l(theta/y)=yln(theta)+(n-y)ln(1-theta)

pltt<-function (tet) {

tet2<-log(tet)

tet1<-log(1-tet)

pltt<- y * tet2 + (n-y) * tet1

pltt

}

#Define derivative function for f(tet)

dpltt<-deriv((~y*log(tet) +(n-y)*log(1-tet)),"tet", function (tet){})

#Define function(tet) to perform iteration

itter<-function ( tets ) {

tet1<- tets

tol<- 0.1

count<-0

Resp <- "Function Converged"

while ( tol > 0.0001) {

dplt1<-dpltt(tet1)

dplt2<- attr(dplt1, "gradient")[1,]

tnew<-tet1- dpltt(tet1) / dplt2

tol <- abs(tet1-tnew)

tet2<- tet1; tet1 <- tnew

count<- count +1

if ( count > 10) {

tet1<- ( tet2+ tnew)/2

count<-count +1

}

else if ( count ==30){

Resp <- "No Convergence"

tol <- 0.0001

}

}

list ( tnew= tnew, Resp = Resp )

}

#Main Program

tet0<-0.55

xobs <- c(1,0,0,1,0)

y<-sum(xobs)

n <- length(xobs)

tet3<-itter(tet0)

t.rep<-tet3$Resp

if(t.rep=="Function Converged"){

cat("\n Parameter Estimate=",tet3$tnew )}

cat("\n",t.rep,"\n")

}

Parameter Estimate = 0.5335892> cat ("\n ", t.rep, "\n")

Function Converged

> tet0

[1] 0.55

> tet3

$tnew

[1] 0.5335892

attr(,"gradient")

tet

[1,] -505.0362

$Resp

[1] "Function Converged"

The Newton Raphson Algorithm for solving g(x)=0 of a k-dimensional variable

-Mathematical idea of NR

The multivariate Newton-Raphson(NR) method of solving an equation g(x)=0, where g is a smooth(k-vector-valued) function of a k-dimensional vector variable x whose Jacobian matrix

never vanishes, is to write and implement an equation saying that the linear (first-order Taylor series) approximation about x to the function at an updated variable value x’ precisely 0, i.e.

, or

The key application of this idea, which we make in computational statistics, is, for a fixed dataset X, to

The Newton-Raphson computational algorithm, which we code below in R-both from first principles and by using existing standard functions, is to begin with some initial value and then iteratively for m= 0,1… define

repeatedly until some termination-criterion is met, usually that either m is equal to a fixed large number (like 30) or || xx|| falls below a fixed tolerance (like 10.

Example: Hereis a simple R function that numerically approximates gradients, is needed only when we do not have a function implementing an analytical formula for the gradient.

Gradmat<-

function(parvec, infcn, eps = 1e-06)

{

# Function to calculate the difference-quotient approx gradient

# (matrix) of an arbitrary input (vector) function infcn

# Now recoded to use central differences !

dd <- length(parvec)

aa <- length(infcn(parvec))

epsmat <- (diag(dd) * eps)/2

gmat <- array(0, dim = c(aa, dd))

for(i in 1:dd)

gmat[, i] <- (infcn(parvec + epsmat[, i]) - infcn(parvec -

epsmat[, i]))/eps

if(aa > 1)

gmat

else c(gmat)

}

uniroot

function(inipar, infcn, nmax = 25, stoptol = 1e-05,

eps = 1e-06, gradfunc = NULL)

{

assign("Infcn", infcn, frame = 0)

assign("Eps", eps, frame = 0)

if(is.null(gradfunc))

gradfunc <- function(x)

Gradmat(x, Infcn, Eps)

ctr <- 0

newpar <- inipar

oldpar <- inipar - 1

while(ctr < nmax & sqrt(sum((newpar - oldpar)^2)) > stoptol) {

oldpar <- newpar

newpar <- oldpar - solve(gradfunc(oldpar), infcn(oldpar))

ctr <- ctr + 1

}

list(nstep = ctr, initial = inipar, final = newpar,

funcval = infcn(newpar))

}

Steepest Descent Method for minimizing the negative log-likelihood function

Our most frequent statistical objective in using the “NRRoot” finder for the gradient log-likelihood is to minimize the negative log-likelihood function. Here we introduce another simple method of numerical optimization called Steepest Descent Method, which can often served as an initialization-stage for a Newton-Raphson method, which must be coded ‘by hand’.

The idea of this method is simply to move an initial guess x to an improved one x` by making a step in the direction of the negative gradient. This method improves the objective-function value as long as the step-size is small enough and positive.

Here is a simple implementation as follows.

GradSrch<-

function(inipar, infcn, step, nmax = 25, stoptol = 1e-05,

unitfac = F, eps = 1e-06, gradfunc = NULL)

{

# Function to implement Steepest-descent. The unitfac condition

# indicates whether or not the supplied step-length factor(s) multiply

# the negative gradient itself, or the unit vector in the same direction.

assign("Infcn", infcn, frame = 0)

assign("Eps", eps, frame = 0)

if(is.null(gradfunc))

gradfunc <- function(x)

Gradmat(x, Infcn, Eps)

steps <- if(length(step) > 1) step else rep(step, nmax)

newpar <- inipar

oldpar <- newpar - 1

ctr <- 0

while(ctr < nmax & sqrt(sum((newpar - oldpar)^2)) > stoptol) {

ctr <- ctr + 1

oldpar <- newpar

newstep <- gradfunc(oldpar)

newstep <- if(unitfac) newstep/sum(newstep^2) else newstep

newpar <- oldpar - steps[ctr] * newstep

}

list(nstep = ctr, initial = inipar, final = newpar,

funcval = infcn(newpar))

}

Example: Maximizing logistic & Probit log-likelihoods by Steepest -Descent

Probit analysis along with logistic regression are the two major methods for analyzing binary response data. The term "probit' was coined in the 1930's by Chester Bliss and stands for probability unit. These two analyses, logit and probit, are very similar to one another. Logit analysis is based on log odds while probit uses the cumulative normal probability distribution.

The probit model is defined as Pr(y=1|x) = Φ(xb) where Φ is the standard cumulative normal probability distribution and xb is called the probit score or index.

First we create logistic and probit –regression data with (the same set of four independent binary regressors, with coefficients 0.5,0.4, -0.3,0.7 and intercept -2.

bc <- c(0.5,0.4,-0.3,0.7)

> matcov <- matrix(rbinom(800,1,0.5),ncol=4)

> respLgst <- rbinom(200,1,plogis(-2 + c(matcov %*% bc)))

> respPrbt <- rbinom(200,1,pnorm(-2 + c(matcov %*% bc)))

> matcov

Next construct function to calculate both Logistic and Probit log- likelihoods.

> binregLik

function(b0, a0, covmat, yresp, dist = plogis)

{

pvec <- dist(c(covmat %*% b0) + a0)

sum(log(ifelse(yresp == 1, pvec, 1 - pvec)))

}

> binregLik(bc,-2,matcov,respLgst, dist=plogis)

[1] -105.1509

> binregLik(bc,-2,matcov,respLgst, dist=pnorm)

[1] -120.7362

> binregLik(bc,-2,matcov,respPrbt, dist=pnorm)

[1] -67.61083

> binregLik(bc,-2,matcov,respPrbt, dist=plogis)

[1] -75.05095

Now we define the functions we will use in doing Logistic or Probit Regression

maximizations. For simplicity, we begin by maximization over only the ¯rst

two regression parameters, treating the intercept and the other regression,

parameters as known.

> tempfunc1 <- function(bb)

-binregLik(c(bb,-.3,.7),-2,matcov,respLgst)

> tempfunc2 <- function(bb)

-binregLik(c(bb,-.3,.7),-2,matcov,respPrbt, dist=pnorm)

>c(Gradmat(c(.3,.3),tempfunc1))

[1] -8.645333 -7.061996

> c(Gradmat(c(.2,.6),tempfunc2))

[1] -11.653346 -5.858309

Now we consider the estimation by Steepest Descents (with all steps equal to –0.05 multiplied by the gradient) and New-Raphson.

> btmp <- c(0.2,0.2)

> tempfunc1(c(0.2,0.2))

[1] 108.8181

> for (i in 1:10) { btmp <- btmp - 0.05*

Gradmat(btmp,tempfunc1)

cat(round(c(btmp, tempfunc1(btmp)), digits=5)," \n") }

0.74331 0.6776 105.0981

0.60541 0.42907 104.7035

0.7286 0.51856 104.5731

0.69225 0.4514 104.5422

0.72549 0.47351 104.5333

0.71621 0.45524 104.5310

0.72527 0.46079 104.5303

0.72296 0.45582 104.5302

0.72544 0.4572 104.5301

0.72488 0.45584 104.5301

> unlist(GradSrch(c(0.2,0.2), tempfunc1, 0.05))

nstep initial1 initial2 final1 final2 funcval

16 0.2 0.2 0.72488 0.45584 104.5301

> unlist(GradSrch(c(0.2,0.2), tempfunc1, 0.05, unitfac=T))

nstep initial1 initial2 final1 final2 funcval

25 0.2 0.2 0.69225 0.4514 104.5422

Now we concerning the last model-fit, in the last-fitted model, we can examine the incremental deviances due to successively added model terms as follows:

> anova(tmpglm)

Analysis of Deviance Table, Binomial model

Response: cbind(rsp, 1 - rsp)

Terms added sequentially (first to last)

Df Deviance Resid. Df Resid. Dev

NULL 200 277.2589

v1 1 24.30848 199 252.9504

v2 1 5.96717 198 246.9832

v3 1 27.07522 197 219.9080

v4 1 0.06075 196 219.8473

Int 1 48.84046 195 171.0068

This says in particular that the intercept coeffcient is highly significant: p-value = 1-pchisq (48.4, 1) = 3e-12. The correct intercept value, we know, is - 2, while the estimated value is – 2.899, and the other coefficients are also different from the true ones! However:

> - binregLik(tmpglm$coef[1:4], tmpglm$coef[5],matcov,respLgst)

[1] 85.5034

> - binregLik(c(NRroot(rep(0,4), function(bb) t(Gradmat(bb, function(bb)

-binregLik(bb,-2,matcov,respLgst,dist=plogis))) )$final),-2,

matcov,respLgst)

[1] 87.32924

So the likelihood ratio statistic () for the difference of the Intercept coefficient from - 2 with these data is 2* (87.329, 85.503)= 3.472, which give p-value 0.062. Thus the difference between the fitted Intercept and - 2, which looked striking, is nevertheless not significant.

Conclusion

Numerical Method has been widely used for function Optimization in Statistics. They are very fast for well-behaved functions. But there are some particular situations we might found some methods no longer making rapid progress during the iterations. So we can use more than one method to calculate during the real problem, Such as if we use steep-decent method to begin a

Estimation of a likelihood maximization problem, when the steepest-descent steps are no longer making rapid progress, we can switch over to Newton-Raphson iterations for more rapid final stages of convergence to a minimum.

Reference

The Newton Raphson algorithm for function optimization

Optimization by Vector Space Methods (Series in Decision and Control)by David G. Luenberger January 25, 1997. ISBN:047118117X published by John Wiley & Sons, Inc. New York, NY, USA

[1] Maximum Likelihood Estimation