Numerical Methods for function Optimization in Statistics
Abstract
The purpose of this paper is to present some numerical methods for function Optimization in Statistics. I will mainly introduce two numerical methods, which are Newton-Raphson method, and Steep-Decent method. The basic mathematical theory behind these two methods and some applications will be illustrated. Such as approximate MLE for binominal model, approximate gradients and Hessians via finite-difference schemes, and searching for parameters for the Logistic and Probit Regression model and so on. Also the implementations of these two methods by using the R programming language are providing.
Newton-Raphson Method for Solving the MLE[1] of Statistical Models
The Newton Raphson algorithm is an iterative procedure that can be used to calculate MLEs.
The basic idea behind the algorithm is the following. First, construct a quadratic approximation to the function of interest around some initial parameter value (hopefully close to the MLE). Next, adjust the parameter value to which maximizes the quadratic approximation. This procedure is iterated until the parameter values stabilize.
Algorithm: Newton-Raphson 1-dimension (tolerance)
Comment: Find the value of x that maximizes f(x)
while |> tolerance
do
return ()
Example: Calculating the MLE of a Binomial Sampling Model by Newton-Raphson Method
To see how the Newton-Raphson algorithm works in practice lets look at a simple example with an analytical solution- a simple model of binomial sampling.
Suppose we have a log-likelihood function is:
Where n is the sample size, y is the number of successes, and is the probability of a success. The first derivative of the log-likelihood function is
and the second derivative of the log-likelihood function is
Analytically, if we let n=10, y=5, we get the MLE is =0.5.
Let’s see how the Newton Raphson algorithm works here. We begin by setting a tolerance level. In this case, we set TOL=0.01. Then we make an initial guess (denoted ) as to the MLE. Suppose =0.55. |Y) which is larger in absolute value than our tolerance of 0.01. Thus we set
Now we calculate which is still larger in absolute value than our tolerance of 0.01. Thus we set
is approximately equal to 0.0012 which is smaller in absolute value than our tolerance of 0.01 so we can stop. The Newton Raphson algorithm here returns a value of
=0.5335892 which is reasonably close to the analytical value of 0.5.
Note if we want to make the Newton Raphson procedure more accurate (within machine precision), we can set the tolerance closer to 0.
The R implementation of NR for solving MLE for Binomial Sampling Modeling
#program for MLE
#program to use Newton-Raphson method to solve MLE
of a Binomial Sampling Sampling Model
#Define LLF log-likelihood function l(theta/y)=yln(theta)+(n-y)ln(1-theta)
pltt<-function (tet) {
tet2<-log(tet)
tet1<-log(1-tet)
pltt<- y * tet2 + (n-y) * tet1
pltt
}
#Define derivative function for f(tet)
dpltt<-deriv((~y*log(tet) +(n-y)*log(1-tet)),"tet", function (tet){})
#Define function(tet) to perform iteration
itter<-function ( tets ) {
tet1<- tets
tol<- 0.1
count<-0
Resp <- "Function Converged"
while ( tol > 0.0001) {
dplt1<-dpltt(tet1)
dplt2<- attr(dplt1, "gradient")[1,]
tnew<-tet1- dpltt(tet1) / dplt2
tol <- abs(tet1-tnew)
tet2<- tet1; tet1 <- tnew
count<- count +1
if ( count > 10) {
tet1<- ( tet2+ tnew)/2
count<-count +1
}
else if ( count ==30){
Resp <- "No Convergence"
tol <- 0.0001
}
}
list ( tnew= tnew, Resp = Resp )
}
#Main Program
tet0<-0.55
xobs <- c(1,0,0,1,0)
y<-sum(xobs)
n <- length(xobs)
tet3<-itter(tet0)
t.rep<-tet3$Resp
if(t.rep=="Function Converged"){
cat("\n Parameter Estimate=",tet3$tnew )}
cat("\n",t.rep,"\n")
}
Parameter Estimate = 0.5335892> cat ("\n ", t.rep, "\n")
Function Converged
> tet0
[1] 0.55
> tet3
$tnew
[1] 0.5335892
attr(,"gradient")
tet
[1,] -505.0362
$Resp
[1] "Function Converged"
The Newton Raphson Algorithm for solving g(x)=0 of a k-dimensional variable
-Mathematical idea of NR
The multivariate Newton-Raphson(NR) method of solving an equation g(x)=0, where g is a smooth(k-vector-valued) function of a k-dimensional vector variable x whose Jacobian matrix
never vanishes, is to write and implement an equation saying that the linear (first-order Taylor series) approximation about x to the function at an updated variable value x’ precisely 0, i.e.
, or
The key application of this idea, which we make in computational statistics, is, for a fixed dataset X, to
The Newton-Raphson computational algorithm, which we code below in R-both from first principles and by using existing standard functions, is to begin with some initial value and then iteratively for m= 0,1… define
repeatedly until some termination-criterion is met, usually that either m is equal to a fixed large number (like 30) or || xx|| falls below a fixed tolerance (like 10.
Example: Hereis a simple R function that numerically approximates gradients, is needed only when we do not have a function implementing an analytical formula for the gradient.
Gradmat<-
function(parvec, infcn, eps = 1e-06)
{
# Function to calculate the difference-quotient approx gradient
# (matrix) of an arbitrary input (vector) function infcn
# Now recoded to use central differences !
dd <- length(parvec)
aa <- length(infcn(parvec))
epsmat <- (diag(dd) * eps)/2
gmat <- array(0, dim = c(aa, dd))
for(i in 1:dd)
gmat[, i] <- (infcn(parvec + epsmat[, i]) - infcn(parvec -
epsmat[, i]))/eps
if(aa > 1)
gmat
else c(gmat)
}
uniroot
function(inipar, infcn, nmax = 25, stoptol = 1e-05,
eps = 1e-06, gradfunc = NULL)
{
assign("Infcn", infcn, frame = 0)
assign("Eps", eps, frame = 0)
if(is.null(gradfunc))
gradfunc <- function(x)
Gradmat(x, Infcn, Eps)
ctr <- 0
newpar <- inipar
oldpar <- inipar - 1
while(ctr < nmax & sqrt(sum((newpar - oldpar)^2)) > stoptol) {
oldpar <- newpar
newpar <- oldpar - solve(gradfunc(oldpar), infcn(oldpar))
ctr <- ctr + 1
}
list(nstep = ctr, initial = inipar, final = newpar,
funcval = infcn(newpar))
}
Steepest Descent Method for minimizing the negative log-likelihood function
Our most frequent statistical objective in using the “NRRoot” finder for the gradient log-likelihood is to minimize the negative log-likelihood function. Here we introduce another simple method of numerical optimization called Steepest Descent Method, which can often served as an initialization-stage for a Newton-Raphson method, which must be coded ‘by hand’.
The idea of this method is simply to move an initial guess x to an improved one x` by making a step in the direction of the negative gradient. This method improves the objective-function value as long as the step-size is small enough and positive.
Here is a simple implementation as follows.
GradSrch<-
function(inipar, infcn, step, nmax = 25, stoptol = 1e-05,
unitfac = F, eps = 1e-06, gradfunc = NULL)
{
# Function to implement Steepest-descent. The unitfac condition
# indicates whether or not the supplied step-length factor(s) multiply
# the negative gradient itself, or the unit vector in the same direction.
assign("Infcn", infcn, frame = 0)
assign("Eps", eps, frame = 0)
if(is.null(gradfunc))
gradfunc <- function(x)
Gradmat(x, Infcn, Eps)
steps <- if(length(step) > 1) step else rep(step, nmax)
newpar <- inipar
oldpar <- newpar - 1
ctr <- 0
while(ctr < nmax & sqrt(sum((newpar - oldpar)^2)) > stoptol) {
ctr <- ctr + 1
oldpar <- newpar
newstep <- gradfunc(oldpar)
newstep <- if(unitfac) newstep/sum(newstep^2) else newstep
newpar <- oldpar - steps[ctr] * newstep
}
list(nstep = ctr, initial = inipar, final = newpar,
funcval = infcn(newpar))
}
Example: Maximizing logistic & Probit log-likelihoods by Steepest -Descent
Probit analysis along with logistic regression are the two major methods for analyzing binary response data. The term "probit' was coined in the 1930's by Chester Bliss and stands for probability unit. These two analyses, logit and probit, are very similar to one another. Logit analysis is based on log odds while probit uses the cumulative normal probability distribution.
The probit model is defined as Pr(y=1|x) = Φ(xb) where Φ is the standard cumulative normal probability distribution and xb is called the probit score or index.
First we create logistic and probit –regression data with (the same set of four independent binary regressors, with coefficients 0.5,0.4, -0.3,0.7 and intercept -2.
bc <- c(0.5,0.4,-0.3,0.7)
> matcov <- matrix(rbinom(800,1,0.5),ncol=4)
> respLgst <- rbinom(200,1,plogis(-2 + c(matcov %*% bc)))
> respPrbt <- rbinom(200,1,pnorm(-2 + c(matcov %*% bc)))
> matcov
Next construct function to calculate both Logistic and Probit log- likelihoods.
> binregLik
function(b0, a0, covmat, yresp, dist = plogis)
{
pvec <- dist(c(covmat %*% b0) + a0)
sum(log(ifelse(yresp == 1, pvec, 1 - pvec)))
}
> binregLik(bc,-2,matcov,respLgst, dist=plogis)
[1] -105.1509
> binregLik(bc,-2,matcov,respLgst, dist=pnorm)
[1] -120.7362
> binregLik(bc,-2,matcov,respPrbt, dist=pnorm)
[1] -67.61083
> binregLik(bc,-2,matcov,respPrbt, dist=plogis)
[1] -75.05095
Now we define the functions we will use in doing Logistic or Probit Regression
maximizations. For simplicity, we begin by maximization over only the ¯rst
two regression parameters, treating the intercept and the other regression,
parameters as known.
> tempfunc1 <- function(bb)
-binregLik(c(bb,-.3,.7),-2,matcov,respLgst)
> tempfunc2 <- function(bb)
-binregLik(c(bb,-.3,.7),-2,matcov,respPrbt, dist=pnorm)
>c(Gradmat(c(.3,.3),tempfunc1))
[1] -8.645333 -7.061996
> c(Gradmat(c(.2,.6),tempfunc2))
[1] -11.653346 -5.858309
Now we consider the estimation by Steepest Descents (with all steps equal to –0.05 multiplied by the gradient) and New-Raphson.
> btmp <- c(0.2,0.2)
> tempfunc1(c(0.2,0.2))
[1] 108.8181
> for (i in 1:10) { btmp <- btmp - 0.05*
Gradmat(btmp,tempfunc1)
cat(round(c(btmp, tempfunc1(btmp)), digits=5)," \n") }
0.74331 0.6776 105.0981
0.60541 0.42907 104.7035
0.7286 0.51856 104.5731
0.69225 0.4514 104.5422
0.72549 0.47351 104.5333
0.71621 0.45524 104.5310
0.72527 0.46079 104.5303
0.72296 0.45582 104.5302
0.72544 0.4572 104.5301
0.72488 0.45584 104.5301
> unlist(GradSrch(c(0.2,0.2), tempfunc1, 0.05))
nstep initial1 initial2 final1 final2 funcval
16 0.2 0.2 0.72488 0.45584 104.5301
> unlist(GradSrch(c(0.2,0.2), tempfunc1, 0.05, unitfac=T))
nstep initial1 initial2 final1 final2 funcval
25 0.2 0.2 0.69225 0.4514 104.5422
Now we concerning the last model-fit, in the last-fitted model, we can examine the incremental deviances due to successively added model terms as follows:
> anova(tmpglm)
Analysis of Deviance Table, Binomial model
Response: cbind(rsp, 1 - rsp)
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev
NULL 200 277.2589
v1 1 24.30848 199 252.9504
v2 1 5.96717 198 246.9832
v3 1 27.07522 197 219.9080
v4 1 0.06075 196 219.8473
Int 1 48.84046 195 171.0068
This says in particular that the intercept coeffcient is highly significant: p-value = 1-pchisq (48.4, 1) = 3e-12. The correct intercept value, we know, is - 2, while the estimated value is – 2.899, and the other coefficients are also different from the true ones! However:
> - binregLik(tmpglm$coef[1:4], tmpglm$coef[5],matcov,respLgst)
[1] 85.5034
> - binregLik(c(NRroot(rep(0,4), function(bb) t(Gradmat(bb, function(bb)
-binregLik(bb,-2,matcov,respLgst,dist=plogis))) )$final),-2,
matcov,respLgst)
[1] 87.32924
So the likelihood ratio statistic () for the difference of the Intercept coefficient from - 2 with these data is 2* (87.329, 85.503)= 3.472, which give p-value 0.062. Thus the difference between the fitted Intercept and - 2, which looked striking, is nevertheless not significant.
Conclusion
Numerical Method has been widely used for function Optimization in Statistics. They are very fast for well-behaved functions. But there are some particular situations we might found some methods no longer making rapid progress during the iterations. So we can use more than one method to calculate during the real problem, Such as if we use steep-decent method to begin a
Estimation of a likelihood maximization problem, when the steepest-descent steps are no longer making rapid progress, we can switch over to Newton-Raphson iterations for more rapid final stages of convergence to a minimum.
Reference
The Newton Raphson algorithm for function optimization
Optimization by Vector Space Methods (Series in Decision and Control)by David G. Luenberger January 25, 1997. ISBN:047118117X published by John Wiley & Sons, Inc. New York, NY, USA
[1] Maximum Likelihood Estimation
