Optimization of some target function is of great relevance in many fields, including finance (portfolio optimization), engineering (minimizing air resistance), and statistics (likelihood maximization). Often, the optimization problem cannot be solved analytically, for example when explicit formulas for gradient or Hessian are unknown. In these cases, numerical optimization algorithms are helpful. They iteratively explore the parameter space, guaranteeing to improve the function value over each iteration, and eventually converge to a point where no more improvements can be made (Bonnans et al. 2006).

In R, several functions are available that can be applied to
numerically solve optimization problems, including (quasi) Newton
(`stats::nlm()`

, `stats::nlminb()`

,
`stats::optim()`

), direct search
(`pracma::nelder_mead()`

), and conjugate gradient methods
(`Rcgmin::Rcgmin()`

). The CRAN Task View:
Optimization and Mathematical Programming provides a comprehensive
list of packages for solving optimization problems. All these functions
share the requirement that initial parameters values must be supplied
from where the optimization is started.

Optimization theory (Nocedal and Wright 2006) shows that the choice of an initial point has a large influence on the convergence time and rate. In general, starting in areas of function saturation increases computation time, starting in areas of non-concavity may lead to convergence problems or convergence to local rather than global optima. Therefore, numerical optimization can be facilitated by putting effort on identifying good starting values. However, it is generally unclear what plausible initial values are and how they might affect the optimization problem at hand.

The purpose of the {ino} R package is therefore threefold: to provide a comprehensive toolbox for

evaluating the effect of the initial values on the optimization,

comparing different initialization strategies,

and comparing different optimizer.

- The starting point for working with {ino} is the
`setup_ino()`

function to specify the optimization problem, including the numerical optimizer. - Subsequently, the following functions can be applied to execute
different initialization strategies:
`fixed_initialization()`

to optimize with fixed initial starting values`random_initialization()`

to optimize with randomly chosen starting values`standardize_initialization()`

to optimize after standardizing the optimization problem`subset_initialization()`

to start optimization on a sub-problem

- For evaluating the results,
`overview_optima()`

prints an overview of all local and global optima, and the`plot()`

and`summary()`

methods summarize the optimization runs.

We demonstrate the basic {ino} workflow in the context of likelihood
maximization, where we fit a two-class Gaussian mixture model to to the
eruption times in the popular `faithful`

data set that is
provided via base R.

Remark: Optimization in this example is very fast. This is because the data set is relatively small and we consider a model with only two classes. Therefore, it might not seem relevant to be concerned about initialization here. However, the problem scales: optimization time will rise with more data and more parameters, in which case initialization becomes a greater issue, see for example Shireman, Steinley, and Brusco (2017).

The `faithful`

data set contains information about the
eruption times of the Old Faithful geyser in Yellowstone National Park,
Wyoming, USA. The following histogram indicates two clusters with short
and long eruption times, respectively:

```
library("ggplot2")
data("faithful")
str(faithful)
#> 'data.frame': 272 obs. of 2 variables:
#> $ eruptions: num 3.6 1.8 3.33 2.28 4.53 ...
#> $ waiting : num 79 54 74 62 85 55 88 85 51 85 ...
ggplot(faithful, aes(x = eruptions, y = ..density..)) +
geom_histogram(bins = 30) +
xlab("Eruption time (min)")
```

For both clusters, we assume a normal distribution here, such that we consider a mixture of two Gaussian densities for modeling the overall eruption times. The log-likelihood is given by

\[ \ell(\boldsymbol{\theta}) = \sum_{i=1}^n \log\Big( \pi f_{\mu_1, \sigma_1^2}(x_i) + (1-\pi)f_{\mu_2,\sigma_2^2} (x_i) \Big), \] where \(f_{\mu_1, \sigma_1^2}\) and \(f_{\mu_2, \sigma_2^2}\) denote the normal density for the first and second cluster, respectively, and \(\pi\) is the mixing proportion. The vector of parameters to be estimated is thus \(\boldsymbol{\theta} = (\mu_1, \mu_2, \sigma_1, \sigma_2, \pi)\). As there exists no closed-form solution for the maximum likelihood estimator, we need numerical optimization for finding the function optimum, where the {ino} package will help us in applying different initialization strategies.

The following function implements the log-likelihood of the normal
mixture. Note that we restrict the standard deviations
`sigma`

to be positive and `pi`

to be between 0
and 1, and that the function returns the negative log-likelihood
value.

```
<- function(theta, data, column){
normal_mixture_llk <- theta[1:2]
mu <- exp(theta[3:4])
sigma <- plogis(theta[5])
pi <- sum(log(pi * dnorm(data[[column]], mu[1], sigma[1]) +
llk 1 - pi) * dnorm(data[[column]], mu[2], sigma[2])))
(return(-llk)
}
```

The optimization problem is specified using the function
`setup_ino`

. Here,

`f`

constitutes the function to be optimized (i.e.,`normal_mixture_llk`

),`npar`

specifies the number of parameters (i.e., the length of the parameter vector),`data`

gives the data frame of observations as required by our likelihood function,`column = "eruptions"`

is an additional function argument,- and
`opt`

sets the numerical optimizer.

Numerical optimizer must be specified through the framework provided by the {optimizeR} package. It is possible to define a list of multiple optimizer for comparison.

```
<- setup_ino(
geyser_ino f = normal_mixture_llk,
npar = 5,
data = faithful,
column = "eruptions",
opt = set_optimizer_nlm()
)
geyser_ino#> Function to be optimized
#> f: normal_mixture_llk
#> npar: 5
#>
#> Numerical optimizer
#> 'stats::nlm': <optimizer 'stats::nlm'>
#>
#> Optimization runs
#> Records: 0
```

Behind the scenes, `setup_ino`

runs several input checks
that can be specified via the `test_par`

argument.
Subsequently, it returns an `ino`

object that can be passed
to the other {ino} functionalities.

We will first consider fixed starting values for the likelihood
optimization, i.e., we make “educated guesses” about starting values
that are probably close to the global optimum. Based on the histogram
above, the means of the two normal distributions may be somewhere around
2 and 4. For the variances, we set the starting values to 1 (note that
we use the log transformation here since we restrict the standard
deviations to be positive by using `exp()`

in the
log-likelihood function).

We will use two sets of starting values where the means are lower and larger than 2 and 4, respectively. For both sets, the starting value for the mixing proportion is 0.5.

```
<- list(c(1.7, 4.3, log(1), log(1), qlogis(0.5)),
starting_values c(2.3, 3.7, log(1), log(1), qlogis(0.5)))
```

For comparison, we also consider a third set of starting values which are somewhat unplausible.

`3]] <- c(10, 8, log(0.1), log(0.2), qlogis(0.5)) starting_values[[`

The function provided by {ino} to run the optimization with chosen
starting values is `fixed_initialization()`

. We loop over the
set of starting values by passing them to the argument
`at`

.

```
for(val in starting_values)
<- fixed_initialization(geyser_ino, at = val) geyser_ino
```

Let’s look at the results:

```
summary(geyser_ino)
#> # A tibble: 3 × 4
#> .strategy .time .optimum .optimizer
#> <chr> <drtn> <dbl> <chr>
#> 1 fixed 0.011204004 secs 276. stats::nlm
#> 2 fixed 0.009703875 secs 276. stats::nlm
#> 3 fixed 0.015278101 secs 421. stats::nlm
```

The `summary()`

method returns an overview of the three
optimization runs. The first column names the initialization strategy,
the second column gives the optimization time, the third column the
function value at the optimum, and the fourth column an identifier for
the optimizer.

The third run (with the unplausible starting values) converged to a local minimum, as we can deduce from the column “.optimum”. We discard this run from further comparisons:

`<- clear_ino(geyser_ino, which = 3) geyser_ino `

The `geyser_ino`

object saved more information than those
provided by the `summary()`

method. The
`var_names()`

function returns the names of all available
variables:

```
var_names(geyser_ino)
#> [1] ".fail" ".strategy" ".time" ".optimum" ".init"
#> [6] ".estimate" ".optimizer" "data" "column" "gradient"
#> [11] "code" "iterations"
```

Variable names starting with a “.” are the variables that {ino} provides per default:

`.fail`

, the error message of a failed optimization run (if any),`.strategy`

, the label for the initialization strategy,`.time`

, the optimization time,`.optimum`

, the function value at the estimated optimum,`.init`

, the initial parameter vector,`.estimate`

, the estimated (optimal) parameter vector,`.optimizer`

, the label for the optimizer.

The other variables come either from the function or the optimizer.
For example, “iterations” is the number of iterations of the
`stats::nlm()`

optimizer. The value can be accessed via the
`get_vars()`

function:

```
get_vars(geyser_ino, runs = 1:2, vars = "iterations")
#> [[1]]
#> [[1]]$iterations
#> [1] 17
#>
#>
#> [[2]]
#> [[2]]$iterations
#> [1] 16
```

When using randomly chosen starting values, we apply the function
`random_initialization()`

. The most simple function call
would be as follows:

`<- random_initialization(geyser_ino, runs = 10) geyser_ino `

Here, starting values are randomly drawn from a standard normal distribution. Depending on the application and the magnitude of the parameters to be estimated, this may not be a good guess. We can, however, easily modify the distribution that is used to draw the random numbers. For example, the next code snippet uses starting values drawn from a \(\mathcal{N}(2, 0.5)\):

```
<- function() stats::rnorm(5, mean = 2, sd = 0.5)
sampler <- random_initialization(
geyser_ino runs = 10, sampler = sampler, label = "random_cs"
geyser_ino, )
```

The argument `sampler`

allows to use any random number
generator, while further arguments for the sampler can easily be
added.

The `overview_optime()`

function provides an overview of
the identified local and global optima so far:

```
overview_optima(geyser_ino)
#> optimum frequency
#> 1 276.36 16
#> 2 421.42 6
```

For standardized initialization, we standardize all columns in our data before running the optimization. Specifically, after standardizing the data, we can again select to use either fixed or randomly chosen starting values. In the example code snippet shown below, we consider ten sets of randomly selected starting values:

```
<- standardize_initialization(
geyser_ino initialization = random_initialization(runs = 10),
geyser_ino, label = "standardize"
)
```

If the data set considered shows some complex structures or is very large, numerical optimization may become computationally costly. In such cases, subset initialization can be a helpful tool.

The function `subset_initialization`

first optimizes the
function on a data subset of proportion `prop`

, and
subsequently optimizes on the full data set using the estimates from the
first step as initial values. For example, the following code snippet
randomly samples 25% of the observations and uses random
initialization:

```
<- subset_initialization(
geyser_ino how = "random", prop = 0.25,
geyser_ino, initialization = random_initialization(runs = 10),
label = "subset"
)
```

In addition to selecting subsamples at random, we can specify two
further options using the argument `how`

. When dealing with
time series data, we usually do not want to delete single observations
at random. Instead, we can select a proportion of the first rows by
specifying `how = "first"`

. We could also cluster our data
first using `how = "kmeans"`

.

The `plot()`

method provides a boxplot comparison of the
optimization times. We can compare the optimization times across
initialization strategies by specifying `by = .strategy`

:

`plot(geyser_ino, by = ".strategy", nrow = 1)`

Bonnans, Joseph-Frédéric, Jean Charles Gilbert, Claude Lemaréchal, and
Claudia A Sagastizábal. 2006. *Numerical Optimization: Theoretical
and Practical Aspects*. Springer Science & Business Media.

Nocedal, Jorge, and Stephen J Wright. 2006. “Quadratic
Programming.” *Numerical Optimization*, 448–92.

Shireman, Emilie, Douglas Steinley, and Michael J Brusco. 2017.
“Examining the Effect of Initialization Strategies on the
Performance of Gaussian Mixture Modeling.” *Behavior Research
Methods* 49 (1): 282–93.