# Building a simple emulator

In this example, we emulate a simple 1d function.
We use the function `EMULATE.lm` to fit the mean function and the function `EMULATE.gpstan` to emulate the residuals.
Then we use `stanLOOplot` for to produce a diagnostic plot, `EMULATOR.gpstan` to predict new values of the simulator.
Some comments in the end might help to improve the code.


```{r}
suppressPackageStartupMessages(library(tidyverse))
knitr::opts_chunk$set(fig.path='figure/simple-emulator/', fig.height=4, 
                      cache.path='_knitr_cache/simple-emulator/')
```

## The simulator

We define the following function to be the simulator that we want to emulate:

```{r simulator-function}
simulator = function(x) {
  sin(2*x) * exp(-0.5 * x^2)
}
df_simulator = data_frame(x=seq(-5, 5, .01), y=simulator(x))
ggplot(df_simulator, aes(x=x, y=y)) + geom_line()
```

## The sample points

The preliminary step is that we need to identify the possible range of input `x`. 
In our example, a reasonable range is `[-3, 3]`. 
For emulation, we have to normalize our input `x` to `[-1, 1]`. 
We decided to choose 15 equidistant design points and evaluate our simulator at these design points in order to construct GP emulator.
We also added a `Noise` term to avoid overfitting our mean function component.

```{r design-points}
# define tData data frame
tData = data_frame(x = seq(-3, 3, length.out = 15), 
                   y = simulator(x),
                   Noise = rnorm(15, 0, 0.4)) %>%
        as.data.frame

# Normalize x to [-1, 1]
tData = mutate(tData, x = (x - min(x)) / (max(x) - min(x)) * 2 - 1)

# show and plot
head(tData)
ggplot(tData, aes(x = x, y=y)) + geom_point()
```


## Loading the code

The emulator code is currently located in the file `BuildEmulator/BuildEmulator.R`.

```{r}
old_wd = getwd() 
setwd('..')
source('BuildEmulator/BuildEmulator.R')
setwd(old_wd)
```


## Building the emulator

**Firstly**, we start with fitting the emulator mean function. 
The function `EMULATE.lm` uses the stepwise regression to find the best regressors for the mean function. 
The terms we consider for addition to the mean function are linear, quadratic, cubic, and Fouriers transformations with intercations up to third order between all parameters. 
(For this simple example we do not need such complicated terms, but in more complicated settings we might.)


```{r emulate-lm, cache=TRUE}
myem_lm = EMULATE.lm(Response='y', tData=tData, tcands='x', tcanfacs=NULL, maxOrder=2, maxdf = 2)
```

**Secondly**, we construct the Gaussian process emulator using the function `EMULATE.gpstan`. 
We have to specify the `nugget` term, which we set to 0.0001 for this particular example, to ensure numerical stability of the covariance matrix. 
Another important function argument is `FastVersion`: 
When the fully Bayesian analysis is too expensive, it allows us to fix parameters at their posterior mode values.

```{r emulate-gpstan, cache=TRUE}
myem_gp = EMULATE.gpstan(meanResponse=myem_lm, nugget=0.0001, tData=tData, 
                         additionalVariables='x', FastVersion=TRUE)
```

The calls to `EMULATE.lm` and `EMULATE.gpstan` are two main steps in constructing Bayesian Gaussian process emulator. 


## Validation of the emulator: Leave-one-out analysis

To validate our GP emulator we produce Leave One Out plots against inputs. 
The function `stanLOOplot` calculates leave-one-out posterior means at the design points: 

```{r plot-stan-loo, cache=TRUE}
tLOOs = stanLOOplot(StanEmulator = myem_gp, ParamNames='x')
```

Leave One Out plot against the input data.
The predictions and two standard deviation prediction intervals for the left out points are in black. 
The true values are in either green, if they are within two standard deviations of the prediction, or red otherwise.


## Emulator predictions

 
The function `EMULATOR.gpstan` allows to produce emulator predictions for unseen point. 
It returns a list with two vectors: `Expectation` and `Variance` at the prediction points. 
An important point is that `FastVersion = FALSE` uses the posterior samples to generate predictions at unseen points, while `FastVersion = TRUE` uses the parameters values fixed at the posterior mode. 


```{r emulator-gpstan, cache=TRUE}
new_data_x = data.frame(x = seq(-1, 1, len=200))
new_data_y = EMULATOR.gpstan(new_data_x, Emulator = myem_gp, FastVersion = FALSE)
```

```{r plot-predictions}
new_data = new_data_x %>% 
  bind_cols(as_data_frame(new_data_y)) %>%
  mutate(lower= Expectation - 2 * sqrt(Variance),
         upper= Expectation + 2 * sqrt(Variance))
ggplot(data=new_data, aes(x=x)) + 
  geom_ribbon(aes(ymin=lower, ymax=upper), fill='magenta') +
  geom_point(aes(y=Expectation), cex=.8)
```


## Comments:

- What exactly does the `Noise` term do in `EMULATE.lm`?
- Is normalisation of the inputs really necessary? Or would it be better to define the length scales for the covariance matrix of the GP?
- Can the stan warnings be suppressed? 
- The return value of `EMULATOR.gpstan` could be a data frame instead of a list, including the input values?
- The column names in `tLOOs` are `upper quartile` and `lower quartile` but the text description says "two standard deviation" -- Which one is it? (also variable names shouldn't contain spaces)
- I had problems using the `ValidPlot` function because the `x` values have to be provided unnormalised to the `simulator` function but in normalised form to the emulator. 
- In general, the code could be consolidated by making the input/output data format more consistent and more predictable. I would suggest using `data_frame`s from the `tidyverse`.