# Emulating the Canadian climate model CanAM4

```{r, echo=FALSE}
knitr::opts_chunk$set(fig.path='figure/emulate-canam4/', 
                      fig.height=6, fig.width=10,
                      cache.path='_knitr_cache/emulate-canam4/')
set.seed(14112017)
```

In this example we use the well-known R-package [`DiceKriging`](https://cran.r-project.org/web/packages/DiceKriging/index.html) to emulate output from the Canadian climate model CanAM4.


First we load the necessary packages, and the data file `globalCanAM4W1.RData`. The emulator inputs are saved in columns 2 to 14 of the object `tData`:

```{r}
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(DiceKriging))

load('../Demonstrations/globalCanAM4W1.RData')

design = tData[, 2:14] %>% as_data_frame
design
```

# DiceKriging emulator for BALX

## Build the emulator


We build the emulator for the output variable `BALX`.
We use the built-in R function `step` to choose the emulator mean function by stepwise regression.
`step` returns an object of class `lm`, whose list element `terms` contains the R formula for the mean function.
That formula is passed to the `DiceKriging` function `km`, along with the design matrix and the response, to fit the Gaussian process emulator.
See `?km` for options to control the emulator; here we only specify to use a Matern covariance matrix (with `nu=5/2`) and let all parameters be estimated automatically:


```{r emulate-balx}
# define response
response = tData$BALX

# stepwise regression to fit trend model
em_lm = step(lm(response ~ ., data=design), 
             direction='both', trace=0)

# fit GP using DiceKriging function `km`
em_km = km(formula  = em_lm$terms, 
           design   = design, 
           response = response, 
           covtype  = 'matern5_2',
           control  = list(trace=FALSE))
```

The object `em_km` contains the emulator.
We can use `predict` to evaluate the emulator at new inputs, for example at the mean of all input points:

```{r}
pred = predict(em_km, newdata=map_df(design, mean), type='UK')
pred[c('mean', 'sd', 'lower95', 'upper95')] %>% as.data.frame
```


## Validate emulator

The emulator can be validated by leave-one-out cross validation, using the `DiceKriging` function `leaveOneOut.km`:

```{r}
em_km_loo = leaveOneOut.km(em_km, type='UK') %>% as_data_frame
em_km_loo 
```

`em_km_loo` is a list with two vectors `mean` and `sd` of equal length, which is why it can be transformed to a data frame.
To plot the cross validation results in a faceted panel plot with `ggplot2`, we have to set up the appropriate data frame with columns for the response, and lower and upper limit of the prediction interval.
We also add a boolean variable `inside` and a corresponding `colr` variable to change the colour of the intervals if the response falls outside.
The variable `pit` is the probability integral transform, which should be uniformly distributed if the emulator is reliable.
At last we append the design matrix, and use `gather` to turn the data frame from wide into long form appropriate for facet plotting:


```{r}
# dataframe for plotting
plot_df = em_km_loo %>% 
  mutate(response = response,
         index    = seq_along(response),
         lwr      = mean - 2 * sd, 
         upr      = mean + 2 * sd,
         pit      = pnorm(response, mean, sd),
         inside   = (response > lwr & response < upr),
         colr     = ifelse(inside, 'deepskyblue3', 'darkorange3')) %>%
  bind_cols(design) %>%
  gather('variable', 'value', names(design))
```

The following plots the emulator output and the actual value of the response over the value of the input in a separate panel for each input.

```{r plot-balx-loo, fig.height=15}
ggplot(plot_df, aes(x=value)) + facet_wrap( ~ variable, ncol=2) +
  geom_point(aes(y=mean), colour=plot_df$colr) + 
  geom_segment(aes(xend=value, y=lwr, yend=upr), colour=plot_df$colr) +
  geom_point(aes(y=response), cex=1.5) 
```

The histogram over the PIT values should be approximately flat:

```{r plot-balx-pit}
ggplot(plot_df, aes(x=pit)) + 
  geom_histogram(breaks=seq(0,1,0.2), colour='black', fill='deepskyblue3')
```


# Repeat everyting with PCP as a response

## Build emulator

```{r}
response = tData$PCP
em_lm = step(lm(response ~ ., data=design), direction='both', trace=0)
em_km = km(formula=em_lm$terms, design=design, response=response, covtype='matern5_2', control=list(trace=FALSE))
```

## Validate emulator

```{r}
plot_df = leaveOneOut.km(em_km, type='UK') %>% as_data_frame %>% mutate(response = response, index = seq_along(response), lwr = mean - 2 * sd, upr = mean + 2 * sd, pit = pnorm(response, mean, sd), inside = (response > lwr & response < upr), colr = ifelse(inside, 'deepskyblue3', 'darkorange3')) %>% bind_cols(design) %>% gather('variable', 'value', names(design))
```

```{r plot-pcp-loo, fig.height=15}
ggplot(plot_df, aes(x=value)) + facet_wrap(~variable, ncol=2) + geom_point(aes(y=mean), colour=plot_df$colr) + geom_segment(aes(xend=value, y=lwr, yend=upr), colour=plot_df$colr) + geom_point(aes(y=response), cex=1.5) 
```

```{r plot-pcp-pit}
ggplot(plot_df, aes(x=pit)) + geom_histogram(breaks=seq(0,1,0.2), colour='black', fill='deepskyblue3')
```