--- title: "Emulation Development with ExeUQ Package" author: "Daniel Williamson and Victoria Volodina" date: "03/05/2018" output: html_document --- ```{r setup, echo=FALSE} knitr::opts_knit$set(root.dir = '~/Documents/ExeterUQ', echo = TRUE, warning = FALSE, fig.height=6, fig.width=10 ) set.seed(14112017) ``` > This vignette aims to present functionality of ExeUQ Package: estimation, prediction and validation for stationary and nonstationary GP models on the example of CNRM-CM5. ## Core steps and functionality (currently) with CNRM-CM5 We begin by loading the key functions and compiling the Stan code, that can be loaded for sampling. ```{r results = 'hide'} twd <- getwd() source('BuildEmulator/BuildEmulator.R') ``` We proceed to loading the CNRM-CM5 data and plotting the **SCM.data** (Average Potential Temperature) against the inputs. In our data frame we we have 13 different input variables, however we are going to consider only 9 of them. ```{r} load('Demonstrations/cnrmWave1.RData') head(tData) cands = c(names(tData)[-c(10:14)]) par(mfrow = c(3, 3), mar = c(4, 4, 2, 1)) for(i in 1:(length(cands)-1)) plot(tData[, cands[i]], tData$SCMdata, pch = 20, xlab = cands[i], ylab = 'Pot Temp Average') ``` ### Mean function for emulator object ### The first step is to construct a mean function. The complicated mean function generated by **EMULATE.lm** function is easily replaced with a linear mean function. In fact, you could specify any form of the mean function manually. ```{r message=FALSE, warning=FALSE, results = 'hide'} tData.valid <- tData[121:160, ] # use for validating emulator performance tData <- tData[1:120, ] # use to train emulator myem.lm.cnrm <- EMULATE.lm(Response="SCMdata", tData=tData, tcands=cands,tcanfacs=NULL, TryFouriers=TRUE,maxOrder=4,maxdf = 30) myem.lm.cnrm$linModel <- lm(SCMdata ~ ., data = tData[, c(cands[-10], "SCMdata")]) myem.lm.cnrm$Names <- cands[-10] ``` ### Build the emulator object ### The emulator requires a mean function, a form for the covariance function, a method, and priors for the parameters. The **EMULATE.gpstan** function construct emulator in a fully Bayesian way and it contains the posterior samples of the parameters. The *meanResponse* is our linear function computed previously, *additionalVariables* asks which inputs, in addition to those fitted in the *meanResponse*, are more active, i.e. have more effect on the model response. *FastVersion* builds all of the required covariance matrices/cholesky factors at the MAP (Maximum a posteriori estimation), so that fast predictions and diagnostics are possible. ```{r} myem.gp.cnrm1 = EMULATE.gpstan(meanResponse=myem.lm.cnrm, tData=tData, additionalVariables=cands[-10], FastVersion = TRUE) ``` ### Diagnostics for emulator object ### We could easily perform Leave One Put diagnostic plots against each of the model inputs. The predictions and two standard deviation prediction intervals are in black. The model values are in green if they lie within two standard deviation prediction intrevals, or red otherwise. ```{r} tLOOs <- LOO.plot(StanEmulator = myem.gp.cnrm1, ParamNames=cands[-10]) ``` To predict with an emulator we use the function *EMULATOR.gpstan* ```{r} validation <- ValidationStan(tData.valid, myem.gp.cnrm1) ``` To identify the problems with the emulator performance we could perform the traditional standardized Leave One Out (LOO) diagnostics fitted to our design data. We need to load key functions first. ```{r results = 'hide'} source('BuildEmulator/BuildEmulatorNSt.R') ``` We use the **CalStError**, where *LOO* is the output from **LOO.plot** function and *true.y* is the model values at the training (Design) set. From the standardized errors plots we observe the heteroscedasticity in response to ALFX input, this clearly indicates the failure of our stationary (standard) GP emulator. ```{r} std.err <- CalStError(LOOs = tLOOs, true.y = tData$SCMdata) par(mfrow = c(3, 3), mar = c(4, 4, 2, 1)) for(i in 1:(length(cands)-1)) plot(tData[, cands[i]], std.err, pch = 20, xlab = cands[i], ylab = 'std. error') ``` We are interested in capturing the distinct regional behaviour by observing the standardized errors, i.e. the standardized errors are close to zero for ALFX less than zero and rapidly increase for increasing values of ALFX. We develop a mixture model that clusters the input points according to the values of standardized errors. For each input point under consideration in *X*, we derive a vector of probabilities indicating the dominant local behaviour around the point. For this purpose we use function **MIXTURE.design**, where *y.std* is the vector of standardized errors and *L* is the number of input regions of distinct model behaviour. ```{r} myem.mixture <- MIXTURE.design(y.std = std.err, X = tData[, cands[-10]], L=2) mixture.cnrm <- data.frame(cbind(std.err, myem.mixture$MixtureMat, tData[, cands[-10]])) names(mixture.cnrm) <- c('std.err', 'red', 'blue', cands[-10]) head(mixture.cnrm) ggplot(mixture.cnrm, aes(x = ALFX, y = std.err, col = rgb(red = red, blue = blue, green = 0), ymin = -3.5, ymax = 3.5)) + geom_point(size = 2) + scale_color_identity() ``` We produce the plot of coloured standardized errors against ALFX, where the deep blue colour corresponds to the higher probability of a point being allocated to region 1 (region of high error variability) and the deep red colour corresponds to the higher probability of a point being allocated to region 2 (region of low error variability). We partitioned the input space into L input regions and for each input region we defined stationary covariance function. For our standard GP emulator we have the following covariance function: \[k(x, x';\sigma^{2}, \delta)= \sigma^{2}r(x, x'; \delta) + \tau^{2}\mathbb{1}\{x=x'\}\] For our nonstationary GP emulator we have the modified covariance function: \[ k(x, x'; \boldsymbol{\sigma}^2, \boldsymbol{\delta}) = \sum_{l=1}^L \lambda_l(x)\lambda_l(x')\sigma_l^2r_l(x, x';\delta_l) + \mathbb{1}\{x=x'\}\sum_{l=1}^Lz_l(x)z_l(x')\tau_l^2\] where $z_l(x) = \mathbb{1}(\lambda_l(x)=\max_k(\lambda_k(x)))$. Such specification allows us to have non-zero covariance between the points from different regions. The function **myem.mixture** produces matrix *myem.mixture$MixtureMat*, that contains $\lambda_l(x), l=1, \dots, L$ for the design matrix. ### Build the nonstationry emulator object ### The **EMULATE.gpstanNSt** contains exactly the same inputs as the **EMULATE.gpstan**, except it requires *mixtureComp* object found previously. ```{r} myem.gp.cnrm.nst1 <- EMULATE.gpstanNSt(meanResponse=myem.lm.cnrm, tData=tData, additionalVariables=cands[-10], FastVersion = TRUE, mixtureComp = myem.mixture) ``` ### Diagnostics for nonstationary emulator object ### We could easily perform Leave One Put diagnostic plots against each of the model inputs for nonstationary GP emulator. The predictions and two standard deviation prediction intervals are in black. The model values are in green if they lie within two standard deviation prediction intrevals, or red otherwise. ```{r} tLOOs.nst <- LOO.plot.NSt(StanEmulator = myem.gp.cnrm.nst1, ParamNames=cands[-10]) ``` To predict with our nonstationary emulator we are required to compute the mixture components $\lambda_l(x), l=1, \dots, L$ for our validation points using function **MIXTURE.predict** and then use the function **ValidationStanNSt**: ```{r} mixture.predict1 <- MIXTURE.predict(x = tData.valid[, cands[-10]], mixtureComp = myem.mixture) validation.nst <- ValidationStanNSt(tData.valid, Emulator=myem.gp.cnrm.nst1, mixturePredict=mixture.predict1) ``` Let's compare the performance of stationary GP emulator against nonstationary GP emulator: ```{r} par(mfrow = c(1, 2), mar = c(4, 4, 2, 1)) ValidPlot(validation, tData.valid[, cands[-10]], tData.valid$SCMdata, axis = 1, heading = "st-GP", xrange = c(-1, 1), interval = range(validation, validation.nst), ParamNames = names(tData.valid)) ValidPlot(validation.nst, tData.valid[, cands[-10]], tData.valid$SCMdata, axis = 1, heading = "nst-GP", xrange = c(-1, 1), interval = range(validation, validation.nst), ParamNames = names(tData.valid)) ```