--- title: "Core Emulation in ExeUQ package" author: "Daniel Williamson" date: "28/11/2017" output: html_document --- ```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, warning = FALSE) ``` # Building emulators: the precursor to calibration This vignette aims to demonstrate the functionality of the emulators we have currently, highlighting the key steps and potentially highlighting things that need work. ## Core steps and functionality (currently) with CanAM4 We begin by loading the key functions in the code and the CanAM4 data. ```{r} twd <- getwd() source('BuildEmulator/BuildEmulator.R') source('HistoryMatching/HistoryMatching.R') load("Demonstrations/globalCanAM4W1.RData") head(tData) ``` The first step is to set the data up so that a mean function can be built. Any mean function can be used, but we provide an advanced stepwise algorithm that has proved useful in many previous applications to climate models. We first generate a "Noise" variable and add it to the data. "Noise" is an insurance against over fitting. Our algorithm will try to add many different functions of combinations of inputs to our mean function. If anything involving "Noise" is added, we stop adding terms as we believe this indicates overfitting. ```{r} Noise <- rnorm(dim(tData)[1],0,0.5) tData <- cbind(tData,Noise) ``` We then tell the code which variables in the data frame (which we require to be named 'tData') are model inputs and we add "Noise" to this set. ```{r} cands <- names(tData)[2:14] cands <- c(cands, "Noise") ``` We then use the function EMULATE.lm to fit a mean function to pass to the emulator object. We should allow the user to specify their own mean function (via formula or lm-type object) or to use our algorithm. ```{r} myem.lm.canPCP <- EMULATE.lm(Response="PCP", tData=tData, tcands=cands, tcanfacs=NULL, TryFouriers=TRUE, maxOrder=2, maxdf = 20) ``` DW to edit this function to remove requirement for canfacs and to explain output object. ```{r} names(myem.lm.canPCP) summary(myem.lm.canPCP$linModel) par(mfrow=c(2,2),mar=c(4,4,1,1)) plot(myem.lm.canPCP$linModel) ``` Are all of the items in this list required by the core emulator code? What is required? Can the object be simplified? Should it be? How will we extract what we require from an alternative mean function? ### Build the emulator object An emulator requires a mean function, a form for the covariance function, a method, and priors for parameters. The Emulator object contains everything that we can pre-compute that is required for prediction. This includes, for our fully Bayesian emulators, a set of posterior samples for the parameters. The EMULATE functions build emulators. Below is an example using the gpstan method (Gaussian process fitted in Stan). - The mean function is our linear model. - Currently we specify a nugget (and there are alternative methods coded up), we need to remove or default this requirement. - We pass the tData object. - additionalVariables asks which inputs, in addition to those fitted in the meanResponse, are "Active" - CreatePredictObject was an attempt to compile the stan file for prediction, but this didn't work - FastVersion builds all of the required covariance matrices/cholesky factors at the MAP estimate of the parameters so that fast prediction and diagnostics are possible. ```{r} CanadaEmulator <- EMULATE.gpstan(meanResponse=myem.lm.canPCP, nugget=0.0001, tData=tData, additionalVariables=NULL, CreatePredictObject = FALSE, FastVersion = TRUE) ``` To predict with an emulator requires a predict-type function. In our code, the EMULATOR functions call emulators (for prediction/diagnostics or otherwise). Here we call a gpstan emulator. ```{r} validation <- EMULATOR.gpstan(x=tData[1:10,], Emulator=CanadaEmulator) validation ``` Sitting inside our history matching functions are calls to EMULATOR objects (see UniImplausibility, for example). #Speed One of our main issues is speed for files calling Stan. E.g. ```{r} tLOOs <- stanLOOplot(StanEmulator = CanadaEmulator, ParamNames=CanadaEmulator$Names) ``` takes a very long time (relatively) to run. Run time seems to increase exponentially with ensemble size, and this is one of the main issues we have. Mainly this is a problem with Stan, but it will require our writing the code such that large ensembles are never passed to Stan without the user having to do an awful lot of working around our default (safety?) settings.