In this example we use the well-known R-package DiceKriging
to emulate output from the Canadian climate model CanAM4.
First we load the necessary packages, and the data file globalCanAM4W1.RData
. The emulator inputs are saved in columns 2 to 14 of the object tData
:
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(DiceKriging))
load('../Demonstrations/globalCanAM4W1.RData')
design = tData[, 2:14] %>% as_data_frame
design
## # A tibble: 60 x 13
## FACAUT UICEFAC CT CBMF WEIGHT DRNGAT
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.11111111 0.21739130 -0.09090909 -0.2000000 0.44444444 -0.9000000
## 2 0.93964656 0.21272724 0.44617789 -0.6328893 -0.12804819 0.5548678
## 3 -0.49607751 0.49866598 0.66486451 -0.4184081 -0.57449420 -0.7147499
## 4 -0.01823092 0.87500017 -0.78226390 0.7481327 0.20387546 0.7659221
## 5 0.18724746 -0.24641582 0.37947947 -0.9198253 -0.03209664 0.8924434
## 6 -0.33415933 -0.04815128 -0.96533195 -0.2130182 -0.93268338 -0.2679695
## 7 0.61434413 -0.67240104 0.06517013 0.9333383 -0.83407115 0.2400356
## 8 -0.11223438 0.80259298 0.08011285 0.3629515 0.92100957 0.1179780
## 9 -0.85252199 -0.12957552 0.89944743 0.5471016 -0.51206544 0.4832247
## 10 -0.68412334 -0.81095030 -0.12232986 0.4093460 0.76273577 -0.4389870
## # ... with 50 more rows, and 7 more variables: RKXMIN <dbl>, XINGLE <dbl>,
## # ALMC <dbl>, ALF <dbl>, TAUS1 <dbl>, CSIGMA <dbl>, CVSG <dbl>
We build the emulator for the output variable BALX
. We use the built-in R function step
to choose the emulator mean function by stepwise regression. step
returns an object of class lm
, whose list element terms
contains the R formula for the mean function. That formula is passed to the DiceKriging
function km
, along with the design matrix and the response, to fit the Gaussian process emulator. See ?km
for options to control the emulator; here we only specify to use a Matern covariance matrix (with nu=5/2
) and let all parameters be estimated automatically:
# define response
response = tData$BALX
# stepwise regression to fit trend model
em_lm = step(lm(response ~ ., data=design),
direction='both', trace=0)
# fit GP using DiceKriging function `km`
em_km = km(formula = em_lm$terms,
design = design,
response = response,
covtype = 'matern5_2',
control = list(trace=FALSE))
The object em_km
contains the emulator. We can use predict
to evaluate the emulator at new inputs, for example at the mean of all input points:
pred = predict(em_km, newdata=map_df(design, mean), type='UK')
pred[c('mean', 'sd', 'lower95', 'upper95')] %>% as.data.frame
## mean sd lower95 upper95
## 1 6.230499 1.142164 3.935236 8.525763
The emulator can be validated by leave-one-out cross validation, using the DiceKriging
function leaveOneOut.km
:
em_km_loo = leaveOneOut.km(em_km, type='UK') %>% as_data_frame
em_km_loo
## # A tibble: 60 x 2
## mean sd
## <dbl> <dbl>
## 1 -1.0139507 1.731929
## 2 5.0540275 1.482308
## 3 0.4496219 1.619887
## 4 6.1947459 1.613111
## 5 -5.3986640 1.785493
## 6 6.6694612 1.610073
## 7 10.4919816 1.572071
## 8 -1.7387301 1.383712
## 9 -5.4089395 1.652246
## 10 7.7264413 1.680943
## # ... with 50 more rows
em_km_loo
is a list with two vectors mean
and sd
of equal length, which is why it can be transformed to a data frame. To plot the cross validation results in a faceted panel plot with ggplot2
, we have to set up the appropriate data frame with columns for the response, and lower and upper limit of the prediction interval. We also add a boolean variable inside
and a corresponding colr
variable to change the colour of the intervals if the response falls outside. The variable pit
is the probability integral transform, which should be uniformly distributed if the emulator is reliable. At last we append the design matrix, and use gather
to turn the data frame from wide into long form appropriate for facet plotting:
# dataframe for plotting
plot_df = em_km_loo %>%
mutate(response = response,
index = seq_along(response),
lwr = mean - 2 * sd,
upr = mean + 2 * sd,
pit = pnorm(response, mean, sd),
inside = (response > lwr & response < upr),
colr = ifelse(inside, 'deepskyblue3', 'darkorange3')) %>%
bind_cols(design) %>%
gather('variable', 'value', names(design))
The following plots the emulator output and the actual value of the response over the value of the input in a separate panel for each input.
ggplot(plot_df, aes(x=value)) + facet_wrap( ~ variable, ncol=2) +
geom_point(aes(y=mean), colour=plot_df$colr) +
geom_segment(aes(xend=value, y=lwr, yend=upr), colour=plot_df$colr) +
geom_point(aes(y=response), cex=1.5)
The histogram over the PIT values should be approximately flat:
ggplot(plot_df, aes(x=pit)) +
geom_histogram(breaks=seq(0,1,0.2), colour='black', fill='deepskyblue3')
response = tData$PCP
em_lm = step(lm(response ~ ., data=design), direction='both', trace=0)
em_km = km(formula=em_lm$terms, design=design, response=response, covtype='matern5_2', control=list(trace=FALSE))
plot_df = leaveOneOut.km(em_km, type='UK') %>% as_data_frame %>% mutate(response = response, index = seq_along(response), lwr = mean - 2 * sd, upr = mean + 2 * sd, pit = pnorm(response, mean, sd), inside = (response > lwr & response < upr), colr = ifelse(inside, 'deepskyblue3', 'darkorange3')) %>% bind_cols(design) %>% gather('variable', 'value', names(design))
ggplot(plot_df, aes(x=value)) + facet_wrap(~variable, ncol=2) + geom_point(aes(y=mean), colour=plot_df$colr) + geom_segment(aes(xend=value, y=lwr, yend=upr), colour=plot_df$colr) + geom_point(aes(y=response), cex=1.5)
ggplot(plot_df, aes(x=pit)) + geom_histogram(breaks=seq(0,1,0.2), colour='black', fill='deepskyblue3')