\documentclass{article}
\usepackage{Sweave}
%\SweaveOpts{eps=TRUE}
%\documentclass[12pt]{article}
%\usepackage{fullpage}
%\usepackage{setspace}
\usepackage[footnotesize]{caption}
\usepackage{amsmath}
\usepackage{amscd}
\usepackage{epsfig}
\newcommand{\bm}[1]{\mbox{\boldmath $#1$}}
\newcommand{\mb}[1]{\mathbf{#1}}
%\VignetteIndexEntry{a guide to the tgp package}
%\VignetteKeywords{tgp}
%\VignetteDepends{tgp,maptree,MASS}
%\VignettePackage{tgp}
\begin{document}
%\doublespacing
\setkeys{Gin}{width=0.85\textwidth}
<>=
library(tgp)
options(width=65)
@
\title{{\tt tgp}: an {\sf R} package for Bayesian
nonstationary,\\ semiparametric nonlinear regression and design
by treed Gaussian process models}
\author{Robert B. Gramacy\\
Department of Statistics\\
Virginia Tech\\
rbg@vt.edu}
\maketitle
\begin{abstract}
The {\tt tgp} package for {\sf R} \cite{cran:R} is a tool for fully
Bayesian nonstationary, semiparametric nonlinear regression and
design by treed Gaussian processes with jumps to the limiting linear
model. Special cases also implemented include Bayesian linear
models, linear CART, stationary separable and isotropic Gaussian
processes. In addition to inference and posterior prediction, the
package supports the (sequential) design of experiments under these
models paired with several objective criteria. 1-d and 2-d
plotting, with higher dimension projection and slice capabilities,
and tree drawing functions (requiring {\tt maptree} and {\tt
combinat} libraries), are also provided for visualization of {\tt
tgp}-class output.
\end{abstract}
\subsection*{Intended audience}
\label{sec:discaimer}
This document is intended to familiarize a (potential) user of {\tt
tgp} with the models and analyses available in the package. After a
brief overview, the bulk of this document consists of examples on
mainly synthetic and randomly generated data which illustrate the
various functions and methodologies implemented by the package. This
document has been authored in {\tt Sweave} (try {\tt help(Sweave)}).
This means that the code quoted throughout is certified by {\tt
R}, and the {\tt Stangle} command can be used to extract it.
Note that this tutorial was not meant to serve as an instruction
manual. For more detailed documentation of the functions contained in
the package, see the package help-manuals. At an {\sf R} prompt, type
{\tt help(package=tgp)}. PDF documentation is also available on the
world-wide-web.
\begin{center}
\tt http://www.cran.r-project.org/doc/packages/tgp.pdf
\end{center}
The outline is as follows. Section \ref{sec:implement} introduces the
functions and associated regression models implemented by the {\tt
tgp} package, including plotting and visualization methods. The
Bayesian mathematical specification of these models is contained in
Section \ref{sec:model}. In Section \ref{sec:examples} the functions
and methods implemented in the package are illustrated by example.
The appendix covers miscellaneous topics such as how to link with the
{\tt ATLAS} libraries for fast linear algebra routines, compile--time
support for {\tt Pthreads} parallelization, the gathering of parameter
traces, the verbosity of screen output, and some miscellaneous details
of implementation.
\subsection*{Motivation}
Consider as motivation the Motorcycle Accident Dataset
\cite{silv:1985}. It is a classic data set used in recent literature
\cite{rasm:ghah:nips:2002} to demonstrate the success of nonstationary
regression models. The data consists of measurements of the
acceleration of the head of a motorcycle rider as a function of time
in the first moments after an impact. Many authors have commented on
the existence of two---perhaps three---regimes in the data over time
where the characteristics of the mean process and noise level change
(i.e., a nonstationarity and heteroskedasticity, respectively). It
can be interesting to see how various candidate models handle this
nuance.
\begin{figure}[ht!]
\centering
\includegraphics[trim=0 25 0 0]{motovate_bgp}
\includegraphics[trim=0 25 0 0]{motovate_btgp}
\caption{Fit of the Motorcycle Accident Dataset using a GP ({\em top})
and treed GP model ({\em bottom}). The $x$-axis is time in
milliseconds after an impact; the $y$--axis is acceleration of the
helmet of a motorcycle rider measured in ``$g$'s'' in a simulated
impact.}
\label{f:motivate}
\end{figure}
Figure \ref{f:motivate} shows a fit of this data using a standard
(stationary) Gaussian process (GP; {\em left}), and the treed GP model
({\em right}).\footnote{Note that these plots are {\em static}, i.e.,
they were not generated in--line with {\tt R} code. See Section
\ref{sec:moto} for {\em dynamic} versions.} Notice how stationary
GP model is unable to capture the smoothness in the linear section(s),
nor the decreased noise level. We say that the standard GP model is
stationary because it has a single fixed parameterization throughout
the input space. An additive model would be inappropriate for similar
reasons. In contrast, the treed GP model is able to model the first
linear part, the noisy ``whiplash'' middle section, and the smooth
(possibly linear) final part with higher noise level, thus exhibiting
nonstationary modeling behavior and demonstrating an ability to cope
with heteroskedasticity.
The remainder of this paper describes the treed GP model in detail,
and provides illustrations though example. There are many special
cases of the treed GP model, e.g., the linear model (LM), treed LM,
stationary GP, etc.. These are outlined and demonstrated as well.
\section{What is implemented?}
\label{sec:implement}
The {\tt tgp} package contains implementations of seven Bayesian
multivariate regression models and functions for visualizing posterior
predictive surfaces. These models, and the functions which implement
them, are outlined in Section \ref{sec:breg}. Also implemented in the
package are functions which aid in the sequential design of
experiments for {\tt tgp}-class models, which is what I call {\em
adaptive sampling}. These functions are introduced at the end of
Section \ref{sec:model} and a demonstration is given in Section
\ref{sec:as}.
\subsection{Bayesian regression models}
\label{sec:breg}
The seven regression models implemented in the package are summarized in
Table \ref{t:reg}. They include combinations of treed partition
models, (limiting) linear models, and Gaussian process models as
indicated by T, LM/LLM, \& GP in the center column of the table. The
details of model specification and inference are contained in Section
\ref{sec:model}. Each is a fully Bayesian regression model, and in
the table they are ordered by some notion of ``flexibility''. These
{\tt b*} functions, as I call them, are wrappers around the master
{\tt tgp} function which is an interface to the core {\tt C} code.
\begin{table}
\centering
\begin{tabular}{l|l|l}
{\sf R} function & Ingredients & Description \\
\hline
blm & LM & Linear Model \\
btlm & T, LM & Treed Linear Model \\
bcart & T & Treed Constant Model \\
bgp & GP & GP Regression \\
bgpllm & GP, LLM & GP with jumps to the LLM \\
btgp & T, GP & treed GP Regression \\
btgpllm & T, GP, LLM & treed GP with jumps to the LLM \\
\hline
tgp & & Master interface for the above methods
\end{tabular}
\caption{Bayesian regression models implemented by the {\tt tgp} package}
\label{t:reg}
\end{table}
The {\tt b*} functions are intended as the main interface, so little
further attention to the {\tt tgp} master function will be included
here. The easiest way to see how the master {\tt tgp} function
implements one of the {\tt b*} methods is to simply type the name of
the function of interest into {\sf R}. For example, to see the
implementation of {\tt bgp}, type:
<>=
bgp
@
The output (return-value) of {\tt tgp} and the {\tt b*} functions is a
{\tt list} object of class ``{\tt tgp}''. This is what is meant by a
``{\tt tgp}-class'' object. This object retains all of the relevant
information necessary to summarize posterior predictive inference,
maximum {\em a' posteriori} (MAP) trees, and statistics for adaptive
sampling. Information about its actual contents is contained in the
help files for the {\tt b*} functions. Generic {\tt print}, {\tt
plot}, and {\tt predict} methods are defined for {\tt tgp}-class
objects. The {\tt plot} and {\tt predict} functions are discussed
below. The {\tt print} function simply provides a list of the names
of the fields comprising a {\tt tgp}-class object.
\subsubsection{Plotting and visualization}
\label{sec:plot}
The two main functions provided by the {\tt tgp} package for
visualization are {\tt plot.tgp}, inheriting from the generic {\tt
plot} method, and a function called {\tt tgp.trees} for graphical
visualization of MAP trees.
The {\tt plot.tgp} function can make plots in 1-d or 2-d. Of course,
if the data are 1-d, the plot is in 1-d. If the data are 2-d, or
higher, they are 2-d image or perspective plots unless a 1-d
projection argument is supplied. Data which are 3-d, or higher,
require projection down to 2-d or 1-d, or specification of a 2-d
slice. The {\tt plot.tgp} default is to make a projection onto the
first two input variables. Alternate projections are specified as an
argument ({\tt proj}) to the function. Likewise, there is also an
argument ({\tt slice}) which allows one to specify which slice of the
posterior predictive data is desired. For models that use treed
partitioning (those with a T in the center column of Table
\ref{t:reg}), the {\tt plot.tgp} function will overlay the region
boundaries of the MAP tree ($\hat{\mathcal{T}}$) found during MCMC.
A few notes on 2-d plotting of {\tt tgp}-class objects:
\begin{itemize}
\item 2-d plotting requires interpolation of the data onto a uniform
grid. This is supported by the {\tt tgp} package in two ways: (1)
{\tt loess} smoothing, and (2) the {\tt akima} package, available
from CRAN. The default is {\tt loess} because it is more stable and
does not require installing any further packages. When {\tt akima}
works it makes (in my opinion) smarter interpolations. However
there are two bugs in the {\tt akima} package, one malign and the
other benign, which preclude it from the default position in {\tt
tgp}. The malign bug can cause a segmentation fault, and bring
down the entire R session. The benign bug produces {\tt NA}'s when
plotting data from a grid. For beautiful 2-d plots of gridded data
I suggest exporting the {\tt tgp} predictive output to a text file
and using {\tt gnuplot}'s 2-d plotting features.
\item The current version of this package contains no examples---nor
does this document---which demonstrate plotting of data with
dimension larger than two. The example provided in Section
\ref{sec:fried} uses 10-d data, however no plotting is required.
{\tt tgp} methods have been used on data with input dimension as
large as 15 \cite{gra:lee:2008}, and were used in a sequential
design and detailed analysis of some proprietary 3-d input and 6-d
output data sampled using a NASA supercomputer
\cite{gra:lee:2009}.
\item The {\tt plot.tgp} function has many more options than are
illustrated in [Section \ref{sec:examples} of] this document.
Please refer to the help files for more details.
\end{itemize}
The {\tt tgp.trees} function provides a diagrammatic representation of
the MAP trees of each height encountered by the Markov chain during
sampling. The function will not plot trees of height one, i.e., trees
with no branching or partitioning. Plotting of trees requires the
{\tt maptree} package, which in turn requires the {\tt combinat}
package, both available from CRAN.
\subsubsection{Prediction}
\label{sec:predintro}
Prediction, naturally, depends on fitted model parameters
$\hat{\bm{\theta}}|\mbox{data}$, or Monte Carlo samples from the
posterior distribution of $\bm{\theta}$ in a Bayesian analysis.
Rather than saving samples from $\pi(\bm{\theta}|\mbox{data})$ for
later prediction, usually requiring enormous amounts of storage, {\tt
tgp} samples the posterior predictive distribution in-line, as
samples of $\bm{\theta}$ become available. [Section \ref{sec:pred}
and \ref{sec:llmpred} outlines the prediction equations.] A {\tt
predict.tgp} function is provided should it be necessary to obtain
predictions {\em after} the MCMC has finished.
The {\tt b*} functions save the MAP parameterization
$\hat{\bm{\theta}}$ maximizing $\pi(\bm{\theta}|\mbox{data})$. More
specifically, the ``{\tt tgp}''--class object stores the MAP tree
$\hat{{\mathcal T}}$ and corresponding GP (or LLM) parameters
$\hat{\bm{\theta}}|\hat{\mathcal{T}}$ found while sampling from the
joint posterior $\pi(\bm{\theta},\mathcal{T}|\mbox{data})$. These may
be accessed and used, via {\tt predict.tgp}, to obtain
posterior--predictive inference through the MAP parameterization. In
this way {\tt predict.tgp} is similar to {\tt predict.lm}, for
example. Samples can also be obtained from the MAP--parameterized
predictive distributions via {\tt predict.tgp}, or a
re--initialization of the joint sampling of the posterior and
posterior predictive distribution can commence starting from the
$(\hat{\bm{\theta}},\hat{\mathcal{T}})$.
The output of {\tt predict.tgp} is also a {\tt tgp} class object.
Appendix \ref{sec:apred} illustrates how this feature can be useful
in the context of passing {\tt tgp} model fits between collaborators.
There are other miscellaneous demonstrations in
Section~\ref{sec:examples}.
\subsubsection{Speed}
\label{sec:speed}
Fully Bayesian analyses with MCMC are not the super-speediest of all
statistical models. Nor is inference for GP models, classical or
Bayesian. When the underlying relationship between inputs and
responses is non-linear, GPs represent a state of the art
phenomenological model with high predictive power. The addition of
axis--aligned treed partitioning provides a divide--and--conquer
mechanism that can not only reduce the computational burden relative
to the base GP model, but can also facilitate the efficient modeling
of nonstationarity and heteroskedasticity in the data. This is in
stark contrast to other recent approaches to nonstationary spatial
models (e.g., via deformations \cite{dam:samp:gutt:2001,schmidt:2003},
or process convolutions
\cite{higd:swal:kern:1999,fuentes:smith:2001,Paci:2003}) which can
require orders of magnitude more effort relative to stationary GPs.
Great care has been taken to make the implementation of Bayesian
inference for GP models as efficient as possible [see Appendix
\ref{sec:howimplement}]. However, inference for non-treed GPs can be
computationally intense. Several features are implemented by the
package which can help speed things up a bit. Direct support for {\tt
ATLAS} \cite{atlas-hp} is provided for fast linear algebra. Details
on linking this package with {\tt ATLAS} is contained in Appendix
\ref{sec:atlas}. Parallelization of prediction and inference is
supported by a producer/consumer model implemented with {\tt
Pthreads}. Appendix \ref{sec:pthreads} shows how to activate this
feature, as it is not turned on by default. An argument called {\tt
linburn} is made available in tree class (T) {\tt b*} functions in
Table \ref{t:reg}. When {\tt linburn = TRUE}, the Markov chain is
initialized with a run of the Bayesian treed linear model
\cite{chip:geor:mccu:2002} before burn-in in order to pre-partition
the input space using linear models. Finally, thinning of the
posterior predictive samples obtained by the Markov chain can also
help speed things up. This is facilitated by the {\tt E}-part of the
{\tt BTE} argument to {\tt b*} functions.
\subsection{Sequential design of experiments}
\label{sec:design}
Sequential design of experiments, a.k.a. {\em adaptive sampling}, is
not implemented by any {\em single} function in the {\tt tgp} package.
Nevertheless, options and functions are provided in order to
facilitate the automation of adaptive sampling with {\tt tgp}-class
models. A detailed example is included in Section \ref{sec:as}.
Arguments to {\tt b*} functions, and {\tt tgp}, which aid in adaptive
sampling include {\tt Ds2x} and {\tt improv}. Both are booleans,
i.e., should be set to {\tt TRUE} or {\tt FALSE} (the default for both
is {\tt FALSE}). {\tt TRUE} booleans cause the {\tt tgp}-class output
list to contain vectors of similar names with statistics that can be
used toward adaptive sampling. When {\tt Ds2x = TRUE} then $\Delta
\sigma^2(\mb{\tilde{x}})$ statistic is computed at each
$\tilde{\mb{x}} \in \mbox{\tt XX}$, in accordance with the ALC (Active
Learning--Cohn) algorithm \cite{cohn:1996}. Likewise, when {\tt
improv = TRUE}, statistics are computed in order to asses the
expected improvement (EI) for each $\tilde{\mb{x}} \in \mbox{\tt XX}$
about the global minimum \cite{jones:schonlau:welch:1998}. The ALM
(Active Learning--Mackay) algorithm \cite{mackay:1992} is implemented
by default in terms of difference in predictive quantiles for the
inputs {\tt XX}, which can be accessed via the {\tt ZZ.q} output
field. Details on the ALM, ALC, and EI algorithms are provided in
Section \ref{sec:model}.
Calculation of EI statistics was considered ``beta''
functionality while this document was being prepared. At that time
it had not been adequately tested, and its implementation changed
substantially in future versions of the package. For updates
see the follow-on vignette
\cite{gra:taddy:2010}, or \verb!vignette("tgp2")!.
That document also discusses sensitivity analysis, handling of
categorical inputs, and importance tempring.
The functions included in the package which explicitly aid in the
sequential design of experiments are {\tt tgp.design} and {\tt
dopt.gp}. They are both intended to produce sequential $D$--optimal
candidate designs {\tt XX} at which one or more of the adaptive
sampling methods (ALM, ALC, EI) can gather statistics. The {\tt
dopt.gp} function generates $D$--optimal candidates for a stationary
GP. The {\tt tgp.design} function extracts the MAP tree from a {\tt
tgp}-class object and uses {\tt dopt.gp} on each region of the MAP
partition in order to get treed sequential $D$--optimal candidates.
\section{Methods and Models}
\label{sec:model}
This section provides a quick overview of the statistical models and
methods implemented by the {\tt tgp} package. Stationary Gaussian
processes (GPs), GPs with jumps to the limiting linear model (LLM;
a.k.a.~GP LLM), treed partitioning for nonstationary models, and
sequential design of experiments (a.k.a.~{\em adaptive sampling})
concepts for these models are all briefly discussed. Appropriate
references are provided for the details, including the original paper
on Bayesian treed Gaussian process models \cite{gra:lee:2008}, and an
application paper on adaptively designing supercomputer experiments
\cite{gra:lee:2009}.
As a first pass on this document, it might make sense to skip this
section and go straight on to the examples in Section
\ref{sec:examples}.
\subsection{Stationary Gaussian processes}
\label{sec:gp}
Below is a hierarchical generative model for a stationary GP with
linear tend for data $D=\{\mb{X}, \mb{Z}\}$ consisting of $n$ pairs of
$m_X$ covariates and a single response variable $\{(x_{i1},\dots,
x_{im_X}), z_i\}_{i=1}^n$.
\begin{align}
\mb{Z} | \bm{\beta}, \sigma^2, \mb{K} &\sim
N_{n}(\mb{\mb{F}} \bm{\beta}, \sigma^2 \mb{K}) &
\sigma^2 &\sim IG(\alpha_\sigma/2, q_\sigma/2) \nonumber \\
\bm{\beta} | \sigma^2, \tau^2, \mb{W},
\bm{\beta}_0 &\sim N_{m}(\bm{\beta}_0,
\sigma^2 \tau^2 \mb{W}) &
\tau^2 &\sim IG(\alpha_\tau/2, q_\tau/2) \label{eq:model} \\
\bm{\beta}_0 &\sim N_{m}(\bm{\mu}, \mb{B}) &
\mb{W}^{-1} &\sim W((\rho \mb{V})^{-1}, \rho), \nonumber
\end{align}
$\mb{X}$ is a design matrix with $m_X$ columns. An intercept term
is added with $\mb{F} = (\mb{1}, \mb{X})$ which has $m\equiv m_X+1$
columns, and $\mb{W}$ is a $m \times m$ matrix. $N$, $IG$, and $W$
are the (Multivariate) Normal, Inverse-Gamma, and Wishart
distributions, respectively. Constants $\bm{\mu}, \mb{B},\mb{V},\rho,
\alpha_\sigma, q_\sigma, \alpha_\tau, q_\tau.$ are treated as known.
The GP correlation structure $\mb{K}$ is chosen either from the
isotropic power family, or separable power family, with a fixed power
$p_0$ (see below), but unknown (random) range and nugget parameters.
Correlation functions used in the {\tt tgp} package take the form
$K(\mb{x}_j, \mb{x}_k) = K^*(\mb{x}_j, \mb{x}_k) + {g} \delta_{j,k}$,
where $\delta_{\cdot,\cdot}$ is the Kronecker delta function, $g$ is
the {\em nugget}, and $K^*$ is a {\em true} correlation
representative from a parametric family. The isotropic Mat\'{e}rn
family is also implemented in the current version as ``beta''
functionality.
All parameters in (\ref{eq:model}) can be sampled using Gibbs steps,
except for the covariance structure and nugget parameters, and their
hyperparameters, which can be sampled via Metropolis-Hastings
\cite{gra:lee:2008}.
\subsubsection{The nugget}
\label{sec:intro:nug}
The $g$ term in the correlation function $K(\cdot,\cdot)$ is referred
to as the {\em nugget} in the geostatistics literature
\cite{math:1963,cressie:1991} and sometimes as {\em jitter} in the
Machine Learning literature \cite{neal:1997}. It must always be
positive $(g>0)$, and serves two purposes. Primarily, it provides a
mechanism for introducing measurement error into the stochastic
process. It arises when considering a model of the form:
\begin{equation}
Z(\mb{X}) = m(\mb{X}, \bm{\beta}) + \varepsilon(\mb{X}) + \eta(\mb{X}),
\label{eq:noisemodel}
\end{equation}
where $m(\cdot,\cdot)$ is underlying (usually linear) mean process,
$\varepsilon(\cdot)$ is a process covariance whose underlying
correlation is governed by $K^*$, and $\eta(\cdot)$ represents
i.i.d.~Gaussian noise. Secondarily, though perhaps of equal practical
importance, the nugget (or jitter) prevents $\mb{K}$ from becoming
numerically singular. Notational convenience and conceptual
congruence motivates referral to $\mb{K}$ as a correlation matrix,
even though the nugget term ($g$) forces $K(\mb{x}_i,\mb{x}_i)>1$.
\subsubsection{Exponential Power family}
\label{sec:pow}
Correlation functions in the {\em isotropic power} family are {\em
stationary} which means that correlations are measured identically
throughout the input domain, and {\em isotropic} in that correlations
$K^*(\mb{x}_j, \mb{x}_k)$ depend only on a function of the Euclidean
distance between $\mb{x}_j$ and $\mb{x}_k$: $||\mb{x}_j - \mb{x}_k||$.
\begin{equation}
K^*(\mb{x}_j, \mb{x}_k|d) =
\exp\left\{-\frac{||\mb{x}_j - \mb{x}_k||^{p_0}}{d} \right\},
\label{eq:pow}
\end{equation}
where $d>0$ is referred to as the {\em width} or {\em range}
parameter. The power $0>=
hist(c(rgamma(100000,1,20), rgamma(100000,10,10)),
breaks=50, xlim=c(0,2), freq=FALSE, ylim=c(0,3),
main = "p(d) = G(1,20) + G(10,10)", xlab="d")
d <- seq(0,2,length=1000)
lines(d,0.2+0.7/(1+exp(-10*(d-0.5))))
abline(h=1, lty=2)
legend(x=1.25, y=2.5, c("p(b) = 1", "p(b|d)"), lty=c(1,2))
@
<>=
graphics.off()
@
\includegraphics[trim=0 25 0 10]{tgp-gpllm}
%\vspace{-0.5cm}
\caption{\footnotesize Prior distribution for the boolean ($b$)
superimposed on $p(d)$. There is truncation in the left--most
bin, which rises to about 6.3. }
\label{f:boolprior}
\end{center}
\end{figure}
Probability mass functions which increase as a function of $d_i$,
e.g.,
\begin{equation}
p_{\gamma, \theta_1, \theta_2}(b_i=0|d_i) =
\theta_1 + (\theta_2-\theta_1)/(1 + \exp\{-\gamma(d_i-0.5)\})
\label{eq:boolp}
\end{equation}
with $0<\gamma$ and $0\leq \theta_1 \leq \theta_2 < 1$, can encode
such a preference by calling for the exclusion of dimensions $i$ with
large $d_i$ when constructing $\mb{K}^*$. Thus $b_i$ determines
whether the GP or the LLM is in charge of the marginal process in the
$i^{\mbox{\tiny th}}$ dimension. Accordingly, $\theta_1$ and
$\theta_2$ represent minimum and maximum probabilities of jumping to
the LLM, while $\gamma$ governs the rate at which $p(b_i=0|d_i)$ grows
to $\theta_2$ as $d_i$ increases. Figure \ref{f:boolprior} plots
$p(b_i=0|d_i)$ %as in (\ref{eq:boolp})
for $(\gamma,\theta_1,\theta_2) =(10, 0.2, 0.95)$ superimposed on a
convenient $p(d_i)$ which is taken to be a mixture of Gamma
distributions,
\begin{equation}
p(d) = [G(d|\alpha=1,\beta=20) + G(d|\alpha=10,\beta=10)]/2,
\label{eq:dprior}
\end{equation}
representing a population of GP parameterizations for wavy surfaces
(small $d$) and a separate population of those which are quite smooth
or approximately linear. The $\theta_2$ parameter is taken to be
strictly less than one so as not to preclude a GP which models a
genuinely nonlinear surface using an uncommonly large range setting.
The implied prior probability of the full $m_X$-dimensional LLM is
\begin{equation}
p(\mbox{linear model}) = \prod_{i=1}^{m_X} p(b_i=0|d_i)
= \prod_{i=1}^{m_X} \left[ \theta_1 + \frac{\theta_2-\theta_1}
{1 + \exp\{-\gamma (d_i-0.5)\}}\right].
\label{e:linp}
\end{equation}
Notice that the resulting process is still a GP if any of the booleans
$b_i$ are one. The primary computational advantage associated with
the LLM is foregone unless all of the $b_i$'s are zero. However, the
intermediate result offers increased numerical stability and
represents a unique transitionary model lying somewhere between the GP
and the LLM. It allows for the implementation of a semiparametric
stochastic processes like $Z(\mb{x}) = \bm{\beta} f(\mb{x}) +
\varepsilon(\tilde{\mb{x}})$ representing a piecemeal spatial
extension of a simple linear model. The first part
($\bm{\beta}f(\mb{x})$) of the process is linear in some known
function of the full set of covariates $\mb{x} = \{x_i\}_{i=1}^{m_X}$,
and $\varepsilon(\cdot)$ is a spatial random process (e.g. a GP) which
acts on a subset of the covariates $\mb{x}'$. Such models are
commonplace in the statistics community~\cite{dey:1998}.
Traditionally, $\mb{x}'$ is determined and fixed {\em a'
priori}. The separable boolean prior (\ref{eq:boolp}) implements
an adaptively semiparametric process where the subset $\mb{x}'
= \{ x_i : b_i = 1, i=1,\dots,m_X \}$ is given a prior distribution,
instead of being fixed.
\subsubsection{Prediction and Adaptive Sampling under LLM}
\label{sec:llmpred}
Prediction under the limiting GP model is a simplification of
(\ref{eq:pred}) when it is known that $\mb{K} = (1+g)\mb{I}$. It can
be shown \cite{gra:lee:2008b} that the predicted value of $z$ at
$\mb{x}$ is normally distributed with mean $\hat{z}(\mb{x}) =
\mb{f}^\top(\mb{x}) \tilde{\bm{\beta}}$ and variance
$\hat{\sigma}(\mb{x})^2 = \sigma^2 [1 +
\mb{f}^\top(\mb{x})\mb{V}_{\tilde{\beta}} \mb{f}(\mb{x})]$, where $
\mb{V}_{\tilde{\beta}} = (\tau^{-2} + \mb{F}^\top \mb{F}(1+g))^{-1}$.
This is preferred over (\ref{eq:pred}) with $\mb{K}=\mb{I}(1+g)$
because an $m \times m$ inversion is faster than an $n\times n$ one.
Applying the ALC algorithm under the LLM also offers computational
savings. Starting with the predictive variance given in
(\ref{eq:pred}), the expected reduction in variance under the LM is
\cite{gra:lee:2009}
\begin{equation}
\Delta \hat{\sigma}^2_\mb{y} (\mb{x}) = \frac{
\sigma^2 [\mb{f}^\top(\mb{y}) \mb{V}_{\tilde{\beta}_N} \mb{f}(\mb{x})]^2}
{1+g + \mb{f}^\top(\mb{x}) \mb{V}_{\tilde{\beta}_N} \mb{f}(\mb{x})}
\label{e:llmalc}
\end{equation}
which is similarly preferred over (\ref{e:gpalc}) with $\mb{K} =
\mb{I}(1+g)$.
The statistic for expected improvement (EI; about the minimum) is the
same under the LLM as (\ref{eq:ego}) for the GP. Of course, it helps
to use the linear predictive equations instead of the kriging ones for
$\hat{z}(\mb{x})$ and $\hat{\sigma}^2(\mb{x})$.
\subsection{Treed partitioning}
\label{sec:treed}
Nonstationary models are obtained by treed partitioning and inferring
a separate model within each region of the partition. Treed
partitioning is accomplished by making (recursive) binary splits on
the value of a single variable so that region boundaries are parallel
to coordinate axes. Partitioning is recursive, so each new partition
is a sub-partition of a previous one. Since variables may be
revisited, there is no loss of generality by using binary splits as
multiple splits on the same variable are equivalent to a non-binary
split.
\begin{figure}%[ht!]
\centering
\includegraphics{tree}
\caption{\footnotesize An example tree $\mathcal{T}$ with two splits,
resulting in three regions, shown in a diagram ({\em left}) and
pictorially ({\em right}). The notation $\mb{X}[:,u] < s$ represents
a subsetting of the design matrix $\mb{X}$ by selecting the rows
which have $u^{\mbox{\tiny th}}$ column less than $s$, i.e. columns
$\{i: x_{iu} < s\}$, so that $\mb{X}_1$ has the rows $I_1$ of
$\mb{X}$ where $I_1 = \{x_{iu_1} < s_1 \;\&\; x_{iu_2} <
s_2\}$, etc. The responses are subsetted similarly so that
$\mb{Z}_1$ contains the $I_1$ elements of $\mb{Z}$. We have that
$\cup_j D_i = \{\mb{X},\mb{Z}\}$ and $D_i \cap D_j = \emptyset$ for
$i\ne j$. }
\label{f:tree}
\end{figure}
Figure \ref{f:tree} shows an example tree. In this example, region $D_1$
contains $\mb{x}$'s whose $u_1$ coordinate is less than $s_1$ and
whose $u_2$ coordinate is less than $s_2$. Like $D_1$, $D_2$ has
$\mb{x}$'s whose coordinate $u_1$ is less than $s_1$, but differs from
$D_1$ in that the $u_2$ coordinate must be bigger than or equal to
$s_2$. Finally, $D_3$ contains the rest of the $\mb{x}$'s differing
from those in $D_1$ and $D_2$ because the $u_1$ coordinate of its
$\mb{x}$'s is greater than or equal to $s_1$. The corresponding response
values ($z$) accompany the $\mb{x}$'s of each region.
These sorts of models are often referred to as Classification and
Regression Trees (CART) \cite{brei:1984}. CART has become popular
because of its ease of use, clear interpretation, and ability to
provide a good fit in many cases. The Bayesian approach is
straightforward to apply to tree models, provided that one can specify
a meaningful prior for the size of the tree. The trees implemented in
the {\tt tgp} package follow Chipman et al.~\cite{chip:geor:mccu:1998}
who specify the prior through a tree-generating process. Starting
with a null tree (all data in a single partition), the tree,
${\mathcal T}$, is probabilistically split recursively with each
partition, $\eta$, being split with probability $p_{\mbox{\sc
split}}(\eta, {\mathcal T}) = a (1 + q_\eta)^{-b}$ where $q_\eta$
is the depth of $\eta$ in $\mathcal{T}$ and $a$ and $b$ are parameters
chosen to give an appropriate size and spread to the distribution of
trees.
Extending the work of Chipman et al.~\cite{chip:geor:mccu:2002}, the
{\tt tgp} package implements a stationary GP with linear trend, or GP
LLM, independently within each of the regions depicted by a tree
$\mathcal{T}$ \cite{gra:lee:2008}. Integrating out dependence on
$\mathcal{T}$ is accomplished by reversible-jump MCMC (RJ-MCMC) via
tree operations {\em grow, prune, change}, and {\em
swap}~\cite{chip:geor:mccu:1998}.
%(2002)\nocite{chip:geor:mccu:2002}. %, however
%Tree proposals can change the size of the parameter space ($\bm{\theta}$).
To keep things simple, proposals for new parameters---via an increase
in the number of partitions (through a {\em grow})---are drawn from
their priors\footnote{Proposed {\em grows} are the {\em only} place
where the priors (for $d$, $g$ and $\tau^2$ parameters; the others
can be integrated out) are used for MH--style proposals. All other
MH proposals are ``random--walk'' as described in Appendix
\ref{sec:howimplement}.}, thus eliminating the Jacobian term usually
present in RJ-MCMC. New splits are chosen uniformly from the set of
marginalized input locations $\mb{X}$. The {\em swap} operation is
augmented with a {\em rotate} option to improve mixing of the Markov
chain \cite{gra:lee:2008}.
There are many advantages to partitioning the input space into
regions, and fitting separate GPs (or GP LLMs) within \index{each}each
region. Partitioning allows for the modeling of non-stationary
behavior, and can ameliorate some of the computational demands by
fitting models to less data. Finally, fully Bayesian model averaging
yields a uniquely efficient nonstationary, nonparametric, or
semiparametric (in the case of the GP LLM) regression tool. The most
general Bayesian treed GP LLM model can facilitate a model comparison
between its special cases (LM, CART, treed LM, GP, treed GP, treed GP
LLM) through the samples obtained from the posterior distribution.
\subsection{(Treed) sequential D-optimal design}
\label{sec:treedopt}
In the statistics community, sequential data solicitation goes under
the general heading of {\em design of experiments}. Depending on a
choice of utility, different algorithms for obtaining optimal designs
can be derived. Choose the Kullback-Leibler distance between the
posterior and prior distributions as a utility leads to what are
called $D$--optimal designs. For GPs with correlation matrix
$\mb{K}$, this is equivalent to maximizing det$(\mb{K})$. Choosing
quadratic loss leads to what are called $A-$optimal designs. An
excellent review of Bayesian approaches to the design of experiments
is provided by Chaloner \& Verdinelli~\cite{chaloner:1995}. Other
approaches used by the statistics community include space-filling
designs: e.g. max-min distance and Latin Hypercube (LH) designs
\cite{sant:will:notz:2003}. The {\tt FIELDS} package
\cite{fields:2004} implements space-filling designs along side kriging
and thin plate spline models.
A hybrid approach to designing experiments employs active learning
techniques. The idea is to choose a set of candidate input
configurations $\tilde{\mb{X}}$ (say, a $D-$optimal or LH design) and
a rule for determining which $\tilde{\mb{x}}\in \tilde{\mb{X}}$ to
add into the design next. The ALM algorithm has been shown to
approximate maximum expected information designs by choosing
$\tilde{\mathbf{x}}$ with the the largest predictive variance
\cite{mackay:1992}. The ALC algorithm selects $\tilde{\mathbf{x}}$
minimizing the reduction in squared error averaged over the input
space \cite{cohn:1996}. Seo et al.~\cite{seo00} provide a comparison
between ALC and ALM using standard GPs. The EI
\cite{jones:schonlau:welch:1998} algorithm can be used to find global
minima.
Choosing candidate configurations $\tilde{\mb{X}}$ ({\tt XX} in the
{\tt tgp} package), at which to gather ALM, ALC, or EI statistics, is
a significant component in the hybrid approach to experimental design.
Candidates which are are well-spaced relative to themselves, and
relative to already sampled configurations, are clearly preferred.
Towards this end, a sequential $D$--optimal design is a good first
choice, but has a number of drawbacks. $D$--optimal designs are based
require a {\em known} parameterization, and are thus not well-suited
to MCMC inference. They may not choose candidates in the
``interesting'' part of the input space, because sampling is high
there already. They are ill-suited partition models wherein
``closeness'' may not measured homogeneously across the input space.
Finally, they are computationally costly, requiring many repeated
determinant calculations for (possibly) large covariance matrices.
One possible solution to both computational and nonstationary modeling
issues is to use treed sequential $D$--optimal design
\cite{gra:lee:2009}, where separate sequential $D$--optimal designs
are computed in each of the partitions depicted by the maximum {\em a
posteriori} (MAP) tree $\hat{\mathcal{T}}$. The number of
candidates selected from each region can be proportional to the volume
of---or to the number of grid locations in---the region. MAP
parameters $\hat{\bm{\theta}}_\nu|\hat{\mathcal{T}}$, or ``neutral''
or ``exploration encouraging'' ones, can be used to create the
candidate design---a common practice \cite{sant:will:notz:2003}. Small
range parameters, for learning about the wiggliness of the response,
and a modest nugget parameter, for numerical stability, tend to work
well together.
Finding a local maxima is generally sufficient to get well-spaced
candidates. The {\tt dopt.gp} function uses a stochastic ascent
algorithm to find local maxima without calculating too many
determinants. This works work well with ALM and ALC. However, it is
less than ideal for EI as will be illustrated in Section \ref{sec:as}.
Adaptive sampling from EI (with {\tt tgp}) is still an open area of
research.
\section{Examples using {\tt tgp}}
\label{sec:examples}
The following subsections take the reader through a series of examples
based, mostly, on synthetic data. At least two different {\tt b*}
models are fit for each set of data, offering comparisons and
contrasts. Duplicating these examples in your own {\sf R} session is
highly recommended. The {\tt Stangle} function can help extract
executable {\sf R} code from this document. For example, the code for
the exponential data of Section \ref{sec:exp} can be extracted with
one command.
\begin{verbatim}
> Stangle(vignette("exp", package="tgp")$file)
\end{verbatim}
\noindent This will write a file called ``exp.R''. Additionally, each
of the subsections that follow is available as an {\sf R} demo. Try
{\tt demo(package="tgp")} for a listing of available demos. To invoke
the demo for the exponential data of Section \ref{sec:exp} try {\tt
demo(exp, package="tgp")}. This is equivalent to {\tt
source("exp.R")} because the demos were created using {\tt Stangle}
on the source files of this document.
\footnote{Note that this vignette functionality is only supported in
{\tt tgp} version $<2.x$. In 2.x and later the vignettes were
coalesced in order to reduce clutter.
The demos in 2.x, however, still correspond to their respective sections.}
Each subsection (or subsection of the appendix) starts by seeding the
random number generator with \verb!set.seed(0)!. This is done to
make the results and analyses reproducible within this document, and
in demo form. I recommend you try these examples with different seeds
and see what happens. Usually the results will be similar, but
sometimes (especially when the data ({\tt X, Z}) is generated
randomly) they may be quite different.
Other successful uses of the methods in this package include
applications to the Boston housing data \cite{harrison:78,
gra:lee:2008}, and designing an experiment for a reusable NASA
launch vehicle \cite{glm:04,gra:lee:2009} called the Langely
glide-back booster (LGBB).
<>=
seed <- 0; set.seed(seed)
@
\subsection{1-d Linear data}
\label{sec:ex:1dlinear}
Consider data sampled from a linear model.
\begin{equation}
z_i = 1 + 2x_i + \epsilon_, \;\;\;\;\; \mbox{where} \;\;\;
\epsilon_i \stackrel{\mbox{\tiny iid}}{\sim} N(0,0.25^2)
\label{eq:linear:sim}
\end{equation}
The following {\sf R} code takes a sample $\{\mb{X}, \mb{Z}\}$ of size
$N=50$ from (\ref{eq:linear:sim}). It also chooses $N'=99$ evenly spaced
predictive locations $\tilde{\mb{X}} = \mbox{\tt XX}$.
<<>>=
# 1-d linear data input and predictive data
X <- seq(0,1,length=50) # inputs
XX <- seq(0,1,length=99) # predictive locations
Z <- 1 + 2*X + rnorm(length(X),sd=0.25) # responses
@
Using {\tt tgp} on this data with a Bayesian
hierarchical linear model goes as follows:
<<>>=
lin.blm <- blm(X=X, XX=XX, Z=Z)
@
\begin{figure}[ht!]
\centering
<