ICSKAT Tutorial

Ryan Sun



The ICSKAT package implements the Interval-Censored Sequence Kernel Association Test (ICSKAT), the interval-censored Burden test, and the ICSKAT-Optimal (ICSKATO) test for inference on sets of features (e.g. all the SNPs in a gene or pathway) in genetic association studies.

Interval-censored data are a type of failure time data (also known as time-to-event data) quite commonly found in modern genetic datasets including the UK Biobank (UKB) and St. Jude Lifetime Cohort Study (SJLIFE). Many people are more familiar with right-censored data, which are a special case of interval-censored data. In right-censored data, an event time is either known to happen exactly or is only known to fall after some last observation time. In interval-censored data, an event time is not known exactly but is only known to fall in some interval (e.g between 0 and 20, or between 19.5 and 22.3, or between 50 and infinity).

For example, in SJLIFE, childhood cancer subjects who have turned 18 and have survived at least five years past their initial cancer treatment are invited back at regular intervals for comprehensive, multi-day clinical checkups. These patients are at extremely high risk of many chronic conditions, and they suffer an average of three severe or life-threatening conditions by age 40. At the (free) visits, the patients undergo a variety of diagnostic tests across many organ systems. Many diseases, for instance bone mineral density deficiency, are only able to be found at these visits because patients cannot self-diagnose the conditions on their own at home. Thus, if a subject visited the hospital for check-ups at age 20, 25, and 30, and bone mineral density deficiency was found at the age 25 visit, the disease onset time is only known to fall between 20 and 25.

In UKB, some patients visit assessment centers multiple times, and each time they take a questionnaire asking about a variety of health outcomes. For example, one question asks whether they have had any fractures. If the patients answers “no” at the first visit and “yes” at the second visit, then the fracture is only known to have occured between the first and second visits.

There are some ad-hoc ways to work with interval-censored data, including dichotomizing the occurrence of the event and working with binary outcomes (e.g. logistic regression) or imputing the occurrence time to fall in the middle of the interval and applying right-censored survival analysis methodology. However applying interval-censored methodology to interval-censored data often results in better operating characteristics (e.g. better control of Type I error rate or more power).


There are three steps to running the ICSKAT test with this package:

  1. First you need to generate the design matrices for the null model in the expected format. You only need to do this once for all SNP-sets to be tested. The function is make_IC_dmat(), and it takes the following arguments:
    • xMat is the n*p matrix of non-genetic covariates (not including an intercept)
    • lt is the n*1 vector of times for the left side of the interval. If an observation is left-censored, you can just put in 0.
    • rt is the n*1 vector of times for the right side of the interval. If an observation is right-censored, you can put in Inf or any numeric value.
    • obs_ind is the n*1 indicator vector for whether the observation is right-censored. If so, there should be a 0, otherwise there should be a 1.
    • tpos_ind is the n*1 indicator vector of whether the observation is left-censored. If so, there should be a 0, otherwise there should be 1.
    • quant_r can be used to pass in knot locations for the spline, we suggest not specifying this and letting the package work automatically.
    • nKnots is a scalar number of interior knots. The default of 1 will create three total knots, one interior and two endpoint knots.
  2. Next you need to fit the null model. You only need to do this once for all SNP-sets to be tested. The call is ICSKAT_fit_null(), and you need the following arguments:
    • init_beta is a vector holding the initial guess at the covariates for each column in the design matrices generated by make_IC_dmat(). The number of elements should be equal to the number of columns in the design matrices. Usually you can just initialize it to a vector of 0s or 1s. If you happen to have a good idea of what the coefficients are, then this will speed up convergence.
    • left_dmat is output directly from make_IC_dmat().
    • right_dmat is output directly from make_IC_dmat
    • obs_ind, tpos_ind, lt, and rt have been covered above.
  3. Finally, call ICskat() for each set of SNPs that you want to test. The gMat argument should be the n*q matrix of genotypes. You will also need left_dmat, right_dmat, lt, rt, obs_ind, and tpos_ind, which have been covered above. Finally, you will need null_beta and Itt, which come directly from ICSKAT_fit_null, see below for an example. You can also call ICSKATO() to run the ICSKATO test. This function takes the output form ICskat() as the only argument.

Worked Example

Suppose we are interested in testing whether a specific gene is associated with time to bone mineral density deficiency. We will simulate event times for 10,000 subjects under a proportional hazards model with baseline cumulative hazard function H(t)=t. We will set four observations times at times 1, 2, 3, and 4, with each subject’s exact visit times drawn from a Uniform(-0.25, 0.25) distribution surrounding these times. Each subject will have a 10% chance of missing any given visit. Our genetic data will consist of 50 SNPs in the gene of interest, and for each patient we have their minor allele count (0,1,2) at each of the 50 SNPs. Additionally suppose have non-genetic covariates for each subject’s gender and whether they take daily vitamins (both binary):

n <- 10^4
q <- 50

# generate data
# all SNPs have minor allele frequency 0.3 in this toy example
gMat <- matrix(data=rbinom(n=n*q, size=2, prob=0.3), nrow=n)
# gender and whether they take daily vitamins
xMat <- matrix(data=rbinom(n=n*2, size=1, prob=0.5), nrow=n)
# the baseline cumulative hazard function
bhFunInv <- function(x) {x}
# observation times
obsTimes <- 1:4
# no effect of either gender or daily vitamins
etaVec <- rep(0, n)

# generate data
outcomeDat <- gen_IC_data(bhFunInv = bhFunInv, obsTimes = obsTimes, windowHalf = 0.25,
                          probMiss = 0.1, etaVec = etaVec)
lt <- outcomeDat$leftTimes
rt <- outcomeDat$rightTimes
# indicators of left- and right-censoring
tpos_ind <- as.numeric(lt > 0)
obs_ind <- as.numeric(rt != Inf)

# make design matrix with cubic spline terms
dmats <- make_IC_dmat(xMat, lt, rt, obs_ind, tpos_ind)
# fit null model - only need to do this once for each genetic set (note there is no information
# on the SNPs used here)
nullFit <- ICSKAT_fit_null(init_beta = rep(0, 5), left_dmat = dmats$left_dmat, right_dmat=dmats$right_dmat, 
                           obs_ind = obs_ind, tpos_ind = tpos_ind, lt = lt, rt = rt)
# perform the ICSKAT and Burden tests
icskatOut <- ICskat(left_dmat = dmats$left_dmat, right_dmat=dmats$right_dmat, lt = lt, rt = rt,
       obs_ind = obs_ind, tpos_ind = tpos_ind, gMat = gMat, null_beta = as.numeric(nullFit$beta_fit), Itt = nullFit$Itt)
## [1] 0.6011683
## [1] 0.256489
# perform the ICSKATO test
ICSKATO(icskatOut = icskatOut)$pval
## [1] 0.4500931

If we want to test another SNP set (e.g. another gene), we don’t need to run the make_IC_dmat or ICSKAT_fit_null functions again. Just run ICskat() on the new genotype matrix.

# another gene with 100 SNPs in it
gMat2 <- matrix(data=rbinom(n=n*100, size=2, prob=0.3), nrow=n)

# we don't need to run the make_IC_dmat or ICSKAT_fit_null functions again as long
# as the event times and non-genetic covariates haven't changed.
icskatOut2 <- ICskat(left_dmat = dmats$left_dmat, right_dmat=dmats$right_dmat, lt = lt, rt = rt,
       obs_ind = obs_ind, tpos_ind = tpos_ind, gMat = gMat2, null_beta = as.numeric(nullFit$beta_fit), Itt = nullFit$Itt)
## [1] 0.8072349
## [1] 0.4946116
# perform the ICSKATO test
ICSKATO(icskatOut = icskatOut2)$pval
## [1] 0.7827799

Questions or novel applications? Please let me know! Contact information can be found in the package description.