This tutorial shows a step-by-step analysis of toy data:

- Step 1: Load of R package ‘wcox’ and toy data.
- Step 2: Acquire population incidence rate in the correct format.
- Step 3: Prepare the sample data using Prepare_data().
- Step 4: Calculate weights using Calculate_weights().
- Step 5: Fit a weighted Cox model using R package ‘survival’.

Let’s start!

Load the package.

Load the toy data set.

Note that this concerns a simulated data set and no real observations. However, it is similar in structure to what one might come across in for example familial cancer studies.

```
# Show the first few lines of the toy data.
head(fam_dat)
#> family_id x event_indicator age individual_id
#> 8 1 0.33093247 1 38 1
#> 9 1 0.18551392 1 67 2
#> 10 1 -1.65904097 0 100 3
#> 11 1 -0.38816903 1 14 4
#> 12 1 0.03958956 1 2 5
#> 18 2 1.05115476 1 9 6
```

Families share the same ‘family_id’; ‘individual_id’ is unique per individual. Risk modifier ‘x’ is a continuous variable. The event indicator (‘event_indicator’) takes value 1 if the individual experienced the event during follow-up and 0 otherwise. We consider the follow-up time since birth, i.e. age. For individuals experiencing the event, ‘age’ is the time-to-event. For others, this is the time-to-censoring.

```
# Show the number of families and top rows.
cat( "There are", unique(fam_dat$family_id) %>% length() , "families in data set. ")
#> There are 219 families in data set.
# Examine family sizes.
fam_sizes <- fam_dat %>% group_by(family_id) %>% count()
cat( "Family size is on average ",
fam_sizes$n %>% mean() %>% round(1),
" and ranges from ",
fam_sizes$n %>% min(), " to ",
fam_sizes$n %>% max(), ".", sep = "")
#> Family size is on average 4.9 and ranges from 2 to 7.
```

It is good practice to report the distribution of family size together with the analysis results.

Besides sample data, we need some information about the population in order to calculate weights that correct for ascertainment bias.

In the weight calculation *‘wcox’* uses the difference between
the incidence rate of the event of interest in the sample versus the
population. Therefore, the incidence rate in the population at risk is
needed.

Incidence rateis the number of events in a certain time window divided by the population.

Think of sentences like *“Breast cancer occurs in
out of 100 000 women between age xx and xx.”* For many (medical)
events, such data is available in national registries, for example the
Dutch cancer registry (https://iknl.nl/nkr/cijfers-op-maat, select crude rate
for incidence to get per 100 000). For high-risk populations of carriers
of certain pathogenic genetic variants, incidence rates are often
available in scientific publications or can be inferred based on
published hazard ratios multiplied by underlying general
population-based incidence rates.

Two aspects are important here. We need the incidence rate per 100
000 individuals, as that is what the package expects. And we need to
make a choice of age groups. *‘wcox’* allows to select the
standard 5-years or 10-years age groups or define age groups
manually.

The population incidence needs to be a so called vector, a sequence
of incidence rates per age group. For entering the age group cut-offs:
*1)* We can specify them manually where the vector ‘breaks’ needs
to be one item longer than the incidence rate vector because the end of
the last age group is also included. Intervals are of the form [begin,
end), which means that in our example below, we will only include those
with a follow-up time smaller than 100. *2)* We can choose
5-years categories (0-4, 5-9, 10-4 et cetera) or 10-years categories
(0-9, 10-19, 20-29 et cetera). Then, ‘wcox’ will define the cut-offs
automatically.

```
# Enter the incidence by age group per 100 000, starting with the youngest age group.
incidence_rate <- c(2882, 1766, 1367, 1155, 987, 845, 775, 798, 636, 650)
# Define the age group cutoffs.
breaks_manually <- c(0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100)
breaks_10yrs <- "10 years"
```

We are almost ready to calculate.

The function *Prepare_data()* expects a data.frame() in which
each row concerns one individual (wide format). Moreover, it looks for
the individual identifier, event indicator and time-to-event/censoring
(age) by searching for variables ‘id’, ‘d’, ‘y’, respectively. In our
case the latter two variables are named differently and we have to
rename them.

Now, it’s time to combine the three pieces (sample data, external data, choice of age categories) to prepare the data for weight calculation. Remember that the calculation is based on the comparison between the sample data and the population and therefore we need to have the population incidence.

The function *Prepare_data()* takes three arguments: dat
inputs the sample data.frame with ‘id’, ‘d’, ‘y’ and a family
identifier, population_incidence inputs the vector (i.e. sequence of
form c(xxx, xxx, xxx, …) ) with incidence rates per 100 000, breaks
inputs either a vector with breaks or one of the pre-set options (“5
years” or “10 years”). The code below uses the manual breaks. To try-out
the pre-set option for 10 years age groups, remove the “#” in front of
the last lines.

```
# Using option 1 (manual cut-offs):
my_dat_out <- Prepare_data(dat = my_dat, population_incidence = incidence_rate,
breaks = breaks_manually)
# Unhash to use option 2 (pre-set cut-offs):
# my_dat_out <- Prepare_data(dat = my_dat, population_incidence = incidence_rate,
# breaks = "10 years")
```

Let’s see what the prepared data looks like.

```
# Select the newly add columns.
my_dat_out %>% select(id, k, d, S_k, S_k.) %>% arrange(id) %>% head(8)
#> id k d S_k S_k.
#> 1 1 (30,40] 1 0.8909206 0.9182261
#> 2 2 (60,70] 1 0.9254270 0.9634811
#> 3 3 (90,100] 0 0.9370675 0.9924243
#> 4 4 (10,20] 1 0.8381150 0.8331193
#> 5 5 (0,10] 1 0.7496117 0.6760318
#> 6 6 (0,10] 1 0.7496117 0.6760318
#> 7 7 (0,10] 1 0.7496117 0.6760318
#> 8 8 (0,10] 0 0.7496117 0.6760318
```

The weights will be calculated based on the newly add columns S_k and S_k. .

In the paper, sample based quantities have a hyperscript ‘0’ which is replaced by a dot (.) here. S_k is based on the external information (incidence rate).

With the prepared data set, we are ready to calculate the weights!

Function *Calculate_weights()* requires the data set prepared
in the previous step, which we will refer to as
**prepared** data. N.B.: Using the original data set
directly will fail as the external information is not integrated
there.

```
# Calculate weights using the object that is output from the Prepare_data() function.
w <- Calculate_weights(dat = my_dat_out)
#> [1] "No negative weights"
```

The function indicates that there are no negative weights. This means that our weights are valid. We will have a look at them now.

```
# Show the weights.
my_dat_out %>% mutate(weight = w) %>%
select(id, d, weight) %>%
arrange(id) %>%
filter(id %in% c(1,3))
#> id d weight
#> 1 1 1 1.374799
#> 2 3 0 1.000000
```

Individual 1 (id = 1) experiences the event between age 30 and 39 (d = 1). The weight for the interval in which the event took place is 1, while other intervals get weighted by less than 1. Individual 3 never experiences the event during follow-up and is censored within interval 90-99 years.

Now, we show how to fit a Cox proportional hazards model with our
calculated weights to correct for ascertainment bias, using R package
‘survival’. Weights can be included using the argument ‘weights’. Note
that because we inputted the exact same data.frame in the function
*Calculate_weights()* in the previous step, i.e. the
**prepared** data, the resulting weight vector can be
directly used: the order of individuals is the same. The covariate of
interest is risk modifier ‘x’. In order to obtain robust standard
errors, the cluster term needs to be included.

```
# Fit the model.
fit <- coxph(Surv(y, d==1) ~ x + cluster(family_id), weights = w, data = my_dat_out)
fit
#> Call:
#> coxph(formula = Surv(y, d == 1) ~ x, data = my_dat_out, weights = w,
#> cluster = family_id)
#>
#> coef exp(coef) se(coef) robust se z p
#> x 1.39863 4.04963 0.06048 0.06789 20.6 <2e-16
#>
#> Likelihood ratio test=600.2 on 1 df, p=< 2.2e-16
#> n= 1069, number of events= 631
```

What does this say about the effect of the risk modifier?

```
# Extract estimates.
fit.summary <- summary(fit)
# Summarize findings.
cat("Covariate effect (95% CI): ",
exp(fit.summary$coefficients[1]), " (",
exp(fit.summary$coefficients[1] - 1.96*fit.summary$coefficients[4]), ", ",
exp(fit.summary$coefficients[1] + 1.96*fit.summary$coefficients[4]), ").", sep = "")
#> Covariate effect (95% CI): 4.049629 (3.545061, 4.626013).
```

The risk of experiencing the event in the next instant of time is estimated to be 3.6 times higher for a unit increase in the risk modifier. The corresponding 95% confidence interval does not include 1, so this positive association is significant (using alpha = 0.05).

Citation paper.