`icc_varcomp.Rmd`

This Document describes the computational background and the use of
the `icc()`

function from the `Agree`

package. We
developed the `icc()`

functions for this package in
connection with a simulation study about sample size requirements for
studies on reliability and measurement error Mokkink, de Vet, and Eekhout (2022a) and a
methodological paper about how to design and conduct a study on
reliability and measurement error Mokkink, de
Vet, and Eekhout (2022b).

```
library(Agree)
>
> Attaching package: 'Agree'
> The following object is masked from 'package:base':
>
> kappa
```

The intra-class agreement is usually obtained for continuous ratings.
As an example we can use data from data study by Dikmans et al. (2017). This data is based on
photographs of breasts of 50 women after breast reconstruction. The
photographs are independently scored by 5 surgeons, the patients, and
three mammography nurses. They each rated the quality of the
reconstruction on a 5 point ordinal scale with the verbal anchors on the
left side ‘very dissatisfied’ on the left end and on the right end ‘very
satisfied’ on the right end. They specifically rated the volume, shape,
symmetry, scars and nipple. For the `icc`

examples we can use
the sum scores for volume, shape, symmetry, scars and nipple as an
overall rating from each rater.

```
<-
breast_scores ::breast %>%
Agree::select(Patient_score, PCH1_score, PCH2_score, PCH3_score, PCH4_score,
dplyr
PCH5_score, Mam1_score, Mam2_score, Mam3_score)
head(breast_scores)
> Patient_score PCH1_score PCH2_score PCH3_score PCH4_score PCH5_score
> 1 10.0 9 9 8 7.0 8
> 2 NA 7 NA NA 8.0 7
> 3 5.5 8 7 7 7.0 8
> 4 7.0 NA 5 4 4.0 8
> 5 1.0 1 2 2 3.0 1
> 6 8.0 8 7 NA 7.5 8
> Mam1_score Mam2_score Mam3_score
> 1 8 8 8
> 2 9 7 NA
> 3 8 NA 7
> 4 NA 5 NA
> 5 3 3 4
> 6 9 7 7
```

The example data shows missings. The `icc`

function can
deal with these missings, because a mixed model is used to estimate the
variances to compute the icc with.

For a mixed model, the data needs to be restructured to a long
format. we can use the `pivot_longer()`

function from the
`tidyr`

package to do that:

```
<- breast_scores %>%
breast_long mutate(id = 1:nrow(breast_scores)) %>% #add id column
pivot_longer(cols = -id, names_to = "rater", values_to = "score")
breast_long>
[38;5;246m# A tibble: 450 x 3
[39m
> id rater score
>
[3m
[38;5;246m<int>
[39m
[23m
[3m
[38;5;246m<chr>
[39m
[23m
[3m
[38;5;246m<dbl>
[39m
[23m
>
[38;5;250m 1
[39m 1 Patient_score 10
>
[38;5;250m 2
[39m 1 PCH1_score 9
>
[38;5;250m 3
[39m 1 PCH2_score 9
>
[38;5;250m 4
[39m 1 PCH3_score 8
>
[38;5;250m 5
[39m 1 PCH4_score 7
>
[38;5;250m 6
[39m 1 PCH5_score 8
>
[38;5;250m 7
[39m 1 Mam1_score 8
>
[38;5;250m 8
[39m 1 Mam2_score 8
>
[38;5;250m 9
[39m 1 Mam3_score 8
>
[38;5;250m10
[39m 2 Patient_score
[31mNA
[39m
>
[38;5;246m# ... with 440 more rows
[39m
>
[38;5;246m# i Use `print(n = ...)` to see more rows
[39m
```

The variances that are used to compute the `icc`

are
obtained from a linear mixed model. This model is estimated with the
`lmer()`

function from the `lme4`

package. In the
`breast`

example we have two levels: patients are level 1 and
raters/observers are level 2. The two-level multilevel model is defined
as \(Y_{ijr} = \beta_0 + b_{0j} + b_{0r} +
\epsilon_{ijr}\), where \(b_{0j}\) is the random intercept at the
subject level and \(b_{0r}\) the random
intercept at the rater/observer level. The \(\epsilon_{ijr}\) is the residual error. The
`r-code`

for the model in `lme4`

is:
`lmer(score ~ (1|id) + (1|observer), data, REML = T)`

De exact specification of the multilevel model, depends on the design of the study and the type of ICC that one wants to compute.

There are three types of icc incorporated in the `icc`

function. The ICC oneway, ICC agreement and the ICC consistency.

The ICC type oneway is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)) and the residual variance (\(\sigma^2_{\epsilon}\)). The \(ICC_{oneway}\) is computed as follows:

\(ICC_{oneway} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}\)

The ICC oneway assumes that each subject is rated by a different set
of raters, that are randomly selected from a larger population of judges
(Shrout and Fleis (1979)). The
`icc_oneway()`

uses the `varcomp()`

function to
compute the variance components. These variances are estimated from a
`lmer`

model with random slope for the subjects. \(Y_{ij} = \beta_0 + b_{0j} +
\epsilon_{ij}\)

The standard error of measurement (\(sem\)) is the square root of the error variance (i.e. \(sem = \sigma^2_{\epsilon}\)). The confidence intervals are computed with the exact F method. \(F = \frac{k \sigma^2_{j} + \sigma^2_{\epsilon}}{\sigma^2_{\epsilon}}\), with \(df1 = n - 1\) and \(df2 = n (k - 1)\) (Shrout and Fleis (1979)).

For the oneway ICC, only the level 1, the patient level, is random. The rater variance is not used.

The `r-code`

to extract the variance component with the
`varcomp`

function is:

```
varcomp(score~(1|id), data = breast_long)
> grp vcov
> id id 2.051072
> Residual Residual 1.047193
```

The variance components, can be used to compute the ICC oneway:

```
<- varcomp(score~(1|id), data = breast_long)
vc "id", "vcov"]/sum(vc[,"vcov"])
vc[> [1] 0.6620067
```

We have also incorporated a function that computed the ICC oneway
directly from the data in the wide format, using the same steps. This is
the `icc_oneway`

function.

```
icc_oneway(breast_scores)
> icc L_icc U_icc sem varj_oneway varerr_oneway
> 1 0.6620067 0.5638296 0.7598147 1.023324 2.051072 1.047193
```

The icc type agreement is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)), rater variance (\(\sigma^2_k\)) and the residual variance (\(\sigma^2_\epsilon\)). The \(ICC_{agreement}\) is computed as follows:

\(ICC_{agreement} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_k + \sigma^2_{\epsilon}}\)

The ICC for agreement generalizes to other raters within a population
(Shrout and Fleis (1979)). All subjects
are rated by the same set of raters, and the rater variance is taken
into account in the calculation of the ICC. The variance components are
computed with the `icc_model()`

function. This is a
`lmer`

model with a random slope for the subjects and for the
raters. The \(sem\) is the square root
of the sum of the rater variance and the error variance (i.e. \(sem = \sqrt{\sigma^2_r +
\sigma^2_\epsilon}\)). The confidence intervals are approximated
to account for the three independent variance components, as defined by
Satterthwaite (1946) & Shrout and Fleis (1979).

For the ICC for agreement, both the level 1 and level 2 are random.

The `r-code`

to extract the variance component with the
`varcomp`

function is:

```
varcomp(score ~ (1|id) + (1|rater), data = breast_long)
> grp vcov
> id id 2.0069835
> rater rater 0.1167749
> Residual Residual 0.9424505
```

The variance components, can be used to compute the ICC agreement:

```
<- varcomp(score~ (1|id) + (1|rater), data = breast_long)
vc "id", "vcov"]/sum(vc[,"vcov"])
vc[> [1] 0.6545488
```

We have also incorporated a function that computed the ICC for
agreement directly from the data in the wide format, using the same
steps. This is the `icc_agreement`

function.

```
icc_agreement(breast_scores)
> icc L_icc U_icc sem varj_agr varr_agr varerr_agr
> 1 0.6545488 0.5554292 0.75384 1.029187 2.006984 0.1167749 0.9424505
```

The ICC type consistency is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)) and the residual variance (\(\sigma^2_\epsilon\)). The rater variance is not used to calculate the ICC and can therefore also be considered as a fixed effect. The \(ICC_{consistency}\) is computed as follows:

\(ICC_{consistency} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}\)

The ICC for consistency generalizes only to the set of raters in the
data (Shrout and Fleis (1979)). The
`varcomp()`

function is used to compute the variance
components. These variances are computed from a a `lmer`

model with a random slope for the subjects and a fixed effect for the
raters. The sem is the square root of the error variance. The confidence
are computed with the exact F method. \(F =
\frac{(k \sigma^2_j + \sigma^2_\epsilon)}{\sigma^2_\epsilon}\),
with \(df1 = n - 1\) and \(df2 = (n - 1) (k - 1)\) (Shrout and Fleis (1979)).

For the ICC for consistency, the level 1 is a random effect and the level 2 is fixed.

The `r-code`

to extract the variance component with the
`varcomp`

function is:

```
varcomp(score ~ (1|id) + rater, data = breast_long)
> grp vcov
> id id 1.9956492
> Residual Residual 0.9428479
```

The variance components, can be used to compute the ICC consistency:

```
<- varcomp(score~ (1|id) + rater, data = breast_long)
vc "id", "vcov"]/sum(vc[,"vcov"])
vc[> [1] 0.6791394
```

We have also incorporated a function that computed the ICC
consistency directly from the data in the wide format, using the same
steps. This is the `icc_consistency`

function.

```
icc_consistency(breast_scores)
> icc L_icc U_icc sem varj_cons varerr_cons
> 1 0.6791394 0.5831551 0.7734836 0.9710035 1.995649 0.9428479
```

We have one general `icc`

function that computes all three
ICC types for a data set. The differences in computations between the
ICC types can quickly be seen in the variance components returned by the
`icc`

function. We can obtain the variances by using
`var = TRUE`

in the `icc()`

function, the
`var_level2`

shows the variance between the raters. Only for
the ICC for agreement this variance component is estimated.

```
# ICC for all methods
icc(breast_scores, var = TRUE)
> icc lower upper sem var_level1 var_level2
> oneway 0.6620067 0.5638296 0.7598147 1.0233245 2.051072 NA
> agreement 0.6545488 0.5554292 0.7538400 1.0291868 2.006984 0.1167749
> consistency 0.6791394 0.5831551 0.7734836 0.9710035 1.995649 NA
> var_Residual
> oneway 1.0471930
> agreement 0.9424505
> consistency 0.9428479
```

When we estimate the ICC for the surgeons only, we can see that the variance at the rater level is decreased. This effect is directly shown in the ICC.

In the icc we can also use the data in wide format and use the
`cols`

option to define the rater columns that we want to
use.

```
# ICC for all methods
icc(breast_scores,
cols = c("PCH1_score", "PCH2_score", "PCH3_score", "PCH4_score", "PCH5_score"),
var = TRUE)
> icc lower upper sem var_level1 var_level2
> oneway 0.7615871 0.6710817 0.8403886 0.8591071 2.357677 NA
> agreement 0.7593693 0.6682981 0.8387870 0.8611481 2.340225 0.05026114
> consistency 0.7711953 0.6829366 0.8473933 0.8317713 2.331886 NA
> var_Residual
> oneway 0.7380650
> agreement 0.6913149
> consistency 0.6918435
```

When we estimate the ICC for the mammography nurses only, we see that the variance at the rater level is increased. This effect is directly shown in the ICC.

```
# ICC for all methods
icc(breast_scores,
cols = c("Mam1_score", "Mam2_score", "Mam3_score"),
var = TRUE)
> icc lower upper sem var_level1 var_level2
> oneway 0.6001478 0.4493754 0.7309390 1.1515241 1.990237 NA
> agreement 0.5698257 0.4137129 0.7079019 1.1919337 1.881923 0.4443499
> consistency 0.6558989 0.5162467 0.7724701 0.9899547 1.868020 NA
> var_Residual
> oneway 1.3260076
> agreement 0.9763561
> consistency 0.9800104
```

Term | Description |
---|---|

\(\beta_0\) | Fixed intercept |

\(b_{0j}\) | Random intercept at subject level |

\(b_{0r}\) | Random intercept at rater level |

\(\epsilon_{ijr}\) | Residual error |

\(\sigma_j\) | Variance between subjects |

\(\sigma_r\) | Variance between raters |

\(\sigma_\epsilon\) | Residual error variance |

\(k\) | Number of raters/observers |

\(n\) | Number of subjects |

Dikmans, R. E., L. Nene, M. B. Bouman, H. C. W. de Vet, M. Mireau, M. E.
Buncamper, H. Winters, M. Ritt, and M. G. Mullender. 2017. “The Aesthetic Items Scale: A Tool for the Evaluation of
Aesthetic Outcome After Breast Reconstruction.” *Plastic
and Reconstructive Surgery. Global Open.* 5 (3): e1254.

Mokkink, L., H. C. W. de Vet, and I Eekhout. 2022a. “Titel Paper 1.” *Tbd*.

———. 2022b. “Titel Paper 2.” *Tbd*.

Satterthwaite, F. E. 1946. “An Approximate Distribution
of Estimates of Variance Components.” *Biometrics* 2:
110–14.

Shrout, P. E., and J. L. Fleis. 1979. “Intraclass
Correlations: Uses in Assessing Rater Reliability.”
*Psychologiclal Bulletin* 87 (2): 420–28.