This Document describes the computational background and the use of the icc() function from the Agree package. We developed the icc() functions for this package in connection with a simulation study about sample size requirements for studies on reliability and measurement error Mokkink, de Vet, and Eekhout (2022a) and a methodological paper about how to design and conduct a study on reliability and measurement error Mokkink, de Vet, and Eekhout (2022b).

library(Agree)
> 
> Attaching package: 'Agree'
> The following object is masked from 'package:base':
> 
>     kappa

Data example

Breast

The intra-class agreement is usually obtained for continuous ratings. As an example we can use data from data study by Dikmans et al. (2017). This data is based on photographs of breasts of 50 women after breast reconstruction. The photographs are independently scored by 5 surgeons, the patients, and three mammography nurses. They each rated the quality of the reconstruction on a 5 point ordinal scale with the verbal anchors on the left side ‘very dissatisfied’ on the left end and on the right end ‘very satisfied’ on the right end. They specifically rated the volume, shape, symmetry, scars and nipple. For the icc examples we can use the sum scores for volume, shape, symmetry, scars and nipple as an overall rating from each rater.

breast_scores <- 
Agree::breast %>%
  dplyr::select(Patient_score, PCH1_score, PCH2_score, PCH3_score, PCH4_score, 
                PCH5_score, Mam1_score, Mam2_score, Mam3_score)

head(breast_scores)
>   Patient_score PCH1_score PCH2_score PCH3_score PCH4_score PCH5_score
> 1          10.0          9          9          8        7.0          8
> 2            NA          7         NA         NA        8.0          7
> 3           5.5          8          7          7        7.0          8
> 4           7.0         NA          5          4        4.0          8
> 5           1.0          1          2          2        3.0          1
> 6           8.0          8          7         NA        7.5          8
>   Mam1_score Mam2_score Mam3_score
> 1          8          8          8
> 2          9          7         NA
> 3          8         NA          7
> 4         NA          5         NA
> 5          3          3          4
> 6          9          7          7

The example data shows missings. The icc function can deal with these missings, because a mixed model is used to estimate the variances to compute the icc with.

For a mixed model, the data needs to be restructured to a long format. we can use the pivot_longer() function from the tidyr package to do that:

breast_long <- breast_scores %>%
 mutate(id = 1:nrow(breast_scores)) %>% #add id column
  pivot_longer(cols = -id, names_to = "rater", values_to = "score")

breast_long
> 
[38;5;246m# A tibble: 450 x 3
[39m
>       id rater         score
>    
[3m
[38;5;246m<int>
[39m
[23m 
[3m
[38;5;246m<chr>
[39m
[23m         
[3m
[38;5;246m<dbl>
[39m
[23m
> 
[38;5;250m 1
[39m     1 Patient_score    10
> 
[38;5;250m 2
[39m     1 PCH1_score        9
> 
[38;5;250m 3
[39m     1 PCH2_score        9
> 
[38;5;250m 4
[39m     1 PCH3_score        8
> 
[38;5;250m 5
[39m     1 PCH4_score        7
> 
[38;5;250m 6
[39m     1 PCH5_score        8
> 
[38;5;250m 7
[39m     1 Mam1_score        8
> 
[38;5;250m 8
[39m     1 Mam2_score        8
> 
[38;5;250m 9
[39m     1 Mam3_score        8
> 
[38;5;250m10
[39m     2 Patient_score    
[31mNA
[39m
> 
[38;5;246m# ... with 440 more rows
[39m
> 
[38;5;246m# i Use `print(n = ...)` to see more rows
[39m

Variance components

The variances that are used to compute the icc are obtained from a linear mixed model. This model is estimated with the lmer() function from the lme4 package. In the breast example we have two levels: patients are level 1 and raters/observers are level 2. The two-level multilevel model is defined as \(Y_{ijr} = \beta_0 + b_{0j} + b_{0r} + \epsilon_{ijr}\), where \(b_{0j}\) is the random intercept at the subject level and \(b_{0r}\) the random intercept at the rater/observer level. The \(\epsilon_{ijr}\) is the residual error. The r-code for the model in lme4 is: lmer(score ~ (1|id) + (1|observer), data, REML = T)

De exact specification of the multilevel model, depends on the design of the study and the type of ICC that one wants to compute.

Types of ICC

There are three types of icc incorporated in the icc function. The ICC oneway, ICC agreement and the ICC consistency.

ICC oneway

The ICC type oneway is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)) and the residual variance (\(\sigma^2_{\epsilon}\)). The \(ICC_{oneway}\) is computed as follows:

\(ICC_{oneway} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}\)

The ICC oneway assumes that each subject is rated by a different set of raters, that are randomly selected from a larger population of judges (Shrout and Fleis (1979)). The icc_oneway() uses the varcomp() function to compute the variance components. These variances are estimated from a lmer model with random slope for the subjects. \(Y_{ij} = \beta_0 + b_{0j} + \epsilon_{ij}\)

The standard error of measurement (\(sem\)) is the square root of the error variance (i.e. \(sem = \sigma^2_{\epsilon}\)). The confidence intervals are computed with the exact F method. \(F = \frac{k \sigma^2_{j} + \sigma^2_{\epsilon}}{\sigma^2_{\epsilon}}\), with \(df1 = n - 1\) and \(df2 = n (k - 1)\) (Shrout and Fleis (1979)).

For the oneway ICC, only the level 1, the patient level, is random. The rater variance is not used.

The r-code to extract the variance component with the varcomp function is:

varcomp(score~(1|id), data = breast_long)
>               grp     vcov
> id             id 2.051072
> Residual Residual 1.047193

The variance components, can be used to compute the ICC oneway:

vc <- varcomp(score~(1|id), data = breast_long)
vc["id", "vcov"]/sum(vc[,"vcov"])
> [1] 0.6620067

We have also incorporated a function that computed the ICC oneway directly from the data in the wide format, using the same steps. This is the icc_oneway function.

icc_oneway(breast_scores)
>         icc     L_icc     U_icc      sem varj_oneway varerr_oneway
> 1 0.6620067 0.5638296 0.7598147 1.023324    2.051072      1.047193

ICC agreement

The icc type agreement is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)), rater variance (\(\sigma^2_k\)) and the residual variance (\(\sigma^2_\epsilon\)). The \(ICC_{agreement}\) is computed as follows:

\(ICC_{agreement} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_k + \sigma^2_{\epsilon}}\)

The ICC for agreement generalizes to other raters within a population (Shrout and Fleis (1979)). All subjects are rated by the same set of raters, and the rater variance is taken into account in the calculation of the ICC. The variance components are computed with the icc_model() function. This is a lmer model with a random slope for the subjects and for the raters. The \(sem\) is the square root of the sum of the rater variance and the error variance (i.e. \(sem = \sqrt{\sigma^2_r + \sigma^2_\epsilon}\)). The confidence intervals are approximated to account for the three independent variance components, as defined by Satterthwaite (1946) & Shrout and Fleis (1979).

For the ICC for agreement, both the level 1 and level 2 are random.

The r-code to extract the variance component with the varcomp function is:

varcomp(score ~ (1|id) + (1|rater), data = breast_long)
>               grp      vcov
> id             id 2.0069835
> rater       rater 0.1167749
> Residual Residual 0.9424505

The variance components, can be used to compute the ICC agreement:

vc <- varcomp(score~ (1|id) + (1|rater), data = breast_long)
vc["id", "vcov"]/sum(vc[,"vcov"])
> [1] 0.6545488

We have also incorporated a function that computed the ICC for agreement directly from the data in the wide format, using the same steps. This is the icc_agreement function.

icc_agreement(breast_scores)
>         icc     L_icc   U_icc      sem varj_agr  varr_agr varerr_agr
> 1 0.6545488 0.5554292 0.75384 1.029187 2.006984 0.1167749  0.9424505

ICC consistency

The ICC type consistency is the variance between the subjects (\(\sigma^2_j\)) divided by the sum of the subject variance (\(\sigma^2_j\)) and the residual variance (\(\sigma^2_\epsilon\)). The rater variance is not used to calculate the ICC and can therefore also be considered as a fixed effect. The \(ICC_{consistency}\) is computed as follows:

\(ICC_{consistency} = \frac{\sigma^2_j}{\sigma^2_j + \sigma^2_{\epsilon}}\)

The ICC for consistency generalizes only to the set of raters in the data (Shrout and Fleis (1979)). The varcomp() function is used to compute the variance components. These variances are computed from a a lmer model with a random slope for the subjects and a fixed effect for the raters. The sem is the square root of the error variance. The confidence are computed with the exact F method. \(F = \frac{(k \sigma^2_j + \sigma^2_\epsilon)}{\sigma^2_\epsilon}\), with \(df1 = n - 1\) and \(df2 = (n - 1) (k - 1)\) (Shrout and Fleis (1979)).

For the ICC for consistency, the level 1 is a random effect and the level 2 is fixed.

The r-code to extract the variance component with the varcomp function is:

varcomp(score ~ (1|id) + rater, data = breast_long)
>               grp      vcov
> id             id 1.9956492
> Residual Residual 0.9428479

The variance components, can be used to compute the ICC consistency:

vc <- varcomp(score~ (1|id) + rater, data = breast_long)
vc["id", "vcov"]/sum(vc[,"vcov"])
> [1] 0.6791394

We have also incorporated a function that computed the ICC consistency directly from the data in the wide format, using the same steps. This is the icc_consistency function.

icc_consistency(breast_scores)
>         icc     L_icc     U_icc       sem varj_cons varerr_cons
> 1 0.6791394 0.5831551 0.7734836 0.9710035  1.995649   0.9428479

Comparing ICC types

We have one general icc function that computes all three ICC types for a data set. The differences in computations between the ICC types can quickly be seen in the variance components returned by the icc function. We can obtain the variances by using var = TRUE in the icc() function, the var_level2 shows the variance between the raters. Only for the ICC for agreement this variance component is estimated.

# ICC for all methods
icc(breast_scores, var = TRUE)
>                   icc     lower     upper       sem var_level1 var_level2
> oneway      0.6620067 0.5638296 0.7598147 1.0233245   2.051072         NA
> agreement   0.6545488 0.5554292 0.7538400 1.0291868   2.006984  0.1167749
> consistency 0.6791394 0.5831551 0.7734836 0.9710035   1.995649         NA
>             var_Residual
> oneway         1.0471930
> agreement      0.9424505
> consistency    0.9428479

When we estimate the ICC for the surgeons only, we can see that the variance at the rater level is decreased. This effect is directly shown in the ICC.

In the icc we can also use the data in wide format and use the cols option to define the rater columns that we want to use.

# ICC for all methods
icc(breast_scores, 
    cols = c("PCH1_score", "PCH2_score", "PCH3_score", "PCH4_score", "PCH5_score"), 
    var = TRUE)
>                   icc     lower     upper       sem var_level1 var_level2
> oneway      0.7615871 0.6710817 0.8403886 0.8591071   2.357677         NA
> agreement   0.7593693 0.6682981 0.8387870 0.8611481   2.340225 0.05026114
> consistency 0.7711953 0.6829366 0.8473933 0.8317713   2.331886         NA
>             var_Residual
> oneway         0.7380650
> agreement      0.6913149
> consistency    0.6918435

When we estimate the ICC for the mammography nurses only, we see that the variance at the rater level is increased. This effect is directly shown in the ICC.

# ICC for all methods
icc(breast_scores, 
    cols = c("Mam1_score", "Mam2_score", "Mam3_score"), 
    var = TRUE)
>                   icc     lower     upper       sem var_level1 var_level2
> oneway      0.6001478 0.4493754 0.7309390 1.1515241   1.990237         NA
> agreement   0.5698257 0.4137129 0.7079019 1.1919337   1.881923  0.4443499
> consistency 0.6558989 0.5162467 0.7724701 0.9899547   1.868020         NA
>             var_Residual
> oneway         1.3260076
> agreement      0.9763561
> consistency    0.9800104

Overview technical terms

Term Description
\(\beta_0\) Fixed intercept
\(b_{0j}\) Random intercept at subject level
\(b_{0r}\) Random intercept at rater level
\(\epsilon_{ijr}\) Residual error
\(\sigma_j\) Variance between subjects
\(\sigma_r\) Variance between raters
\(\sigma_\epsilon\) Residual error variance
\(k\) Number of raters/observers
\(n\) Number of subjects

References

Dikmans, R. E., L. Nene, M. B. Bouman, H. C. W. de Vet, M. Mireau, M. E. Buncamper, H. Winters, M. Ritt, and M. G. Mullender. 2017. The Aesthetic Items Scale: A Tool for the Evaluation of Aesthetic Outcome After Breast Reconstruction.” Plastic and Reconstructive Surgery. Global Open. 5 (3): e1254.
Mokkink, L., H. C. W. de Vet, and I Eekhout. 2022a. Titel Paper 1.” Tbd.
———. 2022b. Titel Paper 2.” Tbd.
Satterthwaite, F. E. 1946. An Approximate Distribution of Estimates of Variance Components.” Biometrics 2: 110–14.
Shrout, P. E., and J. L. Fleis. 1979. Intraclass Correlations: Uses in Assessing Rater Reliability.” Psychologiclal Bulletin 87 (2): 420–28.