Psychometric properties of the Kessler psychological scales in a Swiss young‐adult community sample indicate poor suitability for community screening for mental disorders

The Kessler psychological distress scales (K10 and K6) are used as screening tools to assess psychological distress related to the likely presence of a mental disorder. Thus, we studied the psychometric properties of their German versions in a Swiss community sample to evaluate their potential usefulness to screen for mental disorders or relevant mental problems in the community and, relatedly, for low threshold transdiagnostic German‐speaking services.


| INTRODUCTION
Mental disorders are a main source of illness-related costs and burden, affecting 38% of the European population (Trautmann et al., 2016). Frequently, help for mental disorders is only sought when they have become severe and functionally disabling ; this contributing to high rates of negative outcomes and high burden (Boerema et al., 2017;Trautmann et al., 2016). Consequently, from a clinical and public health perspective, an earlier detection and treatment of mental disorders is imperative (Trautmann et al., 2016). Therefore, valid, economical and easy-to-use screenings, which appropriately identify relevant mental problems/disorders in the community and primary health care, are of great public health importance Trautmann et al., 2016;Webb et al., 2016).
Today, the 10-item Kessler psychological distress scale (K10; Kessler et al., 2003) and its six-item short version (K6; Kessler et al., 2002) are frequently used for such a clinical screening as well as for monitoring of outcomes in primary mental health services (Cotton et al., 2021). In epidemiological research, K10/K6 are used to assess mental disorder caseness and non-specific psychological distress (Andrews & Slade, 2001;Cotton et al., 2021;Kessler et al., 2010).
Since K10 was first developed with focus on disorders of the anxietydepression spectrum, items mainly focus on signs of these .
Good psychometric properties of K10/K6, including good validity for mental health problems/disorders, have been reported from several countries (Ferro, 2019;Batterham et al., 2016;Sampasa-Kanyinga et al., 2018;Stolk et al., 2014;Thelin et al., 2017). However, validity was mostly assessed against self-report questionnaires or fully-standardized lay-interviewer assessments and only rarely against clinical interviews conducted by mental health professionals (Furukawa et al., 2003;Sampasa-Kanyinga et al., 2018;Searle et al., 2017;Sunderland et al., 2011). Furthermore, despite their good research performance in epidemiological studies, inconsistent evidence for K6's/K10's cultural appropriateness in clinical settings, and a lack of clinical norms for different countries indicate the importance of further research into its use in clinical settings (Shon, 2020;Stolk et al., 2014). With regard to Germanspeaking countries, to the best of our knowledge, only one Austrian study has evaluated the psychometric properties of the German translation of K10 against the Brief Symptoms Inventory and State-Trait-Anxiety Inventory in a sample of psychotherapeutic outpatients and medical students (Giesinger et al., 2008). It found K10 to be a suitable measurement of unspecific psychological distress in clinical settings (Giesinger et al., 2008). A validation of the German K10/K6 against the gold-standard of clinician-assessed mental disorders in a less selected community sample, however, is still lacking.
Thus, we examined the psychometric properties of the German K10/K6 against clinician-assessed mental disorders in a young-adult community sample, thereby paying attention to the differential validity for different diagnostic categories. Given the scales' focus on depression and anxiety, we expected the best performance for depressive and anxiety disorders and problems.

| Study design and procedure
The sample consisted of 839 adults of age 19-45 years with main residency in the Swiss Canton Bern, who were assessed between 06/2015 and 03/2018 as part of the second wave (response rate: 66.4%) of the random-sampling Bern epidemiological at-risk (BEAR) community study and, at follow-up, oversampled for lifetime mental problems in terms of symptoms related to a clinical high risk of psychosis . Eligibility criteria included participation in the first wave (response rate: 63.4%) and agreement to be re-contacted given at baseline (provided by 97.9%) (Appendix S1). Five respondents had no data on K10/6 due to studyconform early termination of the interview because of development of a psychotic disorder (Schultze-Lutter et al., 2021), and another five broke off the assessment. Thus, complete data sets of 829 cases were analysed (Table 1).
All participants provided informed verbal consent on the phone.
The Ethics Committee of the University of Bern had approved the study.

| Assessments
Items of K10/K6 are answered on a five-point Likert scale (Table S1), their total scores ranging from 10-50 and 6-30, resp. (Andrews & Slade, 2001;Kessler et al., 2002). The interpretation of total scores varies with the purpose of administration and the setting as well as between studies and cultures, with K10 ≥ 20 frequently being used as a threshold for a likely mental disorder (Table S2). In an extended version of K10, K10+ (Australian Government Department of Health, 2018), four add-on questions assess functioning and related factors (Table S1). We used the validated German translation by Giesinger et al. (2008).
Mental disorders were assessed by the mini international neuropsychiatric interview (M.I.N.I.), a brief semi-structured interview to reliably and validly assess mental disorders according to DSM-IV and ICD-10 Sheehan et al., 1997Sheehan et al., , 1998. The M.I.N.I. uses a two-step procedure: (1) screening questions and (2) full interview of disorders with affirmed screening question. Presence of any subthreshold mental problem that signals a need of professional assessment and, consequently, help-seeking was assumed when a screening question was affirmed (Alexander et al., 2008). For the recently reported low clinical relevance of specific phobia in the community when not accompanied by another mental disorder (Sancassiani et al., 2019), specific phobia, which was the most frequent mental disorder in our sample (11.8%), was not considered in the analyses.
Psychosocial functioning was estimated using the social and occupational functioning assessment scale (SOFAS; American Psychiatric Association, 1994) that has good psychometric properties incl.

| Statistical analysis
Persons with and without a likely mental disorder according to K10 were compared using χ 2 tests for categorical and Mann-Whitney U test for rank data (K10/K6, SOFAS, EQ-5D-3L, BMLSS) and nonnormally distributed ratio data (age).
Internal consistency was examined by Cronbach's alpha. Convergent validity was tested against M.I.N.I., and discriminant (or divergent) validity against BMLSS domains (except health domain) and sum score of the four somatic items of the EQ-5D-3L. Validity was assessed by Cohen's κ correlations (κ) for dichotomized variables, and intra-class correlation coefficients (ICC) for continuous variables using the one-way random-effects model from single measurement (ICC1,1) (Koo & Li, 2016). Prevalence indices (PI) were also calculated as κ tends to be underestimated in case of low-or high-prevalence outcomes; in which case correspondence rates (CR) give a better estimation (Burn & Weir, 2011). The global diagnostic accuracy of K10/K6 was examined by receiver-operating characteristic (ROC) analyses, whose areas under the curve (AUCs) were used to select optimal cut-offs. Thereby, emphasis was put on high sensitivity (≥70%) as the most important diagnostic feature of a screener (Michel, Schultze-T A B L E 1 Sociodemographic and clinical characteristics of participants with and without a K10 score ≥ 20 Lutter, et al., 2014) while keeping specificity as high as possible. Additionally, positive and negative likelihood ratios (LR+, LRÀ) were calculated as conjoint estimations of sensitivity and specificity, i.e., of a test's ability to rule in or rule out a disorder (Jaeschke et al., 1994). 3.3 | General discriminative ability and optimal cut-offs The general ability of K10/K6 to discriminate between individuals with and without mental problems was insufficient (AUC = 0.650) but excellent (AUC = 0.822) for mental disorders (  (Table 2). Overall, the K10 performed slightly better than the K6 (Table 2, Table S5).

| Convergent validity
Testing the accuracy of the K10 and K6 cut-offs suggested by Andrews and Slade (2001) and Kessler et al. (2003) (Table 3) (Table 3) as most κ indicated no to minimal agreement (Table S4). This was against the background of unfavourable PIs. Thus, expectantly, ranging from 52.35% to 99.40%, CRs indicated better convergent validity. However, in particular for K6, sensitivities were small, rarely >40% (Table 3).
Using the lower cut-offs of the ROC analyses (Table 4)

| Discriminant validity
ICCs between the sum of the first four items of EQ-5D-3 L on somatic problems and K10/K6 sum scores were À 0.678 (p = 1.0) and À 0.407 (p = 1.0), respectively. ICCs between the sum scores of K10/K6, and the sum score of the four non-health-related BMLSS domains were À0.942 (p = 1.0) for K10 and À0.964 (p = 1.0) for K6. These results indicated poor discriminant validity of both Kessler scales.

| DISCUSSION
To enhance the early transdiagnostic detection of mental disorders at community and primary care level, easy-to-use screenings of good psychometric properties are required Trautmann et al., 2016;Webb et al., 2016). One suggested screener is the originally English K10 and its short-version, K6 (Cotton et al., 2021). Yet, validity of K10/K6 was reported to strongly depend on subtleties of wording and, thus, to vary between languages and countries (Stolk et al., 2014). Therefore, using clinicianassessment of the M.I.N.I. as a gold standard assessment for the first time in a German-speaking country, we examined the diagnostic validity of the German translation of K10/K6 for non-psychotic mental disorders/problems. Overall, both scales showed poor discriminant validity, and failed to either sufficiently rule in or to sufficiently rule out mental disorders and relevant mental problems; this depending on the chosen threshold. Since their validity might be higher in clinical samples, i.e., among persons seeking help for mental problems, we reran the analyses separately for participants with and without lifetime help-seeking for mental disorders at points-of-contact other than family and friends (see Tables S6-S8). Although, agreement was slightly better in help-seekers, in which the K10 cut-off suggested by Andrews and Slade (2001) demonstrated weak agreement between the K10 at for mental disorders (κ = 0.483), mood problems (κ = 0.465) and anxiety disorders (κ = 0.456), sensitivities for these were still poor (≤57%). Thus, our results discourage from using K10/K6 to screen transdiagnostically for mental disorders in the community or in community-based low-threshold adult services in German-speaking countries. On the whole, this was also true for disorders/problems related to depression and anxiety for whose assessment K10/K6 had been developed and for that good validity had been reported in earlier studies (Cairney et al., 2007;Ferro, 2019;Furukawa et al., 2003;Kessler et al., 2002Kessler et al., , 2003Sakurai et al., 2011;Sampasa-Kanyinga et al., 2018). Only the additional question of the K10+ on frequency was partly exhibiting acceptable diagnostic accuracy and validity in terms of CR.
Interestingly, our cut-offs for K10/K6 were much lower than the ones earlier generated for K10 by Andrews and Slade (2001), and for K6 by Kessler et al. (2003). This might be due to different rates of mental disorders. Our point-prevalence of 5.2% was lower than the 13.1% prevalence reported by Andrews and Slade (2001) and the 23.1% unweighted 12-months prevalence rate reported by Kessler et al. (2003). Yet, already at the cut-offs suggested by these earlier studies (Andrews & Slade, 2001;Kessler et al., 2003), sensitivities were lower than specificities (K10: 66% vs. 96%, and K6: 36% vs. 96%). These discrepancies between specificity and sensitivity were enlarged in our sample (K10: 40% vs. 98%, and K6: 5% vs. 100%), indicating a poor ability to identify true positive cases by these earlier suggested cut-offs in our sample. Additionally, differences in cut-offs likely resulted from the different emphasis put on sensitivity and specificity. While these were balanced by Andrews and Slade (2001), and Kessler et al. (2003), we had put emphasis on sensitivity, because a screener-generally a first step in diagnosis-should not miss positive cases (i.e., possess excellent sensitivity), while ruling out as many negative cases as possible .
Cultural impacts that lower affirmation on the items of the Kessler scales (Stolk et al., 2014) might be another reason for their poorer performance in our study. Significant variations in the form and symptomatic expression of mood and anxiety disorders, including obsessive compulsive and posttraumatic stress disorders, and somatization disorders were described across cultures (Kirmayer & Ryder, 2016). Thus, the selection of items in US populations assessed in the 1990s (Kessler et al., 2002) and their verbatim translation into German (Giesinger et al., 2008) Table S4.
T A B L E 3 Agreement between K10/K6 scores dichotomized according to the cut-off scores suggested by Andrews and Slade (2001) for K10, and Kessler et al. (2003)

| Limitations and strengths
The study may have some potential limitations. First, the Swiss community sample may not necessarily represent all German-speaking communities. However, Switzerland has one of the third largest community of German speakers. Second, psychotic disorders (n = 5) were not included. Third, the point-prevalence of mental disorders (5.2%) was rather low, with only low rates of some diagnoses (mania, GAD, dysthymia, anorexia and bulimia nervosa), which, however, were comparable to the rates of these disorders in the community (Ritchie & Roser, 2018). Fourth, before conducting the BEAR study, the validity of telephone assessments against face-to-face assessments had only been assured for clinical high-risk symptoms/criteria  but not examined for M.I.N.I. diagnoses.
However, assessing mental disorders/symptoms over the phone has been considered sufficiently comparable to face-to-face interviews with added positive effect on the disclosure of personal/intimate data (Azad et al., 2021;Muskens et al., 2014;Smith et al., 2009;Zhang et al., 2017). Thus, the low prevalence rate of mental disorders is unlikely caused by the telephone assessment that, compared to faceto-face assessments, has to be considered a much more valid assessment of mental disorders/problems than questionnaire assessments (Zhang et al., 2017) that were mainly used in studies on the Kessler scales' validity. Furthermore, the far higher frequency of mental problems (32.3%) did not substantially increase validity measures.
Despite the limitations, strengths of this study include validation of K6/10 against a reliable outcome, i.e., clinician-assessment of men-

| CONCLUSION
Given that integrated pre-clinical and primary care youth health services are becoming increasingly widespread around the world (Hetrick et al., 2017), reliable and valid screenings that accurately identify mental health problems/disorders are more important than ever to ensure effective and targeted intervention in high-risk populations. Our findings, however, suggest that K10/K6 should not be recommended to use as main screening tools without additional clinical assessment.
T A B L E 4 Agreement between K10 and K6 (cut-off score of 10 and 6 based on ROC analyses) and M.  Prevalence index values between À1 and 1 and is 0 when both responses are equally probable, (i.e., their prevalence is 50%). A high prevalence index = a low prevalence rate; a low prevalence index = a high prevalence rate.
Although the focus of K10/K6 is mainly on depression and anxiety disorders, our findings illustrate that the screening performance for these most common mental disorders was nevertheless poor. The present findings speak to the vital need for validating a screening tool in both German-speaking community and clinical settings as symptoms might also be expressed differently between these. Further validation studies are required to conclude whether K10/K6 make an adequate initial screening tool in community settings, but also in very lowthresholds programs.