Notes
Article history
The research reported in this issue of the journal was funded by the HTA programme as project number 11/63/01. The contractual start date was in November 2016. The draft report began editorial review in February 2017 and was accepted for publication in April 2018. The authors have been wholly responsible for all data collection, analysis and interpretation, and for writing up their work. The HTA editors and publisher have tried to ensure the accuracy of the authors’ report and would like to thank the reviewers for their constructive comments on the draft document. However, they do not accept liability for damages or losses arising from material published in this report.
Declared competing interests of authors
Andrew Price reports personal fees from Zimmer Biomet, DePuy Synthes and Smith & Nephew plc, and grants from the National Institute for Health Research (NIHR) and Arthritis Research UK, outside the submitted work. Alastair Gray reports grants from NIHR, during the conduct of the study. Rachael Gooberman-Hill reports grants from the NIHR Health Services and Delivery Research programme for this work during the conduct of the study. Helen Dakin reports grants from NIHR during the conduct of the study and consultancy for Halyard Health outside the submitted work. David Beard reports grants from NIHR outside the submitted work. Jonathan Cook was a member of the NIHR Health Technology Assessment (HTA) Efficient Trial Designs Board (2014–16). Jill Dawson reports grants from the NIHR HTA programme during the conduct of the study and royalty payments from Oxford University Innovation (a university technology transfer company) outside the submitted work, and is one of the original developers of the Oxford Hip and Knee Scores. Raymond Fitzpatrick is one of the developers of the Oxford Hip and Knee Scores. Ashley Blom is the principal investigator in a research project funded by Stryker Corporation. Andrew Judge reports personal fees for consultancy from Anthera Pharmaceuticals, Inc., and Freshfields Bruckhaus Deringer LLP, outside the submitted work. Nigel Arden reports grants from Bioibérica and Novartis Pharmaceuticals UK Ltd, and personal fees from Bioventus, Flexion Therapeutics, Freshfields Bruckhaus Deringer LLP, Janssen Pharmaceutica, Merck & Co. Inc. and Regeneron Pharmaceuticals, Inc., outside the submitted work. David Murray reports grants from the NIHR HTA programme and grants and personal fees from Zimmer Biomet outside the submitted work.
Permissions
Copyright statement
© Queen’s Printer and Controller of HMSO 2019. This work was produced by Price et al. under the terms of a commissioning contract issued by the Secretary of State for Health and Social Care. This issue may be freely reproduced for the purposes of private research and study and extracts (or indeed, the full report) may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising. Applications for commercial reproduction should be addressed to: NIHR Journals Library, National Institute for Health Research, Evaluation, Trials and Studies Coordinating Centre, Alpha House, University of Southampton Science Park, Southampton SO16 7NS, UK.
2019 Queen’s Printer and Controller of HMSO
Chapter 1 Introduction
Research questions specified in the National Institute for Health Research Health Technology Assessment research call
-
Can clinical tools for assessment of a patient’s suitability for knee or hip replacement be used to set thresholds for an operation?
-
How does the choice of threshold affect the cost-effectiveness of the procedure and subsequent improvements in patient quality of life?
Background
Hip and knee osteoarthritis is a common musculoskeletal condition causing significant pain and loss of function for patients. Using patient-reported outcome measures (PROMs), joint replacement treatment for end-stage disease has been shown to be an effective treatment. 1 Each year, 150,000 hip and knee replacements are carried out in the UK, with the majority of patients having successful outcomes. 2 However, the nationally collected patient-reported outcome data for hip and knee replacements have identified two striking issues with regard to the provision of joint replacement in the UK. First, there is marked variation in current clinical practice in referring and undertaking surgery in patients with arthritis of the hip and knee. 3 Previous studies from the UK support this observation, with recent evidence showing that access to joint replacement is currently inequitable, with deprived areas associated with greater symptom severity and lower surgery rates. 4–6 A previous large national survey of UK NHS patients undergoing joint surgery7 also concluded that there was no evidence that patients were being prioritised on the basis of the severity of their symptoms and function. Second, the national outcomes data have revealed that 10–15% of patients undergoing hip or knee joint replacement are not satisfied with their treatment, and these findings, particularly for the knee patients, are supported by other recent studies. 8,9 It has been suggested that selecting patients too early in their disease process may play a role in producing dissatisfaction with surgery. 9 Overall, these findings suggest that there is no standardisation to the process by which patients are assessed and selected for hip and knee replacement surgery. This is a particular concern given both the projected increased need for joint replacement over the next decade to accommodate an ageing population and the pressure of potential reductions in NHS funding. 10
Assessing patients for joint replacement surgery within the NHS is generally a two-stage process that begins with the patient presenting to a general practitioner (GP) with hip or knee pain (Figure 1). The assessment process usually takes place over a number of consultations, often including an radiography to confirm osteoarthritis. The GP continues to monitor symptoms and responses to non-operative treatments, eventually deciding when the patient is a candidate for joint replacement surgery, and at this point the patient is referred to secondary care. Currently, there are no widely accepted guidelines within the NHS specifically to help health professionals estimate the level of patient symptoms that warrants referral. 11 In the secondary care setting, the specialist assesses the patient, confirms the diagnosis and radiographic severity, reviews a patient’s symptoms and shares information with them about available treatment options. For each patient, the decision to offer surgery requires a personalised assessment of individual preferences, expectations, functional limitations and requirements, degree of radiographic pathology, comorbidities and predicted outcome. 11 Ideally, the patient and their specialist then make a shared decision, with the patient ultimately deciding whether or not to undergo surgery. We estimate that there are around 1 million GP patient assessments for hip and knee pain each year, with around one-quarter of patients who present being referred to a specialist. Although only a small proportion of these patients receive joint replacement, this amounts to 150,000 procedures a year. 12,13 At the Nuffield Orthopaedic Centre (NOC), our data suggest that approximately 15% of patients who are referred from primary care for possible joint replacement do not have symptoms at a level that requires joint replacement and could have been safely managed without an appointment in secondary care.
FIGURE 1.
Patient pathway framework for the NHS to identify candidates for hip and knee replacement. The selected assessment score may also be used later in the pathway to measure the outcome of surgery and to offer surveillance for joint replacement post surgery. THA, total hip arthroplasty; TKA, total knee arthroplasty.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig1.png)
Given the issues of unwarranted variation and poor outcome in some patients, outlined above, there has been significant interest in trying to standardise the process of referral and selection for joint replacement. The use of certain ‘priority criteria’ (such as the Western Canada Waiting List score,14 the Ontario criteria15 or the New Zealand score16) has been investigated as a more consistent method of selecting patients for referral and treatment. These tools identify candidates for surgery in primary care and are based on estimating a patient’s capacity to benefit from surgery. They are generic and attempt to standardise the patient pathway for joint replacement at the entry point. The New Zealand priority criteria16 have been used in some regions within the NHS but have not reached widespread acceptance, and the current evidence of their reliability and validity is minimal. 17,18 Other tools have been developed but not fully tested in clinical practice within the UK. 19–21 The Osteoarthritis Research Society International (OARSI) Standing Committee for Clinical Trials Response Criteria Initiative and the Outcome Measures in Rheumatology (OMERACT) international initiative has attempted to deliver a standardised approach and has highlighted pain and disability as among the key domains for identifying the capacity to benefit. 22–25 It has therefore been a logical progression to investigate if existing assessment tools used in the joint replacement pathway, that measure pain and disability, could be used as a single score to identify candidates for surgery by referring their preoperative assessment score to a threshold for intervention.
In 2009, the Department of Health and Social Care introduced the routine collection of PROMs for hip and knee surgery to measure the outcome of surgery undertaken in NHS hospitals. 8,26 There has been government support for extending the use of scoring systems preoperatively to create thresholds for referral and candidacy for surgery. 27,28 In fact, many primary care trusts and NHS trusts have already introduced PROM-based severity score thresholds for surgery, although the thresholds used vary widely between regions. 29–36 However, evidence underpinning and endorsing the use of PROMs or any assessment score for thresholds is scant and without validation. This poses a significant risk to patients as an incorrectly set threshold may unfairly restrict access to care or, conversely, inappropriately select patients for joint replacement. 2 The development of a preoperative threshold score to identify candidates for hip and knee replacement offers a significant opportunity to standardise the patient pathway. However, this Health Technology Assessment (HTA) call reflects the pressing need within the NHS to produce evidence to support or refute their use.
A number of scoring systems are used to assess the patients in their care pathway. Many are PROM based, such as the Intermittent and Constant Osteoarthritis Pain Measure (ICOAP),37 the EuroQol-5 Dimensions (EQ-5D) and the Oxford Hip and Knee Scores,38–40 whereas others require a clinician’s involvement (e.g. the New Zealand score16). Some systems were designed to measure the burden of osteoarthritis symptoms [e.g. the Western Ontario and McMaster Universities Arthritis Index (WOMAC®) or ICOAP],41 whereas others were designed to measure the effect of an intervention (e.g. Oxford Hip and Knee Scores). Some scores were produced to measure more general aspects of health status [Short Form questionnaire-36 items (SF-36) or EQ-5D]42 and others aim to prioritise patients for surgery (e.g. the New Zealand priority criteria16). None of these scores has been developed for the specific role of applying thresholds for access to care for joint replacement within the setting of the NHS. It may be that one or more of these scores may be appropriate for such use but evidence is required to validate and justify this role.
To be fit for purpose as a screening device, any candidate score must satisfy a number of requirements.
First, the score must have adequate measurement properties to enable assessment of patients for joint replacement, namely adequate validity. This includes evidence of adequate reliability at an individual level (test–retest and intraclass correlation coefficient), precision [standard error (SE) of the measurement] and responsiveness to change [smallest detectable change and minimally clinically important difference (MCID)]. The effect of comorbidity on the score must also be established.
Second, valid evidence-based thresholds must be produced. The calculation of thresholds is not straightforward, with several different methods available. To generate upper thresholds (i.e. least severity) in preintervention scores, above which patients should not be considered candidates for arthroplasty, methods must account for the likelihood of a patient’s capacity to benefit (i.e. likelihood of achieving a positive change score) and perceive satisfactory improvement following surgery. 43 The measurement properties of the instrument (as described previously), such as the MCID (i.e. the smallest amount of change in a score that patients detect and consider important) and standard error of the mean (SEM) (which relates to the reliability of the instrument and denotes the amount of change that is ‘real’ and beyond measurement error), are also important operational considerations when calculating thresholds. Furthermore, any chosen threshold must distinguish between cases (patients in need of surgery) and non-cases with a consistent level of diagnostic accuracy (discriminative ability).
The process of calculating absolute thresholds will also produce additional and valuable information for patients who are found to be candidates for surgery. By highlighting an individual’s ‘chance’ of benefit following surgery (based on their preoperative score), patients are provided with key information to help with their decision-making, particularly in secondary care. It would provide evidence to support the use of a score embedded within the NHS direct knee/hip osteoarthritis decision aid. Hence, clearly highlighting the risks and benefits may make the decision to have surgery clearer for many patients. This type of information allows patients to more comprehensively participate in the decisions made about their care.
Third, we must understand how the introduction of thresholds for surgery affects the cost-effectiveness of the treatment. Lower-limb joint replacement has previously been shown to be highly cost-effective, costing between €1276 and €18,300 per quality-adjusted life-year (QALY) gained for the average patient,44–48 which is substantially lower than the £20,000–30,000 per QALY range that the National Institute for Health and Care Excellence (NICE) considers to be cost-effective for use within the NHS. 49 However, it is important for commissioners of hip and knee replacement surgery to understand how cost-effectiveness varies between patient and procedure subgroups, and how thresholds for hip and knee surgery affect the cost utility of the interventions. We have recently demonstrated the feasibility of this approach in a pilot study exploring the relationship between costs and improvements in EQ-5D utility and preoperative PROM scores in total knee arthroplasty (TKA). 50
Finally, having identified and validated a clinical tool and calculated valid and evidence-based thresholds for surgery, within the NHS, it must be established whether or not the tools are acceptable to the ‘end-users’. Despite some thresholds for hip and knee replacement having already been introduced to clinical practice in parts of the country, there has been little or no engagement with the wider stakeholders about the appropriateness of this approach or how thresholds should be used in practice. The introduction of thresholds requires the support of patients, health-care professionals and commissioners.
Although the requirements of threshold scores in primary and secondary care may differ, in order to provide consistency for patients and health-care professionals any scoring system would ideally be applicable to both sectors. One aim is to ensure this compatibility by consciously considering the requirements within each setting. In primary care, the requirement is for a simple-to-use patient-based score linked to the patient’s potential to benefit from arthroplasty. This would provide a distinct upper threshold for referral and candidacy for joint replacement. The thresholds calculated for the identified scoring system will be incorporated into a user-friendly knee and hip replacement candidacy assessment tool – the Arthroplasty Candidacy Help Engine (ACHE) tool. Secondary care involves more complex assessments, involving expectation, comorbidity and age-related factors. The ACHE tool would be a starting point for secondary care assessment, linking to other patient decision-support tools. 51,52
In summary, greater standardisation is required in the patient pathway leading to hip or knee joint replacement surgery. The aim of this study is to develop an evidence-based method for identifying patients in primary care who are possible candidates for surgery, using valid thresholds applied to scoring systems that are already available (see Figure 1).
Research objectives
The following research objectives will be met:
-
Create a shortlist of scoring systems that are potentially useful for selecting candidates for arthroplasty surgery.
-
From the literature, establish the scores/instruments available. Published evidence concerning their measurement properties, and their past or projected use in setting thresholds for hip and knee replacement, will be reviewed. This will generate a shortlist of potential scoring systems.
-
Using existing data sets and guidance from users, refine the shortlist by establishing the necessary measurement properties of potential scores/instruments when not available in the literature.
-
-
Identify a scoring system, and a set of threshold values, to be used to select candidates for hip and knee surgery.
-
For each shortlisted instrument, determine score thresholds for candidacy for joint replacement surgery.
-
Determine the relationship between threshold levels and cost-effectiveness of hip and knee arthroplasty surgery.
-
Select the most applicable single score and set of thresholds for incorporation into the ACHE tool.
-
-
Explore the clinical effectiveness and cost-effectiveness of the ACHE tool and determine the potential acceptability of the tool and thresholds to stakeholders and patients.
-
Determine the effect of using the ACHE tool on patterns of referral of hip and knee patients to secondary care.
-
Evaluate user opinion – GPs and patients.
-
Engage with a wider stakeholder group to assess the acceptability of the ACHE tool.
-
The overall aim was to develop a standardised NHS framework for identifying patients for hip and knee replacement surgery using safe and equitable thresholds. This was achieved by creating the ACHE tool, based on a currently available assessment score, with thresholds that take account of patients’ capacity to benefit from surgery and the cost-effectiveness of the treatment. The new system is applicable in both primary and secondary care.
The user group
From the outset of the original design for this project, all service users’ perspectives were considered integral to its success. The ACHE tool was to be designed to help patients, GPs, secondary care personnel, arthroplasty specialists and health-care commissioning staff. It was therefore decided very early on to utilise a ‘user group’ concept, in which representatives of these identified bodies inputted, critiqued and reviewed the progress of the study at appropriate intervals. Importantly, the user group was established as a proactive entity rather than as a passive and disengaged review and agreement exercise. The decisions of the group were critical to the direction of the project and were instrumental in sanctioning various aspects/decisions and vetoing others. The user group was given full autonomy under the direction of the chairperson. Investigators of the study were allowed to be present but did not participate in the meeting unless invited by the chairperson for clarification purposes only. As a result, the final ACHE tool did not reflect the wishes of the researchers, but the group for whom the instrument was designed.
The user group was assembled and provided input at regular and preset intervals during the course of the study. The sequence was predetermined and the role dovetailed with each stage of the project (Figure 2). One option for this report was to describe user group input in separate sections (chapters) in sequence and inserted within the main report at the appropriate temporal intervals. However, some meetings were introductory or had limited remit and do not contain sufficient content to justify separate chapters. A decision was made to report all user group activity in a single separate chapter (see Chapter 10). Readers of the report are required to cross-reference this section and appreciate that user group input took place for each academic section.
FIGURE 2.
Work plan schema highlighting the user group meetings and input. HES, Hospital Episode Statistics; NJR, National Joint Registry; WP, work package.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig2.png)
Chapter 2 Systematic review of existing measures (work package 1)
Background
The aim of this study was to develop an evidence-based system for identifying patients who might be candidates for hip or knee replacement surgery, introducing valid thresholds based on scores that are already available. The first objective in achieving this aim was to create a shortlist of scoring systems that could be used in this way. After discussion within the user group (see Chapter 10), it was established that candidate scores placed on the shortlist would need to meet certain essential criteria:
-
A score must be a patient-reported measure to ensure that patients were engaged in the assessment process and that the score used reflected their perspective on the outcome.
-
A score must demonstrate adequate measurement properties and have been validated within the hip and knee replacement populations. 53–58
Many different scoring systems and outcome measures have been used for assessing the outcomes of hip or knee arthroplasty, but not all measures have evidence of, or reach, even the minimum psychometric standards for their proposed uses. 1,59–61 Therefore, the aims of this work were to use systemic review methodology to identify and evaluate English-language versions of PROMs that have been evaluated with patients undergoing hip or knee replacement surgery and to provide a comprehensive profile of their measurement properties so that a shortlist of candidate scores could be established.
Methods
Identification of studies
The search was conducted in May 2014; it was limited to English-language articles and no time restrictions were set. MEDLINE, EMBASE, PsycINFO and the Allied and Complementary Medicine Database (AMED) were searched using an adjusted methodological filter through Ovid ProQolid, the Oxford PROMs Database, the Database of Abstracts of Reviews of Effects (DARE) and EconLit were also searched using a combination of medical subject heading and free-text terms. 62 Hand-searching of titles of the following key journals in the 6 months preceding the search was also conducted: Health and Quality of Life Outcomes, The Journal of Bone and Joint Surgery (American and British volumes) and The Journal of Arthroplasty.
Screening of articles and instruments
Titles and abstracts of all identified articles were assessed for inclusion/exclusion by two reviewers (KH and EG), with agreement assessed on a screening sample of 313 abstracts. The first round of testing yielded a 77% agreement rate and the second round yielded a 99% agreement rate between reviewers. Full texts of the articles that were to be included in the review were retrieved. Inclusion criteria were:
-
The instrument uses a standard scoring system (representing indices or scales).
-
The instrument is already available and has been used in clinical settings or research to assess adult (aged > 18 years) patients prior to hip or knee replacement.
-
The instrument has been validated for the English-language population.
-
The study design is principle development, concurrent revalidation or a prospective study of a score with information on its measurement properties (e.g. reliability, validity and responsiveness). Retrospective studies (except historical cohort studies) were excluded.
-
The sample size in the study was > 50 subjects/patients.
Titles and abstracts were obtained relating to any tools identified at this stage, and these were scrutinised using the aforementioned inclusion criteria. The same methodology was applied to full-text documents for their inclusion in the review. Selected full-text articles were then screened for all measures that were used in analyses. The aforementioned inclusion criteria were applied to the list of identified measures. Furthermore, the following exclusion criteria were applied to the initial list of measures:
-
The assessment is not patient reported and requires the patient to be assessed on each/every occasion by a clinician.
-
The assessment requires some kind of technical information or equipment [such as a magnetic resonance imaging (MRI) scan or radiographic report], which might not always be available or standardised, or which might not make sense as part of an assessment conducted at both preoperative and postoperative stages.
-
The measure is not capable of demonstrating patients’ ‘capacity to benefit’ because it was not designed to be a health status/outcome measure, and therefore cannot measure change (e.g. purely retrospective measures were excluded).
Instrument-specific search
A specific search was undertaken for each of the identified instruments, with a developmental study and then a population and validation filter applied to the list of citations stemming from the developmental study.
Data extraction
Data were extracted on the psychometric performance and operational characteristics of each PROM. Assessment and evaluation of the methodological quality of PROMs were undertaken independently by three reviewers adapting the London School of Hygiene & Tropical Medicine appraisal criteria outlined in a previous review. 1
Measurement properties assessed
Reliability was assessed by test–retest reliability and internal consistency. Test–retest reliability refers to the stability of a measuring instrument over time, assessed by administering the instrument to respondents on two different occasions and examining the correlation between test and retest scores. Internal consistency refers to the extent to which items constituting a scale measure the same construct (e.g. homogeneity of items in a scale) and is assessed by Cronbach’s alpha and item–total correlations.
Content and construct validity were assessed. Content validity relates to the extent to which the content of a scale is representative of the conceptual domain it is intended to cover and is usually assessed qualitatively during the questionnaire development phase through pretesting with patients, with patients involved in item generation. Construct validity looks at the evidence that the scale is correlated with other measures of the same or similar constructs in the hypothesised direction and is assessed on the basis of correlations between the measure and other similar measures, preferably based on an a priori hypothesis with predicted strength of correlation.
Responsiveness refers to the ability of a scale to detect significant change over time and is assessed by comparing scores before and after an intervention of known efficacy or when other evidence indicates important change on the basis of various methods including paired t-tests, effect sizes (ESs), standardised response mean values or responsiveness statistics. Ideally, evidence of responsiveness will include high correlations between the change scores of the scale and relevant constructs, preferably based on an a priori hypothesis with predicted strength of correlation.
Interpretability relates to the degree to which one can assign qualitative meaning – that is, clinical or commonly understood connotations – to an instrument’s quantitative change in score. It can be assessed by estimating the precision of the measure when used at an individual patient level, by multiplying the SE of measurement with the standard score (z-value). In addition, MCIDs changes can be calculated by relating change to an external anchor, using either mean change or the receiver operating characteristic (ROC) curve method.
Floor and ceiling effects relate to the ability of an instrument to accurately measure across the full spectrum of a construct. If a measure has > 15% of participants achieving a top or bottom score, this is indicative of a ceiling/floor effect.
Acceptability is a practical property of an instrument and reflects respondents’ willingness to complete it without feeling unduly burdened, indicated by, for example, response rates and completion rates.
Measurement properties for each instrument were assessed separately for hip, knee and mixed hip and knee populations (depending on the availability of published studies). The information was then summarised into the appraisal summary tables, which rated the overall quality of evidence for each of the measurement properties. Three authors (KH, EG and JD) reviewed their own respective sections, following which the results were cross-checked to ensure consistency of assessment and scoring across the reviewers.
Results
Identification of studies
The initial search in Ovid yielded 3774 abstracts. After the removal of duplicates, the number of abstracts for assessment was 2887. In addition, keyword searches (combination of knee, hip and orthopaedics) in EconLit yielded 162 results, the PROMs database identified 454 results and DARE had no results (Figure 3).
FIGURE 3.
Instrument flow diagram. AMED, Allied and Complementary Medicine Database; EconLit, American Economic Association’s electronic bibliography; OU, University of Oxford. Reproduced with permission from Harris et al. 63 This is an Open Access article distributed in accordance with the terms of the Creative Commons Attribution (CC BY 3.0) license, which permits others to distribute, remix, adapt and build upon this work, for commercial use, provided the original work is properly cited. See: http://creativecommons.org/licenses/by/3.0/.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig3.png)
Hand-searching of titles of the following key journals in the 6 months preceding the search was conducted:
-
Health and Quality of Life Outcomes (number of articles, one)
-
The Journal of Bone and Joint Surgery (American and British volumes; number of articles, one)
-
The Journal of Arthroplasty (number of articles, three).
Screening of articles and instruments
Out of the 167 selected abstracts, 146 eligible full-text articles were then screened for all PROMs that were analysed, identifying 135 instruments. If the instrument was not validated (developed for or subsequently validated) for use in a population of patients undergoing hip or knee replacement surgery, it was excluded, leaving 67 instruments. A reliability exercise was undertaken for 16 full-text articles between two reviewers, and the agreement was 95% (38/40 questionnaires identified). An instrument-specific search was then undertaken for each of the 67 identified instruments. By this method, 21 new validation papers (in addition to 42 developmental papers) in the targeted population were identified. Furthermore, on closer examination of shortlisted instruments, 21 initially identified instruments were additionally excluded.
Data extraction
Relevant data on the psychometric performance and operational characteristics were extracted for each PROM. The summary texts were sent to corresponding authors from the developmental study of each respective PROM, and further information was added as a result of this exercise. The appraisal summaries are presented in Tables 1–4.
Instrument (groups tested) | Score | ||||||||
---|---|---|---|---|---|---|---|---|---|
Hip | Knee | ||||||||
HOOS | HRQ | PSI | OHS | Knee disorders subjective history (VAS) | KOOS | KOOS-PS | OKS | OKS-APQ | |
Number of studies | 5 | 1 | 4 | 20 | 1 | 3 | 2 | 23 | 1 |
Reproducibility | ++ | + | + | ++ | 0 | + | 0 | +++ | +++ |
Internal consistency | + | 0 | 0 | ++ | 0 | 0 | +++ | +++ | +++ |
Validity: content | 0 | 0 | ++ | ++ | + | + | + | +++ | +++ |
Construct | ++ | + | ++ | +++ | + | + | ++ | +++ | +++ |
Responsiveness | + | + | ++ | +++ | 0 | 0 | ++ | +++ | +++ |
Interpretability | 0 | 0 | 0 | +++ | 0 | 0 | 0 | ++ | 0 |
Floor and ceiling/precision | + | 0 | 0 | –/+ | 0 | + | 0 | ++ | ++ |
Acceptability | 0 | 0 | 0 | +++ | – | 0 | 0 | +++ | +++ |
Instrument (group tested) | Score | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Lower limb | Pain | |||||||||
LEFS (h/k) | WOMAC (h/k) | WOMAC (h) | WOMAC (k) | WOMAC SF (h/k) | Lower limb core score (h/k) | MODEMS-HK (AAOS) hip and knee core score (h/k) | ICOAP (h/k) | P4 (h/k) | McGill pain-short form (h/k) | |
Number of studies | 5 | 25 | N/A | N/A | N/A | 1 | 1 | 2 | 1 | 2 |
Reproducibility | + | ++ | ++ | + | 0 | 0 | 0 | + | 0 | ++ |
Internal consistency | + | + | 0 | 0 | + | 0 | 0 | + | ++ | 0 |
Validity: content | + | + | + | + | + | + | + | ++ | + | 0 |
Construct | ++ | +++ | + | ++ | ++ | 0 | + | + | + | + |
Responsiveness | ++ | +++ | ++ | ++ | + | 0 | ++ | – | 0 | – |
Interpretability | + | ++ | ++ | ++ | 0 | 0 | 0 | 0 | 0 | 0 |
Floor and ceiling/precision | 0 | –/+ | – | 0 | 0 | 0 | ++ | 0 | 0 | 0 |
Acceptability | 0 | ++ | + | + | 0 | 0 | + | 0 | 0 | 0 |
Instrument (group tested) | Instrument group | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Utility | Generic | |||||||||||
SF-6D (h) | HUI2 and HUI3 (h) | EQ-5D (h/k) | EQ-5D (h) | EQ-5D (k) | SF-36 (h/k) | SF-36 (h) | SF-36 (k) | SF-12 (h/k) | SF-12 (h) | SF-12 (k) | SIP (h) | |
Number of studies | 1 | 4 | 9 | N/A | N/A | 14 | N/A | N/A | 3 | N/A | N/A | 2 |
Reproducibility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++ | 0 | 0 | 0 |
Internal consistency | 0 | 0 | N/A | N/A | N/A | 0 | 0 | – | 0 | 0 | 0 | 0 |
Validity: content | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | + |
Construct | 0 | ++ | + | 0 | + | + | 0 | + | 0 | 0 | + | + |
Responsiveness | ++ | + | 0 | 0 | + | 0 | ++ | + | 0 | + | + | – |
Interpretability | 0 | 0 | 0 | ++ | ++ | 0 | + | + | 0 | + | + | 0 |
Floor and ceiling/precision | – | 0 | 0 | 0 | ++ | 0 | – | 0 | +++ | 0 | 0 | – |
Acceptability | 0 | 0 | 0 | 0 | ++ | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Instrument (group tested) | Instrument | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
WHOQOL-BREF (h/k) | Aberdeen IAP (h/k) | Aberdeen IAP (modified) (h/k) | NEADL (h) | AQOL (h/k) | MSK functional limitations index (k) | HAQ (k) | MHAQ (h/k) | MHAQ (h) | K10 (h/k) | |
Number of studies | 1 | 1 | 1 | 1 | 2 | 1 | 2 | 2 | N/A | 1 |
Reproducibility | 0 | 0 | 0 | ++ | 0 | 0 | 0 | 0 | 0 | 0 |
Internal consistency | ++ | + | ++ | ++ | 0 | 0 | – | 0 | 0 | 0 |
Validity: content | + | + | 0 | – | 0 | 0 | 0 | 0 | + | – |
Construct | 0 | + | + | + | + | + | ++ | + | + | + |
Responsiveness | + | 0 | 0 | – | ++ | 0 | – | – | + | – |
Interpretability | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Floor and ceiling/precision | ++ | + | 0 | + | 0 | 0 | + | ++ | 0 | ++ |
Acceptability | 0 | 0 | 0 | 0 | 0 | + | 0 | 0 | 0 | 0 |
Table 1 summarises the evidence of measurement and operational performance applying the adapted appraisal criteria for the hip PROMs identified in this review. On the basis of the volume and quality of evidence, the Oxford Hip Score (OHS) clearly has the best evidence of measurement properties within the hip-specific PROM category. Within the ‘knee scores’ subgroup (see Table 1), the Oxford Knee Score (OKS) [with the OKS – Activity and Participation Questionnaire (OKS-APQ)] demonstrated best evidence of its measurement properties within the knee-specific PROM category. The Knee injury and Osteoarthritis Outcome Score (KOOS) and the KOOS – Physical Score (KOOS-PS) have some favourable evidence of their measurement properties, although compared with the OKS, the evidence is lacking and further evaluations are needed.
Table 2 summarises the evidence of measurement and operational performance by applying the adapted appraisal criteria to the lower-limb and pain PROMs identified in these reviews. The best-performing lower-limb measure for hip/knee patients is the WOMAC, followed by the Lower Extremity Functional Scale. The WOMAC also performed best when applied to separate hip or knee groups. Satisfactory evidence of measurement properties was generally lacking for all of the three identified pain measures (ICOAP, P4 and the McGill Pain-Short Form). ICOAP and McGill Pain-Short Form had no evidence in favour of their responsiveness and P4 did not have any reported evidence of its responsiveness. Three utility and generic measures identified in the review are listed in Table 3. As with the pain scores, the evidence for utility PROMs was generally lacking, with the EQ-5D scoring worse on construct validity and responsiveness than the Short Form questionnaire-6 Dimensions (SF-6D) and the Health Utilities Index Mark 2 (HUI2) and Mark 3 (HUI3). On the basis of the volume and quality of evidence, among all identified generic measures, the Short Form questionnaire-12 items (SF-12) is clearly the most promising one.
Nine measures identified in the review were categorised as ‘other’ scales. Table 4 summarises evidence of their measurement properties. The World Health Organization Quality of Life (WHOQOL)-BREF instrument, Aberdeen Impairment, activity limitation and participation restriction [Aberdeen Impairment, Activity Limitation, and Participation Restriction (Aberdeen IAP)] and assessment of quality of life had the best overall evidence in this subcategory (on a mixed hip/knee population). However, the overall evidence of their validity was generally lacking.
Discussion
Our review has identified the WOMAC, OHS and OKS to be the most promising disease-/site-specific scores that perhaps provide best coverage of the construct of interest and better responsiveness. The best-performing generic measure was the SF-12. However, further research on some of the missing measurement properties in these measures is required. For the WOMAC, further evidence on ceiling/floor effect, content validity and acceptability is required in both the hip and the knee groups of patients. The OHS is currently lacking evidence on its ceiling/floor effects. Many other PROMs do not have sufficient measurement property validation to recommend their use. Given its widespread use in this clinical area (e.g. national PROMs data), it was disappointing that the EQ-5D score did not perform better.
Our findings are supported by existing literature. Alviar et al. 60 published a systematic review of measurement properties of 28 PROMs used in hip/knee arthroplasty based on published evidence up to December 2009 and found the WOMAC, OKS and SF-36 to be the most comprehensively tested measures at that time, although the need for more rigorous evaluation of reliability, responsiveness and interpretability was noted. Our review has updated this evidence, both in breadth (we have assessed 67 instruments) and time period (our search was until May 2014). Browne et al. 64 identified the OHS and OKS (used alongside the EQ-5D) as primary outcome measures of choice to be used in the UK PROMs programme for hip and knee replacement.
It should be noted that the standards (and indeed scope/tolerance) for reporting details of qualitative procedures and psychometric analysis have changed over the past 20 years (very much so in the musculoskeletal literature), so that although measures that were devised earlier in that period have had a longer time in which to accrue evidence of their measurement properties, they can frequently lack relevant detail specifically in relation to the development of the instrument. Reporting has improved, probably as a consequence of the evolving methods and the recognition that minimum standards are required [e.g. Streiner et al. ,65 COSMIN (COnsensus-based Standards for the selection of health Measurement INstruments)54 and the US Food and Drug Administration55].
Further detail and supplementary material can be found in the publication based on this work by Harris et al. 63
Chapter 3 Calculation of measurement properties (work package 1)
Background
The systematic review of the measurement properties reported the properties for the EuroQol-5 Dimensions, three-level version (EQ-5D-3L), SF-12, OHS, OKS and WOMAC tools based on the existing literature. The OHS (20 studies) and OKS (23 studies) are good in terms of reproducibility, internal consistency, validity (content), construct, responsiveness, interpretability, floor and ceiling effects, precision (except OHS) and acceptability. WOMAC (25 studies) was reported as good in terms of reproducibility, validity (content), construct, responsiveness, interpretability and acceptability. Only fair outcomes for knee for EQ-5D-3L (nine studies) were reported in terms of interpretability and acceptability when construct and responsiveness were not applicable. SF-12 (three studies) was poor in terms of construct, responsiveness and interpretability. There were a number of outcomes for which there was no or little available evidence on one of more of the measurement properties. To be fit for purpose, any candidate score to be used as a screening instrument must satisfy a number of requirements, one being that the score must have adequate measurement properties to enable assessment of patients for joint replacement {i.e. adequate validity [ACHE protocol version 4, Nuffield Department of Orthopaedics, Rheumatology and Musculoskeletal Sciences (NDORMS), 2015]}.
Methods
General approach
Following the review of the evidence on the measurement properties of the possible instruments for measuring outcome after knee and hip replacement operations, a number of gaps in the evidence base were identified. Using available data sets, before we refined the shortlist of candidate tools (Table 5), the missing measurement properties were calculated when possible using available data sets.
Data set | Instrument | ||||
---|---|---|---|---|---|
EQ-5D-3L index | SF-12/SF-36 | OKS | OHS | WOMAC | |
Knee | |||||
KAT | ✓ | ✓ | ✓ | ||
ADAPT | ✓ | ✓ | |||
APEX | ✓ | ✓ | |||
Hip | |||||
EUROHIP | ✓ | ✓ | ✓ | ||
EPOS | ✓ | ✓ | |||
ADAPT | ✓ | ✓ | |||
APEX | ✓ | ✓ |
Research aim and objectives
Patient-level data were available from a number of data sets, which included data on several relevant scoring systems. As reported in Chapter 1 and summarised in Tables 1–4, most of the instruments identified in the systematic review lacked evidence on one or more measurement properties. Using the available data sets, missing measurement properties were calculated. The instruments covered varied across the data sets (see Table 5). No data sets were available that included SF-6D, SF-36, ICOAP, KOOS or KOOS-PS, among others.
Data sets
A brief description of the data sets used is provided in the following sections.
The Knee Arthroplasty Trial
Summary
The Knee Arthroplasty Trial (KAT)66 is a National Institute for Health Research (NIHR) HTA programme-funded study that has examined the outcome of 2352 total knee arthroplasties (TKAs) over a median of 10 years, and includes data on costs and resource use. Continued development of TKA systems has seen improvement in quality of life and increased duration of prosthetic survival. It was a pragmatic, multicentre (116 surgeons in 34 centres in the UK) randomised controlled trial (RCT). A total of 2352 participants were randomly allocated to be treated with or without a metal backing of the tibial component (n = 409), a patellar resurfacing (n = 1715) or a mobile bearing (n = 539); in total, 2318 patients took part.
The trial is registered as ISRCTN45837371.
Available outcome measures of interest
The available outcome measures of interest were the OKS, SF-12 [predominantly version 2 (n = 2091), although a small number of version 1 questionnaires were initially used (n = 126)] and the EQ-5D-3L descriptive system (referred to as the ‘EQ-5D-3L index’ in this report).
The European Collaborative Database of Cost and Practice Patterns of Total Hip Replacement
Summary
This used a cohort of people having primary hip replacement for primary osteoarthritis from the UK and other European countries. The European Collaborative Database of Cost and Practice Patterns of Total Hip Replacement (EUROHIP) consortium includes 20 orthopaedic centres in 12 different European countries. 67 The cohort was comprised of 1051 people having primary hip replacement for primary hip osteoarthritis. Originally, 147 patients in the cohort came from the UK (143 remained for the analysis for the OHS) and the remaining 1373 patients were from other European countries (1184 remained for the analysis). In total, 1327 participants were used for analyses of WOMAC and the EQ-5D-3L index. A total of 908 participants (68.4%) completed the 12-month follow-up questionnaire. A minimum of 50 consecutive, consenting patients receiving primary total hip arthroplasty (THA) for hip osteoarthritis in each of the 20 participating orthopaedic centres entered the study. Preoperative data included demographics, employment and educational attainment, drug utilisation and involvement of other joints. 67
Available outcome measures of interest
The available outcome measures of interest were the OHS, EQ-5D-3L index and WOMAC (a five-point Likert version).
The Exeter Prosthesis Outcome Study
Summary
The Exeter Primary Outcome Study (EPOS) recruited 1590 patients who had undergone Exeter hip replacement implants between March 1999 and February 2002. 68 There were 1375 patients (1431 hips) with a primary diagnosis of osteoarthritis. The unit of analysis was the implant rather than the patient, of whom 56 had bilateral procedures. A total of 1431 THRs were undertaken by consultant and non-consultant surgeons using anterolateral or posterior approaches. 69
Available outcome measures of interest
The available outcome measures of interest were the OHS and SF-36.
After Partial and Total Joint Replacement
Between February 2010 and November 2011, 125 patients undergoing THA and 128 patients undergoing TKA were recruited to the cohort. The protocol and full details of the research design and findings have been published. 70 The studies were approved by the Southampton and South West Hampshire Research Ethics Committee (09/H0102/72) and all participants provided informed, written consent. 71 The patients recruited were due to undergo a range of hip and knee replacement procedures, meaning that functional measures could be investigated across a range of people with diverse indications for surgery and degrees of functional impairment.
Available outcome measures of interest
The available outcome measures of interest were the SF-12 version 1 and WOMAC.
Arthroplasty Pain Experience
Summary
Between 2009 and 2012, 322 patients undergoing THA and 316 patients undergoing TKA were recruited. The inclusion criteria were waiting for a primary unilateral THA or TKA for osteoarthritis. The published protocol and clinical results paper for the Arthroplasty Pain EXperience (APEX) study provide full details of the research design and findings. 70,72 The exclusion criteria were the inability to provide informed consent or complete questionnaires and medical comorbidity precluding the use of spinal anaesthesia, regional blocks or strong analgesics postoperatively.
Available outcome measures of interest
The available outcome measures of interest were the EQ-5D-3L index and WOMAC.
Approvals
We successfully sought direct approval from the data controller of each data set to obtain access to the anonymised data.
Available data by data set
Tables 6 and 7 show the number of individuals who responded to each hip and knee measurement tool for the respective data sets. The percentages of items missing in those with an incomplete measurement tool were about 6–24% for the EQ-5D-3L index, 15–41% for the OHS and 12% for the OKS and 21–36% for the WOMAC total score post operation (see Report Supplementary Material 1, Online Supplement 1, for further details. Please note that all further citations to ‘online supplements’ refer to those within Report Supplementary Material 1).
Measurement tool | Pre operation | Post operation | ||||
---|---|---|---|---|---|---|
Individuals who responded to any question in the measurement tool (n) | Individuals who fully completed the measurement tool (n) | Items missing for those with an incomplete measurement tool response (%) | Individuals who responded to any question in the measurement tool (n) | Individuals who fully completed the measurement tool (n) | Items missing for those with an incomplete measurement tool response (%) | |
APEX | ||||||
EQ-5D-3L index | 309 | 302 | 86.8 | 271 | 266 | 94.8 |
EUROHIP | ||||||
EQ-5D-3L index | 1266 | 1228 | 72.9 | 903 | 883 | 96.9 |
ADAPT | ||||||
SF-12 MCS/PCS | 125 | 119 | 53.8 | 112 | 104 | 76.0 |
EPOS | ||||||
OHS | 1534 | 1517 | 79.9 | 1262 | 1239 | 94.9 |
EUROHIP | ||||||
OHS | 140 | 127 | 33.9 | 123 | 114 | 74.7 |
OHS pain | 140 | 133 | 51.7 | 123 | 120 | 92.0 |
OHS function | 139 | 131 | 47.2 | 123 | 115 | 39.6 |
ADAPT | ||||||
WOMAC total | 125 | 112 | 34.5 | 111 | 102 | 70.4 |
WOMAC pain | 125 | 122 | 77.5 | 111 | 109 | 93.3 |
WOMAC function | 125 | 115 | 42.4 | 111 | 104 | 75.8 |
WOMAC stiffness | 125 | 125 | – | 111 | 109 | 95.2 |
APEX | ||||||
WOMAC total | 324 | 261 | 42.7 | 283 | 234 | 65.7 |
WOMAC pain | 324 | 323 | 96.0 | 283 | 279 | 95.0 |
WOMAC function | 308 | 270 | 53.7 | 273 | 242 | 74.0 |
WOMAC stiffness | 309 | 300 | 89.5 | 273 | 268 | 96.7 |
EUROHIP | ||||||
WOMAC total | 1272 | 1243 | 73.4 | 902 | 865 | 94.3 |
WOMAC pain | 1268 | 1255 | 91.1 | 886 | 875 | 98.8 |
WOMAC function | 1270 | 1253 | 86.7 | 901 | 888 | 98.4 |
WOMAC stiffness | 1266 | 1266 | – | 888 | 888 | – |
Measurement tool | Pre operation | Post operation | ||||
---|---|---|---|---|---|---|
Individuals who responded to any question of the measurement tool (n) | Patients who fully completed the measurement tool (n) | Items missing in those with incomplete measurement tool (%) | Individuals who responded to any question of the measurement tool (n) | Patients who fully completed the measurement tool (n) | Items missing in those with incomplete measurement tool (%) | |
APEX | ||||||
EQ-5D-3L index | 302 | 298 | 92.2 | 263 | 261 | 97.9 |
KAT | ||||||
EQ-5D-3L index | 2156 | 2120 | 71.6 | 1995 | 1939 | 84.2 |
ADAPT | ||||||
SF-12 MCS/PCS | 128 | 116 | 43.2 | 110 | 96 | 68.5 |
KAT | ||||||
SF-12 MCS/PCS | 2156 | 2087 | 58.3 | – | 1904 | – |
KAT | ||||||
OKS | 2159 | 2112 | 59.9 | 1996 | 1691 | 47.9 |
OKS pain | 2159 | 2136 | 76.4 | 1996 | 1906 | 76.2 |
OKS function | 2159 | 2132 | 75.8 | 1996 | 1753 | 58.8 |
ADAPT | ||||||
WOMAC total | 128 | 118 | 41.4 | 110 | 102 | 74.6 |
WOMAC pain | 127 | 123 | 64.4 | 110 | 109 | 96.5 |
WOMAC function | 127 | 120 | 49.0 | 110 | 103 | 77.5 |
WOMAC stiffness | 127 | 127 | – | 110 | 110 | – |
APEX | ||||||
WOMAC total | 318 | 246 | 41.6 | 277 | 214 | 59.0 |
WOMAC pain | 318 | 318 | 0.0 | 277 | 269 | 90.9 |
WOMAC function | 301 | 253 | 50.5 | 268 | 224 | 66.6 |
WOMAC stiffness | 301 | 293 | 91.3 | 268 | 260 | 94.9 |
Statistical analysis
The statistical methods used to calculate the five measurement properties of interest are described in the following sections.
Internal consistency
The corresponding Cronbach’s alpha was calculated using preoperation and postoperation data for the SF-12 [physical component score (PCS) and mental component score (MCS)] and OKS (pain, function and total scores). For the OKS total scores and subscales, the actual internal consistency can be assessed (as these summary scores are simple summations of the individual items). For the EQ-5D-3L index and the SF-12 summary scores, the respective Cronbach’s alpha relates to a summary score of the five domains, and, therefore, 12 items were carried out at the best indirectly assessed internal consistency. The SF-12 uses 12-item response values to generate two summary scores: physical and mental. This was carried out for the baseline data only and by version. The analyses were carried out in Stata® (version 14; StataCorp LP, College Station, TX, USA) using the alpha command. The Cronbach’s alpha including all standard items, with each of the constituent items dropped in turn, was calculated along with the correlation between each individual item and the sum of all the other items.
Construct validity
A priori hypotheses about the magnitude and direction of correlations between primary outcomes at pre operation have been proposed (see Analysis plan document). Spearman’s and Pearson’s correlations were calculated for each pair of measures. These were calculated in Stata® using the ci2 command with 95% confidence intervals (CIs). The CI for Pearson’s correlation was calculated in two ways: using Fisher’s r-to-z transformation and using bootstrapping with 1000 replications in Stata® using the bootstrap command. Cohen’s convention is to interpret ES as follows: a correlation coefficient of 0.10 is thought to represent a weak or small association, a correlation coefficient of 0.30 is considered a moderate correlation and a correlation coefficient of 0.50 or larger is thought to represent a strong or large correlation. 73
Responsiveness
Responsiveness was assessed by examining the magnitude and direction of correlations of the change scores (pre to post operation) between the primary outcomes.
Floor and ceiling effects
The proportion of patients responding with the highest and lowest possible scores at pre and post operation were calculated to assess the possibility of floor/ceiling effects. Jette et al. 74 considered that for the measurement of a stage to be useful, no more than 20% of patients’ measurements should show floor and ceiling effects. More values close to the extremes of the instrument’s range suggest more limited ability of a measurement to discriminate among patients’ function at the minimum or maximum possible scale. 74 The proportion specified to designate a floor/ceiling effect is arbitrary, with 20% used for the current study.
Interpretability
The ACHE project methodology aims to calculate the minimal detectable change (MDC), minimally important change (MIC) and minimally important difference (MID). Various approaches can be used to assess these properties. Minor variations in these definitions exist. The definitions used for the estimation of these attributes are given in the following sections.
The MDC was calculated as stated in the following section. The MIC was calculated in three ways: the ROC MIC, MIC (group) and MIC (ES). The MID was calculated in two ways: using a patient-reported global transition item MID (anchor) and using an ES approach MID. The specific methods used were in accordance with the methodology proposed by Beard et al. 75
Minimal detectable change (90% significance level)
This is often called a distribution method for calculating an important change. The SE of the measurement can be defined as:
where SD is the standard deviation and R is a reliability parameter [e.g. test–retest reliability or intracluster correlation coefficient (ICC)]. In this analysis, test–retest reliability was used. 76,77 Applying a 90% significance level for z-distribution (z-statistics value of 1.645), a range for the possible difference between two observations under the same conditions (test–retest scenario) was calculated to define the MDC:77
To calculate an estimate of the MDC, the ICC (test–retest reliability) should be imputed given repeatability data because it is not possible to calculate an estimate of the ICC (i.e. the outcome measured at the same time point multiple times) in the KAT data set. A previous study has estimated ICCs of 0.84 and 0.80 for the SF-12 physical and mental scores, respectively, on a mixed sample of patients undergoing hip and knee replacement surgery, comparing scores at 3 weeks and 1 week pre operation. 78 Although these estimates are not optimal (the ICC would preferably be estimated separately for hip and knee replacement surgery), it is thought that they will be sufficiently close to be used to enable the respective MDCs to be calculated. A previous study79 has reported ICCs of 0.73, 0.78 and 0.53 for WOMAC pain, physical function and stiffness, respectively, at pre operation for patients undergoing THA. In this analysis, the MDCs were calculated for the SF-12 physical and mental scores pre operation using the preoperative ICCs given above. 78
Minimally important change
Receiver operating characteristic minimally important change
The previous MIC approach can be modified by using ROC curve methodology (using the anchor definition as the reference standard for an important change) in order to determine the optimal cut-point. The optimal cut-point can be defined in various ways. In this analysis, Youden’s Index80 is used by maximising (sensitivity + specificity – 1), and shortest distance by minimising:
The area under the corresponding curve was calculated using a non-parametric ROC approach in Stata® (roctab command) to generate an associated 95% CI.
Minimally important change (group)
The MIC can be calculated as the mean change score for patients who identify themselves as having a ‘minimal’ (e.g. ‘a little’) difference on a patient-reported global transition item (anchor). An anchor-based MID was calculated using ‘somewhat satisfied’ versus ‘somewhat dissatisfied’ groups when there was no neutral scale.
Minimally important change (effect size)
An ES for the MIC can be calculated as follows:
where pooled SD is:
and SDs are group-specific SDs and Ns are study sample sizes. An ES of 0.5 has been proposed as an estimate of a MIC; the MIC (ES). 81 The MICs for health-related quality of life instruments have been noted to be close to half a SD in other studies. 77
Minimally important difference
Minimally important difference (group)
The MID (group) can be calculated as the difference in the mean change score for patients who identify themselves as having a ‘minimal’ difference (e.g. ‘a little better’) and those who identify themselves as having no change (e.g. ‘about the same’) on a patient-reported global transition item (anchor). EPOS and EUROHIP data sets both had a satisfaction Likert scale (Table 8). An anchor-based MID was calculated using ‘somewhat satisfied’ versus ‘somewhat dissatisfied’ groups when there was no neutral scale.
Satisfaction | Data set | |||
---|---|---|---|---|
EUROHIP (12 months) | EPOS (24 months) | |||
n | % | n | % | |
Very satisfied | 165 | 68 | 829 | 79 |
Somewhat satisfied | 59 | 24 | 167 | 16 |
Somewhat dissatisfied | 13 | 5 | 39 | 4 |
Very dissatisfied | 7 | 3 | 18 | 2 |
Total | 244 | 100 | 1053 | 100 |
Minimally important difference (effect size)
An ES estimate of the MID can be calculated using an anchor (e.g. satisfaction after the operation):
where pooled SD is:
and the SD is the group-specific SD and N is the study sample size. 82 Half of a SD (ES approach) has been proposed as an estimate of a MID. 83 The MIDs for health-related quality of life instruments had previously been suggested to be close to a half of a SD. 84
Results
Internal consistency
Internal consistency results for hip and knee scores are shown in Table 9.
Measurement tool (total) | Time point | |||
---|---|---|---|---|
Pre operation | Post operationa | |||
n | α | n | α | |
Hip | ||||
EQ-5D-3L index | ||||
APEX | 302 | 0.66 | 266 | 0.82 |
EUROHIP | 1228 | 0.66 | 883 | 0.81 |
SF-12 | ||||
ADAPT | 119 | 0.86 | 104 | 0.89 |
OHS | ||||
EPOS | 1517 | 0.88 | 1239 | 0.93 |
EUROHIP | 127 | 0.89 | 114 | 0.93 |
WOMAC | ||||
ADAPT | 112 | 0.97 | 102 | 0.98 |
APEX | 261 | 0.96 | 234 | 0.98 |
EUROHIP | 1243 | 0.95 | 865 | 0.98 |
Knee | ||||
EQ-5D-3L index | ||||
APEX | 298 | 0.66 | 261 | 0.80 |
KAT | 2120 | 0.55 | 1939 | 0.79 |
SF-12a | ||||
ADAPT | 116 | 0.81 | 96 | 0.89 |
KAT | ||||
Version 1 | 116 | –0.40 | – | |
Version 2 | 1791 | 0.38 | – | |
OKS | ||||
KAT | 2112 | 0.86 | 1691 | 0.93 |
WOMAC | ||||
ADAPT | 118 | 0.96 | 102 | 0.98 |
Hip
The internal consistency of all the instruments demonstrated that the total score can be adequately considered as one scale for hip [except the preoperative EQ-5D-3L Index (Cronbach’s alpha = 0.66)], with Cronbach’s alphas in the range of 0.88–0.89 and 0.93 at pre and post operation, respectively, for the OHS and no improvement obtained by removal of any item. Likewise, no significant improvement was obtained by removal of any item in the EQ-5D-3L index, SF-12 and WOMAC. Cronbach’s alpha values for each individual item were similarly high and are provided in Appendix 1. The postoperation data set showed slightly higher Cronbach’s alphas.
Knee
The internal consistency of all of the instruments demonstrated that the total score can be adequately considered as one scale for knee [except the preoperative EQ-5D-3L Index (Cronbach’s alpha = 0.55–0.66) and the SF-12 (Cronbach’s alpha = 0.38 and 0.40 with KAT)], with Cronbach’s alphas in the range of 0.86 and 0.93 at pre and post operation, respectively, for the OKS and no improvement obtained by removal of any item. No significant improvement was obtained by the removal of any item in the EQ-5D-3L index, SF-12 and WOMAC. Alpha values for each individual item were similarly high and are provided in Appendix 1. The postoperation data set showed slightly higher Cronbach’s alphas.
Construct validity
Construct validity results for hip and knee data sets are shown in Tables 10 and 11.
Comparator | Time point | |||
---|---|---|---|---|
Baseline | 12 months | |||
n | Spearman’s correlation (95% CI) | n | Spearman’s correlation (95% CI) | |
EQ-5D index | ||||
APEX | ||||
WOMAC pain | 301 | 0.59 (0.51 to 0.66) | 262 | 0.67 (0.60 to 0.73) |
WOMAC function | 266 | 0.69 (0.62 to 0.75) | 233 | 0.68 (0.60 to 0.74) |
WOMAC stiffness | 294 | 0.53 (0.44 to 0.61) | 261 | 0.54 (0.44 to 0.62) |
WOMAC total | 259 | 0.70 (0.63 to 0.76) | 228 | 0.70 (0.63 to 0.76) |
EUROHIP | ||||
OHS | 124 | 0.75 (0.66 to 0.82) | 113 | 0.78 (0.70 to 0.84) |
OHS function | 128 | 0.71 (0.62 to 0.79) | 114 | 0.78 (0.70 to 0.84) |
OHS pain | 130 | 0.70 (0.59 to 0.78) | 119 | 0.62 (0.50 to 0.72) |
SF-12 PCS | ||||
ADAPT | ||||
WOMAC total | 118 | 0.57 (0.44 to 0.68) | 75 | 0.70 (0.56 to 0.80) |
WOMAC pain | 118 | 0.45 (0.29 to 0.58) | 75 | 0.71 (0.58 to 0.81) |
WOMAC function | 118 | 0.58 (0.45 to 0.69) | 75 | 0.72 (0.59 to 0.82) |
WOMAC stiffness | 118 | 0.38 (0.21 to 0.52) | 75 | 0.52 (0.33 to 0.67) |
SF-12 MCS | 118 | –0.11 (–0.28 to 0.08) | 75 | –0.13 (–0.35 to 0.09) |
SF-12 MCS | ||||
WOMAC total | 118 | 0.28 (0.10 to 0.44) | 75 | 0.18 (–0.05 to 0.39) |
WOMAC pain | 118 | 0.30 (0.12 to 0.45) | 75 | 0.15 (–0.08 to 0.36) |
WOMAC function | 118 | 0.27 (0.09 to 0.43) | 75 | 0.17 (–0.06 to 0.38) |
WOMAC stiffness | 118 | 0.24 (0.06 to 0.40) | 75 | 0.24 (0.02 to 0.45) |
SF-12 PCS | 118 | –0.11 (–0.28 to 0.08) | 75 | –0.13 (–0.35 to 0.09) |
OHS | ||||
EPOS | ||||
SF-36 general health | 1043 | 0.26 (0.21 to 0.32) | 765 | 0.50 (0.44 to 0.55) |
OHS function | ||||
SF-36 physical function | 1042 | 0.71 (0.68 to 0.74) | 773 | 0.79 (0.76 to 0.81) |
SF-36 role physical | 1038 | 0.37 (0.32 to 0.43) | 785 | 0.63 (0.59 to 0.67) |
SF-36 role emotional | 1043 | 0.26 (0.20 to 0.31) | 781 | 0.44 (0.38 to 0.49) |
SF-36 pain | 1048 | 0.65 (0.61 to 0.68) | 800 | 0.65 (0.61 to 0.69) |
SF-36 vitality | 1033 | 0.42 (0.37 to 0.47) | 780 | 0.60 (0.55 to 0.64) |
SF-36 mental health | 1034 | 0.30 (0.24 to 0.35) | 780 | 0.41 (0.35 to 0.47) |
SF-36 social function | 1042 | 0.58 (0.54 to 0.62) | 783 | 0.64 (0.58 to 0.67) |
OHS pain | ||||
SF-36 physical function | 1042 | 0.53 (0.49 to 0.58) | 771 | 0.57 (0.52 to 0.62) |
SF-36 role physical | 1038 | 0.32 (0.26 to 0.37) | 783 | 0.49 (0.43 to 0.54) |
SF-36 pain | 1048 | 0.69 (0.66 to 0.72) | 800 | 0.61 (0.56 to 0.65) |
WOMAC total | ||||
ADAPT | ||||
WOMAC pain | 125 | 0.88 (0.83 to 0.91) | 111 | 0.81 (0.74 to 0.87) |
WOMAC function | 125 | 0.99 (0.98 to 0.99) | 111 | 0.97 (0.96 to 0.98) |
WOMAC stiffness | 125 | 0.85 (0.79 to 0.89) | 111 | 0.74 (0.64 to 0.81) |
SF-12 MCS | 118 | 0.28 (0.10 to 0.44) | 75 | 0.18 (–0.05 to 0.39) |
SF-12 PCS | 118 | 0.57 (0.44 to 0.68) | 75 | 0.70 (0.56 to 0.80) |
APEX | ||||
EQ-5D score | 301 | 0.59 (0.51 to 0.66) | 262 | 0.67 (0.60 to 0.73) |
WOMAC function | 269 | 0.77 (0.72 to 0.81) | 241 | 0.70 (0.63 to 0.76) |
WOMAC stiffness | 299 | 0.63 (0.56 to 0.70) | 264 | 0.58 (0.49 to 0.65) |
WOMAC total | 261 | 0.85 (0.81 to 0.88) | 234 | 0.79 (0.74 to 0.84) |
WOMAC function | ||||
EUROHIP | ||||
OHS function | 115 | 0.84 (0.78 to 0.89) | 114 | 0.85 (0.79 to 0.89) |
WOMAC pain | ||||
EUROHIP | ||||
OHS pain | 131 | 0.78 (0.70 to 0.84) | 118 | 0.66 (0.54 to 0.75) |
Comparator | Time point | |||
---|---|---|---|---|
Baseline | 12 months | |||
n | Spearman’s correlation (95% CI) | n | Spearman’s correlation (95% CI) | |
EQ-5D index | ||||
KAT | ||||
OKS | 2073 | 0.70 (0.67 to 0.72) | 1647 | 0.78 (0.76 to 0.80) |
OKS function | 2097 | 0.65 (0.63 to 0.68) | 1702 | 0.74 (0.72 to 0.76) |
OKS pain | 2093 | 0.62 (0.59 to 0.65) | 1857 | 0.74 (0.72 to 0.76) |
SF-12 PCS | 2055 | 0.43 (0.40 to 0.47) | 1857 | 0.72 (0.70 to 0.74) |
SF-12 MCS | 2055 | 0.42 (0.38 to 0.45) | 1857 | 0.46 (0.43 to 0.50) |
SF-12 PCS | ||||
ADAPT | ||||
WOMAC total | 114 | 0.36 (0.19 to 0.51) | 74 | 0.75 (0.64 to 0.84) |
WOMAC pain | 115 | 0.38 (0.21 to 0.52) | 74 | 0.65 (0.49 to 0.76) |
WOMAC function | 115 | 0.35 (0.18 to 0.50) | 74 | 0.74 (0.61 to 0.83) |
WOMAC stiffness | 115 | 0.20 (0.02 to 0.37) | 74 | 0.57 (0.39 to 0.70) |
SF-12 MCS | 116 | –0.36 (–0.51 to –0.19) | 74 | 0.0 (–0.23 to 0.23) |
KAT | ||||
OKS function | 2062 | 0.50 (0.47 to 0.53) | 1678 | 0.71 (0.68 to 0.73) |
OKS pain | 2066 | 0.50 (0.47 to 0.54) | 1831 | 0.62 (0.59 to 0.65) |
SF-12 MCS | ||||
ADAPT | ||||
WOMAC total | 114 | 0.11 (–0.07 to 0.29) | 74 | 0.25 (0.02 to 0.45) |
WOMAC pain | 115 | 0.02 (–0.16 to 0.21) | 74 | 0.19 (–0.04 to 0.40) |
WOMAC function | 115 | 0.12 (–0.07 to 0.29) | 74 | 0.26 (0.03 to 0.46) |
WOMAC stiffness | 115 | 0.19 (0.01 to 0.36) | 74 | 0.23 (0.01 to 0.44) |
SF-12 PCS | 116 | –0.36 (–0.51 to –0.19) | 74 | 0 (–0.23 to 0.23) |
KAT | ||||
OKS function | 2062 | 0.41 (0.37 to 0.44) | 1678 | 0.43 (0.39 to 0.47) |
WOMAC total | ||||
ADAPT | ||||
WOMAC pain | 126 | 0.86 (0.81 to 0.90) | 110 | 0.92 (0.88 to 0.94) |
WOMAC function | 126 | 0.98 (0.97 to 0.98) | 110 | 0.99 (0.98 to 0.99) |
WOMAC stiffness | 126 | 0.69 (0.59 to 0.77) | 110 | 0.81 (0.73 to 0.87) |
SF-12 MCS | 114 | 0.11 (–0.07 to 0.29) | 74 | 0.25 (0.02 to 0.45) |
SF-12 PCS | 114 | 0.36 (0.19 to 0.51) | 74 | 0.75 (0.64 to 0.84) |
APEX | ||||
EQ-5D index | 244 | 0.72 (0.65 to 0.77) | 209 | 0.80 (0.74 to 0.84) |
WOMAC pain | 246 | 0.85 (0.81 to 0.88) | 214 | 0.89 (0.86 to 0.92) |
WOMAC function | 246 | 0.98 (0.98 to 0.99) | 214 | 0.99 (0.98 to 0.99) |
WOMAC stiffness | 246 | 0.70 (0.63 to 0.76) | 214 | 0.79 (0.73 to 0.84) |
Hip
Construct validity (based on Spearman’s correlation) was generally supported with moderate to strong correlations between the instruments, except for correlations involving the SF-12 PCS, SF-12 MCS and SF-36. Correlations between the instruments at pre operation versus post operation were generally similar.
Knee
For the knee instruments, the pattern was broadly similar to the hip instruments. Construct validity (based on Spearman’s correlation) was generally supported with moderate to strong correlations between the instruments, except for correlations involving the SF-12 PCS and SF-12 MCS. Correlations between the instruments at post operation tended to be higher than those at pre operation.
Responsiveness
Responsiveness results for hip and knee scores are shown in Tables 12 and 13, respectively.
Comparator | n | Correlation (95% CI) | |
---|---|---|---|
Spearman’s | Pearson’s | ||
EQ-5D-3L index | |||
APEX | |||
WOMAC pain | 247 | 0.53 (0.44 to 0.62) | 0.55 (0.46 to 0.63) |
WOMAC function | 198 | 0.58 (0.48 to 0.66) | 0.59 (0.49 to 0.67) |
WOMAC stiffness | 243 | 0.46 (0.36 to 0.56) | 0.48 (0.38 to 0.57) |
WOMAC total | 193 | 0.58 (0.48 to 0.67) | 0.59 (0.49 to 0.67) |
Change SF-12 PCS | |||
ADAPT | |||
SF-12 MCS | 72 | –0.27 (–0.47 to –0.04) | –0.21 (–0.42 to 0.02) |
WOMAC pain | 72 | 0.47 (0.26 to 0.63) | 0.44 (0.23 to 0.61) |
WOMAC function | 72 | 0.46 (0.25 to 0.62) | 0.48 (0.28 to 0.64) |
WOMAC stiffness | 72 | 0.18 (–0.06 to 0.39) | 0.22 (–0.01 to 0.43) |
WOMAC total | 72 | 0.45 (0.25 to 0.62) | 0.47 (0.26 to 0.63) |
Change SF-12 MCS | |||
SF-12 PCS | 72 | –0.27 (–0.47 to –0.04) | –0.21 (–0.42 to 0.02) |
WOMAC pain | 72 | 0.27 (0.04 to 0.47) | 0.35 (0.13 to 0.54) |
WOMAC function | 72 | 0.24 (0.01 to 0.45) | 0.31 (0.09 to 0.51) |
WOMAC stiffness | 72 | 0.34 (0.12 to 0.53) | 0.33 (0.10 to 0.52) |
WOMAC total | 72 | 0.27 (0.04 to 0.47) | 0.33 (0.11 to 0.52) |
Change OHS | |||
EPOS | |||
SF-36 general health | 739 | –0.25 (–0.31 to –0.18) | –0.27 (–0.34 to –0.20) |
EUROHIP | |||
EQ-5D-3L usual activities | 113 | 0.44 (0.28 to 0.58) | 0.44 (0.28 to 0.58) |
EQ-5D-3L pain | 113 | 0.58 (0.44 to 0.69) | 0.57 (0.43 to 0.68) |
EQ-5D-3L self-care | 113 | 0.33 (0.15 to 0.48) | 0.36 (0.18 to 0.51) |
EQ-5D-3L index | 110 | 0.57 (0.43 to 0.68) | 0.56 (0.42 to 0.68) |
WOMAC pain | 110 | 0.63 (0.50 to 0.73) | 0.60 (0.47 to 0.71) |
EUROHIP | |||
EQ-5D-3L index | 107 | 0.54 (0.40 to 0.67) | 0.53 (0.38 to 0.66) |
WOMAC function | 107 | 0.80 (0.71 to 0.86) | 0.80 (0.72 to 0.86) |
Change WOMAC total | |||
ADAPT | |||
SF-12 MCS | 72 | 0.27 (0.04 to 0.47) | 0.33 (0.11 to 0.52) |
SF-12 PCS | 72 | 0.45 (0.25 to 0.62) | 0.47 (0.26 to 0.63) |
WOMAC pain | 110 | 0.89 (0.85 to 0.93) | 0.89 (0.84 to 0.92) |
WOMAC function | 110 | 0.99 (0.99 to 0.99) | 0.99 (0.99 to 0.99) |
WOMAC stiffness | 110 | 0.81 (0.74 to 0.87) | 0.83 (0.77 to 0.88) |
APEX | |||
EQ-5D-3L index | 193 | 0.58 (0.48 to 0.67) | 0.59 (0.49 to 0.67) |
WOMAC pain | 200 | 0.79 (0.73 to 0.84) | 0.84 (0.79 to 0.87) |
WOMAC function | 200 | 0.98 (0.97 to 0.99) | 0.98 (0.98 to 0.99) |
WOMAC stiffness | 200 | 0.70 (0.62 to 0.76) | 0.71 (0.63 to 0.77) |
Comparator | n | Correlation (95% CI) | |
---|---|---|---|
Spearman’s | Pearson’s | ||
Change EQ-5D-3L index | |||
APEX | |||
WOMAC pain | 240 | 0.50 (0.40 to 0.59) | 0.53 (0.43 to 0.62) |
WOMAC function | 180 | 0.53 (0.41 to 0.63) | 0.56 (0.45 to 0.65) |
WOMAC stiffness | 233 | 0.50 (0.40 to 0.59) | 0.53 (0.43 to 0.62) |
WOMAC total | 169 | 0.57 (0.46 to 0.67) | 0.60 (0.49 to 0.69) |
KAT | |||
OKS | 1565 | 0.56 (0.52 to 0.59) | 0.56 (0.53 to 0.60) |
OKS function | 1632 | 0.47 (0.43 to 0.51) | 0.49 (0.45 to 0.53) |
OKS pain | 1784 | 0.55 (0.52 to 0.58) | 0.55 (0.52 to 0.59) |
SF-12 PCS | 1749 | 0.41 (0.37 to 0.45) | 0.42 (0.38 to 0.45) |
SF-12 MCS | 1749 | 0.24 (0.20 to 0.29) | 0.27 (0.22 to 0.31) |
Change SF-12 PCS | |||
ADAPT | |||
SF-12 MCS | 65 | 0.04 (–0.20 to 0.28) | 0.10 (–0.15 to 0.33) |
WOMAC pain | 65 | 0.56 (0.37 to 0.71) | 0.56 (0.37 to 0.71) |
WOMAC function | 65 | 0.63 (0.45 to 0.76) | 0.61 (0.43 to 0.75) |
WOMAC stiffness | 64 | 0.48 (0.26 to 0.65) | 0.41 (0.18 to 0.59) |
WOMAC total | 64 | 0.64 (0.47 to 0.76) | 0.62 (0.44 to 0.75) |
KAT | |||
OKS function | 1581 | 0.54 (0.50 to 0.57) | 0.55 (0.52 to 0.59) |
OKS pain | 1731 | 0.54 (0.50 to 0.57) | 0.55 (0.52 to 0.58) |
Change SF-12 MCS | |||
ADAPT | |||
SF-12 PCS | 65 | 0.04 (–0.20 to 0.28) | 0.10 (–0.15 to 0.33) |
WOMAC pain | 65 | 0.24 (–0.00 to 0.46) | 0.28 (0.04 to 0.49) |
WOMAC function | 65 | 0.27 (0.03 to 0.49) | 0.34 (0.10 to 0.54) |
WOMAC stiffness | 64 | 0.28 (0.03 to 0.49) | 0.33 (0.09 to 0.53) |
WOMAC total | 64 | 0.28 (0.04 to 0.50) | 0.34 (0.11 to 0.54) |
KAT | |||
OKS function | 1581 | 0.27 (0.22 to 0.31) | 0.30 (0.25 to 0.34) |
Change WOMAC total | |||
ADAPT | |||
SF-12 MCS | 64 | 0.28 (0.04 to 0.50) | 0.34 (0.11 to 0.54) |
SF-12 PCS | 64 | 0.64 (0.47 to 0.76) | 0.62 (0.44 to 0.75) |
WOMAC pain | 108 | 0.88 (0.83 to 0.92) | 0.90 (0.86 to 0.93) |
WOMAC function | 108 | 0.98 (0.98 to 0.99) | 0.99 (0.98 to 0.99) |
WOMAC stiffness | 108 | 0.74 (0.65 to 0.82) | 0.73 (0.63 to 0.81) |
APEX | |||
EQ-5D-3L index | 169 | 0.57 (0.46 to 0.67) | 0.60 (0.49 to 0.69) |
WOMAC pain | 174 | 0.84 (0.79 to 0.88) | 0.86 (0.82 to 0.90) |
WOMAC function | 174 | 0.98 (0.97 to 0.98) | 0.98 (0.98 to 0.99) |
WOMAC stiffness | 174 | 0.74 (0.66 to 0.80) | 0.77 (0.70 to 0.82) |
Hip
Overall, correlations between the change scores of the instruments show a moderate (> 0.5) association, except mainly for the correlations involving SF-12 MCS and PCS instruments. Some of the individual EQ-5D-3L domains also had lower correlations. Pearson’s and Spearman’s correlations were similar.
Knee
Overall, correlations between the change scores of the instruments show a moderate (> 0.5) association, except for the correlations involving the SF-12, MCS and PCS instruments. Pearson’s and Spearman’s correlations were similar.
Floor and ceiling effects
Floor and ceiling effect results for the hip and knee scores are shown in Tables 14 and 15, respectively.
Measurement tool | Time point | |||||
---|---|---|---|---|---|---|
Pre operation | Post operation | |||||
n | Floor (%) | Ceiling (%) | n | Floor (%) | Ceiling (%) | |
EUROHIP | ||||||
EQ-5D-3L index | 1228 | 0.1 | 0.7 | 883 | – | 39.0 |
APEX | ||||||
EQ-5D-3L index | 302 | 0.0 | 1.3 | 266 | 0.0 | 46.2 |
ADAPT | ||||||
SF-12 MCS | 118 | 0.0 | 0.0 | 75 | 0.0 | 0.0 |
SF-12 PCS | 118 | 0.0 | 0.0 | 75 | 0.0 | 0.0 |
EPOS | ||||||
OHS total | 1517 | 0.1 | – | 1239 | – | 19.1 |
OHS pain | 1527 | 2.6 | 0.1 | 1247 | – | 33.8 |
OHS function | 1520 | 0.3 | 0.1 | 1248 | 0.2 | 26.4 |
EUROHIP | ||||||
OHS total | 127 | – | – | 114 | – | 14.0 |
OHS pain | 133 | 3.8 | – | 120 | – | 35.8 |
OHS function | 131 | – | – | 115 | – | 19.1 |
ADAPT | ||||||
WOMAC total | 125 | 0 | 0 | 111 | 0 | 20.7 |
WOMAC pain | 125 | 1.6 | 1.6 | 111 | 0 | 55.0 |
WOMAC function | 125 | 0.8 | 1.6 | 111 | 0 | 24.3 |
WOMAC stiffness | 125 | 4.8 | 4.8 | 111 | 0 | 53.2 |
APEX | ||||||
WOMAC total | 261 | 0.4 | 0 | 234 | 0 | 19.7 |
WOMAC pain | 323 | 2.5 | 0.6 | 279 | 0 | 46.6 |
WOMAC function | 270 | 0.7 | 0 | 242 | 0 | 27.7 |
WOMAC stiffness | 300 | 4.3 | 3.3 | 268 | 0 | 44.8 |
EUROHIP | ||||||
WOMAC total | 1243 | 0.2 | – | 865 | 0.1 | 7.2 |
WOMAC pain | 1255 | 1.0 | 0.2 | 875 | 0.1 | 33.3 |
WOMAC function | 1266 | 5.9 | 1.6 | 888 | 0.5 | 26.6 |
WOMAC stiffness | 1253 | 0.6 | – | 888 | – | 9.2 |
Measurement tool | Time point | |||||
---|---|---|---|---|---|---|
Pre operation | Post operation | |||||
n | Floor (%) | Ceiling (%) | n | Floor (%) | Ceiling (%) | |
APEX | ||||||
EQ-5D-3L index | 298 | 0 | 0.7 | 261 | 0.4 | 29.5 |
KAT | ||||||
EQ-5D-3L index | 2120 | – | 0.4 | 1939 | – | 25.4 |
SF-12 MCS | 116 | 0 | 0 | 74 | 0 | 0 |
SF-12 PCS | 116 | 0 | 0 | 74 | 0 | 0 |
KAT | ||||||
OKS total | 2112 | 0.1 | – | 1691 | 0.12 | 2.3 |
OKS pain | 2132 | 0.1 | 0.1 | 1753 | 0.2 | 3.7 |
OKS function | 2136 | 0.9 | 0.1 | 1906 | 0.1 | 17.0 |
ADAPT | ||||||
WOMAC total | 126 | 0.8 | 0.0 | 110 | 0.0 | 3.6 |
WOMAC pain | 127 | 0.8 | 0.0 | 110 | 0.0 | 20.9 |
WOMAC function | 127 | 0.8 | 0.0 | 110 | 0.0 | 9.1 |
WOMAC stiffness | 127 | 3.9 | 2.4 | 110 | 0.9 | 17.3 |
APEX | ||||||
WOMAC total | 246 | 0.0 | 0.0 | 214 | 0.0 | 5.1 |
WOMAC pain | 318 | 2.2 | 0.0 | 269 | 0.4 | 30.5 |
WOMAC function | 253 | 0.0 | 0.0 | 224 | 0.0 | 9.4 |
WOMAC stiffness | 293 | 2.7 | 1.0 | 260 | 0.8 | 17.3 |
Hip
No floor effects were detected at pre operation for any of the instruments. Substantial ceiling effects were noted at post operation only. This was particularly the case for the EQ-5D-3L index (39–46%), although all instruments assessed mostly had a substantial proportion with the highest possible value. The minimum ceiling effect was 7%, which was observed for the WOMAC total in the EUROHIP data set.
Knee
No floor effects were detected at pre operation for any of the instruments. Substantial ceiling effects were noted, but to a lower extent than for hip data sets at post operation only for the EQ-5D-3L index and WOMAC pain scores. Again, this was most strongly the case for the EQ-5D-3L index (25–30%).
Interpretability
Minimal detectable change (90% significance level)
Literature-based minimal detectable change
Minimal detectable change (90% significance) was calculated using the literature review ICC values presented in Table 16. The SF-12 PCS, assuming an ICC of 0.84, had ± 7–8 MDC points. The MDC values were 10–12 points for the SF-12 MCS with an ICC of 0.80 for hip and knee. 78 MDC values for the WOMAC pain, physical function and stiffness for hip and knee using ICCs of 0.73, 0.78 and 0.53, respectively, ranged from 20–27, 20–27 and 23–39, respectively. 79
Measurement tool | Pre operation | ||
---|---|---|---|
n | ICC | MDC 90% | |
Hip | |||
ADAPT | |||
SF-12 PCS | 72 | 0.84 | ±7.94 |
SF-12 MCS | 72 | 0.80 | ±9.55 |
ADAPT | |||
WOMAC pain | 110 | 0.73 | ±26.60 |
WOMAC function | 110 | 0.78 | ±24.27 |
WOMAC stiffness | 110 | 0.53 | ±38.71 |
APEX | |||
WOMAC pain | 278 | 0.73 | ±22.71 |
WOMAC function | 209 | 0.78 | ±20.64 |
WOMAC stiffness | 255 | 0.53 | ±36.60 |
EUROHIP | |||
WOMAC pain | 1255 | 0.73 | ±21.55 |
WOMAC function | 1253 | 0.53 | ±26.67 |
WOMAC stiffness | 1266 | 0.78 | ±22.62 |
Knee | |||
ADAPT | |||
SF-12 PCS | 65 | 0.84 | ±6.67 |
SF-12 MCS | 65 | 0.80 | ±9.54 |
KAT | |||
SF-12 PCS | 2087 | 0.84 | ±7.60 |
SF-12 MCS | 2087 | 0.80 | ±11.99 |
ADAPT | |||
WOMAC pain | 109 | 0.73 | ±22.89 |
WOMAC function | 109 | 0.78 | ±19.87 |
WOMAC stiffness | 108 | 0.53 | ±32.34 |
APEX | |||
WOMAC pain | 269 | 0.73 | ±20.20 |
WOMAC function | 187 | 0.78 | ±19.54 |
WOMAC stiffness | 242 | 0.53 | ±32.21 |
Minimal detectable change using assumed intracluster correlation coefficient values
Reported ICC values from previous studies reflect the limitations of those studies in terms of population and precision. The arbitrary ICC figures of 0.5, 0.7 and 0.9 that were also used to calculate MDCs are reported in Table 17 for hip and knee scores.
Measurement tool | Pre operation | |||
---|---|---|---|---|
n | MDC (90%) | |||
ICC 0.5 | ICC 0.7 | ICC 0.9 | ||
Hip | ||||
EUROHIP | ||||
EQ-5D-3L index | 1228 | ±0.54 | ±0.42 | ±0.24 |
APEX | ||||
EQ-5D-3L index | 250 | ±0.54 | ±0.42 | ±0.24 |
ADAPT | ||||
SF-12 PCS | 72 | ±14.04 | ±10.87 | ±6.28 |
SF-12 MCS | 72 | ±15.11 | ±11.70 | ±6.76 |
EPOS | ||||
OHS | 1517 | ±13.12 | ±10.16 | ±5.87 |
EUROHIP | ||||
OHS | 127 | ±13.10 | ±10.15 | ±5.86 |
ADAPT | ||||
WOMAC total | 110 | ±35.00 | ±27.11 | ±15.65 |
WOMAC pain | 110 | ±36.20 | ±28.04 | ±16.19 |
WOMAC function | 110 | ±36.59 | ±28.34 | ±16.36 |
WOMAC stiffness | 110 | ±39.92 | ±30.92 | ±17.85 |
APEX | ||||
WOMAC total | 200 | ±29.47 | ±22.83 | ±13.18 |
WOMAC pain | 278 | ±30.90 | ±23.93 | ±13.82 |
WOMAC function | 209 | ±31.11 | ±24.10 | ±13.91 |
WOMAC stiffness | 255 | ±37.75 | ±29.24 | ±16.88 |
EUROHIP | ||||
WOMAC total | 1243 | ±26.41 | ±20.46 | ±11.81 |
WOMAC pain | 1255 | ±29.33 | ±22.72 | ±13.12 |
WOMAC function | 1253 | ±27.51 | ±21.31 | ±12.30 |
WOMAC stiffness | 1266 | ±34.10 | ±26.41 | ±15.25 |
Knee | ||||
KAT | ||||
EQ-5D-3L index | 2120 | ±0.51 | ±0.39 | ±0.23 |
APEX | ||||
EQ-5D-3L index | 248 | ±0.51 | ±0.39 | ±0.23 |
KAT | ||||
SF-12 PCS | 2087 | ±13.43 | ±10.40 | ±6.01 |
SF-12 MCS | 2087 | ±18.96 | ±14.68 | ±8.48 |
ADAPT | ||||
SF-12 PCS | 65 | ±11.78 | ±9.13 | ±5.27 |
SF-12 MCS | 65 | ±15.08 | ±11.68 | ±6.74 |
KAT | ||||
OKS | 2112 | ±12.40 | ±9.60 | ±5.54 |
ADAPT | ||||
WOMAC total | 108 | ±28.68 | ±22.22 | ±12.83 |
WOMAC pain | 109 | ±31.15 | ±24.13 | ±13.93 |
WOMAC function | 109 | ±29.96 | ±23.20 | ±13.40 |
WOMAC stiffness | 108 | ±33.36 | ±25.84 | ±14.92 |
APEX | ||||
WOMAC total | 174 | ±27.06 | ±20.96 | ±12.10 |
WOMAC pain | 269 | ±27.49 | ±21.29 | ±12.29 |
WOMAC function | 187 | ±29.45 | ±22.82 | ±13.17 |
WOMAC stiffness | 242 | ±33.23 | ±25.74 | ±14.86 |
Hip
Using an ICC of 0.9 provided MDCs of 0.24 for the EQ-5D-3L index, 6 points for the OHS total score, 6 and 7 for the SF-12 PCS and MCS, respectively, and 12–16 points for the WOMAC total score across the data sets. MDCs with ICCs of 0.5 and 0.7 were substantially larger, as would be anticipated.
Knee
Using an ICC of 0.9 provided MDCs of 0.23 for the EQ-5D-3L index, 5 and 6 for the SF-12 PCS and MCS, respectively, 6 points for the OKS total score and 13 for the WOMAC subscales across the data sets. MDCs with assumed ICCs of 0.5 and 0.7 were substantially larger, as would be anticipated.
Minimally important change and minimally important difference
Hip
A suitable anchor was available in two data sets: EUROHIP (EQ-5D-3L, OHS and WOMAC) and EPOS (OHS) and applied to the total score only. In EUROHIP, 244 patients answered for the satisfaction question after operation for EUROHIP. Of these, 59 patients (24%) answered ‘somewhat satisfied’ and 13 patients (5.3%) answered ‘somewhat dissatisfied’. These ‘somewhat satisfied’ versus ‘somewhat dissatisfied’ groups were used for the anchor-based analyses in this report when there was no neutral scale. In EPOS, 1053 patients answered the satisfaction question after operation. Of these, 167 patients (16%) answered ‘somewhat satisfied’ and 39 patients (4%) answered ‘somewhat dissatisfied’. These ‘somewhat satisfied’ versus ‘somewhat dissatisfied’ groups were used for the anchor-based analyses in this report.
Figure 4 shows the EQ-5D-3L index ROC curve for the EUROHIP data. Table 18 gives the pre and post operative values in accordance with the anchor, and Table 19 gives the area under the curve (AUC), MIC (ROC), MIC (group), MIC (ES), MID (group) and MID (ES) estimates, along with the sensitivity and specificity for the optimal cut-off point. The AUC was 0.69. The MIC ROC values were 0.07 for both the Youden and the shortest distance methods. MIC (group) and MID (group) were much larger, at 0.36 and 0.28 points, respectively. The values for MIC (ES) and MID (ES) were both around 1.00.
FIGURE 4.
The EQ-5D-3L index: EUROHIP data set ROC curve.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig4.png)
Score | Satisfaction | |||||
---|---|---|---|---|---|---|
Somewhat satisfied | Somewhat dissatisfied | |||||
n | Mean score | SD | n | Mean score | SD | |
Preoperative | 56 | 0.33 | 0.32 | 13 | 0.34 | 0.33 |
Postoperative | 59 | 0.69 | 0.25 | 13 | 0.42 | 0.32 |
Change | 56 | 0.36 | 0.29 | 13 | 0.08 | 0.33 |
MIC | MID | |||||
---|---|---|---|---|---|---|
Individual-level MIC (ROC analysis) | Scale | MIC | ||||
AUC (95% CI) | Optimal cut-off point | Sensitivity | 1 – specificity | |||
0.69 (0.52 to 0.85) | ||||||
Youden’s index | 0.07 | 0.79 | 0.46 | Group | 0.36 | 0.28 |
Shortest distance | 0.07 | 0.79 | 0.46 | ES | 1.26 | 0.97 |
Owing to the low number of OHS observations in the EUROHIP data set (Table 20), in which the satisfaction was ‘somewhat satisfied’ and ‘somewhat dissatisfied’, the MIC ROC was not calculated in this report when there was no neutral scale.
Score | Satisfaction | |||||
---|---|---|---|---|---|---|
Somewhat satisfied | Somewhat dissatisfied | |||||
n | Mean | SD | n | Mean | SD | |
Preoperative | 11 | 14.36 | 3.61 | 3 | 13.33 | 10.50 |
Postoperative | 11 | 30.50 | 7.99 | 4 | 26.50 | 3.32 |
Change | 10 | 16.14 | 6.09 | 3 | 13.17 | 7.12 |
Figure 5 shows the OHS ROC curve for the EPOS data set. Table 21 gives the pre and post operative values in accordance with the anchor, and Table 22 gives the corresponding AUC, MIC (ROC), MIC (group), MIC (ES), MID (group) and MID (ES) estimates, along with the sensitivity and specificity for the optimal cut-off point. The AUC was 0.69. The MIC ROC values were 7 and 14 points for the Youden and shortest distance methods, respectively. The MIC (group) was larger, at 16 points, but the MID (group) was similar, at 6 points. The MIC (ES) and MID (ES) were 2 and 1, respectively.
FIGURE 5.
The total OHS: EPOS data set ROC curve.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig5.png)
Score | Satisfaction | |||||
---|---|---|---|---|---|---|
Somewhat satisfied | Somewhat dissatisfied | |||||
n | Mean | SD | n | Mean | SD | |
Preoperative | 163 | 14.52 | 7.10 | 39 | 14.80 | 8.00 |
Postoperative | 164 | 30.85 | 9.09 | 38 | 25.05 | 8.13 |
Change | 163 | 16.33 | 8.16 | 39 | 10.25 | 8.06 |
MIC | MID | |||||
---|---|---|---|---|---|---|
Individual-level MIC (ROC analysis) | Scale | MID | ||||
AUC (95% CI) | Optimal cut-off point | Sensitivity | 1 – specificity | |||
0.69 (0.59 to 0.78) | ||||||
Youden’s index | 7 | 0.01 | 1 | Group | 16.33 | 6.08 |
Shortest distance | 14 | 0.44 | 0.68 | ES | 2 | 0.75 |
Figure 6 shows the WOMAC total ROC curve for the EUROHIP data. Table 23 gives the preoperative and postoperative values in accordance with the anchor, and Table 24 gives the AUC, MIC (ROC), MIC (group), MIC (ES), MID (group) and MID (ES) estimates, along with the sensitivity and specificity for the optimal cut-off point. The AUC was 0.76. The MIC ROC optimal cut-off point was the same for the Youden and shortest distance methods: 26 points. The MIC (group) and MID (group) were larger, at 31 and 19 points, respectively. The MIC (ES) and MID (ES) were 2 and 1 points, respectively.
FIGURE 6.
The total WOMAC score: EUROHIP data set ROC curve.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig6.png)
Score | Satisfaction | |||||
---|---|---|---|---|---|---|
Somewhat satisfied | Somewhat dissatisfied | |||||
n | Mean | SD | n | Mean | SD | |
Preoperative | 58 | 63.37 | 14.41 | 13 | 61.23 | 13.33 |
Postoperative | 58 | 32.10 | 18.63 | 13 | 48.64 | 20.55 |
Change | 55 | –31.27 | 16.60 | 13 | –12.59 | 17.32 |
MIC | MID | |||||
---|---|---|---|---|---|---|
Individual-level MIC (ROC analysis) | Scale | MID | ||||
AUC (95% CI) | Optimal cut-off point | Sensitivity | 1 – specificity | |||
0.76 (0.62 to 0.90) | ||||||
Youden’s Index | 26 | 0.63 | 0.15 | Group | 31.27 | 18.68 |
Shortest distance | 26 | 0.63 | 0.15 | ES | 1.88 | 1.13 |
Knee
None of the available knee data sets had a suitable anchor for which interpretability properties could be assessed.
Discussion
We calculated the measurement properties of the candidate scores using various statistical methods, using multiple hip and knee replacement data sets for the candidate instruments. This enabled us to obtain estimates for some of the properties for which population-specific estimates had not been previously reported. Estimates for most of the missing measurement properties could be calculated given the available data sets for the measurement tools of interest.
The OHS (EUROHIP and EPOS) and OKS (KAT) showed generally positive additional evidence regarding the measurement properties (internal consistency, construct validity, responsiveness and interpretability). Similarly, the WOMAC [ADAPT (Assessing Disability After Partial and Total Joint Replacement), the APEX study and EUROHIP] showed generally positive additional evidence regarding the measurement properties. For the SF-12 PCS and MCS, the evidence was more mixed and the lack of agreement between MCS and PCS was noteworthy and, although perhaps unsurprising, is problematic in terms of use in the proposed context.
There was no sign of a flooring effect in any instruments. The OHS showed high ceiling effects at post operation, whereas the OKS had a ceiling effect of only 2% at post operation. The high ceiling effects of the OHS after the arthroplasty surgery could be examined in relation to patient-reported satisfaction and/or assessments measuring success. The EQ-5D-3L index showed high ceiling effects for both hip and knee data sets, in keeping with previous evidence. 85 In respect of the systematic review, there is a substantial ceiling effect for patient ratings in the EQ-5D-3L measurement tool. Different versions of SF-12 questionnaires were used in different studies, illustrating how it can be difficult to obtain unified data sets for one measurement tool. Further research is still required for responsiveness, between the generic questionnaires [e.g. the EQ-5D-3L/EuroQol-5 Dimensions, five-level version (EQ-5D-5L) and SF-12] and disease-specific questionnaires. Taken overall, there was generally a reasonable amount of positive evidence of the measurement properties of the OHS, OKS and WOMAC.
Although the WOMAC physical function, pain and stiffness scales were reported as reliable in the systematic review of measurement properties presented in Chapter 2, the pain scale was highly related to physical function including item scales. 86 In this study, we focused on the relationship between the WOMAC total score and WOMAC subscales; further research may be required regarding the WOMAC subscales and their use as instruments for developing thresholds. Measurement properties of the SF-12 PCS were more positive than the MCS, which is similar to some previous studies. 87,88 Further research may be required to clarify what values of the SF-12 MCS are plausible as thresholds for referral and candidacy for the joint replacement surgery and the role, if any, it can play in this context.
The study had a number of limitations. Most important were the variations between available data sets in terms of size and also the collection of relevant instruments and variables (e.g. anchor questions). Evidence on a number of properties for a number of instruments is still lacking or limited. None of the data sets was ideal and they only contained a subset of relevant instruments, which made comparison between instruments difficult. For the WOMAC score, only relatively small data sets were available. Imbalance between data sets and outcomes collected makes direct comparison between the instruments very problematic, and, therefore, we have restricted reporting to the individual properties as opposed to the contrasting instruments. Similarly, the version of the instrument used varied and in some cases (e.g. the EQ-5D-3L) a more recent version of the tool has been proposed (the EQ-5D-5L) for which no data sets were available. The methods used to assess the measurement properties were relatively simplistic, although commonly used in the literature and in general do not provide definitive answers, only suggestive findings.
Conclusion
From the data sets available, additional data on measurement properties were calculated for the EQ-5D-3L, OHS, OKS, WOMAC, SF-12 (PCS and MCS) and SF-36. These results were added to the information identified from the systematic review to produce a summary table of the measurement properties of each of the 36 instruments originally identified. This additional information was used to update the summary of instrument measurement property evidence produced as part of the systematic review in Chapter 2. This updated summary, shown in Tables 25–28, was presented to the user group to inform the choice of candidate instrument to take forward.
Criteria | Instrument | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
HOOS | HRQ | PSI | OHS | Knee disorders subjective history (VAS) | KOOS | KOOS-PS | OKS | OKS-APQ | LEFS (h/k) | LEFS (h) | LEFS (K) | WOMAC (h/k) | WOMAC (h) | WOMAC (K) | WOMAC SF (h/k) | WOMAC SF (h) | WOMAC SF (k) | |
Number of studies | 5 | 1 | 4 | 20 | 1 | 3 | 2 | 23 | 1 | 5 | 0 | 0 | 25 | 0 | 0 | |||
Reproducibility | ++ | + | + | ++ | 0 | + | 0 | +++ | +++ | + | 0 | 0 | ++ | ++ | + | 0 | 0 | 0 |
Internal consistency | + | 0 | 0 | +++ | 0 | 0 | +++ | +++ | +++ | + | 0 | 0 | + | +++ | +++ | + | 0 | 0 |
Validity: content | 0 | 0 | ++ | ++ | + | + | + | +++ | +++ | + | 0 | 0 | + | + | + | + | 0 | 0 |
Construct | ++ | + | ++ | +++ | + | + | ++ | +++ | +++ | ++ | 0 | 0 | +++ | + | ++ | ++ | 0 | 0 |
Responsiveness | + | + | ++ | +++ | 0 | 0 | ++ | +++ | +++ | ++ | 0 | 0 | +++ | ++ | ++ | + | 0 | 0 |
Interpretability | 0 | 0 | 0 | +++ | 0 | 0 | 0 | ++ | 0 | + | 0 | 0 | ++ | ++ | ++ | 0 | 0 | 0 |
Floor and ceiling/precision | + | 0 | 0 | –/+ | 0 | + | 0 | ++ | ++ | 0 | 0 | 0 | –/+ | –/+ | –/+ | 0 | 0 | 0 |
Acceptability | 0 | 0 | 0 | +++ | – | 0 | 0 | +++ | +++ | 0 | 0 | 0 | ++ | + | + | 0 | 0 | 0 |
Data accessible | N | N | N | Y | N | N | N | Y | N | N | N | N | N | N | N | N | N | N |
Criteria | Instrument | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Lower limb core score (h/k) | Lower limb core score (h) | Lower limb core score (k) | MODEMS-HK/(AAOS) hip and knee core score (h/k) | MODEMS-HK/(AAOS) hip and knee core score (h) | MODEMS-HK/(AAOS) hip and knee core score (k) | ICOAP (h/k) | ICOAP (h) | ICOAP (k) | P4 (h/k) | P4 (h) | P4 (k) | McGill Pain-SF (h/k) | McGill Pain-SF (h) | McGill Pain-SF (k) | |
Number of studies | 1 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 0 |
Reproducibility | 0 | 0 | 0 | 0 | 0 | 0 | + | 0 | 0 | 0 | 0 | 0 | ++ | 0 | 0 |
Internal consistency | 0 | 0 | 0 | 0 | 0 | 0 | + | +++ | +++ | ++ | 0 | 0 | 0 | 0 | 0 |
Validity: content | + | 0 | 0 | + | 0 | 0 | ++ | 0 | 0 | + | 0 | 0 | 0 | 0 | 0 |
Construct | 0 | 0 | 0 | + | 0 | 0 | + | 0 | 0 | + | 0 | 0 | + | 0 | 0 |
Responsiveness | 0 | 0 | 0 | ++ | 0 | 0 | – | 0 | 0 | 0 | 0 | 0 | – | 0 | 0 |
Interpretability | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Floor and ceiling/precision | 0 | 0 | 0 | ++ | 0 | 0 | 0 | – | – | 0 | 0 | 0 | 0 | 0 | 0 |
Acceptability | 0 | 0 | 0 | + | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Data accessible | N | N | N | N | N | N | N | Y | Y | N | N | N | N | N | N |
Criteria | Instrument | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SF-6D (h) | SF-6D (k) | SF-6D (h/k) | HUI2 and HUI3 (h) | HUI2 and HUI3 (h) | HUI2 and HUI3 (h) | EQ-5D (h/k) | EQ-5D (h) | EQ-5D (k) | SF-36 (h/k) | SF-36 (h) | SF-36 (k) | SF-12 (h/k) | SF-12 (h) | SF-12 (k) | SIP (h) | SIP (k) | SIP (h/k) | |
Number of studies | 1 | 0 | 0 | 4 | 0 | 0 | 9 | 14 | 3 | 2 | 0 | 0 | ||||||
Reproducibility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++ | 0 | 0 | 0 | 0 | 0 |
Internal consistency | 0 | 0 | 0 | 0 | 0 | 0 | N/A | N/A | N/A | 0 | 0 | – | 0 | +++ | +++ | 0 | 0 | 0 |
Validity: content | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | + | 0 | 0 |
Construct | 0 | 0 | 0 | ++ | 0 | 0 | + | – | – | + | + | + | 0 | + | –/+ | + | 0 | 0 |
Responsiveness | ++ | 0 | 0 | + | 0 | 0 | 0 | – | – | 0 | +/– | + | 0 | + | –/+ | – | 0 | 0 |
Interpretability | + | 0 | 0 | 0 | 0 | 0 | 0 | ++ | ++ | 0 | + | + | 0 | + | + | 0 | 0 | 0 |
Floor and ceiling/precision | 0 | 0 | 0 | 0 | 0 | 0 | 0 | – | + | 0 | – | 0 | +++ | +++ | +++ | 0 | 0 | 0 |
Acceptability | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Data accessible | N | N | N | N | N | N | Y | Y | Y | N | Y | N | Y | Y | Y | N | N | N |
Criteria | Instrument | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
WHOQOL-BREF (h/k) | WHOQOL-BREF (h) | WHOQOL-BREF (k) | Aberdeen IAP (h/k) | Aberdeen IAP (h) | Aberdeen IAP (k) | Aberdeen IAP (modified) (h/k) | Aberdeen IAP (modified) (h) | Aberdeen IAP (modified) (k) | NEADL (h/k) | NEADL (h) | NEADL (k) | AQOL (h/k) | AQOL (h) | AQOL (k) | MSK Functional Limitations Index (h/k) | MSK Functional Limitations Index (h) | MSK Functional Limitations Index (k) | HAQ (h/k) | HAQ (h) | HAQ (k) | MHAQ (h/K) | MHAQ (h) | MHAQ (k) | K10 (h/k) | K10 (h) | K10 (k) | |
Number of studies | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 2 | 0 | 1 | 0 | 0 | |
Reproducibility | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ++ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Internal consistency | ++ | 0 | 0 | + | 0 | 0 | ++ | 0 | 0 | 0 | ++ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | – | 0 | 0 | 0 | 0 | 0 | 0 |
Validity: content | + | 0 | 0 | + | 0 | 0 | 0 | 0 | 0 | 0 | – | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | + | 0 | – | 0 | 0 |
Construct | 0 | 0 | 0 | + | 0 | 0 | + | 0 | 0 | 0 | + | 0 | + | 0 | 0 | 0 | 0 | + | 0 | 0 | ++ | + | + | 0 | + | 0 | 0 |
Responsiveness | + | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | – | 0 | ++ | 0 | 0 | 0 | 0 | 0 | 0 | 0 | – | – | + | 0 | – | 0 | 0 |
Interpretability | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Floor and ceiling/precision | ++ | 0 | 0 | + | 0 | 0 | 0 | 0 | 0 | 0 | + | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | + | ++ | 0 | 0 | ++ | 0 | 0 |
Acceptability | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | + | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Data accessible | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N | N |
Chapter 4 Calculation of threshold values (work package 2)
Background
The selected candidate tools are shown in Table 29. The OKS and OHS consist of 12 questions with 0–48 integer score ranges, with higher scores indicating better health status. The SF-12 has two component scores: PCS, with a theoretical range of 4.3 to 76.4 (US version 2), and MCS, with a theoretical range of –1.1 to 79.6 (US version 2), with higher scores also indicating better health status (see Appendix 2). The WOMAC total score consists of 24 questions: the pain subsection has five questions, the physical function subsection has 17 questions and the stiffness subsection has two questions (all have a score range of 0–100). The WOMAC scores were converted to higher scores indicating better health status. The KOOS-PS consists of seven questions (score range of 0–100); no data were available for the KOOS-PS for evaluating in this study.
Tool | Subscale |
---|---|
Hip | |
OHS | |
Knee | |
KOOS-PS (no data) | |
OKS | |
Hip and knee | |
SF-12 | PCS |
MCS | |
WOMAC | Total |
Pain | |
Physical function | |
Stiffness |
Methods
Data sets
Multiple data sets were used to calculate the thresholds of individual candidate scores when possible (Table 30). We used the 2012–15 web-based NHS PROMs data together with the KAT and EPOS data sets for the analyses of the OHS and OKS, SF-12 PCS and MCS scores. The EUROHIP data set was used for WOMAC scores. The ADAPT and APEX studies were used for the SF-12 PCS and MCS (ADAPT) scores and the WOMAC scores (for both hip and knee) analyses. Postoperative scores were assessed at 6 months post surgery for NHS PROMs; at 12 months for the KAT, EUROHIP, ADAPT and APEX studies; and at 2 years for EPOS. The inclusion criterion was that patients had to have received primary knee or hip replacement surgery (no revision). Cross-validation has been completed when multiple data sets were used for both development and validation.
PROM tool | Data set | ||
---|---|---|---|
1 | 2 | 3 | |
Hip | |||
OHS | NHS PROMs | EPOS | – |
SF-12 | |||
PCS | ADAPT | – | – |
MCS | ADAPT | – | – |
Total | ADAPT | APEX | EUROHIP |
WOMAC | |||
Pain | ADAPT | APEX | EUROHIP |
Physical function | ADAPT | APEX | EUROHIP |
Stiffness | ADAPT | APEX | EUROHIP |
KOOS-PS (no data) | – | – | – |
Knee | |||
OKS | NHS PROMs | KAT | – |
SF-12 | PCS | KAT | ADAPT |
MCS | KAT | ADAPT | |
Total | ADAPT | APEX | |
WOMAC | Pain | ADAPT | APEX |
Physical function | ADAPT | APEX | |
Stiffness | ADAPT | APEX | |
KOOS-PS (no data) | – | – | – |
Improvement criteria
Patient benefit can be defined in various ways. We restricted the definition in this study to improvement and used approaches that were applicable to all candidate scores (limited by data; see Online Supplement 2). Four definitions of improvement were applied:
-
any increase after surgery from before surgery (change score of > 0)
-
medium ES (0.5) × SD of change score (MCID)
-
minimal detectable change (MDC90 and ICC of 0.7)
-
minimal detectable change (MDC90 and ICC of 0.9).
In this study, the MCID is derived from the assumption that the mean change score needed to obtain a medium or large ES to be clinically meaningful. 89 Clinically meaningful refers to a change indicating the efficacy of an intervention (i.e. hip and knee replacement surgery in this study) in domains of health-related functional status tools. 90 For clinical evaluation studies such as this one, the usefulness of the measurement candidate tools will depend on their ability to detect a change that is clinically meaningful. 90 Applying a 0.5 ES (medium/moderate practical importance), classified by Cohen,73 using the variability of the change scores is ideal in this context. It was calculated using the SD of the change score of the candidate tools multiplied by the medium ES (i.e. 0.5 × the SD of the change score (b). For the remainder of this chapter, results based on criterion B are presented.
The MDC was defined as the minimal change that falls beyond the measurement error in the score of a candidate tool measuring a symptom. 91 In this study, the fixed (arbitrary) reliability parameter (e.g. test–retest reliability or ICC) values of 0.7 and 0.9 were applied. We applied the 90% confidence level with z-distribution (z-statistics value of 1.645), a range for the possible difference between the two observations under the same conditions (test–retest scenario) to define the MDC as ± 1.645 × √2 × SE of the measurement, where the SE of the measurement is defined as the SD of the preoperative score × √((1 – R)) and R is the reliability parameter of 0.7 and 0.9 (improvement criteria C and D). 92 Stata® version 14 was used for all statistical analyses.
Statistical analysis
Absolute threshold
We estimated an absolute threshold in the preoperative score, using each data set above, at a level in which an individual could not improve. We examined the theoretical thresholds for the different definitions of improvement (B, C and D) with the method of subtracting each improvement score from the maximum possible score of the candidate tools. The specificity (i.e. true negative) of each threshold was calculated for all data sets. Sensitivity was 100% by definition.
Relative threshold
We calculated the preoperative value (relative threshold) in which individuals are more likely to improve than in others, using each improvement criterion (A–D). We used two modelling approaches and assessed model properties for the four different definitions of improvement (A–D).
Linear regression
Linear regressions including the best-fit third-degree polynomials of the change score (postoperative to preoperative score) were used to estimate at what preoperative score the predicted change is likely to fall below each criterion.
The equation of the linear regression is:
where y^1=(y1−x1) and ei ∼ N(0, σ2), and y1 and x1 indicate the postoperative and preoperative scores for the ith observation, respectively.
Logistic regression
Logistic regressions with dichotomised change score (postoperative to preoperative score) by each improvement criterion were used to estimate the preoperative score at which the probability of improving fell below 50% and 75%.
where p is the probability that p^ of improved/N of preoperative score, yi ∼ B(ni, p), and yi and xi indicate the postoperative and preoperative scores for the ith observation, respectively.
Model performance for the relative threshold
The area under the ROC curve with 95% binomial exact CIs was calculated to examine discriminative ability with each improvement criterion using the observed (rather than estimated) postoperative score. In addition, goodness of fit incorporated inspection using plots of observed versus fitted values, plots of residuals versus fitted values and mean/SD of residuals by decile (or quintile). The predictive models’ performance for the logistic regression was assessed using calibration graphs: an illustration of the Hosmer–Lemeshow goodness-of-fit test, by decile (or quintile) of the predicted probabilities. 93
Threshold reporting
Thresholds for the OHS, OKS, SF-12 and WOMAC were reported, always rounding up the estimated value (e.g. 89.1 to 90.1) to avoid excluding any individuals who can benefit through rounding down. Sensitivity and specificity values were calculated using these thresholds and rounded to the nearest integer except when < 1%.
Percentages of population coverage
Percentages of study population coverage, which is the cumulative percentage up to the estimated absolute and relative thresholds and indicates the percentage of those who achieved the improvements, were calculated for each data set.
Results
Characteristics of the data sets
The mean ages of participants were 65–70 years (SD 8–14 years), the mean body mass index (BMI) values were 27–32 kg/m2 (SD 4–6 kg/m2) and the percentage of females was between 52% and 62% (Table 31).
Characteristic | Data set | ||||||||
---|---|---|---|---|---|---|---|---|---|
NHS PROMsa | KAT | EPOS | EUROHIP | ADAPT | APEX | ||||
Hip | Knee | Knee | Hip | Hip | Hip | Knee | Hip | Knee | |
Age (years) | |||||||||
n | 95,890 | 103,519 | 1634 | 1580 | 1298 | 125 | 128 | 343 | 339 |
Mean | – | – | 69.8 | 68.3 | 65.7 | 64.5 | 66.6 | 67.0 | 69.8 |
SD (< 60%) | 13.7 | 9.5 | 8.1 | 10.8 | 10.9 | 11.7 | 9.7 | 11.0 | 8.6 |
IQR: 25 (60–80%) | 73.2 | 78.3 | 65.0 | 62.6 | 59.0 | 57 | 60.1 | 60.0 | 63.3 |
IQR: 75 (≥ 80%) | 13.1 | 12.2 | 76.0 | 75.8 | 73.8 | 72.5 | 73.5 | 75.0 | 75.9 |
BMI (kg/m2) | |||||||||
n | – | – | 1584 | 1487 | 1225 | 130 | 132 | 338 | 334 |
Mean | – | – | 29.7 | 27.2 | 27.5 | 27.5 | 31.5 | 29.0 | 32.4 |
SD | – | – | 5.5 | 4.9 | 4.4 | 4.4 | 5.9 | 5.5 | 6.4 |
IQR: 25 | – | – | 26.1 | 24.1 | 24.5 | 24.5 | 27.5 | 26.0 | 27.9 |
IQR: 75 | – | – | 32.5 | 30.0 | 29.8 | 29.8 | 35.3 | 32.0 | 36.1 |
Sex | |||||||||
n | 95,816 | 103,412 | 1634 | 1581 | 1267 | 130 | 132 | 343 | 339 |
Female (%) | 61.4 | 58.3 | 55.9 | 62.2 | 55.9 | 49.2 | 53.0 | 59.0 | 52.2 |
Male (%) | 38.7 | 41.8 | 44.1 | 37.8 | 44.1 | 50.8 | 47.0 | 41.0 | 47.8 |
Minimally clinically important difference and minimally detectable change (90%) values
Hip
The MCID was 5 (range 5.0–5.1) units for the OHS; 5.4 and 4.5 for the SF-12 PCS and MCS, respectively; and 9.5–11.1 for the WOMAC total score across the data sets. Using an ICC of 0.9 provided MDCs of 6 (range 5.9–6.0) units for the OHS; 6.3 and 6.8 for the SF-12 PCS and MCS, respectively; and 11.8–15.7 for the WOMAC total score (Table 32).
Data set | Candidate tools | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
OKS | OHS | SF-12 | WOMAC | |||||||||||
PCS | MCS | Total | Pain | Function | Stiffness | |||||||||
K | H | K | H | K | H | K | H | K | H | K | H | K | H | |
PROMs | ||||||||||||||
MCID 0.5 SD | 4.9 | 5 | ||||||||||||
MDC (±) | ||||||||||||||
L | ||||||||||||||
F | ||||||||||||||
0.7 | 9.8 | 10.3 | ||||||||||||
0.9 | 5.7 | 6 | ||||||||||||
KAT | ||||||||||||||
MCID 0.5 SD | 5.1 | 5.4 | 5.8 | |||||||||||
MDC (±) | ||||||||||||||
L | 7.6 | 12 | ||||||||||||
F | ||||||||||||||
0.7 | 9.6 | 10 | 15 | |||||||||||
0.9 | 5.5 | 6 | 8.5 | |||||||||||
EPOS | ||||||||||||||
MCID 0.5 SD | 5.1 | |||||||||||||
MDC (±) | ||||||||||||||
L | ||||||||||||||
F | ||||||||||||||
0.7 | 10 | |||||||||||||
0.9 | 5.9 | |||||||||||||
EUROHIP | ||||||||||||||
MCID 0.5 SD | 10.7 | 11.6 | 10.9 | 14.3 | ||||||||||
MDC (±) | ||||||||||||||
L | 21.6 | 18.3 | 33.1 | |||||||||||
F | ||||||||||||||
0.7 | 20.5 | 22.7 | 21.3 | 26.4 | ||||||||||
0.9 | 11.8 | 13.1 | 12.3 | 15.3 | ||||||||||
ADAPT | ||||||||||||||
MCID 0.5 SD | 5.3 | 5.4 | 4.6 | 4.5 | 10 | 11.1 | 11.2 | 10.9 | 10.2 | 11.7 | 12.6 | 12.5 | ||
MDC (±) | ||||||||||||||
L | 6.7 | 7.9 | 9.5 | 9.6 | 22.9 | 26.6 | 19.9 | 24.3 | 32.3 | 38.7 | ||||
F | ||||||||||||||
0.7 | 9.1 | 10.9 | 11.7 | 11.7 | 22.2 | 27.1 | 24.1 | 28 | 23.2 | 28.3 | 25.8 | 11.7 | ||
0.9 | 5.3 | 6.3 | 6.7 | 6.8 | 12.8 | 15.7 | 13.9 | 16.9 | 13.4 | 16.4 | 14.9 | 6.8 | ||
APEX | ||||||||||||||
MCID 0.5 SD | 9.8 | 9.5 | 10.5 | 11 | 9.9 | 10.4 | 12.7 | 12.3 | ||||||
MDC (±) | ||||||||||||||
L | 20.2 | 22.7 | 19.5 | 20.6 | 32.2 | 36.6 | ||||||||
F | ||||||||||||||
0.7 | 21 | 22.8 | 21.3 | 23.9 | 22.8 | 24.1 | 25.7 | 29.2 | ||||||
0.9 | 12.1 | 13.2 | 12.3 | 13.8 | 13.2 | 13.9 | 14.9 | 16.9 |
Knee
The MCID was 5 (range 4.9–5.1) units for the OKS, 5.3–5.4 for the SF-12 PCS, 4.6–5.8 for the SF-12 MCS, and 9.8–10.0 for the WOMAC total score. Using an ICC of 0.9 provided MDCs of 6 (range 5.5–5.7) units for the OKS; 5.3–6.0 and 6.7–8.5 for the SF-12 PCS and MCS, respectively; and 12.1–12.8 for the WOMAC total score (see Table 32).
Percentage of population improving
Using the MCID 0.5 improvement criterion B, the OHS shows a 94% improvement and the OKS shows an 85–88% improvement. The WOMAC total scores show an 84–86% improvement for hip score and a 75–88% improvement for knee score (Tables 33 and 34). The SF-12 has some theoretically possible scores, which are unrealistic to obtain from the postoperative quality-of-life or health status outcome.
Candidate tools | Data sets | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PROMs | EPOS | EUROHIP | ADAPT | APEX | |||||||||||||||||||||
n | Improvement criteria | n | Improvement criteria | n | Improvement criteria | n | Improvement criteria | n | Improvement criteria | ||||||||||||||||
A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | D | ||||||
OHS | 102,404 | 97 | 94 | 86 | 93 | 1179 | 98 | 94 | 90 | 94 | |||||||||||||||
SF-12 | |||||||||||||||||||||||||
PCS | 118 | 86 | 65 | 49 | 61 | ||||||||||||||||||||
MCS | 118 | 47 | 26 | 15 | 21 | ||||||||||||||||||||
WOMAC | |||||||||||||||||||||||||
Total | 845 | 93 | 86 | 76 | 85 | 125 | 96 | 84 | 56 | 75 | 261 | 98 | 95 | 87 | 95 | ||||||||||
Pain | 865 | 93 | 83 | 72 | 83 | 125 | 96 | 87 | 61 | 76 | 323 | 97 | 94 | 86 | 94 | ||||||||||
Physical function | 874 | 93 | 85 | 73 | 84 | 125 | 95 | 81 | 55 | 69 | 270 | 97 | 94 | 82 | 94 | ||||||||||
Stiffness | 883 | 71 | 53 | 71 | 83 | 125 | 96 | 89 | 56 | 73 | 300 | 98 | 89 | 60 | 77 |
Candidate tools | Data sets | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
PROMs | KAT | ADAPT | APEX | |||||||||||||||||
n | Improvement criteria | n | Improvement criteria | n | Improvement criteria | n | Improvement criteria | |||||||||||||
A | B | C | D | A | B | C | D | A | B | C | D | A | B | C | D | |||||
OHS | 108,832 | 94 | 88 | 76 | 86 | 1634 | 91 | 85 | 74 | 85 | ||||||||||
SF-12 | ||||||||||||||||||||
PCS | 1518 | 60 | 46 | 62 | 80 | 127 | 77 | 51 | 40 | 51 | ||||||||||
MCS | 1518 | 54 | 33 | 12 | 25 | 127 | 55 | 35 | 12 | 22 | ||||||||||
WOMAC | ||||||||||||||||||||
Total | 126 | 90 | 75 | 52 | 69 | 246 | 93 | 88 | 73 | 86 | ||||||||||
Pain | 127 | 93 | 82 | 59 | 82 | 318 | 97 | 89 | 78 | 89 | ||||||||||
Physical function | 127 | 88 | 78 | 47 | 63 | 253 | 93 | 86 | 70 | 82 | ||||||||||
Stiffness | 127 | 91 | 64 | 36 | 64 | 293 | 94 | 67 | 45 | 67 |
Absolute threshold using criterion B
In this section, results based on improvement in criterion B are provided (see Online Supplement 3 for results based on the other criteria). Tables 35 and 36 describe the absolute threshold with specificity (95% CIs) and the study population coverage (%) by the thresholds for hips and knees, respectively.
Candidate tools | Preoperative threshold | Specificity (%) (95% CI) | Population coverage (%) |
---|---|---|---|
OHS | |||
NHS PROMs | 43 | 2 (1 to 2) | 100 |
EPOS | 43 | 2 (< 0.1 to 9) | 100 |
SF-12 PCS | |||
ADAPT | 66 | 0 (0 to 14) | 100 |
SF-12 MCS | |||
ADAPT | 65 | 0 (0 to 7) | 100 |
WOMAC total | |||
ADAPT | 89 | 22 (6 to 48) | 96 |
APEX | 91 | 0 (0 to 31) | 100 |
EUROHIP | 90 | 0 (0 to 2) | 100 |
WOMAC pain | |||
ADAPT | 90 | 21 (5 to 51) | 96 |
APEX | 89 | 18 (4 to 43) | 99 |
EUROHIP | 89 | 4.2 (1 to 9) | 100 |
WOMAC physical function | |||
ADAPT | 89 | 33 (15 to 57) | 94 |
APEX | 90 | 17 (2 to 48) | 99 |
EUROHIP | 90 | 0.8 (0.0 to 5) | 100 |
WOMAC stiffness | |||
ADAPT | 88 | 42 (15 to 72) | 95 |
APEX | 88 | 29 (13 to 49) | 97 |
EUROHIP | 86 | 8.1 (5 to 12) | 99 |
Candidate tools | Preoperative threshold | Specificity (%) (95% CI) | Population coverage (%) |
---|---|---|---|
OKS | |||
NHS PROMs | 43 | 0.5 (0.4 to 0.7) | 100 |
KAT | 43 | 0.9 (0.1 to 3) | 100 |
SF-12 PCS | |||
ADAPT | 66 | 0 (0 to 11) | 100 |
KAT | 71 | 0 (0 to 1) | 100 |
SF-12 MCS | |||
ADAPT | 65 | 2 (0 to 13) | 100 |
KAT | 74 | 0.3 (0.1 to 0.9) | 100 |
WOMAC total | |||
ADAPT | 90 | 7 (1 to 24) | 100 |
APEX | 91 | 0 (0 to 16) | 100 |
WOMAC pain | |||
ADAPT | 89 | 5 (0 to 25) | 100 |
APEX | 90 | 0 (0 to 12) | 100 |
WOMAC physical function | |||
ADAPT | 90 | 8 (1 to 27) | 99 |
APEX | 91 | 7 (1 to 24) | 100 |
WOMAC stiffness | |||
ADAPT | 88 | 8 (2 to 21) | 98 |
APEX | 88 | 3 (0 to 9) | 99 |
Hip
The absolute threshold was 43 for the OHS. The preoperative scores of the SF-12 PCS and MCS were 66 and 65, respectively. We found the threshold range of scores to be 89–91 for the WOMAC total, 89–90 for the WOMAC pain, 89–90 for the WOMAC physical function and 86–88 for the WOMAC stiffness. Study population coverages were 100% for the OHS and SF-12 and ≥ 94% for the WOMAC (see Table 35).
Knee
The absolute threshold was 43 for the OKS. The ranges of the preoperative scores for the SF-12 PCS and MCS were 66–71 and 65–74, respectively. The threshold range was 90–91 for the WOMAC total, 89–90 for the WOMAC pain, 90–91 for the WOMAC physical function and 88 for the WOMAC stiffness. The study population coverages were 100% for the OHS and SF-12 and ≥ 98% for the WOMAC (see Table 36).
Relative threshold using criterion B
Tables 37 and 38 show the relative threshold with sensitivity and specificity (95% CIs) (see Online Supplement 4 for full results) and the study population coverage (%) by each threshold. The AUC (95% CI) for the OHS, OKS, SF-12 PCS and WOMAC total showed poor (< 0.7) discrimination abilities overall.
Candidate tools | Probability level | Preoperative threshold | Specificity (%) (95% CIs) | AUC (95% CI) | Population coverage (%) |
---|---|---|---|---|---|
OHS | |||||
NHS PROMs | |||||
Model 1 | 40 | 4 (4 to 5) | 100 | ||
Model 2 | 0.5 | 43 | 2 (1 to 2) | 0.65 (0.65 to 0.66) | 100 |
0.75 | 38 | 6 (6 to 7) | 99 | ||
EPOS | |||||
Model 1 | 40 | 3 (0 to 12) | 100 | ||
Model 2 | 0.5 | 42 | 2 (0 to 9) | 0.62 (0.59 to 0.64) | 100 |
0.75 | 39 | 3 (0 to 12) | 99 | ||
SF-12 PCS | |||||
ADAPT | |||||
Model 1 | 46 | 24 (9 to 45) | 91 | ||
Model 2 | 0.5 | 47 | 20 (7 to 41) | 0.58 (0.46 to 0.7) | 92 |
0.75 | 35 | 48 (28 to 69) | 70 | ||
SF-12 MCS | |||||
ADAPT | |||||
Model 1 | 42 | 91 (79 to 97) | 26 | ||
Model 2 | 0.5 | 39 | 96 (87 to 100) | 0.93 (0.85 to 0.98) | 19 |
0.75 | 37 | 100 (93 to 100) | 16 | ||
WOMAC total | |||||
ADAPT | |||||
Model 1 | 81 | 50 (26 to 74) | 90 | ||
Model 2 | 0.5 | 85 | 44 (22 to 69) | 0.76 (0.67 to 0.84) | 94 |
0.75 | 78 | 56 (31 to 78) | 86 | ||
APEX | |||||
Model 1 | 86 | 20 (3 to 56) | 99 | ||
Model 2 | 0.5 | 86 | 20 (3 to 56) | 0.61 (0.53 to 0.67) | 99 |
0.75 | 82 | 20 (3 to 56) | 98 | ||
EUROHIP | |||||
Model 1 | 83 | 3 (1 to 8) | 99 | ||
Model 2 | 0.5 | 87 | 1 (0 to 5) | 0.56 (0.53 to 0.59) | 100 |
0.75 | 80 | 4 (1 to 9) | 99 | ||
WOMAC pain | |||||
ADAPT | |||||
Model 1 | 83 | 36 (13 to 65) | 94 | ||
Model 2 | 0.5 | 89 | 36 (13 to 65) | 0.66 (0.57 to 0.75) | 96 |
0.75 | 82 | 36 (13 to 65) | 94 | ||
APEX | |||||
Model 1 | 83 | 18 (4 to 43) | 99 | ||
Model 2 | 0.5 | 88 | 18 (4 to 43) | 0.65 (0.59 to 0.7) | 99 |
0.75 | 83 | 18 (4 to 43) | 99 | ||
EUROHIP | |||||
Model 1 | 84 | 6 (3 to 12) | 99 | ||
Model 2 | 0.5 | 89 | 4 (2 to 9) | 0.58 (0.55 to 0.62) | 100 |
0.75 | 78 | 8 (4 to 14) | 98 | ||
WOMAC physical function | |||||
ADAPT | |||||
Model 1 | 81 | 48 (26 to 70) | 86 | ||
Model 2 | 0.5 | 86 | 43 (22 to 66) | 0.82 (0.73 to 0.89) | 91 |
0.75 | 78 | 52 (30 to 74) | 82 | ||
APEX | |||||
Model 1 | 86 | 17 (2 to 48) | 99 | ||
Model 2 | 0.5 | 88 | 17 (2 to 48) | 0.65 (0.58 to 0.72) | 99 |
0.75 | 83 | 17 (2 to 48) | 99 | ||
EUROHIP | |||||
Model 1 | 82 | 5 (2 to 10) | 99 | ||
Model 2 | 0.5 | 86 | 3 (1 to 8) | 0.55 (0.52 to 0.58) | 100 |
0.75 | 79 | 7 (3 to 13) | 99 | ||
WOMAC stiffness | |||||
ADAPT | |||||
Model 1 | 82 | 50 (21 to 79) | 95 | ||
Model 2 | 0.5 | 91 | 42 (15 to 72) | 0.81 (0.72 to 0.88) | 100 |
0.75 | 86 | 50 (21 to 79) | 95 | ||
APEX | |||||
Model 1 | 82 | 32 (16 to 52) | 97 | ||
Model 2 | 0.5 | 90 | 29 (13 to 49) | 0.77 (0.71 to 0.82) | 100 |
0.75 | 83 | 32 (16 to 52) | 97 | ||
EUROHIP | |||||
Model 1 | 65 | 20 (15 to 25) | 98 | ||
Model 2 | 0.5 | 67 | 20 (15 to 25) | 0.71 (0.68 to 0.74) | 98 |
0.75 | 36 | 81 (75 to 85) | 61 |
Candidate tools | Probability level | Preoperative threshold | Specificity (%) (95% CI) | AUC (95% CI) | Population coverage (%) |
---|---|---|---|---|---|
OKS | |||||
NHS PROMs | |||||
Model 1 | 37 | 4 (4 to 5) | 99 | ||
Model 2 | 0.5 | 40 | 2 (2 to 2) | 0.62 (0.61 to 0.62) | 100 |
0.75 | 33 | 11 (10 to 11) | 96 | ||
KAT | |||||
Model 1 | 35 | 5 (3 to 9) | 98 | ||
Model 2 | 0.5 | 39 | 3 (1 to 6) | 0.62 (0.60 to 0.65) | 100 |
0.75 | 29 | 14 (10 to 20) | 93 | ||
SF-12 PCS | |||||
ADAPT | |||||
Model 1 | 36 | 25 (11 to 43) | 87 | ||
Model 2 | 0.5 | 34 | 31 (16 to 50) | 0.64 (0.5 to 0.75) | 78 |
0.75 | – | – | – | ||
KAT | |||||
Model 1 | 39 | 27 (23 to 31) | 86 | ||
Model 2 | 0.5 | 43 | 16 (13 to 19) | 0.65 (0.63 to 0.68) | 92 |
0.75 | 22 | 94 (92 to 96) | 13 | ||
SF-12 MCS | |||||
ADAPT | |||||
Model 1 | 40 | 93 (81 to 99) | 22 | ||
Model 2 | 0.5 | 49 | 74 (58 to 86) | 0.81 (0.7 to 0.9) | 51 |
0.75 | 34 | 100 (92 to 100) | 8 | ||
KAT | |||||
Model 1 | 43 | 85 (83 to 88) | 28 | ||
Model 2 | 0.5 | 49 | 72 (69 to 75) | 0.78 (0.75 to 0.80) | 45 |
0.75 | 26 | 100 (99 to 100) | 2 | ||
WOMAC total | |||||
ADAPT | |||||
Model 1 | 86 | 7 (1 to 24) | 100 | ||
Model 2 | 0.5 | 81 | 15 (4 to 34) | 0.55 (0.46 to 0.65) | 97 |
0.75 | 71 | 19 (6 to 38) | 92 | ||
APEX | |||||
Model 1 | 81 | 10 (1 to 30) | 99 | ||
Model 2 | 0.5 | 85 | 5 (0 to 24) | 0.60 (0.53 to 0.68) | 100 |
0.75 | 75 | 10 (1 to 30) | 97 | ||
WOMAC pain | |||||
ADAPT | |||||
Model 1 | 78 | 5 (0 to 25) | 98 | ||
Model 2 | 0.5 | 82 | 5 (0 to 25) | 0.56 (0.46 to 0.65) | 100 |
0.75 | 71 | 10 (1 to 32) | 98 | ||
APEX | |||||
Model 1 | 81 | 0 (0 to 12) | 100 | ||
Model 2 | 0.5 | – | – | 0.59 (0.53 to 0.65) | – |
0.75 | 85 | 0 (0 to 12) | 100 | ||
WOMAC physical function | |||||
ADAPT | |||||
Model 1 | 87 | 13 (3 to 32) | 98 | ||
Model 2 | 0.5 | 89 | 8 (1 to 27) | 0.47 (0.37 to 0.57) | 99 |
0.75 | 82 | 13 (3 to 32) | 96 | ||
APEX | |||||
Model 1 | 82 | 15 (4 to 34) | 98 | ||
Model 2 | 0.5 | 86 | 11 (2 to 29) | 0.62 (0.55 to 0.69) | 99 |
0.75 | 75 | 19 (6 to 38) | 94 | ||
WOMAC stiffness | |||||
ADAPT | |||||
Model 1 | 72 | 18 (8 to 34) | 97 | ||
Model 2 | 0.5 | 65 | 18 (8 to 34) | 0.69 (0.59 to 0.77) | 97 |
0.75 | 24 | 97 (87 to 100) | 24 | ||
APEX | |||||
Model 1 | 68 | 28 (18 to 39) | 97 | ||
Model 2 | 0.5 | 65 | 28 (18 to 39) | 0.71 (0.65 to 0.77) | 97 |
0.75 | 34 | 86 (76 to 93) | 58 |
Hip
The range of relative thresholds was 38–43 with specificity of 2–6 for the OHS. A histogram of the preoperative OHS distribution for the NHS PROMs data sets is given in Figure 7. Threshold ranges of the SF-12 PCS and MCS were 35–47 (specificity 20–48%) and 37–42 (specificity 91–100%), respectively. The threshold ranges were 78–87 (specificity 1–56%) for the WOMAC total, 78–89 (specificity 4–36%) for the WOMAC pain, 78–88 (specificity 3–52%) for the WOMAC physical function and 36–91 (specificity 20–50%, EUROHIP 81) for the WOMAC stiffness. Study population coverages for the thresholds of a 50% probability level were 100% for the OHS, 19% for the SF-12 PCS, 92% for the SF-12 MCS and 91–100% for the WOMAC. Figures 8 and 9 give the linear regression model, absolute threshold and logistic regression model estimates. The CI bands (see Figure 8) show the lack of fit for the variation in the OHS outcome for the linear model, although the point estimate seems reasonable. Linear regression models for the other outcomes (not shown) showed a similar lack of fit.
FIGURE 7.
The OHS: NHS PROMs preoperative histogram with the absolute and linear relative thresholds using criterion B.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig7.png)
FIGURE 8.
The OHS: NHS PROMs change scores.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig8.png)
FIGURE 9.
The OHS: NHS PROMs percentage improved using criterion B.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig9.png)
Knee
The range of the relative (observed) threshold was 29–40 with a specificity of 2–14 for the OKS. A histogram of the preoperative OHS distribution for the NHS PROMs data sets is given in Figure 10. Threshold ranges of the preoperative score of the SF-12 PCS and MCS were 22–43 (specificity 16–31%, KAT 94) and 26–49 (specificity 72–100%), respectively. The threshold ranges were 71–86 (specificity 5–19%) for the WOMAC total, 71–85 (specificity 0–10%) for the WOMAC pain, 75–89 (specificity 8–19%) for the WOMAC physical function and 24–72 (specificity 18–28%, ADAPT 97 and APEX 86) for the WOMAC stiffness. Study population coverages for the thresholds of a 50% probability level were 100% for the OHS, 78–92% for the SF-12 PCS, 45–51% for the SF-12 MCS and 97–100% for the WOMAC. Figures 11 and 12 give the linear regression model, absolute threshold and logistic regression model estimates. The CI bands (see Figure 11) show the lack of fit for the variation in the OHS outcome for the linear model, although the point estimate seems reasonable. Linear regression models for the other outcomes (not shown) showed a similar lack of fit.
FIGURE 10.
The OKS: NHS PROMs preoperative histogram with the absolute and linear relative thresholds using criterion B.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig10.png)
FIGURE 11.
The OKS: NHS PROMs change scores.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig11.png)
FIGURE 12.
The OKS: NHS PROMs percentage improved using criterion B.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig12.png)
Discussion
Main findings
We examined the percentage of the population who improved in accordance with each improvement criterion. Based on the MCID 0.5 improvement criterion B, all the candidate tools show > 75% improvement for the OHS, OKS and WOMAC scores (except the WOMAC stiffness: 53–64%) (see Tables 33 and 34).
The ability to predict postoperative quality-of-life or health status outcome from the preoperative score has limitations; it is apparent from the AUC values that although the preoperative score is informative, it does not fully determine what the postoperative outcome will be. Other factors may improve the prediction (this was explored for the OHS and OKS in Chapter 5); however, it is worth noting that the variability in postoperative quality of life or health status is substantial. Related to this, the linear regression model fitted poorly in terms of representing the variability, although, as it was being used as a best guess, this is not problematic per se to the estimation of the relative threshold based on the average response. Regression analysis techniques that can deal with this are available, such as the use of transformations, modelling the variability separately and quantile regression.
The estimated absolute threshold for the OHS, OKS, SF-12 and WOMAC, when applied retrospectively, only excluded a very small proportion of the patients within our data sets who received hip and knee replacement surgery, which is reassuring. Use of this threshold is reliant on the definition of improvement used, not on the predictive performance. Although a small proportion of patients did receive surgery that may not have been suitable, it is possible that other benefits not detected with these outcome tools could explain and justify the operations for these individuals.
There was a suggestion of ceiling effects in the postoperative OHS, although this may to some degree reflect the real positive postoperative results. However, we need to be cautious about the ability of the OHS to distinguish postoperative results at the very top of the range. The OKS seemed to show a ceiling effect only for the function subscale in work package 1; it may be that function after the surgery may not be as well differentiated as pain by the instrument. For the OHS and OKS, we used the web-based NHS PROMs data set (2012–15), which was also used for work package 1. Interpreting with the specificity (false-negative) scores suggests that the NHS PROMs data set could be a more reliable source to develop the tools than other data sets.
The SF-12 PCS and, especially, the SF-12 MCS showed very high specificity (very poor false-negative) scores for relative thresholds (hip ranges of 20–48 for PCS and 91–100 for MCS, and knee ranges of 16–94 for PCS and 72–100 for MCS). The SF-12 PCS and MCS did not well cover the range of possible scores in the population (see Appendix 1). This may suggest that the SF-12 may not be an ideal tool for developing the standardised thresholds for hip or knee replacement surgeries. Use of the SF-12 PCS and MCS individually was problematic given that they are negatively correlated. The SF-12 MCS, perhaps unsurprisingly, did not perform well in terms of face validity of the threshold impact.
Strengths and limitations
The main strengths of our study were (1) the use of diverse methods to define the thresholds and assess prognostic performance and impact, (2) the inclusion of multiple candidate instruments that were selected after a comprehensive search and selection process and (3) the use of multiple data sets including a number of large cohorts in musculoskeletal populations (e.g. NHS PROMs data).
The difficulties of defining a clinically important improvement, particularly for measuring quality of life (whether it is generic or disease-specific), are well known. 90,94 Furthermore, criteria had to be applicable to all of the candidate tools. Given the variation in available information on this despite the work carried out in work package 1, the choice was restricted to definitions that could be applied on data sets and were not reliant on pre-existing work (e.g. the anchor estimate of the MIC in the relevant population), which was not available for all the measures. We used four methods, of which most emphasis was given to the medium ES (0.5)-based definition, in which a medium ES is of moderate clinical importance. 95 It is worth noting that these are relatively simple approaches, for instance, which do not take into account the preoperative score level, which itself may have influence on the clinically important improvement. We also applied the following approaches: any increase in the change score (> 0) and MDC defined as the minimal change beyond the measurement error. None of these is without criticism as a definition of clinically important improvement, although the need to compare evenly across measurement tools limited the options. Nevertheless, we believe that the findings have value, particularly the medium ES definition.
The study had a number of limitations. Potential sources of bias or limitations in the study methodology were (1) the improvement criteria used reflected limitations in the literature and data sets available for the candidate tools; (2) estimation did not control for the baseline characteristic information, such as patient age, gender and comorbidities, which could be prognostically important; (3) complete-case analyses were undertaken (without imputation of the missing data), which could influence the findings; (4) the thresholds will be influenced by the measurement tool’s properties, such as responsiveness, which may mean that there is a failure to reflect meaning impact in some patients; and (5) only point threshold estimations were reported in this study.
Further research needed
Our definition of improvement could be more rigorous. Additional benefit could be extended beyond ‘improvement’ to incorporate other concepts (e.g. patient satisfaction) or impacts of treatment (i.e. other relevant measures of benefit will be needed, especially for relative threshold). The MCID approaches can be defined and calculated in various ways, for example the approaches could be defined subjectively based on a patient’s satisfaction after the surgery (an anchor-based approach). The MIC using the mean change score (pre and post operation) is another potential alternative.
The NHS PROMs data were from a very large data set but only had a 6-month follow-up period, whereas most other data sets used 12 months of follow-up (except EPOS, which used 24 months). This may have somewhat influenced the magnitude of findings. 96 It would have been advantageous to have access to larger data sets to assess the other tools (i.e. non-Oxford scores), particularly with regard to the WOMAC score, for which more precise estimates of relative thresholds could have been achieved. The impact of the baseline characteristics (i.e. main prognostic variables) on the estimation of the relative threshold will be applied in work package 3.
Conclusion
In this study, various improvement definitions and analytical methods were used systemically to calculate threshold levels for the candidate tools in various data sets. The results demonstrated that thresholds of three candidate tools (the OHS, OKS and WOMAC), which suggested promising initial cross-sectional psychometric properties (from work package 1), were consistent across data sets (except for the SF-12).
Chapter 5 Health economic evaluation of thresholds values (work package 2)
Background
This chapter aims to answer the following two inter-related questions:
-
What is the economic threshold for each clinical tool (i.e. what is the highest score at which arthroplasty is cost-effective)?
-
How do the incremental costs, QALYs and cost-effectiveness of arthroplasty vary depending on the threshold and clinical tool used?
We addressed these questions in a UK setting from a NHS perspective by conducting a series of cost–utility analyses, comparing the incremental cost-effectiveness of TKA and THA with no arthroplasty in men and women of different ages with different preoperative scores on each clinical tool.
The analyses presented in this chapter focused on total joint arthroplasty (TJA) because this type of surgery constitutes 92% of knee and 99% of hip arthroplasty procedures conducted in the UK. 97 We compared immediate TJA with having no arthroplasty surgery during the 10-year time horizon used in the analysis. This enabled us to calculate the economic threshold for each clinical tool by comparing the cost-effectiveness in different groups with one another and with the £20,000-per-QALY ceiling ratio typically used in NHS decision-making. 98 The economic threshold simply comprised the highest clinical tool score at which the incremental cost-effectiveness ratio (ICER) is < £20,000 per QALY gained. In practice, patients who are not deemed to warrant immediate surgery may have treatment later, after their condition has worsened. However, there are relatively limited data on how clinical tool scores change over time without surgery,99–102 and modelling the referral pathway and outcomes for patients who undergo surgery at different times would have greatly complicated the analysis. Instead, by directly comparing immediate TJA with no arthroplasty over 10 years, we made the most of existing UK data sets and directly assessed the cost-effectiveness of arthroplasty.
We focused on NHS and Personal Social Services costs in line with current UK guidelines,103 but have narrowed the perspective further to focus on NHS costs because only one of the available data sets (APEX) included non-NHS costs. Health benefits were measured in terms of QALYs, in line with guidelines and in order to capture the effect of surgery on both quality of life and mortality. 103 The cost-effectiveness of TJA versus no arthroplasty was therefore calculated as the difference in cost (between patients undergoing TJA and those having no arthroplasty) divided by the difference in QALYs.
The analysis primarily concerned patients aged between 50 and 90 years and with an ASA (American Society of Anesthesiologists) grade of 1–3 who were undergoing unilateral TKA or THA for osteoarthritis. However, patients not meeting these criteria were not explicitly excluded from the regression analyses used to estimate input parameters to ensure consistency with the clinical analyses. Because the model begins when patients undergo (or do not undergo) TJA, the analyses presented in this chapter would apply regardless of whether the thresholds are applied in the setting of general practice, musculoskeletal hubs or secondary care.
Methods
Costs and QALYs for hypothetical patients with different clinical tool scores and demographic characteristics were estimated using decision-analytic models built in Microsoft Excel® 2010 (Microsoft Corporation, Redmond, WA, USA). Model-based economic evaluation enabled us to synthesise data from different data sets and extrapolate beyond the end of the available data. Separate models were built for TKA and THA for each of the three clinical tools. Model parameters (e.g. costs, utilities and the probability of death or revision) were based on regression models estimated using patient-level data from existing data sets; such models were used to estimate model parameters for patients with different preoperative characteristics. Hypothetical patients of different ages and genders with a wide range of clinical tool scores were run through the models separately and the costs and QALYs with and without TJA were calculated for each hypothetical patient. The models were probabilistic and took account of uncertainty around regression parameters.
Like the analyses described in Chapter 4, we considered hypothetical patients with different total OKSs or total OHSs on a 0–48 scale. For the WOMAC, for simplicity we focused on total scores, rather than considering pain, physical functioning and stiffness separately. As for the clinical analyses, the scale on the total WOMAC score (the sum of the three subscores ranging from 0 to 96) was multiplied by a factor of 10096 to rescale it, and then reversed by subtracting it from 100, such that 0 indicates severe problems and 100 indicates no problems. However, because there is no single summary score for the SF-12, we considered the physical and mental scores as two separate patient characteristics and evaluated the cost-effectiveness of different combinations of physical and mental scores, estimating threshold SF-12 physical scores for patients with different mental scores.
Model
Literature reviews on modelling approach
A comprehensive literature review was conducted to identify previous decision-analytic models that assessed the cost-effectiveness of arthroplasty, specific types of surgery or prostheses or changes to the timing of surgery (see Online Supplement 7). This review identified 41 previous model-based economic evaluations. Almost all of the published studies used Markov models to allow for repeating cycles. Models typically allowed for a proportion of patients having one or more revision operations, which were sometimes separated into one- and two-stage revisions, revisions for infection or other causes, or total and partial revisions. Some models used separate health states to differentiate between patients with good and poor outcomes or quality of life after surgery. Some of the models differentiated between patient groups based on age, sex, comorbidities, ASA grade and/or obesity, and one analysis estimated results for patients with different Kellgren and Lawrence grades with and without symptoms. 104 However, no model-based evaluations calculated how costs and QALYs varied with clinical tool scores. Five studies using patient-level data assessed how costs and/or QALYs varied with the Oxford Hip and Knee Scores or the WOMAC. 2,68,105–107 One study that was published after our search date assessed how cost-effectiveness varied with the SF-12. 108
Description of the Markov model
We used the results of this literature review to inform the design of our model. Like most studies in the literature, we used a Markov model with annual cycles to allow for the fact that patients are at risk of death or needing a revision each year. We estimated costs and QALYs using cohort simulation, because modelling thousands of individual patient trajectories for up to 1410 hypothetical patients using patient-level simulation would have been too computationally time-consuming. Revision rates and mortality varied depending on the time since primary arthroplasty and the age and sex of the hypothetical individual. Costs and EQ-5D utilities also varied depending on the time since primary arthroplasty, age, sex and clinical tool score.
The model started at the point at which patients in the arthroplasty arm underwent primary TKA or THA (Figure 13). Revision was defined in the same way as in the National Joint Registry (NJR), namely an:
. . . operation performed to remove (and usually replace) one or more components of a total joint prosthesis for whatever reason.
FIGURE 13.
State transition diagram for the Markov model.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig13.png)
The base-case analysis used a 10-year time horizon because it was considered clinically unrealistic to assume that patients in the no arthroplasty arm would never have surgery. This also approximated the longest follow-up time available in the data sets used to estimate model parameters. However, different time horizons were examined in the sensitivity analysis. Costs, QALYs and life-years accrued beyond year 1 were discounted at 3.5% per annum. 103,109
The model did not apply a standard half-cycle correction to all health states, because the cost of primary TJA was assumed to be at the start of the year regardless of what subsequently happened to the patient and is not evenly distributed across the first 12-month period. We assumed that patients who die in the same year as having a primary TJA or revision surgery will incur the entire cost of the hospital stay in which the TJA or revision surgery was conducted. The cost of the admission for primary or revision surgery was assumed to be the same regardless of whether patients died during or soon after surgery. Other costs were assumed to be evenly distributed across the year, such that patients who die in any given year were assumed to accrue half of the cost and half of the number of QALYs that they would have accrued if they had lived for the whole year.
The model used a probabilistic sensitivity analysis (PSA) to propagate the uncertainty around all uncertain parameters (see Presentation of results and analysis of uncertainty).
Data sets
Data inputs
Model input parameters were estimated using regression models predicting the model parameter as a function of clinical tool score, age and sex. Each of the six Markov models required regression models informing 19 sets of model parameters (Table 39). As described in Literature reviews on model inputs, we began by conducting literature searches to identify any published regression models that could be directly used in the model. If no suitable regression model was identified, the following data sets (described in Chapter 3, Data sets) were considered as candidates for estimating the regression models conducted before the second user group meeting:
-
PROMs data freely available at www.hscic.gov.uk/proms (accessed 7 October 2015)
-
the KAT66
-
the EUROHIP132
-
Belfast.
Parameter | Data sources: knees | Data sources: hips | ||||
---|---|---|---|---|---|---|
OKS | SF-12 | WOMAC | OHS | SF-12 | WOMAC | |
Preoperative EQ-5D utility (mapping) |
|
|
|
|
|
|
Postoperative EQ-5D utility 3–12 months after arthroplasty |
|
|
|
|
|
|
Annual change in EQ-5D at > 3–6 months after arthroplasty, in patients with and without revision |
|
|
|
|
|
|
EQ-5D utility before revision |
|
|
|
|
||
EQ-5D utility after revision |
|
|
|
|
||
Cost of the initial arthroplasty procedure and hospital stay |
|
|
|
|
|
|
Community, outpatient and re-admission costs beyond the initial hospital stay: year 1 |
|
|
|
|
|
|
Community, outpatient and re-admission costs beyond the initial hospital stay: > 1 year after TJA |
|
|
|
|
|
|
Cost of revision arthroplasty procedure and hospital stay |
|
|
|
|
|
|
Community, outpatient and re-admission costs beyond the initial hospital stay for revision (estimated separately for the year of revision and > 1 year after revision) |
|
|
|
|
||
Community, outpatient and inpatient costs without arthroplasty |
|
|
|
|||
Annual change in clinical tools without arthroplasty | ||||||
Annual change in EQ-5D utility without arthroplasty | ||||||
Probability of revision surgery | ||||||
Probability of re-revision |
|
|
||||
Operative mortality: primary arthroplasty |
|
|
||||
Operative mortality: revision arthroplasty |
|
|
||||
Healthy patient effect |
|
|
||||
All-cause mortality |
|
The COASt is a prospective, dual-centre longitudinal cohort study recruiting patients across two hospitals: the University Hospital Southampton NHS Foundation Trust and NOC as part of Oxford University Hospitals NHS Foundation Trust. The COASt was established in 2010, and patients who were placed on the waiting list for knee or hip replacement surgery were recruited to the study. 131 The data sets extracted for the ACHE tool contained observations for 810 hip surgery patients and 858 patients undergoing knee surgery. After excluding patients who underwent procedures other than TKA/THA (e.g. hip resurfacing), the data sets contained 806 observations on THA patients and 484 observations for TKA patients. Data were collected prior to surgery as well as at 6 weeks and then annually for 5 years thereafter. COASt is funded by NIHR under its Programme Grants for Applied Research programme (reference number RP-PG-0407–10064). The study obtained ethics approval from Oxford Research Ethics Committee (reference number 10/H0604/91) and written consent was obtained from each participant.
As detailed in Chapter 7 and Table 39, nine of the parameters within the OKS/OHS models were re-estimated after the second user group meeting and after a new linked extract of Hospital Episode Statistics (HES) and PROMs data became available. This provides a much larger sample of more recent data, including patients’ exact age, and can be directly used to calculate the payment by results tariff applicable to each patient.
Data sets were not pooled, because they differed substantially in age and data collected and there was little advantage in pooling a small data set with a much larger one (e.g. PROMs). Instead, we selected the most appropriate data set among those reporting the tool and outcome of interest based on size, age of the data set and consistency with other analyses. When possible, we aimed to use the same data set for several time points, tools or arms of the model and aimed to use data sets that measured quality of life at the same time point.
Utility was measured using the EQ-5D-3L using the UK time trade-off tariff133 to ensure consistency across models and with NICE guidelines. 103
Literature reviews on model inputs
We began by conducting literature searches to identify any previous studies that reported regression models that could be used directly in our analysis. Five specific searches were conducted:
-
A review of previous economic evaluations and costing studies using patient-level data (see Online Supplement 7) aimed to inform the model structure and identify data for all model parameters.
-
A review of studies reporting changes in clinical tool scores over time for osteoarthritis patients without arthroplasty surgery (see Online Supplement 9) aimed to inform assumptions about how clinical tool scores change over time.
-
A review of studies reporting long-term changes in clinical scores, mortality and risk of revision after arthroplasty surgery (see Online Supplement 9) aimed to identify any data for long-term costs, utilities and transition probabilities.
-
A review of studies mapping from any of the clinical tools onto the EQ-5D, or between any of the clinical tools (see Online Supplement 11) aimed to identify models predicting baseline EQ-5D utility, because several such studies were identified in a previous systematic review. 134
-
A review of studies reporting mortality after primary or revision knee/hip arthroplasty (see Online Supplement 10) aimed to identify studies on mortality, because none of the available data sets provided data on mortality for a large sample.
The second literature review aimed to inform a key assumption in the model by identifying studies that reported changes in clinical tool scores over time for patients without arthroplasty; this is the comparator for the analysis and therefore has a strong influence on the results. A total of 22 such studies were identified. 99–101,115–117,135–150 However, the reported results were ambiguous. Most studies focused only on changes in WOMAC subscores and only reported data over a 2- to 5-year follow-up period. Only two studies reported results for the OHS or SF-12. 100,115 Overall, the results indicated that patients’ clinical tool scores might either improve or worsen, with several studies reporting approximately equal probabilities for both. 101,142,145 We therefore assumed that, in the absence of arthroplasty, clinical tool scores remain constant over the 10-year time horizon (see Other model assumptions and inputs). However, we did allow for reductions in EQ-5D utility with age. The review of economic evaluations identified three previous estimates of the rate of change in utility with age;2,112,118 we used the Ara and Brazier112 model 1 in our analysis because it is based on patient-level UK data and is not specific to patients with certain comorbid conditions; the variance–covariance matrix for the model was obtained from the authors.
The reviews identified nine studies estimating how mortality varied with age, sex and/or other characteristics. However, only one series of studies reported the full set of model coefficients, or considered mortality beyond 90 days after surgery;118,119,121,123 therefore, we used the most recent of these studies in our model. 119,121
Four studies were identified that mapped from the OKS, OHS and WOMAC total scores or from the SF-12 version 2 physical and domain scores onto the EQ-5D. 50,110,111,126 It was necessary to focus on models mapping from total scores or from physical/mental scores, to match the way in which clinical tool scores are modelled in the rest of the model. However, the study mapping from the WOMAC used a small sample [348 observations vs. 978 (knees)/1067 (hips) available in the APEX study] and presented no measures of uncertainty around model coefficients. 111 The study on the SF-12 version 2 was based on general public samples, rather than patients with arthritis, and mapped onto the US EQ-5D tariff, rather than the UK EQ-5D tariff. 110 Furthermore, variance–covariance matrices were only available for the studies mapping from the OKS and OHS. We therefore used the published studies mapping from the OKS and OHS for the economic evaluations described in this chapter and estimated new mapping algorithms for the WOMAC and SF-12 using the available data sets. 50,126 For the further analyses, we subsequently re-estimated mapping algorithms for the OKS and OHS using PROMs data providing patients’ exact ages.
An analysis conducted as part of the COASt work programme using data from HES and the Clinical Practice Research Datalink (CPRD) was used to provide estimates of ambulatory costs accrued > 1 year after primary THA and the cost of revision surgery, because none of the available data sets provided data on these parameters. 113,131
However, the literature reviews identified no studies predicting any of our model parameters conditional on any of the clinical tools under consideration. Therefore, we used the individual patient data available to us to estimate new regression models.
Costing analyses
The reference year for costs was 2014. Within the data sets giving individual patient data on resource use, the cost of primary TJA was estimated based on the national payment by results tariff for relevant Healthcare Resource Groups (HRGs), whereas other health-care resources were valued using unit costs (see Online Supplement 8). Costs taken from Pinedo Villanueva113 were inflated from 2010–11 to 2013–14 using the Hospital and Community Health Services Index. 151
The analysis focused on resource use associated with the joint that was replaced, because resource use unrelated to the joint in question was excluded from all of the available data sets. The analyses estimating the community, outpatient and re-admission costs after hospital discharge also excluded medications, personal care, nursing homes, convalescence care, equipment, home modifications, alternative practitioners, etc. These costs were available for certain data sets (e.g. the APEX study), but not others (e.g. KAT). Excluding such costs greatly reduced the analysis time and made it easier to make fair comparisons between clinical tools. We assumed that all physiotherapy was paid for by the NHS.
None of the data sets available before the second user group meeting provided sufficient data to put through the NHS Payment Grouper,152 a computer program that allocates individual hospital episodes to HRGs based on procedures, diagnoses and patient characteristics. In the analyses presented in this chapter, we, therefore, manually synthesised an estimate of the cost of the initial hospital stay for primary TJA using information from the national tariff 2014–15153 and the NHS Payment Grouper152 and excess bed-days using the NHS tariff. Under the payment by results scheme, hospitals are paid a fixed amount for each HRG unless the patient remains in hospital for longer than the ‘trim point’. For hospital stays lasting beyond the trim point, hospitals are paid an excess bed-day tariff for each additional day beyond the trim point. For primary arthroplasty, there are separate HRGs for patients with no complications, minor complications and major complications; these HRGs differ in their tariff price and trim point, whereas the excess bed-day price is the same across the relevant HRGs. Because it is difficult to establish defensible and consistent methods for identifying which patients would have had minor or major complications based on the comorbidity and/or complication fields in other data sets, we used national data to estimate weighted average costs for every possible length of stay, taking into account the trim point, tariff price and excess bed-day price, and applied this to all patients with that length of stay (see Online Supplement 12). This approach implicitly assumes that the incidence of complications is unrelated to clinical tool score and ignores the association between age and complications.
Changes in guidelines and waiting time targets have halved the average length of stay for arthroplasty in the last 15 years. 154,155 Consequently, length-of-stay data from trials that started in the early 2000s (e.g. KAT and EPOS) do not accurately reflect current practice. However, because KAT and EPOS are the only data sets containing the SF-12 instrument that also report resource use, we relied on these two data sets to estimate how the cost of primary arthroplasty varies with preoperative SF-12 scores. Applying the current national tariff to these older data sets would have systematically overestimated costs, because the share of patients with excess bed-days would be considerably higher than the share for more recent studies. We addressed this problem by adjusting the length-of-stay data for primary TJA using data from the COASt study, which was conducted in 2011 (see Online Supplement 12). 131
A simple approach was used to value re-admissions occurring after patients were discharged from hospital following primary TJA. Such re-admissions were costed up by estimating a (weighted) mean cost per orthopaedic bed-day for those HRGs with the word ‘hip’ or ‘knee’ in the Department of Health and Social Care reference costs HRG description, and multiplying this by the length of stay. 156 A similar method was used to value day cases, which was applied to all admissions in which the admission and discharge dates were the same. No adjustment was made for the calendar year in which the re-admission or revision took place, because any such adjustment would have been extremely complex as the re-admissions observed in the KAT data set were spread over a 12-year period. The COASt questionnaires on costs before arthroplasty provide no data on length of stay for re-admissions, only the number of such re-admissions. In these cases, we therefore applied to each admission the (weighted) mean cost per orthopaedic admission for HRGs with the word ‘hip’ or ‘knee’ in the description from NHS Reference Costs 2013 to 2014. 156 In order to avoid underestimating costs by excluding patients with missing length of stay data for re-admissions, we applied the same weighted average cost to all patients with missing data on length of stay unless the providers of the data set also provided clear guidance on how to impute missing length of stay data. For KAT, we used the same mean imputation that was already applied in the KAT data set, whereby a length of stay of about 9 days for washout procedures with missing length of stay (mean imputation) was applied, and a length of stay of 14.25 days for the second stage of a two-stage revision with missing length of stay. However, these simple analyses were superseded by the HES data used to calculate re-admissions in the analyses presented in Chapter 7.
Owing to lack of data, the cost of re-admissions other than revisions that took place > 1 year after THA were excluded from the analyses presented in this chapter. No patient-level data giving costs > 1 year after THA were available at that time and the only available cost estimates cover only ambulatory consultations, not re-admissions. 113 The analyses presented within this chapter may, therefore, slightly underestimate costs for patients who have had THA. The cost of re-admissions was therefore added to the analyses conducted after the second user group meeting once HES data became available.
Regression analyses
Regression models predicting each model parameter conditional on preoperative clinical tool score, age and sex were estimated on the individual patient data using Stata® version 14.
A complete-case analysis was conducted to avoid overcomplicating the analysis with multiple imputed data sets. Observations with missing data on the clinical tool or the model parameter in question would have provided very little information to inform model estimation, and there is no reason to expect the relationship between clinical tool and outcome to differ between people with and without missing data on these variables. Each individual analysis excluded patients with missing data on age, sex or either the clinical tool score or the outcome variable being estimated in that regression analysis. As a result, sample sizes differed between regression analyses using the same data set.
The EPOS was the only data set providing information on THA patients beyond year 1. However, EPOS participants who did not complete the EQ-5D and SF-6D utilities cannot be directly compared with those measured using the EQ-5D because the SF-6D values health states using standard gamble rather than time trade-off and tends to produce non-comparable utilities (e.g. higher utilities for patients with poor health states). 157 We therefore mapped participants’ SF-12 responses onto the EQ-5D before analysis. Based on the literature review on mapping studies (see Online Supplement 11), we selected a response-mapping algorithm mapping from the SF-12 version 1 item responses that was slightly modified from the one estimated by Gray et al. 158 and estimated predictions using the expected value method. 159 This algorithm was chosen as it had better prediction accuracy than other algorithms based on the same version of the SF-12 that was used in the EPOS (see Online Supplement 11). Postoperative EQ-5D utilities calculated in this way were used in subsequent regression models in the same way as observed EQ-5D utilities, ignoring the uncertainty around the mapping model (which is in any case likely to be small owing to the large sample size used in the mapping study).
For each model parameter, we began by conducting exploratory data analysis to identify the distribution of the dependent variable and the shape of the relationship between the dependent variable and the clinical tool score, age and (when appropriate) time since primary arthroplasty. Exploratory data analysis was used to identify the most appropriate model specifications for each dependent variable. We then estimated regression models on each of the parameters listed in Table 39 and selected the model specification best predicting each model parameter using mean squared error (MSE). MSE was chosen in preference to information criteria in order to focus on model prediction and because information criteria cannot easily be calculated for some model types (e.g. two-part models) or necessarily be compared between linear and non-linear models.
We used K-fold cross-validation to reduce overfitting. For each of the below steps for each model parameter, each data set was divided into 10 parts of approximately equal size using pseudo-randomly generated numbers. Unless otherwise stated, for the analyses estimated on long format data sets (i.e. costs and EQ-5D utility before/after revision and long-term trends in EQ-5D utility), patients were divided into 10 groups and all observations for the same patient were included in the same part. For the analyses based on KAT and EPOS, we also stratified patients based on whether or not they had a revision ≥ 12 months after primary TJA when dividing patients between the 10 parts; no such stratification was done for PROMs/HES when conducting the analyses described in Chapter 7, because the sample size was markedly larger. Candidate models being considered in any given step were estimated on 9 of the 10 parts and validated on the 10th part; estimation and validation were repeated for each of the 10 validation samples, resulting in an estimate of the squared error (i.e. the squared difference between predicted and observed values in the validation sample) for each candidate model for each observation. MSE was calculated as the crude mean across the squared errors and the model specification with the lowest MSE was chosen for use in the next step.
Regression analyses on each of the parameters listed in Table 39 were conducted using the following steps.
Step 1. Functional form of the outcome variable:
-
Choose a number of candidate regression functions [e.g. ordinary least squares (OLS), generalised linear model (GLM), two-part model] based on the exploratory data analysis.
-
Define a simple model with the following covariates, with no polynomials or interactions –
-
– Clinical tool score.
-
– Age (this comprised dummies for 10-year age bands for the freely available PROMs data set, and continuous age for other data sets).
-
– Sex.
-
– For analyses including RCT data, we added a covariate for treatment effect if the treatment allocation was very different from current practice and the trial found significant differences between treatments. A dummy for treatment effect was therefore included in analyses on the APEX study, but not in KAT. However, no dummy was included in the mapping models predicting EQ-5D utility conditional on patients’ contemporaneous clinical tool scores.
-
– For certain model parameters, we also included time since primary arthroplasty and indicators of previous revision surgery as covariates (see Online Supplement 12).
-
-
Estimate the simple model using the candidate regression functions.
-
Calculate MSE for each model.
-
Choose functional form with lowest MSE to use in step 2
-
Optional step 1b. Functional form for time:
-
For the models of long-term EQ-5D utility, re-admission costs beyond year 1, and the cost utility of revisions, step 1(b) then identified the functional form for time since primary operation based on MSE.
-
Step 2. Functional form of the clinical tool:
-
Based on the exploratory data analysis, we chose candidates for the parameterisation of the clinical tool scores. Depending on the exploratory data analysis results, these included logarithms, polynomials, splines and/or interactions between different domain scores. For the SF-12, we chose the parameterisation for the physical domain first, then the mental domain, then assessed whether or not an interaction between physical and mental domains improved MSE. If non-logarithmic forms were chosen for the SF-12, we only considered the interaction between the absolute physical and absolute mental domain scores (i.e. no splines, quadratic or cubic terms for the interaction); if logarithmic forms were chosen for one (or both) SF-12 domains, we also considered interactions between linear and log scores, or between log-physical and log-mental scores. The model with the lowest MSE was used in step 3.
Step 3. Functional form of covariates:
-
Based on the exploratory data analysis, we chose a limited range of candidates for the parameterisation of age, estimated the candidate models and selected the model with the lowest MSE to use in step 3(b). Interactions between the clinical tool and other covariates were not considered. We also considered models that dropped age. For the models of re-admission costs beyond year 1 and the cost or utility of revisions, we first compared MSE with age at the time of primary surgery with MSE with age at the time of revision/year of data and selected the parameterisation for age based on the variable with the lowest MSE. Analyses of costs > 1 year after revisions used age in the current year throughout (owing to the Markovian assumption inherent within the model).
-
We assessed whether or not dropping the sex variable reduces the MSE. The model with the lowest MSE from step 3(b) was chosen for step 4.
Step 4. Final regression models and their variance–covariance matrices:
For two-part models, the selection of variables in steps 2 and 3 was done simultaneously for both parts of the model, such that the same variables were included in both parts of the model unless one of the final selected models was unstable (e.g. SEs of > 600 for a logit model); in such cases, we reran model selection for the problematic part of the model, leaving the other part unchanged. Online Supplement 12 describes how models for specific parameters were estimated, specific data cleaning and processing steps were conducted on specific data sets and model parameters, and specific assumptions were used when applying the parameters in the Markov models.
Other model assumptions and inputs
The model made the following assumptions:
-
Patients in the no arthroplasty arm were assumed to undergo no knee or hip arthroplasty during the 10-year time horizon. In practice, patients whose symptoms are not currently severe enough to warrant arthroplasty may have surgery later, once their symptoms have deteriorated. However, modelling arthroplasty procedures conducted at different time points as well as allowing for changes in revision rates, mortality, utilities and costs over time would have greatly complicated the model structure and required six patient-level simulation models, each following up to 1410 hypothetical individuals. Furthermore, there were very limited data on the proportion of patients whose clinical tool scores would deteriorate in the absence of arthroplasty, or on the rate at which scores are likely to deteriorate.
-
As discussed in Literature reviews on model inputs, the studies identified in the literature review suggested that some patients’ symptoms worsened over time in the absence of arthroplasty, whereas symptoms for other patients improved or remained the same (see Online Supplement 9). Although the literature review identified several US studies using WOMAC, we identified only one UK study using the WOMAC physical functioning subscale144 and one study showing how OHS100 and SF-12,115 respectively, change over time in the absence of arthroplasty. It is also unclear whether or not all of the patients in such data sets would be considered candidates for TJA. We therefore assumed that clinical tool scores remain unchanged for the time horizon of the model in patients who do not have arthroplasty, but assumed that EQ-5D utility decreases with age following a published model. 112 If patients’ osteoarthritis symptoms did, on average, deteriorate in the absence of arthroplasty, this would mean that our analyses overestimate the QALYs accrued in the no arthroplasty arm and, therefore, overestimate ICERs and underestimate the economic threshold.
-
We placed no restrictions on the number of revisions that patients could have within the Markov models, although patients could not have more than one revision operation per year.
-
Patients in the no arthroplasty arm were assumed to accrue a fixed cost each year. This cost was based on the costs accrued by patients in the COASt study in the year before arthroplasty, because no UK data were available on patients who were potentially eligible for arthroplasty but did not have surgery.
-
For certain parameters (e.g. community, outpatient and re-admission costs or utility > 1 year after TJA), there were no data for WOMAC. Furthermore, the only UK data set providing information on costs before or without arthroplasty (COASt) did not include the SF-12. 131 In these cases, we estimated regression models that used preoperative EQ-5D utility as an explanatory variable in place of the WOMAC or SF-12. Mapping models were used to calculate preoperative EQ-5D utility from preoperative clinical tool scores. For those parameters that were estimated as a function of EQ-5D, we calculated predicted costs/utilities by multiplying mapped preoperative EQ-5D utility by the regression coefficient for preoperative EQ-5D utility.
-
In all analyses, we assumed that costs and quality-of-life questionnaires were completed at the designated time. For example, if a patient had a revision 1.99 years after primary TJA, we assumed that the year 2 EQ-5D questionnaires were completed after the revision (rather than before).
-
Mortality rates incorporated a healthy patient effect estimated by Pennington et al. 119,121 using NJR data that allows for the fact that patients selected to undergo TJA have lower mortality for around 8 years after surgery than people who are not considered candidates for TJA. This was operationalised as a multiple of the annual all-cause mortality risk for individuals in the UK of the relevant age and sex, in which the multiple varies with age, sex and time since TJA. 125 Because the patients in the no arthroplasty arm were assumed to be identical to those patients in the arthroplasty arm, the healthy patient effect was also applied for the first 8 years of the no arthroplasty arm.
-
Mortality associated with revision surgery was excluded from the analyses described in this chapter, unless the models developed by Pennington et al. 119,121 predicted that mortality would be > 10% higher than would be expected in the absence of revision surgery (in which case mortality was assumed to be 10% higher than without revision surgery). Furthermore, no excess mortality was applied to revisions taking place within 12 months of primary arthroplasty. However, the published models of mortality associated with revision surgery were applied to all revision procedures in Chapter 7. This assumption is unlikely to have any significant effect on the conclusions because only a small minority of patients have revisions and revision rates are assumed to not vary with clinical tool score.
-
The cost of re-admissions and ambulatory costs were excluded for those patients who were revised in year 1 in the analyses described in this chapter. This assumption is unlikely to have any significant effect on the conclusions because only a small minority of patients have revisions and revision rates are assumed to not vary with clinical tool score.
-
After a patient’s 100th birthday, we assumed that the all-cause mortality for patients’ 100th year applies for all subsequent years. This assumption was only used in sensitivity analyses extending the time horizon beyond 10 years.
-
We assumed that revision rates do not vary with clinical tool score, because the PROMs/HES extract was not linked to NJR and the available data sets included very few revisions that could be linked to clinical scores measured before primary procedures. We therefore used published models estimated using NJR data that predict revision rates conditional on age, sex, time since primary TJA and other variables. 119,121 In these models, revision rates have a non-linear relationship with several variables that are not explicitly captured as patient characteristics in the model.
-
Postoperative EQ-5D utility and the cost of re-admissions and community/outpatient consultations in year 1 were estimated for all patients, regardless of whether or not they had been revised. This simplification was made because it was not possible to reliably identify the patients who were revised within 12 months of primary surgery in the freely available PROMs data set. This assumption should not affect estimates of economic thresholds, because revision rates were assumed to not vary with clinical tool score.
-
When modelling the QALY profile in the first year after arthroplasty, we assumed that EQ-5D utility at 3 months is the same as EQ-5D utility at 6 months and we assumed a linear change in utilities in the first 3 months after primary arthroplasty. Exploratory data analysis on the APEX study suggested that this approach gives a good approximation of the QALYs that we would calculate for the first year after primary arthroplasty if we had more frequent measurements, and performs at least as well as more complex methods. A recent study on SF-6D utilities after TKA confirmed this finding. 160
-
In all other cases, we assumed that utility changes linearly during the year; therefore, the QALYs accrued during the year equal the crude average of utility at the start of the year and utility at the end of the year. For simplicity, this assumption was also applied in the year of revision surgery, because for some data sets (e.g. KAT) revision surgery may take place at any point in the year. Furthermore, patients undergoing revision are likely to have had a quality of life similar to the pre-revision utility in the months leading up to revision surgery and experience utility similar to that observed 6 months after revision later in the year.
-
Because the > 1 year after revision state includes people with revision operations that took place anywhere between 1 and 9 years earlier and the Markov model cannot differentiate between people with respect to time since revision, we assumed that the utility in this state was equal to the post-revision utility that would have occurred if the revision had taken place in the present year relative to the annual EQ-5D questionnaire. As a result, any effect of ageing on utility after revision is assumed to be captured in the post-revision utility model.
-
For THA, the cost of community and outpatient care > 1 year after hospital discharge was based on an analysis of CPRD data done as part of the COASt study because no individual patient data were available (see Online Supplement 12 for a description of the assumptions made when applying these costs). 131
-
In PSA, all utilities were constrained to be between –0.594 and 1. Utilities that would otherwise be < –0.594 were set to –0.594 and those that would otherwise be > 1 were set to 1. With the exception of the community costs taken from Pinedo Villanueva,113 all costs were constrained to be ≥ £0 and values that would otherwise be < £0 were set to £0. The community costs taken from Pinedo Villanueva113 represent differences between resource use in patients with osteoarthritis and resource use in those without osteoarthritis and were, therefore, permitted to be negative, in line with how they were used in the original study.
Presentation of results and analysis of uncertainty
Hypothetical individuals with different combinations of age, sex and clinical tool score were run through the model sequentially, both using mean values for all parameters and using parameter values sampled from their distributions (PSA). PSA was run separately on the six Markov models representing TKA and THA with each of the three tools. However, within each model, the same set of sampled values was used for all hypothetical individuals to ensure that differences between hypothetical individuals were not masked by differences between sampled values.
All uncertain parameters, including all regression coefficients, were varied in PSA. We allowed for correlations between coefficients estimated in the same regression model by assuming a multivariate normal distribution. 161 Variance–covariance matrices for published models were obtained from the authors;50,112,119,121,126 those for the models estimated on patient-level data were estimated in Stata® and are available on request. However, for simplicity, we did not allow for correlations between the coefficients from different regression models, or between the coefficients for the logit and OLS/GLM parts of two-part models. Differences in the cost of ambulatory consultations after THA were assumed to follow independent normal distributions, whereas the cost of hip revision surgery in different patient subgroups was assumed to follow independent gamma distributions. 113
We ran all six models using men and women aged exactly 50, 60, 70, 80 and 90 years. It should be noted that several data sets include relatively few people at the upper and lower ends of this age range and that the published models of mortality and revision rates excluded individuals aged < 55 or > 85 years. 119,121 Results for patients aged 50 or 90 years should, therefore, be interpreted with caution, but may give an indication of incremental cost-effectiveness for these patients. A wide range of integer values for the OKS, OHS, WOMAC and SF-12 physical score were selected to cover the range of possible scores for the instrument, with greater concentration of scores in the region where preliminary analyses had shown the threshold to be. For the SF-12, results were also repeated for patients with SF-12 mental scores of 30, 50 or 70; although 30 and 70 represent very extreme values that were observed for very few patients, they were chosen to give an indication of the range of possible values. In total, between 490 and 1410 hypothetical individuals were analysed using point estimates for all model parameters and between 220 and 780 hypothetical individuals were analysed in PSA. For the WOMAC, OKS and OHS models, 2000 PSA runs were conducted; for the SF-12 models, 1000 PSA runs were conducted, because there was insufficient computing time for 2000 runs for each of the three mental scores.
We also present the weighted average across men and women and across ages; unless otherwise stated, all figures averaging across sexes and/or ages are weighted by the proportion of people in each group. The proportion of men and women in different age groups was calculated using the number of procedures per 100,000 people aged ≥ 10 years by age and sex published in the final PROMs report for 2013–14162 and the corresponding population numbers from the Office for National Statistics’ Mid-2013 Population Estimates in England163 (see Online Supplement 12). These proportions were multiplied by the total number of primary arthroplasty procedures conducted solely for osteoarthritis in England in 2014–15 (76,617 knee replacements and 69,313 hip replacements) to give patient numbers. The total number of primary arthroplasty procedures (79,726 knee replacements and 77,880 hip replacements) was calculated from HES based on the number of finished consultant episodes for the OPCS (Office of Population Censuses and Surveys Classification of Surgical Operations and Procedures) codes beginning with O or W that the NJR used to identify primary hip/knee replacement and was multiplied by the proportion of all primary arthroplasty procedures conducted solely for osteoarthritis (96% for knees and 89% for hips). 154,164,165 Data from PROMs/HES, the APEX study, KAT and EPOS were used to calculate the proportion of people with different clinical tool scores (see Online Supplement 12). We allowed for the fact that the age distribution varies between men and women and the fact that the distribution of SF-12 mental scores varies with physical scores. However, for simplicity and to ensure consistency between models on different clinical tools, we assumed that the distribution of clinical tool scores was independent of age and sex.
The base-case results represent the point estimates, keeping all parameters at their mean values. This approach was used because it was not feasible to run PSA for all analyses in time for the second user group meeting. Allowing for non-linearities by taking the expected value from PSA had very little effect on ICERs or thresholds (results not shown). 161
The costs and QALYs with and without TJA were calculated for each hypothetical individual and were used to calculate the cost per QALY gained for TJA versus no TJA; these ICERs are displayed in the decision grids shown in Results. The threshold clinical tool score was defined as the highest clinical tool score at which the ICER for TJA versus no arthroplasty was < £20,000 per QALY gained for patients of any given age, sex and, when applicable, SF-12 mental score.
We also used PSA results to calculate 95% credible intervals (CrIs) around the threshold clinical score. These intervals were calculated by first examining the results of each individual PSA replicate to identify the threshold clinical tool score for that PSA draw (within each age/sex group). The 95% CrI limits for the threshold were assumed to equal the 2.5th percentile and the 97.5th percentile across the sets of PSA results. However, because PSA was run for only a finite number of hypothetical individuals owing to the long simulation time, only even-numbered SF-12 physical scores were evaluated and, for this reason, the 95% CrI around the SF-12 thresholds consider only even-numbered scores. Similarly, we only conducted PSA on WOMAC scores that were multiples of five or > 95. We also averaged incremental net benefits (INBs) (INB = Rc • ΔQALYs – ΔCost) across men and women at a £20,000-per-QALY ceiling ratio (Rc), and also across ages and calculated 95% CrIs across thresholds based on these averaged results.
We also used data on the distribution of patients by age, sex and clinical tool score to calculate the number of people who currently undergo arthroplasty but would no longer have access to surgery if different threshold tool scores were introduced. We also calculated the net health benefit (or net harm) of stratifying access to TJA using different thresholds using published methods. 166,167 These estimates explicitly exclude patients who do not currently have arthroplasty but might gain access to surgery if national guidelines were to introduce a threshold that was higher than that currently used by local commissioners or primary health-care services; estimates of the impact of different thresholds that include this population of people are described in Chapter 8.
The expected value of stratifying access to TJA based on different clinical tool thresholds, age and sex (ValueofStratifying) was calculated by first multiplying the number of people in England of age (a), sex (s) and clinical tool score (c) (Nasc) by the incremental net (health) benefit of TJA calculated in the model for that patient group (INBasc = Rc • ΔQALYsasc – ΔCostasc). The expected value of introducing a specific threshold was calculated by summing these figures across all clinical tool scores less than or equal to the threshold (T) and subtracting that sum from 0:
For example, if the economic threshold was estimated to be 1 for tool (c) for people of age group (a) and sex (s) and the INB was 2 for the 100 patients with a score of 0 and 1 for the 200 patients with a score of 1, we would calculate the value of stratifying by clinical tool score in this demographic group as:
We would then add these values across all demographic groups.
Because there may be practical and equity arguments against rationing access to surgery by age or sex, we also calculated the weighted average incremental costs and incremental QALYs for TJA versus no arthroplasty at different clinical tool scores, weighting by the proportion of people of different ages and sexes (see Online Supplement 12). These values were used to calculate the ICER for different clinical tool scores averaged across sexes or across all age/sex groups. We then calculated the value of stratifying only by clinical tool score, and the value of stratifying only by clinical tool score and age:
The proportion of PSA draws that find TJA to be cost-effective compared with no TJA (i.e. the proportion that have positive INB) was identified for each hypothetical patient group at different ceiling ratios. The probability that TKA is cost-effective was plotted against clinical tool score. For brevity, we present figures averaged over men and women by calculating the weighted average probability that TKA is cost-effective, weighted by the proportion of patients who are female.
We also conducted four sensitivity analyses on each of the six Markov models:
-
Taking a 5-year time horizon (cf. 10 years in the base-case analysis).
-
Taking a 60-year (lifetime) time horizon.
-
Assuming that EQ-5D utility without TJA worsens by 0.025 per year (cf. a 0.0036–0.0069 decrease per year, depending on age in the base-case analysis). The figure of 0.025 was based on the smallest measurable difference in the original study used to estimate the EQ-5D time trade-off tariff (3 months in 10 years). 168
-
Assuming that EQ-5D without TJA increases by 0.115 per year in the first year of the model and then follows an age-related decline after that. The figure of 0.115 was based on the increase in EQ-5D in the non-surgical treatment arm of a recent RCT that compared TKA against 12 weeks’ non-surgical treatment comprising exercise, education, dietary advice, insoles and pain medication. 169 This figure is conservative because the non-surgical treatment used in the trial was relatively intensive and many NHS patients may have already received similar conservative management before being listed for surgery.
For brevity, and to reduce computation time, we present point estimates only for sensitivity analyses, not PSA. Furthermore, sensitivity analyses were run using the same reduced set of clinical tool score values used for PSA; subsequently, the thresholds estimated for sensitivity analyses on the SF-12 and WOMAC are approximate and may slightly overestimate the true threshold, because costs and QALYs for hypothetical individuals that were not simulated were assumed to equal those for patients with a score one point higher than the score that was not simulated.
Results
Effect of scores on costs and utilities
Regression analyses demonstrated that the preoperative OKS, OHS, WOMAC and SF-12 physical and mental scores had a statistically significant effect on preoperative EQ-5D utility, EQ-5D utility 3–12 months after primary arthroplasty, EQ-5D utility > 12 months after primary arthroplasty and EQ-5D utility after knee revision surgery (p < 0.05; see Online Supplement 12). The OKS, OHS and SF-12, but not the WOMAC, had a significant effect on the cost of primary arthroplasty surgery (p < 0.05). The OHS had a significant effect on costs in the absence of hip arthroplasty (p = 0.003), but allowing for preoperative EQ-5D worsened prediction accuracy. However, within the analyses conducted before the second user group meeting, the clinical tool score was found to have no significant impact on EQ-5D utility before revision surgery, the cost of revision surgery and costs in the absence of knee arthroplasty (p > 0.05; see Online Supplement 12). The results of all regression models are shown in Online Supplement 12.
There was a strong relationship between preoperative clinical tool score and both preoperative and postoperative utility that varied slightly with age (Figures 14–19). The magnitude of the difference between preoperative and postoperative utility was, in turn, the main driver determining the QALYs gained from TJA, the ICERs and the threshold clinical tool scores.
FIGURE 14.
Effect of the preoperative OKS and age on preoperative and 6-month EQ-5D utility.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig14.png)
FIGURE 15.
Effect of the preoperative OHS and age on preoperative and 6-month EQ-5D utility, based on a published mapping algorithm and a two-part model of the relationship between log-OHS, age and sex estimated on PROMs data. 126
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig15.png)
FIGURE 16.
Effect of the preoperative WOMAC score and age on preoperative and 6-month EQ-5D utility in TKA patients, based on the APEX study data. Preoperative utilities were predicted based on a Tobit model that included quadratic and cubic terms, whereas postoperative utilities were modelled using linear regression.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig16.png)
FIGURE 17.
Effect of the preoperative WOMAC score and age on preoperative and 6-month EQ-5D utility in THA patients, based on linear regression on the APEX study data. Models of preoperative utility included quadratic and cubic terms for WOMAC.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig17.png)
FIGURE 18.
Effect of the preoperative SF-12 physical score on preoperative and 6-month EQ-5D utility in 70-year-old TKA patients at SF-12 mental scores of (a) 30, (b) 50 or (c) 70, based on Tobit models on KAT data that included polynomial terms for SF-12 scores.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig18a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig18b.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig18c.png)
FIGURE 19.
Effect of the preoperative SF-12 physical score on preoperative and 6-month EQ-5D utility in 70-year-old THA patients at SF-12 mental scores of (a) 30, (b) 50 or (c) 70. Postoperative utility was based on a GLM with log-link estimated on EPOS.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig19a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig19b.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig19c.png)
Figure 14 shows the relationship between the preoperative OKS and preoperative and postoperative EQ-5D utility estimated using regression equations used in the Markov model. Preoperative EQ-5D utility was estimated using a published linear mapping model,50 whereas 6-month utility was estimated on freely available PROMs data using a Tobit model that included quadratic and cubic terms for OKS. Both preoperative and postoperative utility increase sharply with preoperative OKS; however, the relationship between the preoperative OKS and 6-month EQ-5D utility is non-linear and is markedly less steep for patients with a baseline OKS of > 20. As a result, the change in utility following TKA is greatest at an OKS of 9 or 10, and declines steadily at higher scores until the difference becomes negative (i.e. TKA is expected to reduce quality of life) at an OKS of 41–44 or higher. Age had a relatively modest effect, with the postoperative utility being between 0.02 and 0.08 lower for 50-year-old patients than for those aged 70–90 years. After adjusting for age and OKS, gender was found to have no significant impact on utility after TKA and was dropped from the final model. Results for the OHS were similar (see Figure 15).
The relationship between preoperative WOMAC and preoperative and postoperative utility was similar to that for the OKS and OHS (see Figures 16 and 17). However, the models for WOMAC predicted that postoperative utility would always be higher than preoperative utility at all baseline WOMAC scores. The linear models predicted postoperative EQ-5D utility to be > 1 for patients with high baseline WOMAC; such predictions were set to 1 in the model.
The relationship between preoperative SF-12 physical score and preoperative and postoperative EQ-5D utility varied with SF-12 mental score (see Figure 18). At a mental score of 30, the change in utility following TKA was very small and was negative for patients with physical scores of ≥ 36, whereas the change in utility was markedly larger for patients with high mental scores and declined more sharply with SF-12 physical score. The way in which SF-12 is calculated means that patients cannot simultaneously get very low scores on both the physical and mental scales, or get very high scores on both. In particular, the highest possible physical score for patients with a mental score of 70 is 42.13, which is lower than the physical score at which the models predict that the change in utility following TKA becomes negative; TKA is therefore predicted to be beneficial for all patients with high mental scores.
For THA, the relationship between SF-12 physical score and change in utility following THA (see Figure 19) was similar to that of TKA at high mental scores. However, for patients with a SF-12 mental score of 30, the curve for utility after THA crossed the curve for utility before THA in two places. As a result, the change in utility was negative for patients with low SF-12 physical scores as well as for those with high scores. For 70-year-olds, the physical scores at which the change in utility is negative are below the minimum possible score for patients with SF-12 mental scores of 30, although the models predict that for 90-year-olds there are some attainable combinations of low physical and mental scores at which the change in utility following THA is negative.
Because operative mortality is rare, relatively few patients have revision surgery and patients’ utility remains relatively constant after 6 months (see Online Supplement 12), the change in utility in the first 6 months after TKA is approximately proportional to the QALY gain from TKA. 119 Figures 14–19 therefore indicate the relationships that are driving the ICERs and clinical tool thresholds shown in the next six sections.
Effect of the Oxford Knee Score on the cost-effectiveness of knee replacement
The Markov models used the regression models predicting EQ-5D utilities, costs, mortality and revision rates shown in Online Supplement 12 to calculate the costs and QALYs that different patient subgroups would accrue over the 10-year period after TKA or after the decision was made not to operate. These results were used to calculate the incremental cost-effectiveness of TKA versus no arthroplasty in people of different ages with different preoperative OKSs (Table 40).
Preoperative OKS (selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
0 | £2059 | £1643 | £1978 | £2781 | £4762 | £2155 |
10 | £33 | £221 | £566 | £1230 | £2778 | £655 |
20 | Dominant | £211 | £616 | £1377 | £3154 | £710 |
21 | £53 | £266 | £680 | £1460 | £3291 | £778 |
24 | £378 | £508 | £956 | £1812 | £3858 | £1071 |
28 | £1167 | £1061 | £1597 | £2627 | £5151 | £1748 |
29 | £1457 | £1254 | £1828 | £2924 | £5624 | £1990 |
30 | £1799 | £1475 | £2099 | £3276 | £6187 | £2274 |
31 | £2207 | £1730 | £2418 | £3696 | £6863 | £2608 |
32 | £2697 | £2025 | £2800 | £4204 | £7689 | £3006 |
33 | £3296 | £2370 | £3263 | £4830 | £8717 | £3485 |
34 | £4044 | £2778 | £3834 | £5619 | £10,027 | £4073 |
35 | £5003 | £3267 | £4556 | £6641 | £11,752 | £4811 |
36 | £6277 | £3865 | £5499 | £8020 | £14,125 | £5765 |
37 | £8058 | £4614 | £6786 | £9981 | £17,593 | £7049 |
38 | £10,733 | £5582 | £8652 | £12,998 | £23,145 | £8874 |
39 | £15,226 | £6888 | £11,616 | £18,258 | £33,482 | £11,687 |
40 | £24,418 | £8761 | £17,084 | £29,791 | £59,554 | £16,616 |
41 | £54,138 | £11,690 | £30,672 | £75,796 | £253,711 | £27,568 |
42 | Dominated | £16,983 | £124,962 | Dominated | Dominated | £73,754 |
43 | Dominated | £29,746 | Dominated | Dominated | Dominated | Dominated |
44 | Dominated | £99,241 | Dominated | Dominated | Dominated | Dominated |
45 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
46 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
47 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
48 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
Threshold (95% CrI) | 39 (37 to 43) | 42 (41 to 45) | 40 (39 to 42) | 39 (38 to 40) | 37 (24 to 41) | 40 (39 to 42) |
The results show that, as expected, TKA is highly cost-effective for most patients who currently undergo surgery (see Table 40). The light blue area in Table 40 and subsequent decision grids indicates patient groups for which TKA is cost-effective (i.e. costs < £20,000 per QALY gained compared with no arthroplasty). Table 40 shows only selected OKS values, focusing on values close to the threshold; for the OKS values not shown in the decision grids, TKA costs < £20,000 per QALY gained. Based on the distribution of the preoperative OKS, age and sex within the final PROMs/HES extract (see Online Supplement 12), TKA costs < £20,000 per QALY gained compared with no arthroplasty in 99.7% of patients who currently undergo surgery and it costs < £5000 per QALY gained in 97.4% of patients. TKA was most cost-effective (i.e. has the lowest cost/QALY) for those patients with low preoperative OKSs.
The QALY gains from TKA are highest for patients with a preoperative OKS of around 10 and decrease steadily as the OKS increases (mean incremental QALYs: 2.570 at OKS of 0, 3.301 at OKS of 9 and –0.355 at OKS of 48, averaged over all age and sex groups). This follows naturally from the difference in EQ-5D utility before and after TKA shown in Figure 14. The difference in cost between patients undergoing TKA and those having non-operative management was highest for patients with very low and very high OKS, and lowest for patients with scores between 14 and 21 (mean incremental cost: £5539 at OKS of 0, £6055 at OKS of 48 and £1710 at OKS of 16, averaged over all age and sex groups).
Cost-effectiveness results were primarily driven by the difference in QALYs, and the ICER (calculated as the difference in cost divided by the difference in QALYs) increased sharply as the difference in QALYs approached zero. For any given age group, the ICER for TKA versus no arthroplasty was very low for patients with an OKS of below around 35, but increased sharply as the OKS increases and the difference in quality of life between patients with and without TKA approached zero. For those patient subgroups shown in dark or medium green in Table 40, TKA costs > £20,000 per QALY gained; although arthroplasty improves quality of life, it is not cost-effective compared with the cost-effectiveness threshold that is generally used in NHS decision-making. For patient subgroups with very high OKSs (the dominated area, shown in white, in Table 40), TKA is dominated: it increases costs to the NHS and, on average, people in these groups will have fewer QALYs than those having no arthroplasty because their EQ-5D utility is expected to be lower with TKA than without and (to a lesser extent) because of surgical mortality.
For any given age group, we can identify the economic threshold as the score at the bottom of the light blue area in Table 40. The economic threshold varies with age, being highest for 60-year-olds [42 (95% CrI 41 to 45)] and lowest for 90-year-olds [37 (95% CrI 24 to 41)]. The economic threshold is slightly lower for older patients because they have lower life expectancy and therefore enjoy the quality of life improvements for a shorter period of time. 125 Similarly, thresholds were somewhat lower for 50-year-olds as they have higher revision rates and lower postoperative EQ-5D utility than patients aged 60–80 years. 119
Cost-effectiveness also varied slightly between men and women, with all ICERs being slightly lower for women than for men (see Online Supplement 12). However, the threshold OKS did not differ.
Averaging costs and QALYs across all age and sex groups suggests that if a single threshold were to be applied for patients of all ages, the threshold OKS would be 40 (95% CrI 39 to 42).
However, there was uncertainty around the results, particularly for 90-year-olds (see Table 40) and patients with an OKS close to the economic threshold. Figure 20 shows how the probability that TKA is cost-effective compared with no arthroplasty (i.e. costs < £20,000 per QALY gained) changes with the amount that the NHS is willing or able to pay to gain 1 QALY. For 70-year-olds with preoperative OKSs of < 40, we can be > 99% confident that TKA is good value for money. However, this decreases to 73% at an OKS of 40 and to 18% at an OKS of 41.
FIGURE 20.
Effect of the OKS on the probability that TKA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig20.png)
Effect of the Western Ontario and McMaster Universities Arthritis Index on the cost-effectiveness of total knee arthroplasty
To facilitate the comparison of results across the different clinical tools, the results presented here are based on a reversed and rescaled WOMAC total score, in which 0 represents the worst possible score and 100 represents the best possible score. As in the previous section, incremental cost-effectiveness was calculated using the Markov model, in which EQ-5D utility, costs, revision rates and mortality were predicted based on a patient’s preoperative characteristics.
For almost all WOMAC total scores, TKA is highly cost-effective (Table 41). The QALY gains were highest for patients with low WOMAC scores (mean incremental QALYs at a score of 0 = 6.428) and decreased steadily with WOMAC score (mean incremental QALYs at a score of 50 = 1.509; at a score of 100 = 0.109). Similarly, incremental costs for patients with TKA compared with patients without TKA were lowest for patients with WOMAC scores of 5 (mean incremental cost = £2) and highest for patients with high scores (mean incremental costs for a score of 100 = £3361).
WOMAC score (rescaled to 0–100; 0 indicates poor function) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
0 | Dominant | Dominant | £57 | £354 | £949 | £83 |
10 | Dominant | Dominant | £14 | £460 | £1350 | £68 |
20 | Dominant | £28 | £313 | £940 | £2276 | £401 |
30 | £262 | £362 | £713 | £1588 | £3574 | £851 |
40 | £647 | £706 | £1131 | £2312 | £5168 | £1329 |
50 | £989 | £996 | £1489 | £2968 | £6816 | £1743 |
60 | £1265 | £1220 | £1768 | £3490 | £8321 | £2064 |
70 | £1522 | £1427 | £2026 | £3993 | £9895 | £2366 |
80 | £1827 | £1677 | £2345 | £4707 | £12,190 | £2755 |
85 | £2289 | £2055 | £2873 | £6029 | £17,305 | £3408 |
86 | £2472 | £2201 | £3085 | £6609 | £19,984 | £3674 |
87 | £2689 | £2371 | £3331 | £7295 | £23,561 | £3983 |
88 | £2943 | £2566 | £3617 | £8118 | £28,555 | £4343 |
89 | £3236 | £2788 | £3944 | £9117 | £35,977 | £4759 |
90 | £3578 | £3040 | £4321 | £10,350 | £48,101 | £5244 |
91 | £3978 | £3327 | £4758 | £11,901 | £71,295 | £5813 |
92 | £4451 | £3656 | £5267 | £13,899 | £132,863 | £6486 |
93 | £5014 | £4033 | £5863 | £16,549 | £763,143 | £7289 |
94 | £5688 | £4466 | £6565 | £20,207 | Dominated | £8256 |
95 | £6504 | £4965 | £7396 | £25,534 | Dominated | £9432 |
96 | £7503 | £5538 | £8387 | £33,931 | Dominated | £10,881 |
97 | £8739 | £6198 | £9578 | £48,964 | Dominated | £12,693 |
98 | £10,294 | £6957 | £11,019 | £83,197 | Dominated | £14,997 |
99 | £12,286 | £7829 | £12,780 | £235,097 | Dominated | £17,997 |
100 | £14,902 | £8830 | £14,958 | Dominated | Dominated | £22,018 |
Threshold (95% CrI) | 100 (80 to 100) | 100 (85 to 100) | 100 (80 to 100) | 93 (65 to 99) | 86 (40 to 90) | 99 (80 to 100) |
The economic threshold averaged across age and gender was identified as 99 (95% CrI 80 to 100; see Table 41), which is very close to the maximum value of 100. It should be noted that the highest observed WOMAC score in the APEX study data was 87; the identified threshold value of 99 lies out with the sample. However, there was considerable variation with age. For patients aged ≤ 70 years, TKA is cost-effective at all WOMAC scores. TKA ceases to be cost-effective at a WOMAC score of 93 (95% CrI 65 to 99) for patients aged 80 years, and the threshold is 86 (95% CrI 40 to 90) for those aged 90 years. Again, this reflects the lower life expectancy at these ages, which means that the small benefits of TKA for patients with a relatively good clinical score (see Figure 16) are not accrued for as long a period as younger patients.
There is considerable uncertainty around these threshold values, as shown by the wide 95% CrI. For example, the 95% CI for the threshold averaged over all age and gender groups (99) ranges from 80 to 100. Similarly, the 95% CrI for patients aged 90 years ranges from 40 to 90, covering half of all possible scores. This is likely because of the relatively small sample sizes of the data sets used for the regression models (the regressions of WOMAC on EQ-5D utility were based on 221 patients whereas those on costs of the primary operation were based on 272 patients).
There was slight variation in cost-effectiveness between men and women. At higher ages (80 and 90 years), the threshold for men was (in both cases) 3 points higher than for women (95 vs. 92 at age 80 years and 87 vs. 84 at age 90 years).
The PSA demonstrated that there is greater uncertainty around the results for the WOMAC than for the OKS (Figure 21). For 70-year-olds, the probability that TKA was cost-effective was > 99% for patients with WOMAC scores of ≤ 65 and fell to 96% for patients with a score of 90, 78% for a score of 98 and 62% for a score of 100. For 90-year-olds, the probability that TKA was cost-effective was < 99% for scores of ≥ 40 and was < 0.1% for scores of ≥ 96.
FIGURE 21.
Effect of the WOMAC on the probability that TKA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig21.png)
Effect of the Short Form questionnaire-12 items on the cost-effectiveness of total knee arthroplasty
For the SF-12, results were estimated for a range of physical scores at each of three mental scores (30, 50 and 70). The SF-12 scoring system decreases the physical scores for patients who have good mental health or good social function and decreases the mental scores for patients with poor physical or general health. As a result, it is not possible for the same person to have the lowest (or highest) scores on both the physical domain and the mental domain. Results are only presented for those combinations of scores that are possible on the SF-12 scoring system.
For all three mental scores, the incremental cost of TKA versus no arthroplasty changed very little with SF-12 physical score, whereas QALYs decreased very sharply. For example, for a mental score of 50, the average incremental cost increased from £1954 per patient at a physical score of 15 to £3169 per patient at a physical score of 60, whereas incremental QALYs decreased from 3.591 at a physical score of 12 to –1.388 at a physical score of 60.
At a mental score of 30 (indicating very poor mental health), the threshold physical score was 58 (95% CrI 54 to 60), averaging across all ages (Table 42). For patients with a mental score of 50 (indicating average mental health), thresholds were generally markedly lower than they were at a mental score of 30, and the effect of age was more pronounced. Averaging across ages, the threshold physical score at a mental score of 50 was 50 (95% CrI 48 to 52) (Table 43).
SF-12 physical score | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
18 | £367 | £440 | £682 | £1273 | £3586 | £789 |
19 | £321 | £400 | £652 | £1257 | £3620 | £760 |
20 | £287 | £370 | £631 | £1247 | £3658 | £739 |
25 | £236 | £314 | £594 | £1249 | £3860 | £708 |
30 | £279 | £334 | £616 | £1297 | £4088 | £738 |
35 | £367 | £391 | £672 | £1389 | £4440 | £807 |
40 | £509 | £490 | £774 | £1562 | £5164 | £931 |
45 | £769 | £669 | £969 | £1919 | £7018 | £1172 |
49 | £1206 | £953 | £1282 | £2544 | £12,030 | £1573 |
50 | £1391 | £1065 | £1404 | £2806 | £15,284 | £1734 |
51 | £1633 | £1204 | £1557 | £3145 | £21,478 | £1939 |
52 | £1966 | £1384 | £1752 | £3600 | £37,694 | £2206 |
53 | £2450 | £1624 | £2007 | £4236 | £186,187 | £2566 |
54 | £3216 | £1957 | £2355 | £5181 | Dominated | £3078 |
55 | £4610 | £2449 | £2852 | £6722 | Dominated | £3853 |
56 | £7924 | £3247 | £3618 | £9648 | Dominated | £5161 |
57 | £25,718 | £4757 | £4937 | £17,221 | Dominated | £7803 |
58 | Dominated | £8646 | £7707 | £80,227 | Dominated | £15,819 |
59 | Dominated | £40,402 | £17,051 | Dominated | Dominated | Dominated |
60 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
Threshold (95% CrI) | 56 (52 to 60) | 58 (54 to 60) | 59 (56 to 60) | 57 (53.95 to 60) | 50 (44 to 54.05) | 58 (54 to 60) |
SF-12 physical score | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
12 | Dominant | £198 | £506 | £1101 | £3255 | £581 |
15 | Dominant | £175 | £516 | £1182 | £3736 | £601 |
20 | Dominant | £225 | £598 | £1377 | £4736 | £707 |
25 | £108 | £322 | £718 | £1614 | £6028 | £855 |
30 | £263 | £447 | £866 | £1909 | £7970 | £1039 |
35 | £483 | £619 | £1075 | £2351 | £12,162 | £1303 |
38 | £692 | £775 | £1265 | £2784 | £19,302 | £1551 |
39 | £787 | £843 | £1347 | £2981 | £24,634 | £1661 |
40 | £902 | £923 | £1444 | £3219 | £34,769 | £1791 |
42 | £1229 | £1133 | £1696 | £3877 | £326,196 | £2139 |
43 | £1473 | £1276 | £1864 | £4349 | Dominated | £2380 |
47 | £4823 | £2490 | £3202 | £9381 | Dominated | £4550 |
48 | £9637 | £3230 | £3938 | £13,673 | Dominated | £5978 |
49 | £193,008 | £4562 | £5129 | £25,787 | Dominated | £8763 |
50 | Dominated | £7654 | £7374 | £272,840 | Dominated | £16,538 |
51 | Dominated | £22,689 | £13,137 | Dominated | Dominated | £152,507 |
52 | Dominated | Dominated | £59,560 | Dominated | Dominated | Dominated |
53 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
60 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
Threshold (95% CrI) | 48 (44 to 52) | 50 (48 to 52) | 51 (48 to 52) | 48 (46 to 50) | 38 (25 to 44) | 50 (48 to 52) |
For patients with a mental score of 70 (indicating very good mental health), the highest possible physical score is 42. The threshold physical score is above the maximum that can be achieved at this mental score for all groups other than 90-year-olds (Table 44) and 50- or 80-year-old men (see Online Supplement 13). Averaging across all ages, the threshold physical score is 44 (95% CrI 42 to 46) (see Table 44); this is 2 points higher than the maximum achievable.
SF-12 physical score | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
12 | Dominant | Dominant | £274 | £845 | £2630 | £314 |
15 | Dominant | Dominant | £353 | £1000 | £3189 | £407 |
20 | Dominant | £81 | £508 | £1309 | £4485 | £589 |
25 | Dominant | £209 | £695 | £1706 | £6689 | £813 |
30 | Dominant | £373 | £940 | £2282 | £11,907 | £1118 |
31 | £5 | £414 | £1002 | £2437 | £14,066 | £1197 |
32 | £44 | £459 | £1070 | £2614 | £17,205 | £1285 |
33 | £89 | £511 | £1147 | £2818 | £22,217 | £1386 |
34 | £141 | £569 | £1235 | £3057 | £31,542 | £1501 |
35 | £204 | £637 | £1336 | £3342 | £55,118 | £1637 |
36 | £282 | £717 | £1454 | £3690 | £233,756 | £1798 |
37 | £382 | £814 | £1594 | £4126 | Dominated | £1994 |
38 | £516 | £932 | £1763 | £4687 | Dominated | £2238 |
39 | £703 | £1083 | £1973 | £5442 | Dominated | £2551 |
40 | £986 | £1279 | £2241 | £6509 | Dominated | £2967 |
41 | £1461 | £1546 | £2594 | £8136 | Dominated | £3548 |
42 | £2421 | £1930 | £3080 | £10,919 | Dominated | £4416 |
Threshold (95% CrI) | 43 (38 to 46) | 44 (42 to 48) | 44 (44 to 48) | 43 (40 to 46) | 32 (25 to 38) | 44 (42 to 46) |
The ICERs and threshold SF-12 physical scores varied with age, with 70-year-olds having the highest threshold SF-12 physical score and (generally) the lowest ICERs. Thresholds were markedly lower for 90-year-olds. There was also substantially more uncertainty around thresholds for older patients, shown by the wider 95% CrI. Thresholds were between 0 and 5 points higher for women than for men for subgroups aged 50–80 years, with large differences between sexes for 90-year-olds (see Online Supplement 13).
The PSA demonstrated that we can be > 99% confident that TKA is cost-effective for 70-year-olds with SF-12 mental scores of 50 and SF-12 physical scores of ≤ 48 at a £20,000-per-QALY ceiling ratio (Figure 22). However, as would be expected, there is substantially more uncertainty around cost-effectiveness for patients around the threshold: at a mental score of 50, the probability that TKA is cost-effective for patients with a SF-12 physical score of 50 is 75%, which decreases to 40% for patients with a physical score of 52 and to 0% for patients with a score of 56. The level of uncertainty was similar across the range of mental scores, but was markedly greater for 90-year-olds.
FIGURE 22.
Effect of the SF-12 score on the probability that TKA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig22.png)
Effect of the Oxford Hip Score on cost-effectiveness of total hip arthroplasty
The incremental cost of THA compared with no arthroplasty increased with the OHS, from £779 per patient at an OHS of 0 up to £4258 at an OHS of 48 (averaged across men and women of all ages). THA was less costly than conducting no arthroplasty for 50- and 60-year-old patients with an OHS of 0 or 1, because the cost of non-surgical management was highest for patients with very low OHSs. The QALY gains from surgery were highest for patients with an OHS of 6 or 7: THA produced an increase of 3.68 QALYs for the average patient with an OHS of 0, 4.47 QALYs at an OHS of 6 and –0.61 QALYs at an OHS of 48. Therefore, conducting THA dominated no arthroplasty for 50- and 60-year-olds with an OHS of 0 or 1, given that it was less costly and generated more QALYs.
The threshold OHS above which THA ceased to be cost-effective was 43 (95% CrI 42 to 43) for 70-year-olds, but fell to 38 (95% CrI 34 to 41) for 90-year-olds (Table 45). The analysis averaging across all age groups suggested that if it was not acceptable to have different thresholds for different age groups, a single threshold of 42 (95% CrI 42 to 42) would be most appropriate. PSA demonstrated that there is very little uncertainty around the economic threshold, with 95% of the 2000 PSA replicates indicating a threshold of 42 when results were averaged across men and women of all age groups. The difference between men and women was negligible: the threshold was 1 point higher for 80-year-old men and 1 point lower for 70-year-old women, although thresholds for men and women were otherwise the same as those shown in Table 45 (see Online Supplement 13).
Preoperative OHS (selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
0 | Dominant | Dominant | £326 | £571 | £1208 | £212 |
10 | £630 | £616 | £737 | £958 | £1885 | £779 |
20 | £1055 | £966 | £1029 | £1273 | £2171 | £1105 |
30 | £1936 | £1699 | £1762 | £2214 | £4101 | £1928 |
35 | £3080 | £2600 | £2737 | £3556 | £7682 | £3027 |
36 | £3480 | £2902 | £3071 | £4042 | £9310 | £3410 |
37 | £3994 | £3280 | £3497 | £4680 | £11,816 | £3901 |
38 | £4682 | £3769 | £4058 | £5556 | £16,172 | £4554 |
39 | £5648 | £4426 | £4830 | £6832 | £25,631 | £5467 |
40 | £7105 | £5355 | £5960 | £8865 | £61,836 | £6832 |
41 | £9557 | £6769 | £7775 | £12,616 | Dominated | £9097 |
42 | £14,554 | £9187 | £11,169 | £21,853 | Dominated | £13,592 |
43 | £30,346 | £14,265 | £19,787 | £81,441 | Dominated | £26,810 |
44 | Dominated | £31,771 | £86,085 | Dominated | Dominated | £910,589 |
45 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
46 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
47 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
48 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
Threshold (95% CrI) | 42 (41 to 43) | 43 (43 to 43) | 43 (42 to 43) | 41 (41 to 42) | 38 (34 to 41) | 42 (42 to 42) |
The PSA demonstrated that there was relatively little uncertainty around cost-effectiveness for patients aged ≤ 80 years (Figure 23). For 70-year-olds, the probability that THA cost < £20,000 per QALY gained was > 99% for patients with an OHS of ≤ 42 and fell to 47% for patients with an OHS of 43 and to < 0.1% for patients with an OHS of ≥ 44. Uncertainty was markedly greater for 90-year-olds over a wide range of OHSs.
FIGURE 23.
Effect of the OHS on the probability that THA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig23.png)
Effect of the Western Ontario and McMaster Universities Arthritis Index on the cost-effectiveness of total hip arthroplasty
The QALY gains from THA were highest for patients with low WOMAC scores (mean incremental QALYs at a score of 0 = 8.228) and decreased for patients with higher scores (mean incremental QALYs at a score of 50 = 3.054, and at a score of 100 = 0.289). In contrast, the model results suggest that WOMAC scores do not predict costs of THA compared with no arthroplasty. The mean incremental costs were around £3200, regardless of WOMAC score.
Total hip arthroplasty was cost-effective at almost all WOMAC scores (Table 46). The only exception was for patients aged 90 years and those with a WOMAC score of 100 (the best possible score). In contrast to the results for TKA, the 95% CrIs indicate little uncertainty about the threshold; for almost all age groups, the CrIs range from 98 to 100. There are almost no differences in cost-effectiveness between men and women, the only exception being patients aged 90 years with a clinical score of 99, among whom THA is cost-effective for women but not for men. Again, it should be noted that there were no patients observed at these scores in the available data sets.
Preoperative WOMAC score (rescaled to 0–100; 0 = poor function; selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
0 | £372 | £366 | £399 | £499 | £743 | £424 |
10 | £427 | £418 | £454 | £574 | £878 | £485 |
20 | £510 | £496 | £537 | £683 | £1060 | £577 |
30 | £624 | £604 | £652 | £829 | £1294 | £702 |
40 | £762 | £735 | £790 | £1005 | £1570 | £852 |
50 | £922 | £885 | £949 | £1205 | £1885 | £1024 |
60 | £1115 | £1065 | £1137 | £1441 | £2257 | £1229 |
70 | £1395 | £1323 | £1406 | £1777 | £2791 | £1522 |
80 | £1915 | £1796 | £1899 | £2392 | £3774 | £2060 |
85 | £2391 | £2222 | £2342 | £2944 | £4663 | £2545 |
90 | £3227 | £2952 | £3094 | £3879 | £6188 | £3373 |
95 | £5069 | £4484 | £4650 | £5804 | £9402 | £5104 |
96 | £5742 | £5019 | £5186 | £6465 | £10,529 | £5706 |
97 | £6631 | £5707 | £5870 | £7306 | £11,980 | £6478 |
98 | £7861 | £6623 | £6771 | £8411 | £13,919 | £7504 |
99 | £9672 | £7906 | £8016 | £9929 | £16,644 | £8933 |
100 | £12,610 | £9830 | £9844 | £12,148 | £20,753 | £11,064 |
Threshold (95% CrI) | 100 (98 to 100) | 100 (99 to 100) | 100 (99 to 100) | 100 (98 to 100) | 99 (96 to 100) | 100 (99 to 100) |
The PSA suggested that there was relatively little parameter uncertainty within the model: the probability that THA was cost-effective at a £20,000-per-QALY ceiling ratio was ≥ 99% for all patients with WOMAC scores of ≤ 90 (Figure 24).
FIGURE 24.
Effect of the WOMAC on the probability that THA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig24.png)
Effect of the Short Form questionnaire-12 items on the cost-effectiveness of total hip arthroplasty
For patients with SF-12 mental scores of 30 or 50, incremental QALYs were highest at physical scores of between 18 and 28 and were markedly lower for patients with higher or lower scores. By contrast, for patients with a mental score of 70, incremental QALYs were highest at a physical score of 12 and declined steadily with an increasing physical score. For all three mental scores, the incremental cost of TKA versus no arthroplasty changed very little with SF-12 physical score. For example, for a mental score of 50, the average incremental cost (across men and women of all ages) decreased from £4267 per patient at a physical score of 19 to £3058 per patient at a physical score of 60, whereas incremental QALYs decreased from 3.42 to –0.26. As was the case for TKA, THA was dominated by no arthroplasty (generating fewer QALYs at a higher cost) for patients with very high physical scores. However, THA was also dominated for 90-year-old patients with extremely low physical scores and below-average mental scores. For patients with very poor mental health (a mental score of 30), the threshold physical score was 59 (95% CrI 56 to 60), averaging across all ages (Table 47). Thresholds were slightly lower for patients with a mental score of 50 (i.e. average mental health), for whom the threshold physical score was 56 (95% CrI 54 to 58), averaging across ages (Table 48).
Preoperative SF-12 physical score (selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
18 | £1268 | £1096 | £1204 | £2051 | Dominated | £1439 |
19 | £1205 | £1057 | £1166 | £1930 | Dominated | £1376 |
20 | £1159 | £1028 | £1138 | £1843 | £103,982 | £1329 |
21 | £1124 | £1007 | £1117 | £1778 | £23,002 | £1294 |
22 | £1098 | £991 | £1102 | £1730 | £13,811 | £1268 |
30 | £1043 | £977 | £1106 | £1650 | £5200 | £1236 |
40 | £1094 | £827 | £936 | £1659 | £5086 | £1151 |
50 | £1293 | £1245 | £1467 | £2448 | £11,212 | £1690 |
51 | £1392 | £1340 | £1590 | £2708 | £13,914 | £1837 |
52 | £1516 | £1456 | £1745 | £3049 | £18,758 | £2022 |
53 | £1672 | £1603 | £1945 | £3514 | £29,816 | £2262 |
54 | £1877 | £1794 | £2209 | £4180 | £79,333 | £2585 |
55 | £2154 | £2050 | £2573 | £5208 | Dominated | £3029 |
56 | £2547 | £2407 | £3104 | £6981 | Dominated | £3694 |
57 | £3144 | £2938 | £3941 | £10,715 | Dominated | £4781 |
58 | £4148 | £3801 | £5435 | £23,396 | Dominated | £6853 |
59 | £6164 | £5423 | £8798 | Dominated | Dominated | £12,230 |
60 | £12,140 | £9504 | £22,862 | Dominated | Dominated | £56,760 |
Threshold (95% CrI) | 60 (58 to 60) | 60 (58 to 60) | 59 (58 to 60) | 57 (54 to 60) | 52 (11 to 56) | 59 (56 to 60) |
Preoperative SF-12 physical score (selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
12 | £1226 | £1091 | £1214 | £1987 | £580,230 | £1419 |
15 | £1079 | £996 | £1120 | £1711 | £7280 | £1268 |
20 | £1045 | £993 | £1050 | £1670 | £4454 | £1222 |
30 | £864 | £849 | £976 | £1488 | £4257 | £1103 |
40 | £1110 | £1105 | £1310 | £2051 | £4913 | £1455 |
50 | £2060 | £2056 | £2625 | £4737 | £17,623 | £2933 |
51 | £2311 | £2304 | £2990 | £5621 | £26,282 | £3351 |
52 | £2643 | £2630 | £3485 | £6936 | £52,926 | £3923 |
53 | £3103 | £3075 | £4189 | £9078 | Dominated | £4748 |
54 | £3772 | £3715 | £5257 | £13,126 | Dominated | £6027 |
55 | £4826 | £4699 | £7046 | £23,438 | Dominated | £8244 |
56 | £6702 | £6382 | £10,592 | £99,530 | Dominated | £12,941 |
57 | £10,892 | £9850 | £20,671 | Dominated | Dominated | £29,016 |
58 | £27,878 | £20,768 | £218,915 | Dominated | Dominated | Dominated |
59 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
60 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
Threshold (95% CrI) | 57 (54 to 58) | 57 (56 to 58) | 56 (54 to 58) | 54 (52 to 56) | 50 (40 to 54) | 56 (54 to 58) |
The threshold physical score for patients with a mental score of 70 (i.e. very good mental health) was substantially higher than the maximum that can be achieved at this mental score for all age groups (Table 49). Averaging across all ages, the models predicted that the threshold physical score is 55 (95% CrI 52 to 56) for a mental score of 70: 13 points higher than the maximum achievable score (42).
Preoperative SF-12 physical score (selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
12 | £940 | £855 | £817 | £1419 | £3356 | £1020 |
15 | £691 | £662 | £730 | £1061 | £2976 | £826 |
20 | £749 | £726 | £809 | £1168 | £2572 | £898 |
25 | £836 | £816 | £921 | £1333 | £2709 | £1014 |
30 | £946 | £929 | £1063 | £1552 | £3094 | £1165 |
35 | £1096 | £1081 | £1256 | £1861 | £3700 | £1373 |
40 | £1328 | £1317 | £1562 | £2375 | £4822 | £1703 |
41 | £1393 | £1383 | £1649 | £2524 | £5169 | £1797 |
42 | £1467 | £1458 | £1748 | £2698 | £5586 | £1904 |
Threshold (95% CrI) | 57 (54 to 58) | 57 (56 to 58) | 56 (54 to 58) | 54 (52 to 56) | 50 (40 to 54) | 56 (54 to 58) |
The PSA suggested that there was very little uncertainty around the results for 70-year-olds (Figure 25). The probability that THA costs < £20,000 per QALY gained was > 99% for 70-year-olds with SF-12 physical scores of ≥ 52 and fell sharply as physical scores increased. However, there was markedly greater uncertainty for 90-year-olds across the range of physical scores.
FIGURE 25.
Effect of the SF-12 physical score on the probability that THA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig25.png)
Sensitivity analysis
Sensitivity analyses demonstrated that the results are reasonably robust to large changes in two key assumptions (Table 50). Reducing the time horizon from 10 to 5 years reduced thresholds slightly, because it reduces the duration for which patients are assumed to experience the postarthroplasty EQ-5D utility. Conversely, increasing the time horizon to 60 years had minimal impact on thresholds.
Sensitivity analysis | Economic threshold for TKA ignoring age and sex | Economic threshold for THA ignoring age and sex | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
OKS | WOMAC score | SF-12 physical and mental scores of: | OHS | WOMAC score | SF-12 physical and mental scores of: | |||||
30 | 50 | 70 | 30 | 50 | 70 | |||||
Base-case analysis | 40 | 98 | 58 | 49 | 44 | 42 | 100 | 59 | 56 | 56 |
5-year time horizon | 38 | 94 | 57 | 48 | 43 | 41 | 99 | 59 | 55 | 53 |
60-year (lifetime) time horizon | 40 | 98 | 57 | 49 | 44 | 42 | 100 | 60 | 57 | 55 |
EQ-5D utility without TJA worsens by 0.025 per year | 46 | 100 | 59 | 54 | 50 | 47 | 100 | 60 | 60 | 60 |
EQ-5D utility without TJA increases by 0.115 in the first year and follows age-related decline thereafter | 33 | 80 | 52 | 43 | 37 | 36 | 89 | 55 | 51 | 47 |
In the base-case analysis, we assumed that patients’ osteoarthritis symptoms would remain at a constant level in the absence of arthroplasty, with EQ-5D utility reducing by around 0.0036–0.0069 per year because of ageing. Increasing the rate at which EQ-5D utility decreases to an arbitrary rate of 0.025 per year (the smallest difference in valuations that was possible in the EQ-5D valuation study168) substantially increased all thresholds, because it reduced the number of QALYs accrued in the no arthroplasty arm. We also conducted an analysis that assumed that EQ-5D utility for patients in the no arthroplasty arm would increase by 0.115 in year 1 (the mean EQ-5D change in a trial comparing arthroplasty against an intensive non-surgical management intervention)169 and then decline at the same age-dependent rate assumed in the base-case analysis; this analysis substantially reduced thresholds.
Discussion
Summary of the results
We were able to calculate thresholds for TKA and THA for the OKS, OHS, WOMAC and SF-12 physical scores that represent the highest score at which arthroplasty is cost-effective if the NHS is willing or able to pay £20,000 per QALY gained (Table 51). These thresholds were generally somewhat higher than the relative thresholds calculated in Chapter 4, but a little lower than the absolute thresholds. The economic thresholds are more similar to the absolute thresholds than relative thresholds as they are based on the score at which there is zero difference in net benefits, rather than a MID. However, unlike the clinical analyses, the economic evaluation measures outcomes using the EQ-5D rather than the score in question and takes account of mortality, revisions, complications and costs. The economic evaluation also explicitly compares costs and outcomes with arthroplasty against the costs and outcomes that are expected to be accrued without arthroplasty. Unlike the relative thresholds described in Chapter 4, the economic thresholds also evaluate outcomes on a continuous scale (being more like the linear regression models described in Chapter 4 than the logistic regression models) and are also based on the average (mean) benefit, rather than the probability of benefit. No MID is imposed on the data, and the analysis directly takes account of operative mortality and the cost of surgery and subsequent follow-up and revision surgery. Because arthroplasty is relatively cheap and produces such large QALY gains for those people who benefit, it is cost-effective for almost all patients who are expected to have any improvement in quality of life.
Clinical tool | Threshold (95% CrI) | QALYs gained from stratifyinga | Number of operations avoideda |
---|---|---|---|
Knees | |||
OKS | |||
Economic thresholdb | 40 (39 to 42) | 53 | 219 |
Clinical relative thresholdc | 32 | –2020 | 3342 |
WOMAC | |||
Economic thresholdb | 99 (80 to 100) | 0 | 0 |
Clinical relative thresholdc | 71 | –4234 | 4813 |
SF-12d | |||
Economic thresholdb | 50 (48 to 52) | 867 | 1584 |
Clinical relative thresholdc | 37 | –15,655 | 17,326 |
Hips | |||
OHS | |||
Economic thresholdb | 42 (42 to 42) | 25 | 100 |
Clinical relative thresholdc | 37 | –314 | 814 |
WOMAC | |||
Economic thresholdb | 100 (99 to 100) | 0 | 0 |
Clinical relative thresholdc | 80 | –1775 | 1411 |
SF-12d | |||
Economic thresholdb | 56 (54 to 58) | 54 | 224 |
Clinical relative thresholdc | 35 | –22,389 | 11,566 |
We also calculated how the number of operations currently done in England would change using each of the six economic thresholds and the corresponding clinical relative thresholds, based on the distribution of scores in the best available data sets (see Online Supplement 12) and assuming that 76,617 knee replacements and 69,313 hip replacements are done each year (see Presentation of results and analysis of uncertainty). 154,165 This calculation has two parts: (1) operations currently done that would be avoided (based on those patients currently receiving operations who are above the threshold) and (2) operations not currently done that might be considered (based on patients who are below the threshold but at present are not listed for arthroplasty). The calculations presented in this chapter focused only on the first part (i.e. the number of current operations conducted on patients above the threshold).
If the economic threshold of 40 (ignoring age and sex) was applied to all 76,617 knee replacements done in England each year,154,165 then 219 operations could be avoided, because only 219 patients (0.29%) having knee replacement operations have an OKS of > 40 (see Table 51). If the NHS stopped doing knee arthroplasty on those 219 patients, the equivalent of 53 QALYs would be gained from avoiding undertaking surgery that would reduce quality of life in some of these 219 patients, and by saving money that could be spent on other patients who would gain additional health benefits. Similarly, applying the economic threshold of 42 would avoid 100 of the 69,313 hip replacements currently done each year,154,165 gaining 25 QALYs.
For comparison, the clinical analysis suggested a relative threshold for an OKS of 32; using this threshold, we would avoid 3342 operations, but this policy would be expected to reduce the amount of health benefits gained from the NHS budget by 2020 QALYs, by avoiding treatment in patients for whom knee replacement would improve quality of life. Similarly, the clinical relative threshold for hip replacement (37) would avoid 731 operations but reduce health by 284 QALYs.
The economic thresholds estimated for the WOMAC were extremely close to the top of the scale and WOMAC did not identify any patients aged < 90 years for whom THA is not cost-effective, nor any patients aged 60 or 70 years for whom TKA is not cost-effective. No patients recruited to the studies used to estimate EQ-5D utility and costs after TKA had WOMAC scores of > 87, whereas the highest score observed among THA patients was 93. Taken at face value, this implies that no patients who currently undergo surgery would be denied treatment based on the economic thresholds estimated in the analysis. However, owing to the lack of data in patients with high WOMAC scores, the results for patients with high WOMAC scores and the estimated WOMAC thresholds should be interpreted with caution as they are based on predictions extrapolated outside the range of the observed data. In particular, it seems implausible that TJA would be cost-effective in patients with a WOMAC score of 100, who would have no pain, stiffness or loss of function. Conversely, the clinical thresholds estimated for WOMAC are low relative to those for the OKS and OHS and would, therefore, be expected to avoid more operations, but with greater loss of patients’ health. Therefore, more research on larger data sets including patients with scores of > 85 is needed to estimate economic thresholds for the WOMAC.
The threshold SF-12 physical score varied markedly with SF-12 mental scores. As would be expected, thresholds were lower for patients with mental scores of 30 (poor mental health) than for patients with mental scores of 50 (average mental health). However, other than for TKA patients aged 90 years, the threshold physical score for patients with SF-12 mental scores of 70 (indicating very good mental health) was higher than the maximum score that can be achieved at that mental score. As a result, SF-12 cannot distinguish between those people who would benefit from arthroplasty and those who would not among those with very good mental health. Furthermore, the scoring of the SF-12 is such that people with very poor physical scores are assigned better mental scores than people who have good physical scores but the same level of mental health.
The model also suggested that THA (but not TKA) is poor value for money in 90-year-old patients with extremely low SF-12 physical scores and SF-12 mental scores of either 30 or 50. One possible explanation for this unexpected finding is that the generic SF-12 instrument captures a wide range of health problems, including comorbid conditions, whereas the OHS and WOMAC focus on hip or musculoskeletal symptoms. It is possible that these comorbid conditions are worsened by arthroplasty or that, in patients who have other conditions limiting their quality of life, THA produces relatively limited improvements in EQ-5D utility because patients still experience severe pain and limited function because of their other health problems. However, this result could also be an artefact of the way that SF-12 scores are calculated or the shortage of data in this patient group; only nine patients in EPOS were aged ≥ 90 years and no patients (of any age) had SF-12 mental scores of < 35 and SF-12 physical scores of < 22.
The results for the OKS are similar to those estimated previously for KAT by Dakin et al. :2 we estimate the economic threshold to be 40 (95% CrI 39 to 42), and Dakin et al. 2 estimated a threshold of 39 for patients with an ASA grade of 1 or 2 and a threshold of 34 for patients with an ASA grade of 3. However, the difference between men and women appears to be smaller in the current analysis than was suggested by Dakin et al. 2 These differences appear to be largely attributable to the differences in time horizon (5 years vs. 10 years in the current analysis), because the sensitivity analysis using a 5-year time horizon estimated the OKS threshold to be 38. Furthermore, the additional data sets available in this analysis (particularly COASt) enabled us to allow for the costs and quality-of-life changes likely to arise without arthroplasty. Although we used KAT data for costs and long-term outcomes, we also used data from published studies,50,112,119,125 COASt and national PROMs and used very different analytical methods, synthesising data from different sources in a Markov model rather than estimating total costs and total QALYs for individual trial participants in a within-trial analysis. In particular, PROMs provided a much larger sample of patients with very high OKSs. The current study also allowed for a wide range of non-linear models, which suggested that it may be appropriate to have lower thresholds for patients aged < 60 years. Given the differences in analytical approach and data inputs, the similarity of the thresholds in the two studies appears to suggest that the thresholds estimated are relatively robust.
Jenkins et al. 105 also estimated the relationship between the Oxford Hip and Knee Scores and QALYs for both TKA and THA, and Fordham et al. 68 assessed how cost-effectiveness varied with the OHS. Lavernia et al. 106 assessed how cost-effectiveness varied with preoperative WOMAC score and, unlike the current analysis, found that patients in the top quartile for WOMAC scores (with worst function and pain) had the largest QALY gains and lowest ICERs for THA. Whereas these previous studies had found arthroplasty to be highly cost-effective for most patients, Ferket et al. 108 found TKA to cost more than US$100,000 per QALY gained unless treatment was restricted to patients with SF-12 physical scores of ≤ 20. The high ICERs in the Ferket et al. 108 study appear to largely arise from the small estimates of the improvements in utility achieved through arthroplasty (0.008), which appear to contradict the findings of a RCT (which found the difference in EQ-5D utility between patients with and without TKA to be 0.078)169 and English National PROMs data (which observe a mean increase in EQ-5D utility of 0.310). The difference could reflect differences between EQ-5D and SF-12 utilities, use of propensity weights in the Ferket et al. 108 study or the small sample of US patients with mild osteoarthritis who were analysed by Ferket et al. 108 None of these studies estimated thresholds that can be directly compared with the current analysis.
Limitations
The economic evaluation compared immediate TJA with conducting no arthroplasty for 10 years. In practice, arthroplasty may simply be delayed in those patients who have high PROM scores. Although the choice of a 10-year time horizon is arbitrary, sensitivity analyses suggested that extending or shortening the time horizon had minimal effect on the results.
Whereas the NHS PROMs data provided Oxford Hip and Knee Scores for > 80% of all arthroplasty operations conducted in England,162 the data sets available for the WOMAC and SF-12 were < 2% of the size and included very few patients with very high scores, which meant that the economic threshold for the WOMAC could only be estimated by extrapolating beyond the observed data. Furthermore, the only data set providing preoperative costs (COASt) did not use the WOMAC or SF-12 and no data sets with preoperative WOMAC scores provided more than 1 year’s follow-up; as a result, it was necessary to estimate the relationship between the WOMAC and SF-12 and these outcomes indirectly via EQ-5D utilities. KAT was the only study using both the SF-12 and the EQ-5D. All estimates of improvements in EQ-5D utility and long-term changes in utility for the SF-12 in THA are therefore based on mapped utilities, which introduces additional errors not propagated through the analysis, and this could introduce bias. Unlike the Oxford Hip and Knee Scores, there are two available versions of the SF-12; in this analysis, we assumed that versions 1 and 2 are equivalent.
Equity implications
At present, GPs, commissioning groups and hospitals use a wide variety of different tools and mechanisms to determine eligibility for arthroplasty. 170–176 As a result, patients’ access to surgery depends on where they live, not just their clinical need. Furthermore, the thresholds used by commissioning groups, such as 19 or 24,170,171,176 are far below the score at which arthroplasty ceases to be cost-effective and, therefore, thousands of patients who would get substantial benefit from surgery are denied treatment. In principle, introducing an evidence-based threshold across the country could reduce postcode rationing and improve equity of access.
The results suggest that a modest amount of population health could be gained by setting different thresholds for patients of different ages. Focusing on those patients currently undergoing surgery and ignoring the large numbers of patients who are currently not being referred for treatment suggests that society could avoid 219 operations and gain 53 QALYs by stopping all operations on patients with OKSs of > 40, but could avoid 247 operations and gain 62 QALYs by having different thresholds for patients of different ages. Similarly, for the OHS, we could avoid 100 operations and gain 25 QALYs using a fixed threshold of 42, and could avoid 114 operations and gain 29 QALYs by having different thresholds for patients of different ages. However, restricting access to care by age as well as clinical need raises equity and ethics concerns. NICE guidance may refer only to age if age is a good indicator of health status or response to treatment and/or adverse effects, or if there is no practical way of identifying patients other than by age. 98 Our regression models and those identified in our literature review demonstrate that age does influence mortality, revision rates, quality of life and the cost of surgery, suggesting that setting different thresholds by age could be clinically justifiable. However, we found the effect of age on the estimates of the economic thresholds to be generally quite small and most noticeable in the highest age groups (e.g. ≥ 90 years) in which very few data were available. Moreover, the effect of age is likely to arise largely through physical activity, general health and comorbidities, which have not been taken into account in this analysis, and it may be possible to identify those patients likely to have minimal benefit or greatest risk based on these characteristics, rather than relying on age.
In principle, thresholds could also vary by gender; thresholds for the OHS were 1 point lower for women aged 70 years and 1 point higher for men aged 80 years, although OKS thresholds were the same for men and women. However, a policy varying thresholds by sex would avoid no knee operations and only three hip operations and gain 0.25 QALYs across England compared with thresholds that varied only by age. These modest gains are unlikely to be sufficient to justify the additional complexity or the equity implications of rationing access to care by gender.
Implementation issues
Our results suggest that population health could be improved by introducing thresholds that are markedly higher than those used by most hospitals and commissioning groups. If such thresholds were introduced, it is highly likely that more patients would come forward for surgery and be referred for surgical assessment. This would require additional funding for arthroplasty procedures. Our results suggest that health would be improved by increasing spending on arthroplasty in preference to less cost-effective interventions. However, to achieve these health gains, less cost-effective interventions would have to be identified and deprioritised. Furthermore, capacity constraints of trained surgeons and operating theatres would have to be addressed.
The current analysis is based on UK data and may not necessarily be generalisable to other countries, particularly those in which the cost of TJA or non-surgical management are very different. The analysis also focuses on TJA for osteoarthritis; the costs and benefits of unicompartmental knee arthroplasty or hip resurfacing may also differ from those of total arthroplasty, and the risks and benefits of arthroplasty may be very different for patients with conditions other than osteoarthritis.
Additional findings
The models developed to map from the WOMAC and SF-12 to EQ-5D utilities could also be applied in other settings in which WOMAC or SF-12 data are available, but EQ-5D utilities are not. Coefficients for these models are shown in Online Supplement 12. However, owing to the nature of our analysis, we focused on models based on the WOMAC total score and the SF-12 physical and mental scores; more accurate models could have been obtained by exploring the effect of subdomain scores or dummy variables for individual questions. Our analyses suggested that published models mapping from the SF-12 onto EQ-5D utility that were estimated on general public samples performed poorly in knee arthroplasty candidates,110,158,177 probably because these patients have low utility compared with the general public. This highlights the importance of selecting a mapping model that has been developed or validated in the sample of interest.
Conclusion
In conclusion, economic thresholds have been estimated for the OKS, OHS, WOMAC and SF-12 physical score, which reflect the incremental costs and benefits of total hip/knee arthroplasty compared with conducting no arthroplasty. However, there is a shortage of data on the WOMAC, particularly from patients with very high scores, which means that we cannot identify patients for whom arthroplasty is not cost-effective based on the available data. The threshold SF-12 physical score varies with mental score and, among those patients with very good mental health, the SF-12 physical score cannot identify any patients for whom arthroplasty is not cost-effective. The economic thresholds for the Oxford Hip and Knee Scores are somewhat higher than the relative clinical thresholds estimated in Chapter 4 and are markedly higher than those currently used in clinical practice. There is some evidence that modest additional health gains could be achieved by using different thresholds for patients of different ages, but setting different thresholds for men and women is unlikely to be justifiable.
Chapter 6 Further analysis of threshold values (work package 2)
Background
Based on the work presented in the previous chapters and after input from the user group (see Chapter 10), the Oxford Hip and Knee Scores were selected as the candidate tools. In this chapter, relative thresholds for the OHS and OKS were estimated through more advanced modelling approaches using the NHS PROMs data set (2009–15) linked to HES. It was agreed in the third user group meeting to develop the ACHE tool based on the principle of the capacity of a patient to benefit from surgery. To do this, more detailed information was required on the effect of the patient’s preoperative score on the potential range of outcomes. The data could then be incorporated into delivering the ACHE tool.
Research aims
The aims of this work were to:
-
estimate the postoperative Oxford score using the respective hip and knee preoperative scores and the respective relative thresholds using an improvement criterion based on an individual-level ROC estimate
-
assess the influence and prognostic value of candidate baseline covariates to predict postoperative improvement in the Oxford Hip and Knee Scores.
Methods
General approach
Two modelling approaches were used for analyses of both of the Oxford scores. First, third-degree polynomial-based quantile regression models were used to estimate the change score (postoperative to preoperative) using the preoperative Oxford score. Quantile regression seeks to extend the ideas, that the population could be divided into several segments using quantiles or percentiles, to develop the models in which the quantiles of the conditional distribution of the response variable are expressed as functions of observed covariates. 178 The 10th, 20th, 30th and 50th percentile models were produced and the corresponding accuracy was assessed by visually comparing the estimated values, again with the observed percentile values across the preoperative score range. From these relative threshold values, the preoperative score for which the respective percentile matches the respective improvement criteria was estimated (see Improvement criteria for further details). In addition, we compared the percentile model (from 1st to 99th) of those achieving an important difference against the observed proportion of individuals who improved in accordance with the improvement criteria. Given the very large size of the available data set, modelling performance was internally validated by assessing sensitivity to key factors (time period, gender and age) through carrying out subset analyses and not by selecting a random sample or bootstrapping approach.
The second approach used fractional polynomial logistic regressions. Using this approach, the benefit of the baseline covariates on the capacity of benefit was investigated. Fractional polynomial regression was introduced in terms of parsimonious modelling. 179
The baseline covariates investigated in the logistic regression models were age (as a continuous variable), gender, the presence or absence of 12 patient-reported comorbidities (heart disease, high blood pressure, problems caused by stroke, leg pain when walking owing to poor circulation, lung disease, diabetes, kidney disease, disease of the nervous system, liver disease, cancer, depression and arthritis) and symptom period (up to 5 years or 6 to ≥ 10 years). For age, the optimal relationship was assessed individually between the variables using the fractional polynomial logistic regression when the default model included the preoperative score. Comorbidities were assessed simultaneously with only those significant and with a log-odds ratio (OR) of ≥ 0.8 remaining a candidate. For each of the candidate covariates, the discriminative performance with only that covariate in addition to the preoperative score was calculated. Covariates were selected for inclusion in the final model if the AUC with only the candidate covariate was ≥ 0.01 greater than the AUC for the model with the preoperative score included. The performance of the final model was assessed in terms of discrimination and calibration by calculating the AUC and producing calibration plots. Sensitivity and specificity values for the estimated relative threshold were calculated with corresponding 95% CIs. Complete-case analyses were conducted throughout. Stata® version 14 was used for all of the further statistical analyses looking at the threshold.
Data set
The NHS PROMs linked to HES (2009–15) was used. The inclusion criterion was that patients must have undergone a primary procedure and those undergoing revision procedures were excluded. Using PROMs and HES variables, primary only were selected. First, patients who have the same procedure [as identified by ‘PROMS_PROC_CODE’ (a code identifying the type of procedure that the patient underwent)] carried out on the same side of the body more than once (e.g. two primary hip replacements of the left hip) were excluded based on the PROMs data. Using the variables ‘EPISODE_MATCH_RANK’ [a score is attributed to each part of the linking process, in which the quality of the match is denoted by the rank with the lowest rank (i.e. 1) being the highest-quality match; the scores for each possible match are compared and the highest match is chosen] and ‘EPIKEY’, the PROMs and HES data were linked and duplicate entries were identified and removed. The former variable ensured that the highest-quality match between HES and PROMs is listed first, and the latter variable ensures that a unique sorting is produced. Based on this sorting, we dropped the second observations from the duplicate pairs. In addition, patients who answered that they have not had previous surgery of the type they were going to undergo were included, using the ‘Q1_PREVIOUS_SURGERY’ variable.
Approval
We successfully applied for access to the NHS PROMs/HES-linked data from NHS Digital (reference NIC-392690-F7H2Q).
Improvement criteria
Patient benefit can be defined in various ways. We restricted the definition to improvement in OHS and OKS patient-reported scores using standard approaches. Definitions of improvement were applied in this chapter as follows:
-
B. medium ES (0.5) × SD of change score (MCID) – 5 (OHS and OKS)
-
E. anchor-based best cut-off point using the individual-level ROC approach from the literature – 8 (OHS) or 7 (OKS).
A previously reported anchor-based best cut-off point was applied (using the NHS PROMs data set from 2009 to 2011). 90 This was a ROC curve analysis with the probability of correctly identifying patients based on an patient-reported anchor, ‘a little better’ versus ‘the same’ from the post-surgery evaluation in the questionnaire. When it was used at an individual level (MIC from ROC analysis), any changes in the OHS and OKS beyond 8 and 7 points, respectively, were considered as ‘clinically relevant’ changes. 75 These values lead (under improvement criterion E) to absolute threshold values for the OHS and OKS of 40 and 41 points, respectively. Corresponding values for definition B were as previously: 43 for both scores.
In addition, we also applied a 0.5 (medium, moderate practical importance) ES approach [as classified by Cohen180] using the variability of the change scores as an approach consistent with work package 2. It was calculated simply by using the SD of change scores of the candidate tools multiplied by the medium ES (0.5): 0.5 × SD of change score (criterion B). 181
Results
Descriptive statistics
A total of 644,656 observations were available for PROMs data linked to HES (2009–15). After dropping the identified duplicates from the data set linked to HES, there were 602,287 observations remaining (289,808 for hip and 312,479 for knee). Excluding those patients who had not confirmed in the PROMs questionnaire whether or not they had undergone previous surgery on the respective side reduced the data set further (to 575,980; 277,786 for hip and 298,194 for knee). Of these, only a subset of patients who had completed both the preoperative and the postoperative questionnaires were included, using the ‘Q1 and Q2 Complete’ variables (443,262; 214,038 for hip and 229,224 for knee). In addition, patients who had submitted both the preoperative and the postoperative hip questionnaires with sufficient procedure-specific data to derive scores were included, using the ‘Q1 and Q2 Complete’ variables (209,761 for hip and 222,933 operations) (Figure 26).
FIGURE 26.
Flow diagrams for hip and knee PROMs linked to HES data sets. (a) Hip; and (b) knee.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig26a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig26b.png)
Demographics
The patient characteristics for those included in the matched hip and knee data sets is provided in Table 52. In total, 209,761 hip replacement surgeries were included; over half (125,058; 59.6%) were female, with ages ranging from 13 to 100 years, with a mean age of 69 years. In total, 96,041 NHS PROMs linked to HES (45.8%) were measured in 2009–11. In total, 222,933 knee replacement surgeries were included; over half (126,885; 56.9%) were female, with ages ranging from 18 to 99 years, with a mean age of 69.6 years. In total, 96,041 NHS PROMs linked to HES (45.8%) were measured for hip replacements and 102,448 (50.0%) were measured for knee replacements in 2009–11. In total, 113,720 (54.2%) post-operation scores were completed for hip and 120,485 (54.1%) for knee in 2012–15. Figure 27 shows histograms of the OHS and OKS scores pre and post surgery.
Characteristic | Hip replacement (N = 209,760) | Knee replacement (N = 222,933) |
---|---|---|
Age (years) | ||
Mean | 68.37 | 69.52 |
SD | 10.46 | 8.92 |
Minimum | 13 | 16 |
Maximum | 100 | 102 |
Age category (years), n (%) | ||
< 60 | 37,904 (18.07) | 29,349 (13.16) |
60–79 | 144,064 (68.68) | 164,132 (73.62) |
≥ 80 | 27,793 (13.25) | 29,452 (13.21) |
Gender, n (%) | ||
Male | 84,673 (40.37) | 96,006 (43.06) |
Female | 125,058 (59.62) | 126,885 (56.92) |
Comorbidity, n (%) | ||
Heart disease | 19,679 (9.38) | 23,340 (10.47) |
High blood pressure | 82,428 (39.30) | 102,542 (46.00) |
Stroke | 2912 (1.39) | 3733 (1.67) |
Circulation | 11,968 (5.71) | 16,464 (7.39) |
Lung disease | 15,592 (7.43) | 18,571 (8.33) |
Diabetes | 18,449 (8.80) | 27,789 (12.47) |
Kidney | 3550 (1.69) | 4022 (1.80) |
Nervous system | 1566 (0.75) | 2155 (0.97) |
Liver disease | 1081 (0.52) | 1199 (0.54) |
Cancer | 10,085 (4.81) | 10,416 (4.67) |
Depression | 15,264 (7.28) | 18,375 (8.24) |
Arthritis | 151,331 (72.14) | 174,391 (78.23) |
Comorbidity group, n (%) | ||
0 | 29,933 (14.27) | 22,121 (9.92) |
1 | 79,168 (37.74) | 73,599 (33.01) |
2 | 63,076 (30.07) | 74,649 (33.48) |
≥ 3 | 37,584 (17.92) | 52,564 (23.58) |
Year of NHS PROMs, n (%) | ||
2009–11 | 96,041 (45.79) | 102,448 (45.95) |
2012–15 | 113,720 (54.21) | 120,485 (54.05) |
Living arrangement, n (%) | ||
I live with someone | 151,669 (72.31) | 164,451 (73.77) |
I live alone | 53,318 (25.42) | 52,985 (23.77) |
I live in a care home, etc. | 259 (0.12) | 219 (0.10) |
Other | 769 (0.37) | 802 (0.36) |
Symptom period (years), n (%) | ||
< 1 | 29,053 (13.85) | 11,041 (4.95) |
1–5 | 142,960 (68.15) | 116,195 (52.12) |
6–10 | 23,108 (11.02) | 48,340 (21.68) |
> 10 | 13,588 (6.48) | 46,273 (20.76) |
FIGURE 27.
The NHS PROMs histograms. (a) OHS; and (b) OKS.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig27a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig27b.png)
Relative thresholds
Hip
Quantile regression was used to estimate the shape of the 10th, 20th, 30th and 50th percentile change in OHS. For example, the prediction of the third-degree polynomial 50th quantile regression model was:
where x is the preoperative score, εi i. i. d. ∼N(0,σ2) and i refers to the ith patient.
The observed postoperative change in the OHS against the preoperative score along with the observed and estimated percentiles using the quantile regression model are shown in Figure 28. Estimated relative thresholds using the E improvement criterion for the 10th, 20th, 30th and 50th percentiles are presented in Table 53 for the whole data set (‘Total’) and also by key factors (age, gender and time period). There was no clear sign of a different threshold value in accordance with the covariates, with some suggestion that the group of patients aged ≥ 80 years had slightly lower estimates. Results based on the B improvement criterion are available in Online Supplement 14.
FIGURE 28.
The OHS: NHS PROMs. (a) Observed percentiles and 10th, 20th, 30th and 50th quantile regression curves (coloured dots indicate each observational percentile and grey dots indicate the actual weighted observations); and (b) proportion improved using improvement criterion E (observed and quantile regression model).
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig28a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig28b.png)
Baseline covariate | n | Percentiles | |||||||
---|---|---|---|---|---|---|---|---|---|
Predicted | Observed | ||||||||
10th | 20th | 30th | 50th | 10th | 20th | 30th | 50th | ||
Total | 209, 761 | 23 | 32 | 35 | 38 | 24 | 31 | 35 | 38 |
Age category (years) | |||||||||
< 60 | 37,904 | 26 | 33 | 36 | 38 | 27 | 32 | 38 | 38 |
60–79 | 144,064 | 24 | 32 | 35 | 38 | 24 | 32 | 35 | 38 |
≥80 | 27,793 | 19 | 27 | 32 | 36 | 19 | 26 | 31 | 36 |
Gender | |||||||||
Male | 84,673 | 25 | 32 | 35 | 38 | 26 | 33 | 35 | 38 |
Female | 125,058 | 22 | 31 | 34 | 38 | 23 | 31 | 35 | 38 |
Year of NHS PROMs | |||||||||
2009–11 | 96,041 | 22 | 31 | 34 | 38 | 20 | 31 | 35 | 38 |
2012–15 | 113,720 | 25 | 32 | 35 | 38 | 25 | 31 | 35 | 38 |
Figure 28b shows the percentages improved using the A improvement criterion against preoperative score observed and estimated using the predictions from the quantile regression models. Corresponding figures presenting the findings in the six population subsets by age and gender are given in Online Supplement 14.
Knee
Quantile regression was used to estimate the shape of the 10th, 20th, 30th and 50th percentile change in the OKS. For example, the prediction of the third-degree polynomial 50th quantile regression model was:
where x is the preoperative score, εi i. i. d. ∼N(0,σ2) and i refers to the ith observation.
The observed postoperative change in score against preoperative score along with the observed and estimated percentiles using the quantile regression model are shown in Figure 29a. Estimated relative thresholds using the E improvement criterion for 10th, 20th, 30th and 50th percentiles are presented in the Table 54 for the whole data set (‘Total’) and also by key factors (age, gender and time period). There was no clear sign of different threshold values in accordance with the covariates. Results based on the B improvement criterion are available in Online Supplement 15.
FIGURE 29.
The OKS: NHS PROMs. (a) Observed percentiles and 10th, 20th, 30th and 50th quantile regression curves (coloured dots indicate each observational percentile and grey dots indicate the actual weighted observations; and (b) proportion improved using improvement criterion E against preoperative score (observed and quantile regression model).
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig29a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig29b.png)
Baseline covariate | n | Percentiles | |||||||
---|---|---|---|---|---|---|---|---|---|
Predicted | Observed | ||||||||
10th | 20th | 30th | 50th | 10th | 20th | 30th | 50th | ||
Total | 222,933 | N/A | 25 | 31 | 36 | 1 | 25 | 30 | 36 |
Age category (years) | |||||||||
< 60 | 29,349 | N/A | 18 | 29 | 35 | 1 | 21 | 30 | 35 |
60–79 | 164,132 | N/A | 26 | 31 | 36 | 1 | 26 | 31 | 36 |
≥ 80 | 29,452 | 13 | 24 | 30 | 35 | 15 | 26 | 30 | 36 |
Gender | |||||||||
Male | 96,006 | N/A | 26 | 31 | 36 | 0 | 26 | 32 | 36 |
Female | 126,885 | N/A | 25 | 30 | 35 | 1 | 25 | 30 | 36 |
Year of NHS PROMs | |||||||||
2009–11 | 102,448 | N/A | 24 | 30 | 35 | 1 | 23 | 30 | 36 |
2012–15 | 120,485 | N/A | 27 | 31 | 36 | 1 | 26 | 32 | 36 |
Internal validation
Internal validation on the quantile regression estimates was carried out by applying quantile regression to population subsets based on age, gender and year of operation (time period). The data set was split into the 2009–11 and 2012–15 time period subsets to internally validate the estimates and check for sensitivity within the data set to the time period. The two time period subpopulation analyses did not show distinct shapes (Figures 30 and 31 provide the OHS and OKS results, respectively).
FIGURE 30.
The OHS: NHS PROMs. (a) Age; (b) gender; and (c) time period subset populations (observed change and 50th quantile regression model).
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig30a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig30b.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig30c.png)
FIGURE 31.
The OKS: NHS PROMs. (a) Age; (b) gender; and (c) time period subset populations (observed change and 50th quantile regression model).
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig31a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig31b.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig31c.png)
Figures are shown in Online Supplement 14 and the subsets showed some different patterns around the relative threshold values for each percentile, particularly around low scores (see Figures 30 and 31).
The gender subpopulation analyses did not show distinct shapes (see Figures 30 and 31). Corresponding figures presenting the findings in the six age-by-gender population subsets for both the OHS and the OKS are given in Online Supplement 14.
Model performance
Sensitivity and specificity were calculated using the E improvement criterion as the ‘gold standard’ regarding the OHS and OKS relative thresholds for each of the quantile model. The OHS and OKS results are shown in Tables 55 and 56, respectively. Relative threshold values from the 20th quantile regression showed relatively good sensitivity and poor specificity outcomes compared with other thresholds from the 10th, 30th and 50th quantile regressions. Specificity outcomes for the 80% population coverage relative threshold (i.e. 20th percentile) were 19 (95% CI 18 to 19) for the OHS and 35 (95% CI 34 to 35) for the OKS. Specificity outcomes for the 50% population coverage relative threshold (i.e. 50th percentile) were 6 (95% CI 5 to 6) for the OHS and 5 (95% CI 5 to 6) for the OKS (see Table 4).
Baseline covariate | Percentiles | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
10th | 20th | 30th | 50th | |||||||||
Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | |
Total | 23 | 77 (77 to 77) | 47 (47 to 48) | 32 | 97 (97 to 97) | 19 (18 to 19) | 35 | 99 (99 to 99) | 12 (12 to 12) | 38 | 100 (100 to 100) | 6 (6 to 7) |
Age category (years) | ||||||||||||
< 60 | 26 | 87 (86 to 87) | 37 (36 to 37) | 33 | 98 (98 to 98) | 16 (16 to 17) | 36 | 100 (99 to 100) | 10 (10 to 10) | 38 | 100 (100 to 100) | 6 (6 to 7) |
60–80 | 24 | 81 (80 to 81) | 44 (43 to 44) | 32 | 97 (97 to 97) | 19 (18 to 19) | 35 | 99 (99 to 99) | 12 (12 to 12) | 38 | 100 (100 to 100) | 6 (6 to 7) |
≥ 80 | 19 | 61 (61 to 61) | 61 (61 to 62) | 27 | 89 (89 to 89) | 34 (33 to 34) | 32 | 97 (97 to 97) | 19 (18 to 19) | 36 | 100 (99 to 100) | 10 (10 to 10) |
Gender | ||||||||||||
Male | 25 | 84 (84 to 84) | 40 (40 to 41) | 32 | 97 (97 to 97) | 19 (18 to 19) | 35 | 99 (99 to 99) | 12 (12 to 12) | 38 | 100 (100 to 100) | 6 (6 to 7) |
Female | 22 | 74 (73 to 74) | 51 (50 to 51) | 31 | 96 (96 to 96) | 22 (21 to 22) | 34 | 99 (99 to 99) | 14 (14 to 15) | 38 | 100 (100 to 100) | 6 (6 to 7) |
Year of NHS PROMs | ||||||||||||
2009–11 | 22 | 74 (73 to 74) | 51 (50 to 51) | 31 | 96 (96 to 96) | 22 (21 to 22) | 34 | 99 (99 to 99) | 14 (14 to 15) | 38 | 100 (100 to 100) | 6 (6 to 7) |
2012–15 | 25 | 84 (84 to 84) | 40 (40 to 41) | 32 | 97 (97 to 97) | 19 (18 to 19) | 35 | 99 (99 to 99) | 12 (12 to 12) | 38 | 100 (100 to 100) | 6 (6 to 7) |
Baseline covariate | Percentile | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
10th | 20th | 30th | 50th | |||||||||
Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | Threshold score | Sensitivity (95% CI) | Specificity (95% CI) | |
Total | – | – | – | 25 | 83 (83 to 83) | 35 (34 to 35) | 31 | 96 (96 to 96) | 15 (15 to 15) | 36 | 100 (100 to 100) | 5 (5 to 6) |
Age category (years) | ||||||||||||
< 60 | – | – | – | 18 | 52 (52 to 53) | 63 (63 to 64) | 29 | 93 (93 to 93) | 21 (20 to 21) | 35 | 99 (99 to 99) | 7 (6 to 7) |
60–80 | – | – | – | 26 | 86 (86 to 87) | 31 (30 to 31) | 31 | 96 (96 to 96) | 15 (15 to 15) | 36 | 100 (100 to 100) | 5 (5 to 6) |
≥ 80 | 13 | 28 (28 to 28) | 81 (80 to 81) | 24 | 80 (79 to 80) | 38 (38 to 39) | 30 | 95 (95 to 95) | 18 (18 to 18) | 35 | 99 (99 to 99) | 7 (6 to 7) |
Gender | ||||||||||||
Male | – | – | – | 26 | 86 (86 to 87) | 31 (30 to 31) | 31 | 96 (96 to 96) | 15 (15 to 15) | 36 | 100 (100 to 100) | 5 (5 to 6) |
Female | – | – | – | 25 | 83 (83 to 83) | 35 (34 to 35) | 30 | 95 (95 to 95) | 18 (18 to 18) | 35 | 99 (99 to 99) | 7 (6 to 7) |
Year of NHS PROMs | ||||||||||||
2009–11 | – | – | – | 24 | 80 (79 to 80) | 38 (38 to 39) | 30 | 95 (95 to 95) | 18 (18 to 18) | 35 | 99 (99 to 99) | 7 (6 to 7) |
2012–15 | – | – | – | 27 | 89 (89 to 89) | 27 (27 to 28) | 31 | 96 (96 to 96) | 15 (15 to 15) | 36 | 100 (100 to 100) | 5 (5 to 6) |
Figures 30 and 31 show close agreement with the estimated observed proportions across the range of preoperative Oxford Hip and Knee Scores, both for the overall data set and by subpopulation.
Influence of covariates
Fourth-degree fractional polynomial logistic regressions with dichotomised change scores (postoperative to preoperative score) by the improvement criterion (A) were used to examine the benefit of the baseline covariates on the capacity of benefit.
Hip
The selected fractional model that was selected had the following form:
where x is preoperative score, p^i=N of improved/N of preoperative score and i refers to the ith patient.
Predicted probabilities with 95% CIs using the logistic regressions in comparison with the observational proportion with 95% CIs are presented in Figure 32a. Generally, the fit was good, with the exception of the lowest handful of preoperative scores. The AUC was 0.65 (95% CI 0.64 to 0.65). Model performance was also examined using the calibration graphs in Figure 32, which showed a good level of calibration.
FIGURE 32.
The OHS: NHS PROMs. (a) Proportion improved (improvement criterion E) and observed proportion (logistic regression); and (b) calibration graph.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig32a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig32b.png)
Knee
The selected fractional model had the following form:
where x is the preoperative score, p^i=N of improved/N of preoperative score and i refers to the ith patient.
Predicted probabilities with 95% CIs using the logistic regressions in comparison with the observational proportion with 95% CIs are presented in Figure 33a. Generally, the fit was good, with the exception of the lowest handful of preoperative scores. The AUC was 0.61 (95% CI 0.61 to 0.62). Model performance was also examined using the calibration graphs in Figure 33b, which showed a good level of calibration.
FIGURE 33.
The OKS: NHS PROMs. (a) Proportion improved (improvement criterion E) and observed proportion (logistic regression); and (b) calibration graph.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig33a.png)
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig33b.png)
Selected models with baseline covariates
Hip
Following the model-building process, the selected model with covariates including circulation and depression showed meaningful but marginal impacts on the OHS. The model had the following form:
where x is preoperative score, p^=N of improvedN of preoperative score and i refers to the ith observation.
The AUC was 0.68 (95% CI 0.67 to 0.68). The proportion improving and the calibration graphs were very similar to the model without the additional covariates and, therefore, are not provided here.
Knee
Following the model-building process, the selected model including age, circulation, diabetes and depression showed meaningful but marginal impacts on the OKS. It had the following form:
where x is preoperative score, p^=N of improvedN of preoperative score and i refers to the ith observation.
The area under the ROC curve, for the final model was 0.65 (95% CI 0.65 to 0.66). The proportion improving and the calibration graphs were very similar to the model without the additional covariates and, therefore, are not provided here.
Discussion
The postoperative score and the probability of improvement were modelled for both the OHS and the OKS. In addition, we calculated threshold estimates using both of these estimates and assessed performance.
Strengths and limitations
Two different modelling approaches were used to get the robust and consistent estimated thresholds: logistic and quantile regression. Unlike standard regression models, the quantile regression deals with the asymmetric spread of outcomes and enables estimates of the corresponding proportion of individuals to be estimates. Performance showed excellent agreement between the observed and estimated results for this approach, with only the highest values showing some signs of an inadequate fit. This reflects both the regression models used and also the very data sets that they were applied to. The results showed that the estimated relative thresholds by percentiles and predicted probabilities were equal or similar to the thresholds from the logistic regressions that the dichotomised outcomes (i.e. improved or not improved) used. In this study, the relative thresholds were based on the quantile regressions and were reported for the 10th, 20th, 30th and 50th percentiles.
In addition, the baseline covariates were examined in terms of their impact on the patient’s capacity of benefit, in addition to their impact on preoperative score. In terms of the potential threshold value, they did not generally seem to have much influence, perhaps with the exception of the 80-years age group.
The PROMs data set linked to HES provided a very large and rich data source, which enabled robust and reliable estimates. Internal validation was undertaken, dividing the data set into two time periods. Even the data subsets were very large and there were no clear differences of threshold estimates between the two internal data sets. The model diagnoses and validations that were performed suggested that the quantile models were a good fit, except for the extremes of the Oxford Hip and Knee Scores.
This study defined improvement (and, by implication, the capacity of benefit) using a literature-reported anchor-based best cut-off point. 75 This score was calculated using the patient-reported anchor-based question; however, there are various approaches to define the improvement. One of the approaches applied in Chapters 4 and 6 was a MDC-based approach. We explored another definition (medium ES) in this chapter and provided the results in Online Supplement 13. It is worth noting that even the anchor-based approach is still somewhat crude and does not account for the preoperative state, which has been shown in some settings to influence the magnitude of a difference (change) that would be considered important by patients. Measurement errors are problematic in the repeatedly measured patient-reported outcomes, and there will be several ways to control the measurement errors (e.g. using the adjusted improvement score and applying advance statistical inference approaches such as Bayesian models with computational methods). It is the applicability of the finding to an individual that is problematic; population estimates should be fair. The impact of measurement errors in prognostic modelling is an area of active research. In terms of performance, the estimated proportion improvement should accurately reflect the population. Based on Chapter 3, those within 6 points of a cut-off point (MDCs using an ICC of 0.9) could plausibly have a value sufficient to meet the cut-off point. Further research may be required; care is needed when applying to use the estimated relative thresholds in terms of this aspect. A related issue was that the data set reflected patients who received arthroplasty and, therefore, it may not fully represent the population of those who were considering arthroplasty. A further limitation of our analysis was the substantial amount of missing data, which reflected the passive nature of data collection. Missing data analyses that utilised available information on the missing data (which was very limited) are, in our view, very unlikely to substantively alter the findings.
Conclusion
We estimated the postoperative scores and the probability of improvement based on both the OHS and the OKS. From these, we calculated relative thresholds for the OHS and OKS. The model generally showed good performance for predicting the probability of improvement for individuals at each preoperative score level. The results clearly reflect both the substantial improvement in score from pre to post operation, but also the uncertainty about where an individual patient will end up. The quantile regression nicely models differing proportions of individuals and their expected outcomes.
A very large data set was used, which enabled a data-intensive approach (quantile regression) to be successfully used, with only the extreme handful of values at each end of the scale showing uncertainty. The NHS PROMs data set is arguably the best representation of the typical NHS patient available. Sensitivity to other factors was also assessed; there was some suggestion that the ≥ 80-years age group were somewhat different. Overall, the explanatory value of the available factors (collected routinely) were limited and a number showed no value. There was a substantial amount of variability between the individuals’ postoperative outcomes, which is unexplained.
As previously stated, it should be borne in mind that all of these analyses reflect the change in pain and function estimated by the Oxford Hip and Knee Scores only. Other justifications for arthroplasty surgery exist that would not be fully reflected in these data. Similarly, these analyses do not take into account the potential risk of infection and other problems that may require further treatment (e.g. revision surgery).
Chapter 7 Further health economic evaluation of threshold values (work package 2)
Background
We extended and improved the analyses described in Chapter 5 to obtain more accurate estimates of preoperative economic thresholds for the OKS and OHS using the NHS PROMs data linked to HES. In particular, the PROMs/HES data set provides:
-
Sufficient data to estimate the cost of admissions for primary surgery, revision or subsequent re-admissions using the NHS Payment Grouper.
-
The ability to link revisions and re-admissions with patients’ primary operations.
-
Additional years of data with indicators of which operations were revisions that were not available in the freely available PROMs data.
-
Patients’ exact ages, rather than 10-year age bands.
-
Data on age and sex for all patients. By contrast, the extract of PROMs data freely available online omitted age and sex data for individuals (generally the very old and very young) to avoid the potential for patients to be identified.
We did not evaluate the impact of additional covariates other than age and sex, because the variables available in the PROMs/HES data extract had very little effect on prediction accuracy and have not been included in the ACHE tool intended for general practice. The effect of BMI was not evaluated because this variable is not included within PROMs/HES and it was not possible to obtain NJR data (in which BMI is recorded) linked to PROMs/HES; subsequently, the only data sets providing BMI data were small and likely to be underpowered.
Like the analyses described in Chapter 5, the economic evaluation comprised a cost–utility analysis evaluating the cost-effectiveness of TKA or THA compared with no arthroplasty over a 10-year time horizon in a UK setting from the perspective of the UK NHS (see Chapter 5, Background). However, it is worth noting that, in the absence of linked NJR data, it is not possible to distinguish TKA and THA from other types of knee and hip arthroplasty (e.g. hip resurfacing) or to reliably identify the indication for arthroplasty, and, therefore, all analyses based on PROMs and/or HES data include all types of primary knee and hip arthroplasty conducted for any indication.
Method
We conducted additional regression analyses on the new extract of PROMs/HES linked data and adapted the Markov models described in Chapter 5 to accommodate the new regression models. The Markov model structure and assumptions were generally identical to those described in Chapter 5, Model. In particular, the analysis used a 10-year time horizon and used 2014 as the reference year for costs. We also took a NHS perspective and focused on the costs of hospital admissions and consultations with health-care professionals that were associated with the joint in question.
The model used the same assumptions described in Chapter 5, Other model assumptions and inputs, with the following exceptions:
-
The analyses described in this chapter allowed for mortality associated with all revision procedures. By contrast, in Chapter 5, mortality associated with revisions was only included for all revisions > 12 months after primary arthroplasty and only if the mortality associated with revisions was expected to be > 10% above all-cause mortality. In the analyses described in this chapter, the probability of dying in the year of hip revision surgery was based on the mortality in the year of revision estimated by Pennington et al. 121 However, we followed Pennington et al. 119 in using the same model to predict mortality after revision and primary knee arthroplasty.
-
For both TKA and THA, we followed Pennington et al. 119,121 by capping mortality in the year of revision at a maximum of 10% above all-cause mortality to avoid extrapolating very high mortality rates to very old patients, who were generally outside the sample used to estimate mortality rates. We also allowed for the additional mortality associated with revisions taking place within 12 months of primary arthroplasty (see Online Supplement 15).
-
The cost of re-admissions and ambulatory consultations were taken into account for all health states and in all years of the model.
-
We followed the assumption used in Chapter 5 that QALYs in the year of revision were equal to the average of the before-revision utility and the after-revision utility. Although PROMs specifically includes data before revision and 6 months afterwards, calculating QALYs as the average of these utilities allows for the fact that people are likely to have had a quality of life similar to the prerevision utility in the months leading up to revision surgery and experience utility similar to that observed 6 months after revision later in the year.
Revision rates, costs in the no arthroplasty arm and the rate of change in EQ-5D utility with age or over time were the same as those used in Chapter 5, and mortality rates were also based on those by Pennington et al. 119,121 (see Online Supplement 12). In the hip model, the cost of ambulatory consultations was also based on the same models used in Chapter 5. 113 The following parameters were re-estimated (see Table 39):
-
Mapping models predicting preoperative EQ-5D score based on preoperative Oxford Hip and Knee Scores were re-estimated using the PROMs/HES extract, replacing the published algorithms that were used previously. 126,134 Re-estimating the mapping equations enabled us to consider non-linear relationships between EQ-5D scores and Oxford Hip and Knee Scores and evaluate how EQ-5D utility varies with patient’s age and sex.
-
Models predicting EQ-5D utility 6 months after primary arthroplasty were re-estimated using PROMs/HES data because this updated data set includes a larger sample and patients’ exact ages.
-
The EQ-5D utilities before and after revision were re-estimated using PROMs/HES data because this data set enables utility before and after revisions to be linked to the Oxford Hip and Knee Scores measured before patients’ primary arthroplasty procedures and provides a far larger number of revisions than KAT or EPOS. 66
-
The costs of primary and revision arthroplasty were re-estimated using PROMs/HES data because this data set is very large and up to date and provides sufficient information to use the NHS Payment Grouper,152 thereby avoiding the additional assumptions that were necessary to estimate the cost of re-admissions using KAT, COASt or published estimates. 66,113,131
-
The cost of re-admissions was re-estimated using PROMs/HES data because this large, up-to-date data set enables the Oxford Hip and Knee Scores measured before patients’ primary arthroplasty procedures to be linked to all subsequent re-admissions. This enabled the cost of re-admissions to be included in the hip model; such costs were excluded from the analyses described in Chapter 5 as there were no available data. Although re-admission costs following TKA could be estimated from KAT and COASt, PROMs/HES provided a substantially larger and more up-to-date data set.
-
The cost of community and outpatient visits were re-estimated for TKA in all years and for THA in year 1 to exclude the cost of re-admissions, which were now captured separately. The cost of community and outpatient visits beyond year 1 in THA patients continued to be based on published estimates. 113 The methods and assumptions used to estimate the cost of community and outpatient visits were the same as those described in Chapter 5, Costing analyses.
The next section describes how the PROMs/HES data were manipulated prior to conducting these regression analyses. Additional details on the methods and results of each regression analysis are given in Online Supplement 15.
Methods for manipulating and analysing NHS Patient-Reported Outcome Measures/Hospital Episode Statistics linked data
Outline of the approach
We obtained the complete NHS PROMs data set for all patients undergoing hip and knee arthroplasty between April 2009 and October 2015, as well as admitted patient care (APC) data from HES. These two data sets were linked and manipulated to create data sets on preoperative and 6-month utility, costs of primary arthroplasty, costs of revision arthroplasty, utility before and after revisions, and the costs of re-admissions related to arthroplasty. Hospital re-admissions were defined as related to arthroplasty if they either took place within 30 days of primary arthroplasty or the patient had a primary diagnosis for hip or knee arthritis, had a procedure code relating to the hip or knee joint or if the patient had a primary diagnosis of infections commonly associated with hip or knee arthroplasty.
We used the NHS Local Payment Grouper for 2014/15152 and the corresponding tariffs from the National Schedule 2014/15153 to derive the costs for primary and revision arthroplasty as well as for relevant re-admissions. We assumed that all primary and revision arthroplasty procedures were elective. For re-admissions, we distinguished between elective and non-elective admissions. Further details on data cleaning, data manipulation and handling of missing data are provided in Online Supplement 15.
The resulting data sets were used to estimate inputs for the Markov models. They included 309,001 primary knee arthroplasty procedures, 286,812 primary hip arthroplasty procedures, 3403 knee revisions and 2346 hip revisions. We also identified 171,459 relevant admissions, of which 75,803 took place within the first 30 days, 41,583 were for patients who had a relevant primary diagnosis, 6613 had a relevant procedure code and 83,774 were for patients who had a diagnosis of infection.
Regression analyses
Regression models were estimated in Stata® version 14 using the same methods as were described in Chapter 5, Regression analyses. Details of specific methods used for each regression analyses are described in Online Supplement 15. Variance–covariance matrices for the regression models are available from the corresponding author on request.
Presentation of results and analysis of uncertainty
The same set of hypothetical individuals with different combinations of age, sex and clinical tool score were run through the revised models using mean values for all parameters and using PSA. PSA and calculation of 95% CrIs around the threshold were conducted in the same way as for Chapter 5, generating 2000 estimates of costs and QALYs with and without arthroplasty for each of the 260 hypothetical individuals in both the knee and the hip models. Plots of the probability that arthroplasty is cost-effective against the OHSs and OKSs were generated as described previously (see Chapter 5, Presentation of results and analysis of uncertainty). The cost-effectiveness acceptability curves (CEACs) shown in Online Supplement 16 were generated using the same methods but varying the ceiling ratio.
We conducted the following sensitivity analyses:
-
taking a 5-, 20- or 60-year time horizon (cf. 10 years in the base-case analysis)
-
assuming that EQ-5D utility without TJA worsens by 0.025 per year (see Chapter 5, Presentation of results and analysis of uncertainty)168
-
assuming that EQ-5D without TJA increases by 0.115 per year in the first year of the model and then follows an age-related decline after that169
-
assuming that patients accrued no costs in the absence of arthroplasty
-
halving the cost in the absence of arthroplasty
-
doubling the cost in the absence of arthroplasty
-
discounting QALYs at 1.5% and costs at 3.5%
-
no discounting.
Results
Effect of Oxford Knee Score on cost-effectiveness of total knee arthroplasty
Regression analyses demonstrated that the preoperative OKS had a statistically significant effect on preoperative EQ-5D utility, EQ-5D utility 6 months after surgery, EQ-5D utility before and after revision surgery, the cost of primary arthroplasty, re-admission costs in year 1 and subsequent years, ambulatory costs in year 1 and in subsequent years, and re-admission costs and ambulatory costs > 1 year after revision surgery (p < 0.05) (see Online Supplement 15). However, the OKS had no significant impact on the cost of revision surgery or the cost of re-admissions or ambulatory consultations in the year of revision surgery (p > 0.05) (see Online Supplement 15 for results of all regression models).
The regression models predicting preoperative and 6-month EQ-5D utility were re-estimated using the PROMs/HES extract (Figure 34 and see Online Supplement 15). The model of 6-month EQ-5D produced extremely similar predictions as that used in Chapter 5 (see Figures 14 and 34). However, whereas Chapter 5 used a published linear model of the relationship between the OKS and EQ-5D that included baseline and postoperative measurements,134 the analyses described in this chapter used the PROMs/HES extract to assess how preoperative EQ-5D varies with age and gender as well as preoperative OKS and to explore non-linear functions. The model selection process suggested that prediction accuracy was optimised by a Tobit model including polynomials for OKS and age. This model predicted that EQ-5D utility would rise steeply as OKSs rose from 40 to 48 (see Figure 34). The difference between preoperative and 6-month EQ-5D utility was predicted to be negative for patients with a preoperative OKS of 45–46 or higher (depending on age). In contrast, the regression functions used in Chapter 5 predicted the change in EQ-5D utility to be negative for patients with an OKS of 41–44 or higher (see Figure 14). Although the observed data suggest that EQ-5D utility rises more slowly as the OKS increases from 40 to 48 than is predicted by the Tobit model, the models used in this chapter accurately predict that the change in EQ-5D after knee arthroplasty is negative for patients with a preoperative OKS of 47, but positive for patients with a preoperative OKS of ≤ 44 (see Figure 34).
FIGURE 34.
Comparison of observed mean EQ-5D utility in PROMs/HES for patients with different preoperative OKSs against the predictions for the Tobit regression functions used in the Markov model.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig34.png)
The Markov model predicted the costs and QALYs that each hypothetical individual would accrue over a 10-year period with and without arthroplasty, taking into account the change in EQ-5D utility predicted in the models shown in Figure 34, operative mortality, the cost of primary and revision surgery, inpatient and ambulatory care required by patients with and without arthroplasty and the changes in quality of life that occur before/after revisions and as patients age.
The difference in QALYs between patients with and without arthroplasty was highest for patients with a preoperative OKS of 6 or 7 and declined steadily with increasing OKSs; averaged across all ages, arthroplasty gained 2.99 QALYs per patient with an OKS of 0, 4.05 QALYs per patient with an OKS of 6 and 0.19 QALYs per patient with an OKS of 44. TKA was predicted to increase QALYs for all patients with an OKS of below 44–46, depending on age.
The difference in cost between patients with and without arthroplasty was lowest for patients with an OKS of 16–19, but was markedly higher for patients with lower or higher than average OKSs. The model predicted that TKA was less costly than no surgery for 50-year-old men with an OKS of between 15 and 18 and for 50-year-old women with an OKS of between 10 and 21. Averaged across all ages, the difference in cost was £6457 per patient with an OKS of 0, £1898 per patient with an OKS of 16 and £6481 per patient with an OKS of 48.
As was demonstrated in Chapter 5, the results demonstrate that TKA is highly cost-effective for the vast majority of patients who currently undergo surgery (Table 57), although ICERs rose as the OKS increased from 4 to 48. In particular, the decision grid shows only selected OKS values, focusing on values in the region of the threshold; TKA costs < £20,000 per QALY gained for all age groups at the OKS values omitted from the grid. As a result of re-estimating models using PROMs/HES data, TKA is more cost-effective (i.e. has lower ICERs) in almost all groups compared with the analyses described in Chapter 5 (see Table 40). In particular, TKA was dominated by no arthroplasty (i.e. produced fewer QALYs at a greater cost) only for patients with an OKS of 44–46 or greater (cf. an OKS of 42–45 in Chapter 5). The current analysis also estimated that there was only a narrow band of one or two OKS values at which TKA improves patients’ health but is not cost-effective (the areas shown in dark and medium green in Table 57).
Preoperative OKS (selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
0 | £1979 | £1693 | £1939 | £2808 | £5257 | £2156 |
10 | £41 | £247 | £577 | £1195 | £2656 | £646 |
20 | Dominant | £278 | £911 | £2094 | £4986 | £1015 |
21 | Dominant | £355 | £1035 | £2325 | £5523 | £1150 |
24 | £288 | £711 | £1543 | £3190 | £7473 | £1693 |
28 | £1158 | £1412 | £2393 | £4401 | £9907 | £2582 |
29 | £1404 | £1596 | £2589 | £4625 | £10,252 | £2782 |
30 | £1643 | £1771 | £2765 | £4802 | £10,460 | £2962 |
31 | £1870 | £1936 | £2922 | £4934 | £10,547 | £3119 |
32 | £2083 | £2090 | £3060 | £5033 | £10,547 | £3256 |
33 | £2285 | £2236 | £3187 | £5111 | £10,505 | £3382 |
34 | £2479 | £2378 | £3311 | £5189 | £10,467 | £3504 |
35 | £2675 | £2523 | £3446 | £5288 | £10,485 | £3637 |
36 | £2887 | £2683 | £3605 | £5433 | £10,618 | £3796 |
37 | £3132 | £2871 | £3811 | £5658 | £10,935 | £4003 |
38 | £3437 | £3109 | £4095 | £6014 | £11,547 | £4291 |
39 | £3846 | £3428 | £4509 | £6582 | £12,642 | £4710 |
40 | £4439 | £3885 | £5146 | £7523 | £14,615 | £5357 |
41 | £5377 | £4590 | £6205 | £9199 | £18,462 | £6427 |
42 | £7061 | £5787 | £8197 | £12,653 | £27,668 | £8421 |
43 | £10,813 | £8158 | £12,934 | £22,592 | £68,515 | £13,060 |
44 | £25,078 | £14,537 | £34,786 | £172,396 | Dominated | £32,707 |
45 | Dominated | £67,573 | Dominated | Dominated | Dominated | Dominated |
46 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
47 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
48 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
Threshold (95% CrI) | 43 (43 to 48) | 44 (43 to 48) | 43 (43 to 44) | 42 (42 to 43) | 41 (40 to 42) | 43 (43 to 44) |
The economic threshold OKS (i.e. the highest OKS at which TKA costs < £20,000 per QALY gained, shown in light blue in Table 57) was 44 (95% CrI 43 to 48) for 60-year-olds and 41 (95% CrI 40 to 42) for 90-year-olds (see Table 57). If a single threshold were to be set across all ages, a threshold of 43 (95% CrI 43 to 44) would be the most cost-effective value to choose. This is somewhat higher than the threshold estimated in Chapter 5 (40, 95% CrI 39 to 42) (see Table 40). Overall, TKA costs < £20,000 per QALY gained compared with no arthroplasty for 99.9% of patients who currently undergo surgery and costs < £5000 per QALY gained for 96.6% of patients who currently undergo surgery.
The effect of age on cost-effectiveness and thresholds was markedly less pronounced than in the analyses described in Chapter 5. In addition, although ICERs were generally slightly lower for women than for men (see Online Supplement 16), the economic threshold was identical for men and women.
However, there was a modest amount of uncertainty around the economic thresholds. The 95% CrI demonstrated that we can be 95% confident that the economic threshold for all ages combined lies between 43 and 44. The probability that TKA is cost-effective varied with age, OKS (Figure 35) and how much the NHS is willing or able to pay per QALY gained (see Online Supplement 16). For 70-year-olds, we can be > 99% confident that TKA is cost-effective at a £20,000-per-QALY ceiling ratio at an OKS of ≤ 42, although this falls to 98% for patients with an OKS of 43, 18% for patients with an OKS of 44 and 2% for patients with an OKS of 45. Substantially greater uncertainty is observed for patients aged 50 or 90. CEACs for men and women at different ages are shown in Online Supplement 16.
FIGURE 35.
Effect of preoperative OKS on the probability that TKA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig35.png)
Effect of Oxford Hip Score on the cost-effectiveness of total hip arthroplasty
Regression analyses demonstrated that the preoperative OHS had a statistically significant effect on preoperative EQ-5D utility, EQ-5D utility 6 months after surgery, EQ-5D utility before and after revision surgery, cost of primary arthroplasty and revision surgery, re-admission costs in year 1 and subsequent years, ambulatory costs in year 1 and re-admission costs > 1 year after revision surgery (p < 0.05) (see Online Supplement 15). However, the OHS had no significant impact on re-admission costs in the year of revision (p > 0.05) (see Online Supplement 15 for the results of all regression models).
The Tobit models predicting EQ-5D before and 6 months after hip arthroplasty showed similar trends to those for knee arthroplasty (Figure 36). In particular, the polynomial function predicting how preoperative EQ-5D utility varies with the preoperative OHS predicted a sharp increase in EQ-5D as the OHS increases from 40 to 48. The models predicted that arthroplasty would increase EQ-5D utility for patients with a preoperative OHS of ≤ 45–46 or less (depending on age and gender) and would decrease EQ-5D utility for patients with an OHS of ≥ 46–47. By contrast, within the observed data, the mean EQ-5D change was positive for patients with an OHS of ≤ 47 and negative for patients with an OHS of 48, which suggests that the estimated Tobit models may underestimate thresholds for THA (see Figure 36).
FIGURE 36.
Comparison of observed mean EQ-5D utility in PROMs/HES for patients with different preoperative OHSs against the predictions for the Tobit regression functions used in the Markov model.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig36.png)
Following the trends observed for change in EQ-5D utility (see Figure 36), the QALY gain from THA was greatest for patients with an OHS of 5 or 6 and declined steadily with increasing OHSs; the average patient with an OHS of 5 gained 5.28 QALYs, compared with 4.33 for patients with an OHS of 0 and 0.05 for patients with an OHS of 46. The model predicted that, on average, THA would worsen health by up to 0.44 QALYs for patients with an OHS of 47 or 48, and for 80- or 90-year-olds with an OHS of 46.
The difference in costs between patients with and without THA was smallest for patients with an OHS of 1 (£1975 per patient, averaged across all ages). THA was predicted to be less costly than conducting no arthroplasty surgery for 50-year-old women with an OHS of 1, but was more costly for all other groups. The incremental cost of THA rose gradually as OHS increased from 1 to 48; the average incremental cost across all ages was £5113 for patients with an OHS of 48.
The results demonstrate that THA is highly cost-effective for the vast majority of patients who currently undergo surgery (Table 58). As was observed for TKA, ICERs for patients with high Oxford Hip and Knee Scores were markedly lower than in the analyses described in Chapter 5 (see Table 45). For patients aged ≤ 70 years, THA cost < £20,000 per QALY gained when the OHS was ≤ 45. The economic threshold reduced to 43 (95% CrI 43 to 44) for 90-year-olds. The economic threshold ignoring age and gender was 45 (95% CrI 44 to 45), markedly higher than the threshold of 42 estimated in Chapter 5 (see Table 45).
Preoperative OHS (selected values only) | Cost | |||||
---|---|---|---|---|---|---|
Age (years) | ||||||
50 | 60 | 70 | 80 | 90 | Average | |
0 | £92 | £327 | £501 | £926 | £1945 | £533 |
10 | £809 | £792 | £820 | £1187 | £2011 | £923 |
18 | £1413 | £1330 | £1368 | £1776 | £2899 | £1491 |
20 | £1640 | £1537 | £1580 | £2058 | £3369 | £1724 |
21 | £1760 | £1651 | £1696 | £2214 | £3632 | £1852 |
24 | £2148 | £2009 | £2065 | £2710 | £4479 | £2257 |
28 | £2584 | £2409 | £2472 | £3253 | £5410 | £2705 |
29 | £2656 | £2474 | £2537 | £3334 | £5548 | £2776 |
30 | £2711 | £2524 | £2586 | £3391 | £5646 | £2828 |
35 | £2833 | £2634 | £2683 | £3474 | £5752 | £2932 |
40 | £3412 | £3169 | £3228 | £4188 | £6991 | £3529 |
41 | £3786 | £3515 | £3590 | £4695 | £7911 | £3929 |
42 | £4397 | £4079 | £4186 | £5557 | £9533 | £4590 |
43 | £5474 | £5066 | £5247 | £7173 | £12,765 | £5777 |
44 | £7636 | £7027 | £7421 | £10,842 | £21,196 | £8244 |
45 | £13,344 | £12,059 | £13,428 | £24,427 | £76,368 | £15,330 |
46 | £50,387 | £40,160 | £64,676 | Dominated | Dominated | £97,787 |
47 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
48 | Dominated | Dominated | Dominated | Dominated | Dominated | Dominated |
Threshold (95% CrI) | 45 (44 to 46) | 45 (44 to 46) | 45 (44 to 46) | 44 (44 to 45) | 43 (43 to 44) | 45 (44 to 45) |
Age had little impact on ICERs and economic threshold compared with the estimates from Chapter 5. Furthermore, gender had very little impact on ICERs and the economic threshold was the same for men and women (see Online Supplement 16).
Total hip arthroplasty was found to cost < £20,000 per QALY gained compared with no arthroplasty for 99.96% of patients who currently undergo arthroplasty surgery and to cost < £5000 per QALY gained for 99.16% of patients.
The PSA demonstrated that we can be 95% confident that the economic threshold, ignoring age and sex, is between 44 and 45. For 70-year-olds, the probability that THA is cost-effective was > 95% at an OHS of ≤ 44, 90% at an OHS of 45 and 5% at an OHS of 46 (Figure 37). There was markedly greater uncertainty for patients aged 80 and 90 years.
FIGURE 37.
Effect of the Oxford Hip and Knee Scores on the probability that THA is cost-effective at a £20,000-per-QALY ceiling ratio.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig37.png)
Sensitivity analyses
Ten sensitivity analyses were conducted to evaluate the sensitivity of the results to changes in time horizon and discount rates and to the assumptions made around EQ-5D utility and costs for patients who do not undergo TJA (Table 59 and see Online Supplement 16). This demonstrated that the results are reasonably robust to changes in all of the key assumptions. Halving the time horizon from 10 to 5 years reduced the economic threshold by 1 point, whereas extending the time horizon had no impact. Reducing the discount rates used to adjust for time preference or drastically changing the costs assumed to be accrued by patients in the absence of arthroplasty had little or no impact.
Sensitivity analysis | Economic threshold ignoring age and sex | |
---|---|---|
Knee arthroplasty | Hip arthroplasty | |
Base-case analysis | 43 | 45 |
5-year time horizon | 42 | 44 |
20-year time horizon | 43 | 45 |
60-year (lifetime) time horizon | 43 | 45 |
EQ-5D utility without TJA worsens by 0.025 per year | 46 | 48 |
EQ-5D utility without TJA increases by 0.115 in the first year and follows age-related decline thereafter | 39 | 41 |
Assuming that patients accrued no costs in the absence of arthroplasty | 43 | 44 |
Halving the cost in the absence of arthroplasty | 43 | 45 |
Doubling the cost in the absence of arthroplasty | 43 | 45 |
Discounting QALYs at 1.5% and costs at 3.5% | 43 | 45 |
No discounting | 43 | 45 |
Assuming that EQ-5D utility will decrease by 0.025 per year (a decrease 4–7 times larger than was assumed in the base-case analysis) markedly increased the threshold Oxford Hip or Knee Score at which TJA ceases to be cost-effective; indeed, in this analysis, THA was cost-effective for all patients < 80 years of age, regardless of OHS. Conversely, making an extremely optimistic assumption about the efficacy of the first year of non-operative management (assuming that EQ-5D utility would increase by 0.115 in the first year)169 reduced the economic threshold for TKA to 39 and reduced the economic for THA to 41. As discussed previously, this figure is conservative because the non-surgical treatment used by Skou et al. 169 was relatively intensive.
Discussion
Summary of findings
Findings of the economic evaluation
The results presented in this chapter suggest that TJA is cost-effective for > 99.9% of patients who currently undergo surgery if the NHS is willing to pay £20,000 per QALY gained. Averaging across men and women of all ages, it is cost-effective to conduct TKA on patients with an OKS of ≤ 43 (95% CrI 43 to 44) and to conduct THA on patients with an OHS of ≤ 45 (95% CrI 44 to 45).
These thresholds are slightly higher than those estimated in Chapter 5, largely owing to re-estimation of the models of preoperative EQ-5D utility to take account of age and gender and to allow for non-linear relationships between Oxford Hip and Knee Scores and EQ-5D utility. The values presented in this chapter make full use of the PROMs/HES linked data, which cover 608,170 knee and hip arthroplasty procedures, including around 67.7% of the 897,619 operations conducted in England between April 2009 and October 2015. 162,182–184 Because the analyses described in Chapter 5 used small data sets for several key parameters (including the cost of primary arthroplasty), the economic thresholds and cost-effectiveness results presented in this chapter should be used in decision-making in preference to the results described in Chapter 5 for the OKS and OHS.
The thresholds for TKA were also substantially higher than those estimated previously by Dakin et al. ,2 who used KAT data and estimated the economic threshold to be 39 for patients of ASA grades 1 and 2 and a threshold of 34 for patients of ASA grade 3. As discussed in Chapter 5, Summary of the results, this difference may arise from the substantially larger number of patients with high OKSs in PROMs data, which enabled us to take account of non-linear relationships between OKS, costs and quality of life. Nonetheless, the analyses described in this chapter confirm the earlier findings described in Chapter 5 and by Dakin et al. 2 that suggest that TKA is cost-effective for patients with an OKS of ≥ 30–39.
The economic thresholds are slightly higher than the absolute thresholds owing to differences in the aims and principles of each analysis. The clinical thresholds are based on the probability that patients will achieve a MIC in Oxford Hip or Knee Score following surgery, which was defined as 7 for the OHS and 8 for the OKS. By contrast, the economic threshold is based on the difference in mean QALYs and mean cost between patients who undergo arthroplasty and those who do not. Although the economic analyses took account of patients’ life expectancy, surgical mortality and the cost of primary and revision arthroplasty, the results are primarily driven by the mean change in EQ-5D utility following arthroplasty. The PROMs/HES data demonstrate that, on average, arthroplasty increases patients’ EQ-5D utility at all Oxford Hip and Knee Scores of ≤ 46. Although patients with Oxford Hip and Knee Scores between 42 and 46 cannot have a 7-point increase in score following surgery, 73.5% (845/1149) of these patients nonetheless had an increase in Oxford Hip or Knee Score and 55.4% (453/817) of such patients had an increase in EQ-5D utility; among patients with improvements in EQ-5D, the mean improvement was 0.215 following arthroplasty. Only 16.9% (138/817) had a decrease in EQ-5D utility; among these patients, the mean change was –0.181. Because TKA costs around £5642,153 providing there are no complications and the benefits of surgery last 10 years, EQ-5D utility needs to improve by only 0.032 after arthroplasty in order for TKA to cost < £20,000 per QALY gained [0.032 = £5642/£20,000, all divided by 8.72 years (10 years, discounted at 3.5% per annum)].
However, even though TJA is cost-effective for patients with high Oxford Hip and Knee Scores, many patients may choose not to undergo surgery as they feel that the risks of surgery may outweigh the benefits. In particular, most patients are likely to be risk averse and prefer to maintain their current level of quality of life rather than undergo surgery that will, on average, improve their health but that carries a risk that their joint will deteriorate or that they will die or have a serious cardiovascular complication as a result of surgery. By contrast, the NHS as a whole spreads the risk across the population of patients (including the majority who benefit and the minority who are worse off after surgery), and so may be able to ignore risk aversity in their resource allocation decisions.
Nonetheless, the analyses described in this chapter demonstrate that there is no economic justification for restricting access to TKA for patients with Oxford Hip and Knee Scores of ≤ 43, or restricting access to THA for patients with Oxford Hip and Knee Scores of ≤ 45. The results also demonstrate that there is no economic justification for using different thresholds for men and women and suggest that thresholds based on cost-effectiveness vary little with age.
Strengths/limitations
Strengths and limitations of the economic modelling
The economic evaluation used the best available UK evidence, including PROMs/HES data on > 608,170 primary arthroplasty procedures and trial data sets following patients for up to 12 years after primary arthroplasty. However, some model inputs, including ambulatory costs and the long-term changes in EQ-5D utility, were based on smaller data sets such as KAT and COASt.
The biggest shortcoming of the analysis is the limited data on patients who have not undergone arthroplasty. No UK longitudinal data following patients who have not undergone arthroplasty were available and the only international data sets identified used the WOMAC. 185,186 Within this analysis, we therefore assumed that the costs accrued in the absence of arthroplasty would be the same as those accrued by COASt participants in the year before surgery, and assumed that EQ-5D utility would decrease at the same rate as is observed in the general population samples. In practice, many osteoarthritis patients are likely to experience a marked worsening of symptoms in the absence of arthroplasty; taking account of this trend would raise the economic threshold further. A sensitivity analysis allowing for EQ-5D utility worsening in patients who have not undergone arthroplasty estimated economic thresholds to be 2–3 points higher than the base case. However, other sensitivity analyses showed that thresholds would be substantially lower if non-surgical management markedly improved patients’ quality of life and showed that even substantial changes to the assumptions about the costs accrued by patients without arthroplasty had a negligible impact on the conclusions.
For simplicity, the economic evaluation compared immediate TJA with a strategy of no arthroplasty for 10 years. The choice of a 10-year time horizon is arbitrary, although sensitivity analyses demonstrated that this had little impact on the results. In practice, many patients whose symptoms are considered not sufficiently severe to warrant surgery at the present time may have surgery in the future. However, given the shortage of data on changes in Oxford Hip and Knee Scores in the absence of arthroplasty, the current analysis provides a convenient assessment of the cost-effectiveness of arthroplasty that enables economic thresholds to be estimated. We also excluded the cost of assessing patients to determine whether or not they are appropriate candidates for surgery; the impact of including these costs is evaluated in Chapter 8, which evaluates the cost-effectiveness of referral to surgical assessment compared with no referral.
The analyses took a NHS perspective and excluded the cost of medications, personal care, nursing homes, convalescence, equipment, home modifications, lost productivity and informal care. It is likely that broadening the perspective to include these wider costs and taking account of any deterioration in health in the absence of arthroplasty would improve the cost-effectiveness of TJA and increase the economic threshold. Indeed, one economic evaluation187 evaluating TKA in patients aged ≥ 90 years argued that TKA would be cost-saving owing to reductions in the cost of nursing home placement.
The PSA suggested that there was very little parameter uncertainty around the economic threshold. The 95% CrIs take account of uncertainty around all model inputs and allow for correlations between regression coefficients from the same regression model; these CrIs are narrow because most of the parameters are estimated on PROMs/HES data on around 300,000 operations. However, the PSA does not take account of uncertainty around the choice of regression function, uncertainty around model assumptions/structure or methodological uncertainty. 188
This analysis also cannot evaluate the impact of BMI on thresholds because it was not possible to obtain NJR data linked to PROMs/HES within the time frame of this project, and the only available BMI data came from comparatively small studies that may be underpowered to assess the impact of BMI over and above the impact of the Oxford Hip and Knee Scores, age and sex. In the absence of NJR data, we also assumed that the Oxford Hip and Knee Scores had no impact on revision rates.
Although the PROMs/HES extract includes data on > 608,018 primary arthroplasty procedures, only 0.5% (2884/608,018) of the sample had Oxford Hip or Knee Scores of > 40. Furthermore, only 0.5% (3250/608,018) of those in the sample were aged ≥ 90 years and 5.0% (30,437/608,018) were aged ≤ 50 years, and the published estimates of mortality and revision rates excluded patients aged ≤ 55 or > 84 years. 119,121 Results for 50- and 90-year-olds should therefore be interpreted with caution. Furthermore, the model selection process described in Chapter 5, Regression analyses, is likely to select models that give best prediction accuracy for patients with an OKS close to the mean, and place less importance on prediction accuracy for patients with high Oxford Hip and Knee Scores. This may introduce additional uncertainty around the economic threshold that is not captured within the PSA and the reported 95% CrIs. In particular, the Tobit models predicting preoperative EQ-5D utility may overestimate EQ-5D utility for those patients with Oxford Hip and Knee Scores of > 43 (see Figures 34 and 36), although the models appear to predict change in EQ-5D utility accurately.
As described in Chapter 5, the current analyses are based on UK data and may not generalise to other countries. Analyses conducted on KAT and COASt data sets and the published studies providing mortality and revision rates excluded patients who did not undergo TJA. 119,121 However, in the absence of NJR data, it was not possible to reliably identify which primary arthroplasty procedures within the PROMs/HES data set were TJA and which comprised unicompartmental knee replacement or hip resurfacing. Costs, QALYs and cost-effectiveness may differ between different types of arthroplasty and between different indications in ways that cannot be assessed using the current data.
Further research needed for economic modelling
Further research is needed on the impact of BMI and to assess whether or not the Oxford Hip and Knee Scores affect revision rates. Additional analyses using NJR data linked to PROMs and HES could also be used to assess whether or not the Oxford Hip and Knee Scores affect the rate of revision surgery.
Further research is needed on the costs accrued in patients who do not undergo arthroplasty and on how costs, Oxford Hip and Knee Scores and EQ-5D utility change over time in the absence of surgery.
Additional findings
We also developed additional models to map from the Oxford Hip and Knee Scores to EQ-5D utilities, which could also be applied in other settings in which OKS or OHS data are available but EQ-5D utilities are not. Coefficients for these models are given in Online Supplement 15. However, owing to the nature of the economic evaluation, we considered only models based on OKS/OHS total score, rather than responses to individual questions. As a result, these models have substantially worse prediction accuracy than those developed by Dakin et al. 134 and Pinedo-Villanueva et al. ,126 which mapped from dummy variables for individual questions. However, our OKS mapping model had better prediction accuracy than the simple OLS model mapping from total OKS to EQ-5D utility developed by Dakin134 (MSE 0.047 vs. 0.052 in the preoperative estimation sample). The models shown in Online Supplement 15 were estimated only on preoperative data and may not perform as well in data sets that include postoperative scores. These analyses suggest that there is a non-linear relationship between OKS/OHS total score and EQ-5D utility and that age and gender have a significant effect on EQ-5D utility that is not explained by OKS/OHS. The models give good prediction accuracy, although they may overestimate utility for patients with Oxford Hip or Knee Scores of > 43.
Conclusion
The economic evaluation demonstrates that TKA is cost-effective for patients with an OKS of ≤ 43, whereas THA is cost-effective for patients with an OHS of ≤ 45. Therefore, it is not appropriate to restrict access to arthroplasty for patients with Oxford Hip or Knee Scores below these limits on cost-effectiveness grounds. The analysis also suggests that it is not cost-effective to set separate thresholds for men and women and that age has little impact on economic thresholds.
Chapter 8 Determining the outcome of using the Arthroplasty Candidacy Help Engine tool in the NHS (work package 3)
Background
Over the past decade, many clinical commissioning groups (CCGs) have used PROMs such as the Oxford Hip and Knee Scores to set thresholds for arthroplasty. A 2014 review by the Royal College of Surgeons170 found that 31% (16/52) of CCGs that they reviewed imposed an OHS threshold. The thresholds used have frequently been relatively low, such as 19 and 24 on the OKS and OHSs, respectively. 170,171,176 The work, described in Chapters 6 and 7 have shown that these thresholds are inappropriate, because patients with an OKS or OHS of 19 or 24, respectively, have a ≥ 80% chance of a good outcome, and TKA and THA would each cost < £10,000 per QALY gained for these patients.
Whereas a threshold of 24 would exclude 21% of current arthroplasty operations, avoiding referrals for patients with a < 70% chance of a good outcome would exclude only 3% of patients who currently have hip arthroplasty and 7% of those having knee arthroplasty. This demonstrates that the vast majority of arthroplasty operations that are currently done in the UK are appropriate and are on patients who have a high capacity to benefit. Furthermore, some of the patients with high Oxford Hip or Knee Scores may need arthroplasty regardless of the referral thresholds (e.g. for indications other than osteoarthritis, or severe deformity in the absence of other symptoms). However, there are likely to be many patients currently managed in general practice or at musculoskeletal hubs who have high capacity to benefit who do not currently undergo surgery; if the thresholds estimated in Chapters 6 and 7 were introduced, numbers of referrals from this population might increase.
There is a shortage of published studies and national data on the patients who do not currently have surgery, and there are no data on OKSs/OHSs for patients managed in primary care. We therefore obtained data from the musculoskeletal hub that is run from the NOC in Oxford to get an initial estimate of how the introduction of the ACHE tool and the choice of threshold may affect the number of referrals and operations, costs and health benefits.
In the Thames Valley CCGs, osteoarthritis patients cannot be referred for consideration of joint surgery unless their symptoms have a substantial impact on their quality of life and are refractory to non-surgical treatment, including advice, activity and exercise. 172,173 All patients with a BMI of ≥ 25 kg/m2 must also be offered, and be strongly encouraged to participate in, a weight-loss programme. Until October 2016, patients with an OKS of > 32 could only be listed for knee surgery if approved by two consultants. 189 In practice, nearly all patients attending the NOC hub with knee or hip symptoms are asked to complete the OKS or OHS and patients with a BMI of ≥ 40 kg/m2 cannot be listed for surgery until they have first completed a monitored weight-loss programme.
This analysis aims to:
-
Evaluate how the probability of referral and undergoing surgery varies with preoperative characteristics in current clinical practice using an audit of a musculoskeletal hub.
-
Estimate the number of patients who might be expected to be referred for surgical assessment or undergo arthroplasty if the ACHE tool were introduced into NHS practice.
-
Assess how the economic thresholds estimated in Chapter 7 change when we take into account the additional numbers of surgical assessments that would be required if the ACHE tool were introduced into routine clinical practice and evaluate the cost-effectiveness of referring patients for surgical assessment compared with no referral (rather than the cost-effectiveness of arthroplasty vs. no arthroplasty, as was evaluated in Chapters 5 and 7).
-
Assess the cost-effectiveness of using the ACHE tool with different thresholds or different probabilities of benefiting from surgery to assess osteoarthritis patients attending the musculoskeletal hub compared with current practice, from the perspective of the UK NHS.
The predictions of the impact the ACHE tool might have if it were introduced into clinical practice are, by their nature, speculative. In particular, they rely on assumptions around how the ACHE tool would be used in practice and what impact it would have on referrals. In addition, the analysis excludes patients who are not currently referred by their GPs but might, in some circumstances, be candidates for arthroplasty, because there are no data on this population. We also assumed that referral patterns across England are the same as those in Oxfordshire, because this comprised the only data set available for analysis. The figures presented in this chapter must therefore be interpreted with caution and represent initial estimates of the direction and possible magnitude of changes that could be brought about by the ACHE tool.
Methods
General approach
We conducted an audit of medical records for patients who were referred to the NOC hub with knee or hip pain. We used the results to estimate the probability that patients with different baseline characteristics are referred from the hub to an orthopaedic surgeon and the probability that such patients then undergo arthroplasty. These probabilities were then used to estimate the number of patients of different ages, genders and with Oxford Hip and Knee Scores who may be referred nationally each year. The costs and QALYs estimated for each patient group within the economic evaluation described in Chapter 7 were multiplied by the number of patients anticipated to come forward based on the hub data to predict the potential impact that the ACHE tool may have on cost and health benefits, and how such costs and benefits may vary with the threshold.
In principle, the ACHE tool could be used at several points in the referral pathway. It could be used by GPs to decide if or when to refer the patient to secondary care. Oxford Hip and Knee Scores or the ACHE probability of benefit could be stated on the GP’s referral letter to help hub staff identify the next course of action, or the tool could be used during a hub consultation to inform decisions about whether or not the patient should be referred for surgical assessment. The ACHE tool could also be used at the surgical assessment to guide discussions between the surgeon and the patient about whether or not arthroplasty is appropriate. In this chapter, we assumed that the ACHE tool would be used only during face-to-face consultations at the hub for three reasons. First, discussions with GPs and hub staff suggested that it is likely to be most practical to use the ACHE tool at the hub rather than at the general practice. Second, there are no reliable data on the total number of patients consulting their GPs about hip or knee pain who are not currently referred, and no information on the distribution of Oxford Hip and Knee Scores for such patients. By contrast, we were able to obtain a set of data for patients attending the NOC hub. Third, although the ACHE tool may also be used to inform joint decision-making by the surgeon and patient during the surgical assessment, we did not attempt to model the impact of using the ACHE tool during the surgical assessment because the final decision to proceed or not with surgery is a more complex, personalised, shared decision and cannot be easily modelled given current data.
Approval
We successfully sought permission to analyse anonymised data from the Oxford musculoskeletal hub under direction of the lead and clinical director of Oxford University Hospitals NHS Foundation Trust (Oxford University Hospitals research and development reference number 11603).
Analysis of data from the Nuffield Orthopaedic Centre musculoskeletal hub
The primary analysis aimed to estimate how the probability that patients who attend face-to-face consultations at the hub will be referred to secondary care varies with OKS/OHS, age and gender. Following a well-established rule of thumb,190 we aimed to collect sufficient data to have at least 10 referrals from the hub to secondary care for each of these three explanatory variables.
Two medically-qualified surgical research fellows extracted data on patients’ gender, age and OKS/OHS at the time of each patient’s first face-to-face attendance at the hub into a pre-prepared data extraction table, as well as details on the date of attendance and whether or not the patient was referred to secondary care from the hub’s electronic medical records database. The date of any surgical assessment visit, whether or not patients were listed for knee/hip arthroplasty and the date of surgery were also recorded. Free-text fields were also used to record additional information on imaging, referrals to other clinics, other surgery and other diagnoses. Additional data on OKSs were extracted from the clinical pathway database for patients who had been referred directly for surgical assessment. Data on BMI were not extracted because BMI is not part of the ACHE tool.
The following exclusion criteria were used to exclude from the analysis patients who are unlikely to use the ACHE tool in clinical practice:
-
Patients aged < 50 years because knee/hip pain in younger patients is unlikely to be caused by osteoarthritis.
-
Patients for whom it was clear from the records that the symptoms were attributable to a condition other than osteoarthritis. In particular, the analysis excluded patients with rheumatoid arthritis, fracture (or arthritis secondary to fracture), gout, chondroid lesions, sports or other injuries, bone/joint infections, bursitis or quadriceps rupture. Patients requiring limb reconstruction, those with hip pain caused by previous spinal surgery and those recorded as having no mechanical symptoms were also excluded.
-
Patients who had previously had arthroplasty on the joint in question.
-
Patients who had attended the hub or surgical assessment before July 2015, and patients for whom medical records were inaccessible for research.
Patients referred for radiography, MRI or physiotherapy were in the analysis, regardless of whether or not they attended a face-to-face hub assessment.
Modelling the treatment pathway
Decision tree model
We constructed a decision tree model of the treatment pathway based on a preliminary examination of the data and consultation with hub staff (Figure 38). GPs’ letters referring patients to secondary care are first reviewed by senior hub staff, who examine the patient’s radiographs, and take into account the GP’s description of the patient’s symptoms, their BMI and what treatments have been previously tried.
FIGURE 38.
Schematic of the decision tree model. P(x) indicates the probability of x occurring; M indicates the Markov model that is used to calculate payoffs for the relevant branch (green for the arthroplasty Markov model and blue for the no-arthroplasty Markov model); and # indicates the residual probability (i.e. 1 minus the probabilities of the other branches leading from the preceding chance node).
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig38.png)
As a result of this triage process, some patients are referred directly to surgical assessment, including those who have severe symptoms and those who have previously been referred to secondary care but chose not to have surgery at that time.
Triage also identifies some patients who can be managed in primary care and therefore do not need to attend a face-to-face consultation at the hub or with an orthopaedic surgeon. These patients may have mild symptoms, or may not have exhausted all of the non-surgical treatment options (e.g. advice and information, activity and exercise, and, if appropriate weight loss). 191–193 In other cases, it may be clear from the referral letter that the patient is unfit for surgery or has a recent injury that needs more time to heal before a secondary care assessment is necessary. Other patients are sent referral letters, but do not call to book an appointment or miss booked appointments. A small minority of patients opt for non-NHS care. In some cases, it is necessary to conduct radiography or other imaging before the patient can be assessed.
The remaining patients attend the hub for a face-to-face assessment. This group tends to be those with moderate symptoms, those for whom there is uncertainty around the clinical diagnosis and those patients with BMIs of ≥ 40 kg/m2 who must be referred for a weight-loss programme before surgery. At the hub, consultant physiotherapists or orthopaedic fellows assess the patient to confirm diagnosis. Although the CCG no longer specifies a threshold Oxford Hip or Knee Score for referral,172,173 patients complete the OKS or OHS to assess whether or not symptoms are sufficiently severe to warrant surgery and to guide discussions about patients’ symptom profiles. Staff discuss with the patient what arthroplasty involves, including the potential need for revision surgery, recovery times and the need for support at home after hospital discharge, and how these might be affected by patients’ comorbidities and living arrangements. Landmark injections, injections for trochanteric bursitis and diagnostic imaging may also be done during the hub consultation. Obese patients are referred for monitored weight-loss programmes that must be followed for 12 months before surgery. Based on the hub visit, patients may be referred to an assessment consultation with a hip or knee surgeon if it is considered that they may be appropriate candidates for arthroplasty, or for other interventions such as arthroscopy, interventional radiology or anterior cruciate ligament repair. Other patients may be referred to other outpatient clinics, such as a sports injury clinic, or rheumatology. Patients may also be referred for physiotherapy or other non-surgical management, or may choose not to be referred because they have decided that they would prefer not to have surgery.
Patients attending a surgical assessment consultation will discuss the risks and benefits of surgery with an orthopaedic surgeon and make an informed decision about whether or not to undergo arthroplasty or another type of surgery, taking account of the severity of their symptoms and comorbidities.
Methods for estimating patient numbers
The decision tree model required estimation of five probabilities, shown in grey in Figure 38. We also evaluated whether or not the probability of being referred by the hub to surgical assessment and the probability of subsequently undergoing arthroplasty varied with Oxford Hip or Knee Score, age and sex. From these figures, we estimated the probability that men and women aged 50, 60, 70, 80 and 90 years with different Oxford Hip and Knee Scores will undergo arthroplasty after being referred by their GP with hip/knee symptoms.
We then calculated how many patients in each of these groups would need to be referred with knee/hip symptoms to account for the number of patients undergoing arthroplasty surgery across England each year. The number of operations in each group was first calculated by multiplying the proportion of patients in different age and sex groups (see Online Supplement 15) by the proportion of different Oxford Hip and Knee Scores (see Online Supplement 12) and by the number of primary knee/hip replacements conducted solely for osteoarthritis in England in 2014–15 (76,617 knee replacements and 69,313 hip replacements;154,165 see Chapter 5, Presentation of results and analysis of uncertainty). The number of patients in each group who are likely to have been referred with knee/hip symptoms was then calculated by dividing the number of operations in each group by the probability that each group of patients will undergo surgery.
We then estimated what impact using the ACHE tool at the hub might have on the number of referrals to surgical assessment and on the number of operations. In practice, many patients who are not referred to surgical assessment after their hub visit would not be considered candidates for arthroplasty regardless of their Oxford Hip or Knee Scores. In particular, some patients may decide after their hub visit that they do not want arthroplasty surgery, whereas others may be unsuitable owing to comorbid conditions. For the purposes of estimating the potential impact of the ACHE tool, we therefore assumed that half of the patients who are not currently referred from the hub to surgical assessment would not go on to surgical assessment even if the ACHE tool were introduced, whereas the remaining patients will be referred only if the ACHE tool predicts them to have capacity to benefit. We modelled the likely patient numbers if the ACHE tool were used to restrict referral to patients with at least a 50%, 60%, 70%, 80% or 85% probability of achieving a MID in Oxford Hip or Knee Score based on the data.
These patient numbers were used to assess the cost of the referral pathway by applying the costs described in Cost inputs. The impact of changing the number of arthroplasty operations on total costs and QALYs was then calculated by applying the costs and QALYs with and without arthroplasty that were estimated in Chapter 7 to the number of patients in each group who were expected to undergo arthroplasty or not have surgery within each scenario. The net health benefit of stratifying patients using the ACHE tool was also evaluated using the methods described in Chapter 5, Presentation of results and analysis of uncertainty, assuming a £20,000-per-QALY ceiling ratio. 166,167
We also estimated the total costs and QALYs that would be accrued if all hub attendees were referred for surgical assessment and those that would be accrued if no hub attendees were referred. From these estimates, we calculated the cost-effectiveness of referring hub attendees of different ages, genders and Oxford Scores to surgical assessment compared with no referral and produced alternative estimates of the decision grids and economic thresholds that take account of the cost of surgical assessment. We also used the ‘Goal Seek’ function in Microsoft Excel® 2010 to estimate the ICER at which the number of knee/hip arthroplasty procedures that would be conducted using the economic threshold equal to a specific target.
The analysis made the following assumptions:
-
Because there were no data on OKSs/OHSs patients who did not attend the hub, the probability of attending a face-to-face visit at the hub and the probability that patients are referred directly to surgery were assumed to be independent of Oxford Hip or Knee Score, age and sex. In practice, it is likely that those patients referred directly to surgical assessment will have more-severe symptoms, whereas those who attend neither the surgical assessment nor the hub may have less-severe symptoms. There may also be differences in the probability of referral by age and sex. However, because we evaluated the impact of the ACHE tool only in those patients attending the hub, allowing for variations in these parameters by age and sex would have had little or no impact on the conclusions. These probabilities were therefore estimated as the proportion of the osteoarthritis patients included in the analyses who attended the hub or were referred directly to surgical assessment without attending the hub.
-
For patients who were directly referred to surgical assessment, we assumed that the probability of undergoing surgery was independent of Oxford Hip or Knee Score, age and sex under current practice. As described above, this assumption is unlikely to have affected the estimates of the impact of the ACHE tool on hub attendees.
-
The five probabilities estimated based on the NOC hub data were assumed to be representative of clinical practice across the UK. Clinical practice at this hub is likely to differ from practices in other areas in several respects (see Strengths, limitations and further research requirements). However, this assumption was necessary as no other data were available.
-
All patients were assumed to attend a surgical assessment visit before surgery, as was observed in the audit.
-
We assumed that the ACHE tool would not be used at the surgical assessment and that the probability of undergoing surgery after attending surgical assessment (conditional on Oxford Hip or Knee Score, age and sex) will be unaffected by the threshold used.
-
Because there were insufficient data to evaluate non-linear functions, age and Oxford Hip and Knee Score were assumed to have linear effects on the log-odds of being referred for surgical assessment and on the log-odds of having surgery if referred.
-
Patients who were on the waiting list for arthroplasty surgery or for whom surgery had been delayed because of comorbidities (including high BMI scores) after being referred at the surgical assessment were counted as having been referred for surgery.
Owing to time and resource constraints, we did not attempt to quantify the uncertainty around the results presented in this chapter using PSA.
Regression methods
Logistic regression models were used to assess how patients’ Oxford Hip and Knee Scores, age and sex affected the odds of being referred for surgical assessment and the odds of subsequently undergoing surgery. Regression analyses and two-sample tests of proportions were conducted in Stata® version 14. Comparisons of Oxford Hip or Knee Scores between patient groups were conducted using unpaired t-tests conducted in Stata®; an F-test for equal variances was first conducted, which determined whether the t-test assumed equal or unequal variance. Oxford Hip and Knee Scores in the hub sample were compared against the population mean (nationally or in Oxfordshire) using a one-sample t-test in Stata®. Population means, nationally and in Oxfordshire, were based on NHS PROMs reports,162,182–184,194,195 which provide average preoperative OKSs and OHSs at the provider and commissioner level; the national average OKSs and OHSs were based on 2014–15 data, although owing to be smaller numbers of operations, those for the NHS Oxfordshire CCG comprise weighted average scores over all years from April 2009 to March 2015, and these averages were weighted by the number of procedures conducted in each year.
Oxford Hip and Knee Scores, age and sex were initially considered as potential explanatory variables, but were dropped from the regression analyses if they were found to be poor predictors. Given the sample of around 100 patients attending the hub, it was not possible to follow the 10-fold cross-validation approach used to select the regression specifications used in the Markov models. We therefore selected regression models based on Akaike information criterion (AIC), testing variables in a prespecified sequence. We began with a model including just the constant term and evaluated whether or not adding OHS or OKS to the model reduced the AIC; if it did, we added in the Oxford Hip or Knee Score in the next step, and if not, we kept the model as just constant. In the next step, we assessed whether or not adding age into the model reduced the AIC, and in the final step, we assessed whether or not adding sex into the model reduced AIC. The order in which variables were considered for the models (OHS or OKS, then age and then sex) was specified in advance of data analysis; we hypothesised that disease severity (i.e. OHS and OKS) would be most important, followed by age (because it is a proxy for comorbidities) and finally sex (because we had no prior hypotheses about how this variable would affect referrals).
The data for patients attending the hub were also used to obtain two measures of the current threshold OHS and OKS for referral. First, the absolute threshold was determined simply as the highest OHS and OKS observed among the patients who were referred for surgical assessment. Second, the logistic threshold was estimated from the regression output for the model selected using the process described in the previous paragraph: the model coefficients were used to calculate the OHS or OKS at which the probability of being referred for surgical assessment is 50%.
Cost inputs
We assumed that each 40-minute hub attendance cost £58, whereas surgical assessment visits cost £132 (see Online Supplement 8). The costs of GP consultations and the hub triage process were excluded from the analysis because they occurred before the hub attendance at which the ACHE tool is assumed to be used. Similarly, the cost of radiography, imaging, physiotherapy, injections, weight-loss programmes, missed appointments and referrals to other clinics was excluded because there is no reason to expect the introduction of the ACHE tool to affect the proportion of patients requiring these resources.
Results
Results of the musculoskeletal hub audit
Results of the audit for knee replacement
Records were reviewed for 616 patients referred to the hub with knee symptoms, of whom 315 osteoarthritis patients aged ≥ 50 years were included in the analysis (Figure 39). Of the 18 patients aged ≥ 50 years who were excluded because of conditions other than osteoarthritis, five had rheumatoid arthritis, four had bone or joint infections, two had injuries, one had gout, one had a chondroid lesion, one had no mechanical symptoms, one had quadriceps rupture, one needed limb reconstruction and two had arthritis secondary to fracture.
FIGURE 39.
Patient flow diagram for patients referred with knee symptoms.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig39.png)
Of the patients analysed, 44% (130/315) attended face-to-face consultations at the hub and 23% (71/315) were referred directly to surgical assessment based on the hub triage.
Among the 68 patients who were referred directly to surgical assessment and had outcomes recorded, 56% (38/68) underwent or were awaiting surgery. Data on the OKS measured at the surgical assessment visit were available for 27 of these patients registered in the Javlin study (URL: www.hra.nhs.uk/planning-and-improving-research/application-summaries/research-summaries/local-javlin-registry-v10/; accessed 6 February 2019) who had a mean OKS of 15 (range 1–30). Among the 22 of these patients who went on to have knee arthroplasty, the mean OKS was 14 (range 1–28), non-significantly higher than the average for the five patients who did not have arthroplasty (mean score 21, range 10–30; p = 0.08).
Five of the 114 patients who did not attend face-to-face consultations at the hub or surgical assessment were referred for physiotherapy, 60 had radiography, 16 had MRI and two had ultrasounds. Ten patients had two of these contacts.
Among the 110 patients attending the hub who had complete data, the average OKS was 21 (range 1–48), which was significantly higher than the average for patients undergoing arthroplasty nationally (mean score 18.43; p = 0.015) (Figure 40), but not significantly different from that in Oxfordshire (mean score 19.6; p = 0.134). 162,182–184,194,195 Of the 110 hub attendees, 49 (45%) were referred. OKS values were significantly lower for those patients who were referred (mean score 18, range 3–41) than for those who were not referred (mean score 23, range 1–48; p = 0.013). The highest OKS at which patients were referred to surgical assessment was 41, although the second-highest score was 32. Logistic regression analyses suggested that the odds of being referred for surgical assessment varied with OKS, although allowing for age and gender worsened model fit. This analysis suggested that the odds of being referred to surgical assessment decreased by 4.7% for each 1-point increase in OKS (p = 0.019), and suggested that the OKS at which patients have a 50% chance of being referred is just 16.
FIGURE 40.
Distribution of OKSs for patients attending the hub with knee pain compared with the distribution of patients undergoing knee arthroplasty in England in PROMs/HES data (2009–15).
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig40.png)
Among the 40 hub attendees who were referred to surgical assessment and had data on the outcome of that consultation, 30% (12/40) underwent or were awaiting arthroplasty surgery: significantly fewer than among those patients directly referred to surgical assessment [56% (38/68); p = 0.0092]. Among the sample of hub attendees, OKS, age and gender did not predict which patients would undergo arthroplasty. This may reflect the role of comorbidities and patient choice in the decisions made at the surgical assessment. For two patients (who both attended the hub), the medical notes indicated that it was necessary for the patients to reduce weight before knee replacement could be conducted.
Results of the audit for hip replacement
We reviewed records for 1022 patients referred to the hub with hip symptoms, of whom 607 osteoarthritis patients aged ≥ 50 years were included in the analysis (Figure 41). Among the 19 patients aged ≥ 50 years who were excluded because they had conditions other than osteoarthritis, nine had rheumatoid arthritis, three had fractures, two had bone or joint infections, two needed limb reconstruction, one had a sports injury, one had no mechanical symptoms and one had hip symptoms caused by previous spinal surgery.
FIGURE 41.
Patient flow diagram for patients referred with hip symptoms.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig41.png)
Face-to-face consultations at the hub were attended by 17% of patients (106/607), whereas 39% (236/607) were referred directly to surgical assessment. Of the 235 patients who were directly referred to surgical assessment and had known outcomes, 69% (161) had hip arthroplasty or were awaiting surgery. Because Javlin recruited only knee patients in the period covered by our audit, no OHS data were available on this patient group.
Of the 265 patients (44% of the total) who attended neither the hub nor the surgical assessment, six had physiotherapy, four were referred to other clinics, 10 had hip injections, 186 had radiography, eight had MRI and nine had ultrasound. Twenty-nine patients had two or more of these contacts.
Across the 101 hub attendees with complete data, the mean OHS was 23.7 (range 5–46), which was significantly higher than the average for hip arthroplasty patients nationally (mean score 17.5; p < 0.001) (Figure 42) and in Oxfordshire (mean score 18.7; p < 0.0001). 162,182–184,194,195 Overall, 36% of hub attendees (36/101) were referred for surgical assessment.
FIGURE 42.
Distribution of OHSs for patients attending the hub with hip pain compared with the distribution of patients undergoing hip arthroplasty in England in PROMs/HES data (2009–15).
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig42.png)
The mean OHS was non-significantly lower for patients who were referred (mean score 21, range: 5–44) than for patients who were not referred (mean score 25, range 5–46; p = 0.057). One patient was referred with a score of 44, although the second-highest score was 35. Logistic regression analysis suggested that the odds of referral varied with OHS and gender but not age. Each 1-point increase in OHS reduced the odds of referral by 3.9% (p = 0.062). The analysis also suggested that men attending the hub were twice as likely to be referred than women, although the difference did not reach statistical significance (p = 0.147). Women were more likely to attend the hub.
Among the hub attendees who were referred for surgical assessment, 37% (10/27) underwent hip arthroplasty. Logistic regression analysis suggested that OHS, age and gender did not predict which patients would undergo arthroplasty.
Medical records for one patient attending the hub and eight patients who were directly referred to surgical assessment indicated that the patients needed to reduce weight. Of these, two patients had surgery delayed because they needed to lose weight, after they had been directly referred to surgical assessment and then referred for hip arthroplasty.
Anticipated patient numbers, budget impact and cost-effectiveness of the Arthroplasty Candidacy Help Engine
Knee arthroplasty
We extrapolated the hub data to estimate the number of patients who may be referred to secondary care with knee pain in England by dividing the number of knee arthroplasty procedures conducted on patients with different OKSs, ages and genders by the probability of hub attendees undergoing surgery. This suggested that GPs across England refer > 400,000 patients aged ≥ 50 years to secondary care with knee osteoarthritis symptoms each year (Figure 43). Of these, > 170,000 patients might be expected to attend the hub if all CCGs followed a treatment pathway similar to that in Oxfordshire. Among the hub attendees, there may be around 80,000 referrals and 24,000 arthroplasty procedures, suggesting that around 53,000 of the 76,600 primary knee arthroplasty operations conducted for osteoarthritis each year are done on patients who were directly referred to surgical assessment.
FIGURE 43.
Number of patients predicted to be referred with knee osteoarthritis symptoms in England.
![](/sites/journalslibrary/files/journal_data/hta23320/11-63-01-fig43.png)
We also modelled the impact that the introduction of the ACHE tool might have on patient numbers, costs and QALYs (Table 60). As discussed in Methods, we assumed that the ACHE tool would only be used in face-to-face hub consultations and would not influence decisions about referrals or surgery in other settings. We also assumed that 26% of all hub attendees (half of the 53% of hub attendees who are not currently referred to surgical assessment) would not be referred regardless of the ACHE tool.
Outcome | Current practice | The ACHE tool referring patients with a specified probability of achieving a good outcome | Optimal health economic threshold | |||||
---|---|---|---|---|---|---|---|---|
85% | 80% | 70% | 60% | 50% | Taking into account assessment pathway costs | Ignoring assessment pathway costs | ||
Range of threshold OKSs across age and gender groups | N/A | 1–20 | 2–27 | 27–32 | 32–35 | 34–37 | 40–44 | 41–44 |
Number of attendances at the surgical outpatient visit | 80,211 | 69,702 | 103,731 | 121,507 | 127,545 | 130,019 | 131,965 | 131,967 |
Number of arthroplasty procedures conducted | 24,063 | 20,911 | 31,119 | 36,452 | 38,263 | 39,006 | 39,589 | 39,590 |
Total cost of the assessment pathway (£) | 20,397 | 19,005 | 23,511,404 | 25,865,310 | 26,664,801 | 26,992,441 | 27,250,072 | 27,250,414 |
10-year cost excluding the assessment pathway (£) | 1,077,817 | 1,067,131 | 1,091,969 | 1,105,648 | 1,112,839 | 1,116,290 | 1,119,389 | 1,119,394 |
Total cost (£) | 1,098,213 | 1,086,137 | 1,115,481 | 1,131,513 | 1,139,503 | 1,143,282 | 1,146,639 | 1,146,645 |
Difference in cost vs. current practice (£) | N/A | –12,077 | 17,268 | 33,300 | 41,290 | 45,069 | 48,426 | 48,431 |
10-year QALYs | 564,744 | 568,033 | 585,592 | 594,130 | 596,303 | 597,212 | 597,855 | 597,855 |
Difference in QALYs vs. current practice | N/A | 3289 | 20,848 | 29,385 | 31,559 | 32,468 | 33,111 | 33,111 |
Net health benefit (QALYs) | 509,834 | 513,726 | 529,818 | 537,554 | 539,328 | 540,048 | 540,523 | 540,523.21 |
Difference in net health benefit vs. current practice | N/A | 3892 | 19,984 | 27,720 | 29,494 | 30,214 | 30,689 | 30,689 |
ICER vs. current practice | N/A | Dominant | £828 | £1133 | £1308 | £1388 | £1463 | £1463 |
ICER vs. next best non-dominated option | N/A | Dominant | £1671 | £1878 | £3677 | £4157 | £5220 | £20,470 |
This analysis suggested that introducing the ACHE tool is likely to substantially increase the number of referrals, the number of arthroplasty procedures, costs and health benefits. The ACHE tool was predicted to be cost-effective compared with current practice if the value the NHS places on each QALY is £20,000; for example, using the ACHE tool to identify patients with ≥ 70% of a good outcome (which is equivalent to an OKS threshold of 32 for 60- to 79-year-old men and 30 for 60- to 79-year-old women) would lead to an additional 122,000 surgical assessments and 36,000 additional knee arthroplasty procedures. Taking account of the patients who were directly referred to surgical assessment (and for whom it is assumed that the ACHE tool would not be used), this equates to a 24% increase in the number of surgical assessments and a 16% increase in the number of knee arthroplasty procedures. Conducting these additional operations would cost an additional £33 million compared with current practice, but would gain > 29,000 healthy years or QALYs over the 10-year time horizon. Compared with current practice, this strategy costs £1133 per QALY gained: well below the £20,000–30,000 that the NHS is typically willing to pay to gain 1 QALY. 98 Taking into account the improved health for knee arthroplasty patients and the health forgone by spending additional NHS money on knee arthroplasty rather than other services, using the ACHE tool with a 70% cut-off point would gain the equivalent of 28,000 QALYs.
The number of referrals, procedures, costs and benefits increase as the capacity-to-benefit threshold is decreased. Because the probability of a good outcome from TKA is < 85% for patients aged < 60 years who have an OKS of ≥ 2 and for older patients with an OKS of 19–21 and above, referring only patients with ≥ 85% chance of a good outcome would reduce the number of referrals and operations and save money but would also reduce the health benefits.
Applying the health economic thresholds estimated in Chapter 7 would increase the number of referrals and operations further, but would gain 33,000 QALYs compared with current practice and gain 3700 more QALYs than referring patients who had a ≥ 70% chance of a good outcome. After taking into account the health benefits that are forgone by spending money on knee arthroplasty rather than other NHS services, using the health economic thresholds would gain around 31,000 QALYs compared with current practice.
Hip arthroplasty
Applying the same methods to hip data suggests that GPs in England refer around 235,000 patients with hip osteoarthritis symptoms each year (Figure 44). Of these, around 41,000 would be expected to attend the musculoskeletal hub if all CCGs followed the same referral pathway as that followed in Oxfordshire. Around 18,000 hub attendees may be referred for surgical assessment, of whom around 7000 undergo hip arthroplasty. However, the NOC hub data suggest that around 90% of hip arthroplasty procedures (around 63,000 operations nationally) are conducted on patients who