Notes
Article history
The research reported in this issue of the journal was funded by PGfAR as project number RP-PG-0608-10076. The contractual start date was in October 2010. The final report began editorial review in October 2014 and was accepted for publication in September 2015. As the funder, the PGfAR programme agreed the research questions and study designs in advance with the investigators. The authors have been wholly responsible for all data collection, analysis and interpretation, and for writing up their work. The PGfAR editors and production house have tried to ensure the accuracy of the authors’ report and would like to thank the reviewers for their constructive comments on the final report document. However, they do not accept liability for damages or losses arising from material published in this report.
Declared competing interests of authors
Sarah E Lamb is chairperson of the Health Technology Assessment Clinical Evaluation and Trials (HTA CET) Board. Martin Underwood is a member of the National Institute for Health Research Journals Library Editorial Group.
Permissions
Copyright statement
© Queen’s Printer and Controller of HMSO 2016. This work was produced by Patel et al. under the terms of a commissioning contract issued by the Secretary of State for Health. This issue may be freely reproduced for the purposes of private research and study and extracts (or indeed, the full report) may be included in professional journals provided that suitable acknowledgement is made and the reproduction is not associated with any form of advertising. Applications for commercial reproduction should be addressed to: NIHR Journals Library, National Institute for Health Research, Evaluation, Trials and Studies Coordinating Centre, Alpha House, University of Southampton Science Park, Southampton SO16 7NS, UK.
Chapter 1 Overview of the programme
In this chapter we have provided the background and rationale for our programme to improve the clinical effectiveness and cost-effectiveness of low back pain (LBP) treatment by identifying groups that may gain maximum benefit from therapist-delivered treatments.
Background
Chronic non-specific LBP (CNSLBP) is a common problem affecting a large proportion of the population. 1–4 In the UK, around 70–80% of adults will experience back pain at some point in their life. 5 Some argue that episodic LBP is a universal part of human experience. 6,7 Half of the adult population in the UK (49%) report LBP lasting at least 24 hours in a 1-year period. 5 The 2010 Global Burden of Disease study8 identified LBP as the leading cause of years lived with disability internationally. LBP affects around one-third of the world’s population. 8
Most episodes of back pain are short lived, resolving without the need for any specific treatment. It is the minority of episodes that develop into CNSLBP which create the greatest health need. The natural history of LBP is untidy; around 70% of those affected will experience at least one recurrent episode within a 12-month period. 9
The true prevalence of CNSLBP is difficult to estimate, as definitions and populations vary between studies and countries. However, a review of prevalence studies, reported, between 1966 and 1998, a 12–33% point prevalence; 22–65% 1-year prevalence and up to 84% lifetime prevalence. 10
Since this review, further reviews on the prevalence, focusing on older people and adolescents, have been published. 3,11 A 2012 systematic review synthesised the global prevalence of LBP in studies published between 1980 and 2009. The greatest prevalence was in females aged 40–80 years. After adjusting for methodological variations the point prevalence of back pain lasting for > 1 day was 11.9% [95% confidence interval (CI) 7.98% to 15.82%] and 1-month period prevalence was estimated at 23.2% (95% CI 17.52% to 28.88%). 12
Defining low back pain
The International Association of the Study of Pain defines pain as ‘an unpleasant sensory and emotional experience associated with actual or potential tissue damage, or described in terms of such damage’. 13 The British Pain Society defines acute pain as ‘short term lasting less than 12 weeks’ duration’, whereas chronic pain is defined as ‘long-term pain of more than 12 weeks or after the time that healing would have been thought to have occurred in pain after trauma or surgery’. 14
Low back pain is diagnosed based on the presence of pain and discomfort in the lumbosacral area. 15 Some people also experience pain in the upper leg as a result of LBP. In the majority of cases it is difficult to identify a single cause for back pain. A 2013 systematic review16 of studies of new presentations of LBP found a combined prevalence of 1.5% for fracture and malignancy in primary care; in secondary and tertiary care, prevalence was 6.5%. Once specific causes for LBP have been excluded [malignancy, fracture, infection, inflammatory disorders (such as ankylosing spondylitis)] then a diagnosis of non-specific LBP (NSLBP) is made. This recognises the difficulty in producing robust classification criteria to identify different populations of people affected by chronic LBP.
There is no evidence for a reduction in the population burden of LBP over time. Between 1990 and 2010, in the UK, the number of disability-adjusted life-years attributable to LBP increased by 3.7% from 2231 (95% CI 1555 to 3015) of 100,000 to 2313 (95% CI 1574 to 3113) of 100,000 of the age-standardised population. 17
Economic burden of low back pain
Low back pain is a costly condition to society, health care and the individual. It is the leading cause of sickness absence and health-care use. 18–21 In the UK, the direct health-care cost of back pain in 1998 was £163M. However, the larger burden is that of the indirect costs related to lost production and informal care, which were estimated to be at least £5018M. 22 More up-to-date UK estimates are not available. The current cost is likely to be substantially larger. It is difficult to make direct comparisons of the cost of LBP internationally because of varying health and social care systems. 23
Low back pain results in approximately 4% of the UK population taking time off work. This translates to around 90 million working days lost and between 8 and 12 million general practitioner (GP) consultations per year. 22,24 In 2013 the Office for National Statistics reported 131 million lost working days due to sickness absences in that year in the UK; 30.6 million of these (23%) were lost because of musculoskeletal conditions including back and neck pain. 25
Treatment options for low back pain
People experiencing LBP will often seek medical and drug therapies, as well as therapist-delivered complementary therapies, such as acupuncture, chiropractic or osteopathy, to help relieve pain. 26 Until comparatively recently there were few robust trials of treatments for LBP, and no convincing evidence for the effectiveness of any back-pain treatments. Guidance on the management of LBP was based largely on expert opinion, custom and practice. Since the mid-1990s, there has been a substantial investment in high-quality randomised controlled trials (RCTs) of different treatments for NSLBP. We now have good evidence to show that several therapist-delivered treatment approaches are effective, and for some of these there is also evidence that they are cost-effective. 15,27 By ‘therapist-delivered interventions’ we mean non-drug, non-surgical approaches to the treatment of LBP. Typically, these are delivered by physiotherapists or health/clinical psychologists, but they may be delivered by doctors, health trainers, statutorily regulated complementary practitioners (such as osteopaths or chiropractors), or independently registered professionals providing treatments such as acupuncture or the Alexander technique. The types of interventions offered include acupuncture, manual treatments, exercise regimens, cognitive behavioural approaches or combinations of these.
A number of therapist-delivered interventions are superior to ‘treatment as usual’ (GP care) for participants with chronic LBP. There are numerous treatment options for LBP and several guidelines recommending treatment, including the National Institute for Health and Care Excellence (NICE), the European Corporation in Science and Technology, and the American College of Physicians and American Pain Society guidelines. Such guidance is typically framed as examining independent treatment modalities. Any recommendation for a treatment modality is, inevitably, recommending a package of care including both the non-specific effects of the therapist encounter and the specific effects of the treatment modality in question.
In 2009, NICE guidance15 advised that all people with persistent LBP should be given advice and encouraged to self-manage. As part of this advice, people are encouraged to remain physically active and to engage in daily activity. Subsequently, those affected should be offered a course of acupuncture, exercise or manual therapy. 15 The decision on which treatment to select should be a collaborative decision, taking into account the patient’s treatment preferences. If the selected treatment option is not effective then the patient should be offered another option from the remaining recommended treatments. If the patient is still troubled by back pain then he/she should be considered for an intense physical and psychological intervention. NICE is currently revising its LBP guidelines.
Effectiveness and cost-effectiveness of treatments for low back pain
Although the effectiveness of adding a range of therapist-delivered interventions to best usual care or to no treatment has been well established, the typical mean effect sizes are, at best, modest. By way of illustration, the minimally important (within-person) change in the Roland–Morris Disability Questionnaire (RMDQ) score,28 the most commonly used outcome measure in back pain trials, has been established as 5 points. 29,30 Typical between-group differences in high-quality RCTs are in the order of 1–2 points on the RMDQ, although a few studies have found larger effect sizes (Table 1). These modest mean differences probably translate into ‘numbers needed to treat’ in the order of 5–10. 29,33 These are similar to the numbers needed to treat that are found with antidepressant or antiepileptic drugs which are used to treat chronic painful disorders. 36
Study | Control | Intervention | Mean difference in RMDQ score (95% CI); SMD | |
---|---|---|---|---|
3 months | 12 months | |||
UK BEAM31 | GP care | Exercise | 1.36 (0.63 to 2.10); 0.34 | 0.39 (–0.41 to 1.19); 0.10 |
Manipulation | 1.57 (0.82 to 2.32); 0.39 | 1.01 (0.22 to 1.81); 0.25 | ||
Manipulation plus exercise | 1.87 (1.15 to 2.60); 0.47 | 1.30 (0.54 to 2.07); 0.33 | ||
A-TEAM32 | Usual care | Massage | 1.96 (0.74 to 3.18); 0.39 | 0.58 (0.77 to 1.94); 0.12 |
Alexander technique (six sessions) | 1.71 (0.47 to 2.95); 0.34 | 1.40 (0.03 to 2.77); 0.28 | ||
Alexander technique (12 sessions) | 2.91 (1.66 to 4.16); 0.58 | 3.40 (2.03 to 4.76); 0.68 | ||
BeST33,34 | Advice only | Cognitive–behavioural therapy | 1.10 (0.38 to 1.71); 0.22 | 1.30 (0.56 to 2.06); 0.27 |
York Yoga35 | Usual care | Yoga | 2.17 (1.03 to 3.31); 0.50 | 1.57 (0.42 to 2.71); 0.36 |
The cost per quality-adjusted life-year (QALY) for some of these treatments is well within cost-effectiveness thresholds that are usually used by NICE. Despite this, evidence of access to such treatments within the UK NHS remains patchy. The guideline-endorsed treatments of interdisciplinary rehabilitation, exercise, acupuncture, spinal manipulation and cognitive–behavioural therapy for subacute or chronic LBP have been shown to be cost-effective, but evidence for other endorsed treatments for NSLBP do not yield conclusive or consistent evidence about their relative cost-effectiveness. 37 The scarcity of economic evaluations for some guideline-endorsed treatments means that well-conducted economic evaluations are required to strengthen the evidence base of treatments for LBP.
Subgrouping
Identifying which participants are likely to gain the greatest benefit from different treatments for LBP is an identified high research priority internationally and was one of the key recommendations for future research in the 2009 NICE guidelines for the management of persistent LBP. Current research does not provide any robust data on how to match back pain treatments to participants to maximise effects on outcomes relevant to the participant and cost-effectiveness for the health service.
As different treatment options are argued to work in very different ways, it is a reasonable hypothesis that matching people with LBP to those treatments that are more likely to be effective for their back pain will be a more efficient use of health-care resources and will improve patient outcomes. One might expect that people with high levels of psychological distress that is related to their back pain may gain greater benefit from a psychologically orientated intervention, such as cognitive–behavioural therapy; those with marked loss of physical fitness to benefit most from an exercise intervention; or those with poor back function to benefit most from manual therapy interventions. Developing an evidence base to inform the development of such a stratified care approach has great potential to improve outcomes for people with LBP.
We are aware of one trial of a stratified care approach, published after this programme of work started. The STarT Back trial38 successfully demonstrated that a combination of using a stratification tool and enhanced physiotherapy packages for selected participants improves outcomes and reduces costs when compared with usual physiotherapy care. This study38 does not, however, allow the performance of the stratification tool to identify subgroups to be assessed.
There is a myriad of RCTs that could be designed to address individual components of this problem. High-quality trials in this area are very costly and time-consuming, and can address only one small part of this complex problem. Alternative approaches, which make the best possible use of existing data, can produce timely answers to a range of important research questions and provide substantial added value to the money that is already invested in this area.
We present a programme of work – using systematic reviews, methodological development and secondary analyses of existing data sets – to identify strategies to improve outcomes for people seeking treatment for back pain, by improving how participants, clinicians and purchasers choose treatments. Our programme of work ensures that the maximum information is gleaned from existing substantial trial data sets. The analysis plan for these data and modelling of clinical effectiveness and cost-effectiveness are informed by our literature reviews.
Aim and objectives
The overall aim was to improve the clinical effectiveness and cost-effectiveness of LBP treatment by providing participants, their clinical advisors and health service purchasers with better information about which participants are most likely to benefit from which treatment choices. To achieve this, our objectives were to:
-
synthesise what is already known about the validity, reliability and predictive value of possible treatment moderators
-
develop a repository of individual participant data from RCTs testing therapist-delivered interventions for LBP
-
determine which participant characteristics, if any, predict clinical response to different treatments for LBP
-
determine which participant characteristics, if any, predict the most cost-effective treatments for LBP.
We have defined a therapist as a person trained in administering any of the available recommended treatments, excluding drug interventions and surgical interventions, for the management of LBP.
Structure of this report
This report has been structured as shown in Figure 1. In this report we use some specific terminology that needs additional definition to aid understanding. We have defined these in the Glossary at the start of this report and in more detail at relevant points in the report.
Chapter 2 Literature reviews
As part of this programme of work, we carried out two systematic reviews. In this chapter, we have presented the details and results of each review, followed by an overall summary.
Systematic review 1: identification of potential moderators
This review has been published in Physiotherapy under the terms of the Creative Commons Attribution – NonCommercial – NoDerivs (CC BY-NC-ND 4.0) Licence (http://creativecommons.org/licenses/by-nc-nd/4.0/). Here we present a summary of the paper. 39
Abstract
Background: in RCTs, moderators are baseline characteristics that predict whether or not an intervention will be more or less effective for an individual in the trial. For our final individual participant data meta-analyses selected potential moderators grounded in existing data to inform our selection.
Aim: to identify potential moderators from existing studies of therapist-delivered interventions for LBP to apply to our data set.
Methods: we developed a review protocol detailing the inclusion and exclusion criteria, search strategy, data extraction process and quality assessment method. We conducted electronic searches in MEDLINE, EMBASE, Web of Science (Science Citation Index and Social Science Citation Index) and Cochrane Central Register of Controlled Trials (CENTRAL) databases for studies reporting moderator analyses. Two researchers independently screened the titles and abstracts. Additionally, we searched the reference lists of relevant articles for any further potential references. We included RCTs with ≥ 500 participants, and cohort studies of ≥ 1000 participants. We classified potential moderators into those with strong evidence (p < 0.05) or weaker evidence (p < 0.20, p ≥ 0.05).
Results: we identified 914 potential citations. We selected 64 papers for detailed evaluation. Four papers, all RCTs, were included. We identified potential moderators with strong evidence (p < 0.05) in one or more studies as age, employment status and type, back pain status, narcotic medication use, treatment expectations and education. Potential moderators with weaker evidence (0.05 < p ≤ 0.20) include gender, psychological distress, pain/disability and quality of life.
Conclusion: the overall data obtained from this review were weak and lacked the in rigour to inform clinical practice. However, this review has helped us to identify potential moderators of treatment effect with some weak evidence to inform our further analyses.
Background
The ability to identify which patients are likely to gain the greatest benefit from a treatment would have significant implications in clinical practice. To explore this it is crucial to identify moderators of treatment response. These are factors measured prior to randomisation and subsequently influence the effect of the treatment. 40 To identify such moderators, large data sets are required to provide sufficient statistical power to detect any interaction between the moderator and treatment. 41
Aims
The purpose of this review was to identify potential moderators which we could test in our individual participant data pooled repository.
Method
Originally this review was conducted up until September 2011. Searches were updated in July 2014. Electronic searches were conducted using the following databases:
-
MEDLINE
-
Ovid MEDLINE® In-Process & Other Non-Indexed Citations
-
EMBASE
-
Web of Science
-
Citation Index and CENTRAL.
To ensure that we had not overlooked useful data identifying possible treatment moderators, we searched for both RCTs and observational studies that had tested for effect modification.
Search strategy
We started our searches using the terms ‘low back pain’ combined with keywords including ‘subgroup’, ‘effect modifier’ and ‘moderator’. The results from this preliminary search allowed identification only of publications that used the term ‘subgroup’ in the title and/or the abstract – it failed to pick up papers that used the term in the main body of the text. We therefore re-ran searches using keywords (‘trial’) for RCTs and (‘Observational’, ‘Cohort’, ‘Prospective studies’) for non-RCTs or observational studies separately and then combining them with terms ‘low back pain’. Hand-searching and screening of included studies were carried out for additional studies.
Minimum sample size for included studies
To allow us to identify meaningful interactions it was critical to select research based on an adequate sample size. We made the following assumptions to determine the sample size criterion:
-
the outcome of interest is continuous and normally distributed
-
there are two treatment arms (intervention and control)
-
the potential moderator is binary.
To determine the minimum sample needed to test for an interaction we used a model proposed by Lachenbruch. 42 To test for a long-term (12 months) moderate standardised effect size [between-group difference/baseline standard deviation (SD)] of 0.5 for the interaction at a 0.05 level of significance and 80% power for the primary outcome, a minimum data set of 503 participants was needed. Recognising the inherent risk of bias in observational studies we set a higher threshold of 1000 participants for any observational studies included.
A priori we estimated that we needed to include RCTs with at least 500 participants to identify a moderate standardised mean difference (SMD; between-group difference/baseline SD) of 0.5 for the interaction at a 0.05 level of significance and 80% power. The SMDs in high-quality RCTs of therapist-delivered interventions for LBP are typically in the range of 0.1–0.7 (see Table 1). Smaller trials would be able to detect treatment moderation, at this level, only if the moderation effect was substantially larger than the main treatment effect. Thus, even having set quite a large entry criterion by size we would run the risk of failing to consider potential treatment effect moderators that did not reach the conventional level of statistical significance. Therefore, any variables identified as moderators of treatment effect at p < 0.05 were classed as potential moderators with strong evidence and those at 0.05 < p ≤ 0.20 as potential moderators with weak evidence. For our final analyses we considered potential moderators with both strong and weak evidence to be worth exploring further.
Inclusion and exclusion criteria
Box 1 provides an outline of the inclusion and exclusion criteria for this review.
-
Aged ≥ 18 years.
-
NSLBP of any duration.
-
Therapist-delivered interventions.
-
RCTs with sample size of ≥ 500.
-
Non-RCTs and observational studies with sample size of ≥ 1000.
-
English language.
-
Primary and secondary analysis seeking to identify predictors of response to treatment using ‘a priori’ and ‘post hoc’ subgroups and those looking for interaction between baseline variable and treatment.
-
Studies with no comparison between two treatment groups.
-
Studies that did not report effect sizes for treatment by using moderator interactions.
Screening and data extraction
At all stages two researchers (Dr Tara Gurung and DE) worked independently to screen titles and abstracts based on the inclusion criteria. All agreed full papers were obtained for data extraction. Data were extracted on to a standardised extraction form and any discrepancies were resolved using a third reviewer (DM). As no relevant observational studies were identified we do not address further methodological considerations related to observational studies.
Risk of bias and quality assessment
Both reviewers independently assessed risk of bias for the between-group comparison using the Cochrane Collaboration risk-of-bias tool. 43 From this tool the criteria used were:
-
method of randomisation
-
allocation concealment
-
incomplete outcome data
-
selective outcome reporting
-
other sources of bias.
To assess quality we used the criteria developed by Pincus et al.,44 whereby the answers to the five questions presented below allowed evidence to be classified as ‘confirmatory’ or ‘exploratory’:
-
Was the subgroup analysis specified a priori?
-
Was the selection of subgroup factors for analysis theory/evidence driven?
-
Were subgroup factors measured prior to randomisation?
-
Was measurement of subgroup factors, measured by adequate (reliable and valid) measurements, appropriate for the target population?
-
Does the analysis contain an explicit test of the interaction between moderator and treatment?
To reduce conflicts of interest, members of the reviewing team who were authors on any included studies did not participate in the quality assessment exercises.
Results
Our initial electronic searches generated 7208 hits; 6294 were removed based on title, abstract and duplicates. We obtained 64 papers for detailed review; of these, 60 papers were excluded (Figure 2). Four studies45–48 were included in this review (Table 2). All four trials45–48 were RCTs, constituting a total sample of n = 5514.
Study | Country | Sample | Interventions |
---|---|---|---|
UK BEAM45 | UK | 1334 | Group exercise, manual therapy and combination therapy |
BeST46 | UK | 701 | Group cognitive–behavioural approach |
Witt47 | Germany | 2841 | Acupuncture |
Cherkin49 | USA | 638 | Acupuncture |
Once we had identified these papers we revisited our search results to include any studies with a sample size of ≥ 300 in a two-group comparison because the trial by Cherkin et al. 49 was a four-arm trial with a sample of n = 638, whereas our sample size calculation of ≥ 500 was based on a two-arm trial. As this paper48 generated some useful moderators for our exploratory work we decided to include it. We did not identify any additional relevant studies with between 300 and 499 participants.
Although the Witt et al. 47 paper provided insufficient data to judge the quality of its exploratory analysis, it did include a specific test for interaction. The data presented did not allow for any pooling of moderator analyses across studies testing similar interventions.
Risk of bias and methodological quality for subgroups
To assess risk of bias and quality of subgroups we used both the original main trial papers and the associated secondary papers where appropriate (Tables 3 and 4).
Quality of the study based on main trial paper(s) | UK BEAM31 | BeST33,34 | Witt47 | Cherkin48,49 |
---|---|---|---|---|
Random sequence generation | L | L | L | L |
Allocation concealment | L | L | L | L |
Blinding of participants and personnel | H | H | H | H |
Blinding of outcome assessment | L | L | H | L |
Incomplete outcome data | L | L | U | L |
Selective reporting | L | L | U | L |
Generalisability | L | L | L | L |
Sample size calculation | L | L | U | L |
Conflict of interest | L | L | H | L |
Source of funding | MRC | NIHR HTA | Social Health Fund Providers | National Institutes of Health |
Quality of the moderator analyses based on subgroup paper(s) | UK BEAM31 | BeST33,34 | Witt47 | Cherkin48,49 |
---|---|---|---|---|
Was the subgroup analysis specified a priori? | N | Y | N | N |
Was the selection of subgroup factors for analysis theory/evidence driven? | N | Y | N | N |
Were subgroup factors measured prior to randomisation? | Y | Y | U | Y |
Was measurement of subgroup factors measured by adequate (reliable and valid) measurements, appropriate for the target population? | Y | Y | N | Y |
Does the analysis contain an explicit test of the interaction between moderator and treatment? | Y | Y | U | Y |
Strength of evidence | EE | CE for two potential moderators | IE | EE |
Table 5 presents the potential moderators with strong and/or weak evidence from the four included trials. 45–48 The many interactions tested that were not statistically significant are not reported here.
Study ID | Potential moderators | Significant interaction on selected outcomes (12 months) | |||||
---|---|---|---|---|---|---|---|
RMDQ | MVK pain | MVK disability | |||||
BeST34,46 | Troublesomeness (very/extremely – moderately) | p = 0.190; –1.01 (–2.52 to 0.50) | p = 0.184; –5.04 (–12.47 to 2.40) | NS | |||
Age (≥ 54 years – < 54 years) | p = 0.035; –1.58 (–3.05 to –0.12) | NS | NS | ||||
Female – male | p = 0.102; –1.27 (–2.79 to 0.25) | NS | NS | ||||
Left FT education (> 16 years of age – ≤ 16 years of age) | p = 0.098; 1.29 (–0.24 to 2.82) | NS | NS | ||||
Employed – not employed | p = 0.011; 1.89 (0.43 to 3.35) | p = 0.181; 5.01 (–2.33 to 12.34) | NS | ||||
HADS – anxiety (≥ 11 – < 11) | p = 0.195; –1.12 (–2.83 to 0.58) | NS | NS | ||||
HADS – depression (≥ 11 – < 11) | p = 0.135; –2.07 (–4.79 to 0.65) | NS | p = 0.051; –14.58 (–29.19 to 0.03) | ||||
Study ID | Potential moderators | Significant interactions; outcome, RMDQ | |||||
8 weeks | 52 weeks | ||||||
IAd | StAe | SiAf | IA | StA | SiA | ||
Cherkin48,49 | Age | NS | p = 0.08; 0.08 (–0.02 to 0.18) | NS | NS | p = 0.15; 0.07 (–0.03 to 0.17) | NS |
Self-efficacy | p = 0.04; –6.17 (–12.01 to –0.33) | NS | NS | NS | NS | NS | |
RMDQ (B/L) | p < 0.0001; –0.48 (–0.72 to –0.24) | p = 0.004; –0.37 (–0.62 to –0.12) | p = 0.001; –0.41 (–0.66 to –0.16) | p = 0.07; –0.23 (–0.48 to 0.02) | p = 0.07; –0.24 (–0.49 to 0.01) | NS | |
Bothersomeness score (B/L) | NS | p = 0.10; 0.47 (–0.10 to –1.04) | NS | NS | NS | NS | |
Heavy lifting | p = 0.03; 4.29 (0.43 to 8.15) | p = 0.13; 3.00 (–0.86 to 6.86) | p = 0.18; 2.73 (–1.27 to 6.73) | p = 0.01; 5.19 (1.17 to 9.21) | p = 0.15; 3.03 (–1.05 to 7.11) | p = 0.04; 4.45 (0.28 to 8.62) | |
Sedentary | NS | NS | NS | p = 0.12; 2.73 (–0.72 to 6.18) | p = 0.15; 2.47 (–0.90 to 5.84) | NS | |
Use of narcotic medication | p = 0.08; 3.52 (–0.38 to 7.42) | NS | p = 0.01; 4.81 (0.97 to 8.65) | NS | p = 0.04; 4.06 (0.18, 7.94) | p = 0.19; 2.71 (–1.31 to 6.73) | |
Acupuncture expectation (top tertile) | p = 0.05; –2.65 (–5.28 to –0.02) | NS | NS | NS | p = 0.17; –1.9 (–4.60 to 0.80) | p = 0.03; –2.91 (–5.56 to –0.26) | |
Study ID | Potential moderators | Significant interactions; outcome, bothersomeness score | |||||
8 weeks | 52 weeks | ||||||
Cherkin48,49 | Age | NS | p = 0.09; 0.04 (0.001 to 0.08) | p = 0.07; 0.04 (0.001 to 0.08) | NS | p = 0.15; 0.04 (–0.02 to 0.10) | p = 0.08; 0.05 (–0.01 to 0.11) |
Self-efficacy | p = 0.14; –2.21 (–5.13 to 0.71) | NS | NS | NS | NS | NS | |
Baseline RMDQ score | p = 0.01; –0.15 (–0.27 to –0.03) | NS | p = 0.0005; –0.22 (–0.34 to –0.10) | p = 0.16; –0.09 (–0.23 to 0.05) | NS | NS | |
Heavy lifting | p = 0.05; 1.97 (0.03 to 3.91) | NS | p = 0.04; 2.10 (0.10 to 4.10) | p = 0.02; 2.51 (0.43 to 4.59) | NS | NS | |
Light/medium lifting | NS | p = 0.12; –1.28 (–2.87 to 0.31) | NS | p = 0.12; 1.35 (–0.36 to 3.06) | NS | NS | |
Sedentary | NS | NS | NS | p = 0.19; 1.20 (–0.58 to 2.98) | NS | NS | |
Acupuncture expectation (top tertile) | p = 0.10; –1.10 (–2.41 to 0.21) | NS | NS | p = 0.051; –1.44 (–2.87 to –0.01) | NS | p = 0.06; –1.29 (–2.64 to 0.06) | |
Study ID | Potential moderators | 3 months for RMDQ outcome, combined treatment | 12 months for RMDQ outcome, combined treatment | ||||
UK BEAM31,45 | Quality of life | p = 0.174; –0.1 (–0.26 to 1.43) | NS | ||||
Treatment expectation (helpful) | p = 0.073; –3.2 (–6.74 to 0.30) | p = 0.038; –3.8 (–7.39 to –0.20) | |||||
Treatment expectation (very helpful) | p = 0.192; –2.2 (–5.49 to 1.11) | p = 0.019; –4.0 (–7.38 to –0.67) | |||||
Manipulation | |||||||
Beliefs | p = 0.07; –0.8 (–1.62 to 0.06) | NS | |||||
Quality of life | p = 0.118; 1.4 (–0.35 to 3.07) | NS | |||||
Pain/disability | p = 0.176; –1.9 (–4.61 to 0.85) | p = 0.143; –2.2 (–5.16 to 0.75) | |||||
Treatment expectation (helpful) | NS | p = 0.083; –0.1 (–0.16 to 0.01) | |||||
Treatment expectation (very helpful) | p = 0.113; 1.6 (–0.38 to 3.60) | NS | |||||
Study ID | Potential moderators | Outcome, FFbHR | |||||
Witt47 | Worse initial back function | p < 0.001 Back function and pain improvement at 3 months with acupuncture treatment |
|||||
Younger | p < 0.001 | ||||||
> 10 years of schooling | p = 0.01 |
Moderator variables identified
Potential moderators with strong evidence (p < 0.05) in one or more studies include age (younger participants may gain more benefit), employment status and type (those employed or in sedentary occupations may gain greater benefit), back pain status (those who are worse may gain greater benefit), narcotic medication use (users may benefit less), treatment expectations (those with a greater positive expectation gained more benefit) and education (those with > 10 years of schooling gained a greater benefit). Potential moderators with weaker evidence (0.05 < p ≤ 0.20) include gender (female participants may gain greater benefit), psychological distress (those with anxiety and depressive symptoms may benefit more), pain/disability (those with greater pain/disability at baseline may benefit more) and quality of life (those with a better quality of life may benefit more). It should be noted that these findings might just be a chance finding, particularly as these conclusions come from different studies.
Age: the BeST (Back Skills Training Trial), Cherkin and Witt trials46,49,50 found an interaction with age. In the BeST trial,46 younger participants gained more benefit from cognitive behavioural therapy than older participants on the RMDQ score. The treatment difference was –1.58 (p = 0.035; 95% CI –3.05 to –0.12). As the p-value was < 0.05, the interactions provided strong evidence. Witt et al. 50 found a statistically significant additional benefit from acupuncture treatment in younger participants (p < 0.001).
Gender: the BeST trial46 found that gender had a moderating effect on treatment. In this trial, females had comparatively greater improvement following group cognitive behavioural therapy than males. The treatment difference between male and female was –1.27 (p = 0.102; 95% CI –2.79 to 0.25) for the RMDQ score. As the p-value was 0.05 < p ≤ 0.20, the interaction provides weak evidence.
Employment status: employment was found to be one of the positive moderating factors. In the BeST trial,46 the authors found that employed participants gained additional benefit from a cognitive behavioural approach compared with those who were unemployed. The treatment difference between employed and unemployed was 1.89 (p = 0.011; 95% CI 0.43 to 3.35) and 5.01 (p = 0.181; 95% CI –2.33 to 12.34) for the RMDQ and Modified von Korff (MVK) pain scores, respectively. The interaction effect in the analysis of the MVK pain score was weak. 46 The Cherkin trial48,49 found some moderating effect according to types of employment status. The participants in this trial48 received acupuncture therapy. Those participants whose job involved heavy lifting showed positive moderating effect against back-related dysfunction score at 8 weeks (p = 0.03 to 0.18) and 52 weeks (p = 0.01 to 0.04). Those participants doing medium/light lifting at work showed positive moderating effect in terms of the bothersomeness score (p = 0.12) at 8 and 52 weeks; however, the interaction was weak. Finally, those participants with sedentary work showed positive moderating effect at 52 weeks (p = 0.12 to 0.19). The interaction was generally weak.
Education: the BeST trial46 found that participants who had left full-time education after the age of 16 years had better improvement from cognitive behavioural therapy than participants who left full-time education aged ≤ 16 years. The treatment difference was 1.29 (p = 0.098; 95% CI –0.24 to 2.82) for the RMDQ score. The interaction effect was > 0.05 and, therefore, this provides weak evidence. Witt et al. 50 found that those participants who have had > 10 years of schooling gained a greater benefit from acupuncture (p = 0.01).
Back pain status: In the Cherkin and Witt trials48–50 participants with a worse initial back pain status (baseline RMDQ score) gained an increased benefit from acupuncture compared with those with a better back pain status at baseline (p-values ranged from < 0.001 to 0.16). The extent to which LBP inconveniences participants – how troublesome or bothersome it is – was found to be a moderator in two trials, with a greater benefit from treatment in those with a more troublesome/bothersome condition. The interaction was weak, with the p-values being > 0.05. In the Cherkin trial,48,49 the p-value was 0.10, whereas in the BeST trial46 the treatment difference for the RMDQ score was –1.01 (p = 0.190; 95% CI –2.52 to 0.50) and –5.04 (p = 1.184; 95% CI –12.47 to 2.40) for MVK pain score.
Pain/disability: similarly, those participants with greater pain/disability at baseline seemed to benefit more at 3 months (p = 0.176) and 12 months (p = 0.143) for the RMDQ score with manipulation treatment [UK Back pain Exercise And Manipulation (UK BEAM45)] (see Table 5). The p-values are > 0.05 and < 0.2, therefore providing weak evidence. 45
Narcotic: Cherkin et al. 48,49 found that use of medication such as narcotics had a negative moderating effect in those receiving acupuncture. The p-value for this interaction ranged from 0.01 to 0.19, demonstrating a spectrum of strong to weak evidence.
Treatment expectations: having better expectations about the treatment was found to be a moderating factor in two trials. 45,48,49 The p-values ranged between 0.03 and 0.192, demonstrating a spectrum of strong to weak evidence for the interactions. 48,49 Cherkin et al. 48,49 found that participants with higher expectation of acupuncture treatment helpfulness gained more benefit in the back-related dysfunction score (p = 0.03–0.17) and bothersomeness score (p = 0.05–0.10). 48,49 In the UK BEAM trial,45 manipulation at 3 months (p = 0.113) and 12 months (p = 0.083), or a combined treatment of manipulation and exercise (p = 0.03 to 0.192) at both 3 and 12 months, showed positive moderating effect, as was demonstrated by the RMDQ score. Overall, the interactions were found to range between a spectrum of strong to weak evidence.
Quality of life: good quality of life showed weak evidence for a moderating effect on treatment outcome for both manipulation treatment (p = 0.118) and a combined manipulation and exercise treatment (p = 0.174). 45
Psychosocial status: in the BeST trial,46 psychosocial status moderated treatment effect. The trial46 investigated whether psychological status moderated better outcome from a cognitive behavioural therapy. Participants with higher levels of anxiety at baseline gained more benefit from treatment in terms of the RMDQ score. The treatment difference was found to be –1.12 (p = 0.195; 95% CI –2.83 to 0.58), demonstrating a weak interaction. Similarly, those participants who were depressed considerably gained more benefit from the treatment than those who were less depressed as was found in the RMDQ and MVK disability scores. The treatment difference was found to be –2.07 (p = 0.135; 95% CI –4.79 to 0.65) and –14.58 (p = 0.051; 95% CI –29.19 to 0.03) for the RMDQ and MVK disability scores, respectively.
Discussion and conclusion
In this review we aimed to identify potential moderators of treatment effect to test in our repository of data. Only four trials were included. We considered any variables that were identified as moderators of treatment effect at p < 0.05 as potential moderators with strong evidence, and those at p < 0.20 and p ≥ 0.05 as potential moderators with weak evidence. Only for two comparisons, in one study,46 were any confirmatory analyses performed. Any apparently positive findings need to be interpreted with considerable caution. We have set the threshold for potential moderation with weak evidence at p = 0.02, and the included studies included many comparisons, meaning that any positive results may well be no more than chance findings. Nevertheless, we have identified some domains for which there is some weak evidence of moderation that is worth exploring further.
Systematic review 2: quality of subgroup analyses in low back pain trials
This review has been published in Spine. 51 Here we present a summary of the paper.
Abstract
Background: trials of back pain interventions have generally shown small to moderate positive effects. Therefore, identifying subgroups in this population is a research priority. This review evaluates the quality, conduct and reporting of subgroup analyses performed in the NSLBP literature.
Aim: to evaluate the quality, conduct and reporting of subgroup analyses performed in RCTs of therapist-delivered interventions for NSLBP.
Method: electronic databases were searched for RCTs of therapist-delivered interventions for NSLBP. We included papers reporting only subgroup analyses (confirmatory or exploratory). The quality of subgroup analyses and quality of conduct and reporting were also evaluated.
Results: thirty-nine papers were included in the final review. Of these, only three (8%) tested hypotheses about moderators (confirmatory findings); 18 (46%) generated hypotheses about moderators to inform future research (exploratory findings) and 18 (46%) provided insufficient findings. The appropriate statistical test for interaction was performed in 27 of the papers, of which 10 papers reported results from interaction tests, four papers incorrectly reported results within individual subgroups and the remaining papers either reported p-values or nothing at all.
Conclusions: subgroup analyses performed in NSLBP trials have been severely underpowered, are able to provide only exploratory or insufficient findings and have rather poor quality of reporting. Using current approaches, few definitive trials of subgrouping in back pain are very likely to be performed. There is a need to develop new approaches to subgroup identification in back pain research.
Background
The identification of subgroups that gain the most benefit from interventions for the management of LBP is an important research priority internationally. 15,52–54 Although several trials claim to have performed subgroup analyses, the quality, conduct and reporting of the analyses performed has not been critically reviewed. There is some confusion in the papers between investigating ‘subgroup effects’ and investigating ‘differential subgroup effects’, where the former investigates a specific subset or subpopulation of the entire sample for a main effect and the latter investigates treatment effect heterogeneity using an interaction test between subgroups defined by factors measured prior to treatment. 55
Aims
The objective of this literature review is to first identify RCTs of therapist-delivered interventions for NSLBP, which have performed secondary analyses in the form of subgroup analyses. All identified literature was assessed using a set of methodological criteria to evaluate the quality of subgroup analyses. Furthermore, the conduct and reporting of subgroup analyses were also assessed.
Method
This literature review work was carried out as part of the PhD studentship funded in this programme of work.
The same search strategy described above in our previous review was used in this review to identify potential papers of RCTs looking at therapist-delivered interventions for LBP. Originally, the following databases were searched until September 2011. Searches were updated in July 2014. Electronic searches were conducted using the following databases:
-
MEDLINE
-
Ovid MEDLINE In-Process & Other Non-Indexed Citations
-
EMBASE
-
Web of Science
-
Citation Index and CENTRAL.
Search strategy
As described above we started our searches using the terms ‘low back pain’ combined with keywords including ‘subgroup’, ‘effect modifier’ and ‘moderator’. This only yielded publications which used the term ‘subgroup’ in the title and/or the abstract, it failed to pick up papers that used the term in the main body of the text. Therefore, we reran searches to identify all ‘low back pain’ and ‘RCTs’ which we filtered for therapist-delivered interventions.
Inclusion and exclusion criteria
Box 2 outlines the inclusion and exclusion criteria for this review.
-
Randomised controlled trials.
-
Participants aged 18 years or more with history of NSLBP.
-
Therapist delivered interventions for NSLBP (including psychological interventions and intensive rehabilitation programmes).
-
Primary or secondary analysis of RCTs reporting that a subgroup analysis had been conducted.
-
LBP with known likely cause (fracture, infection, malignancy specific cause, ankylosing spondylitis and other inflammatory disorders).
-
Studies investigating disorders additional to NSLBP, e.g. NSLBP and neck pain.
-
Outcome not a valid clinical measure of NSLBP, e.g. number of days sick leave.
-
Testing a clinical prediction rule.
-
Treatment effect modification over time, i.e. treatment × moderator × time.
-
Pooled datasets of similar trials.
Reproduced from Mistry D, Patel S, Hee SW, Stallard N, Underwood M. Evaluating the quality of subgroup analyses in randomized controlled trials of therapist-delivered interventions for nonspecific low back pain: a systematic review. Spine 2014;39:618–29; with permission from Lippincott Williams & Wilkins.
Screening and data extraction
We screened titles and abstracts based on the predetermined inclusion criteria. We selected all papers potentially reporting subgroup analysis for further investigation. All agreed full papers were obtained for data extraction. Data were extracted on to a standardised extraction form and any discrepancies were resolved using a second reviewer.
Quality assessment of subgroup analysis
We used the same Pincus et al. 44 criteria described in the previous review (see Risk of bias and quality assessment, above) the review above to assess the quality of subgroups. Three independent reviewers (DM, SP and SWH) assessed the quality of the identified papers. All discrepancies were addressed and resolved through discussion.
To reduce conflicts of interest, members of the reviewing team who were authors on any included studies did not participate in the quality assessment exercises.
Analysis
To assess the conduct and reporting of subgroup analysis we referred to existing authoritative reviews. 56,57 Papers were assessed for:
|
||
|
} | Only for those papers that used interaction tests for subgroup analyses. |
|
Each paper was examined to see if it conformed to four key recommendations in the area of subgroup analyses (Box 3).
-
Exact subgroup definitions should be given beforehand for continuous and categorical variables, along with some justification to avoid post-hoc data dependent definitions of subgroups.
-
Subgroup analyses should be performed on the primary outcome in the study. This is simply because trials are designed to detect differences in the primary outcome only; therefore, performing subgroup analyses on any other outcome measure will substantially reduce the power.
-
A differential subgroup effect should be formally evaluated using a statistical test for interaction and the interaction effect reported. Performing tests within individual subgroups and then comparing the results is an incorrect approach to subgroup analyses as it does not directly evaluate the subgroup effect.
-
The number of subgroup analyses to be performed should be kept to a minimum. This is to avoid the issue of false-positive discovery (type I error inflation) due to multiple testing; a well-known issue if there are several subgroups of interest. Any concerns regarding multiplicity should be acknowledged and addressed appropriately, e.g. applying a Bonferroni or Sidak correction.
Reproduced from Mistry D, Patel S, Hee SW, Stallard N, Underwood M. Evaluating the quality of subgroup analyses in randomized controlled trials of therapist-delivered interventions for nonspecific low back pain: a systematic review. Spine 2014;39:618–29; with permission from Lippincott Williams & Wilkins.
Results
Our initial search identified 5581 papers. All titles and abstracts were screened to identify potential papers reporting results of RCTs of therapist-delivered interventions for LBP. We excluded 5521 papers during the screening process. The full text for the remaining 60 papers was then thoroughly examined to look for subgroup analyses, of which 21 were excluded as they either did not meet the inclusion criteria or they met one or more of the exclusion criteria. We included 39 papers in the final review (Figure 3).
A summary of the included studies is given in Table 6 and a summary of excluded studies can be found in Appendix 1. A total of 63% of the included papers were from the Netherlands, the UK or the USA. The median study size was 223, ranging from 100 to 3093.
Subgroup quality assessment | Author | Date of publication | Country | Study size | Interventions compared | Outcome measure and follow-up | Subgroups identified (interaction test only) |
---|---|---|---|---|---|---|---|
Confirmatory findings | Sheets58 | 2012 | Australia | 148 | First-line care group vs. McKenzie group | Pain measured at 1 week and 3 weeks GPE at 3 weeks |
None |
Smeets59 | 2009 | Australia and New Zealand | 259 | Exercise and advice vs. exercise and sham advice vs. sham exercise and advice vs. sham exercise and sham advice | Pain intensity (11-point scale) and patient-specific function scale (0–10 scale) measured at baseline 6 weeks and 52 weeks | None | |
Underwood46 | 2011 | UK | 701 | Advice plus cognitive–behavioural intervention vs. advice only | RMDQ and MVK scores measured at baseline and 3, 6 and 12 months | Age and employment | |
Exploratory findings | Becker60 | 2008 | Germany | 1378 | Multifaceted GI vs. GI plus MC vs. postal dissemination of guideline (control) | FFbHR measured at baseline and 6 months | None |
Cecchi61 | 2012 | Italy | 210 | Back school vs. individual physiotherapy vs. spinal manipulation | RMDQ score measured at baseline and 3, 6 and 12 months | None | |
Cherkin62 | 1998 | USA | 321 | Physical therapy vs. chiropractic manipulation vs. educational booklet | Bothersomeness of symptoms and RMDQ score measured at baseline, 4 weeks and 12 weeks | Mental health | |
Cherkin63 | 2001 | USA | 262 | Chinese acupuncture vs. therapeutic massage vs. self-care education | Bothersomeness of symptoms and RMDQ score measured at baseline, 4 weeks, 10 weeks and 1 year | None | |
Cherkin49 | 2009 | USA | 638 | Individualised acupuncture vs. StA vs. SiA vs. usual care | Bothersomeness of symptoms and RMDQ score measured at baseline, 8 weeks, 26 weeks and 1 year | None | |
Hansen64 | 1993 | Denmark | 180 | Intensive dynamic back-muscle exercise vs. conventional physiotherapy vs. placebo control (semi-hot packs and light traction) | Pain level (10-point scale) measured at baseline, 4 weeks, 6 weeks and 1 year | None | |
Hay65 | 2005 | UK | 402 | Brief pain management vs. manual physiotherapy | RMDQ score measured at baseline, 3 months and 12 months | None | |
Juni66 | 2009 | Switzerland | 104 | Standard care alone vs. standard care plus SMT | Pain intensity (11-point scale) and analgesic use measured at baseline, days 1–14 and 6 months | None | |
Karjalainen67 | 2004 | Finland | 170 | Mini-intervention group vs. worksite visit group vs. usual care group | Pain intensity (11-point scale) measured at baseline, 3 months, 6 months, 1 year and 2 years | Perceived risk for not recovering and type of occupation (comparing Mini-intervention vs. usual care and worksite visit vs. usual care) | |
Kole-Snijders68 | 1999 | Netherlands | 159 | OPCO vs. OPDI vs. WLC | Main outcome unclear Outcomes measured post treatment and at 6 months and 1 year |
None | |
Roche69 | 2007 | France | 132 | AIP vs. FRP | Main outcome unclear Outcomes measured at baseline and 5 weeks |
Sorenson score | |
Sherman48 | 2009 | USA | 638 | Individualised acupuncture vs. StA vs. SiA vs. usual care | Bothersomeness of symptoms and RMDQ score measured at baseline, 8 weeks, 26 weeks and 1 year | Baseline RMDQ score | |
Smeets70 | 2006 | Netherlands | 223 | APT vs. CBT vs. Combined APT and CBT (CTrt) vs. WL | RMDQ score measured at baseline, 10 weeks, 6 months and 12 months | Baseline RMDQ | |
Smeets71 | 2008 | Netherlands | 223 | ATP vs. GAP vs. CTrt vs. WL | RMDQ score measured at baseline, 10 weeks, 6 months and 12 months | None | |
Tilbrook35 | 2011 | UK | 313 | Yoga vs. usual care | RMDQ score measured at baseline and 3, 6 and 12 months | None | |
Underwood45 | 2007 | UK | 1334 | Control (best care in general practice) vs. exercise programme vs. spinal manipulation vs. combined treatment (manipulation and exercise) | RMDQ score measured at baseline, 3 months and 1 year | Expectation | |
Van der Hulst72 | 2008 | Netherlands | 163 | RRP vs. usual care | RMDQ score measured at baseline, 1 week after treatment and 4 months after treatment | Pain intensity and depression | |
Witt50 | 2006 | Germany | 3093 | Acupuncture vs. control (delayed acupuncture treatment 3 months later) | FFbHR (0–100 scale) measured at baseline and 3 and 6 months | Initial back pain, age and years of schooling | |
Insufficient findings | Bendix73 | 1998 | Denmark | 816 | FRP programme vs. outpatients programme (control) | Main outcome unclear Outcomes measured at baseline and 1 year |
|
Beurskens74 | 1995 | Netherlands | 151 | Traction vs. sham traction | GPE and severity measured on VAS at baseline and 5 weeks | ||
Bishop75 | 2011 | USA | 112 | Supine thrust technique vs. side-lying thrust vs. non-thrust technique | ODI measured at 1 week, 4 weeks and 6 months | None | |
Carr76 | 2005 | UK | 237 | Group exercise programme vs. individual physiotherapy | RMDQ score measured at baseline, 3 months and 6 months | ||
Ferreira77 | 2009 | Australia | 191 | General exercise vs. motor control exercise vs. SMT | GPE (11-point scale), Patient specific functional status, RMDQ score, Pain intensity (10-point scale) and spinal stiffness measured at baseline and 8 weeks | None | |
Glazov78 | 2009 | Australia | 100 | Laser acupuncture vs. sham acupuncture (control) | Pain (VAS) measured at baseline, immediately after treatment, 6 weeks and 6 months | ||
Gudavalli79 | 2006 | USA | 235 | FD vs. ATEP | Perceived pain (VAS), RMDQ score and SF-36 measured at baseline, 4 weeks, 3 months, 6 months and 1 year | ||
Hsieh80 | 2004 | China | 146 | Acupressure vs. physical therapy | Short-form pain questionnaire measured at baseline, 4 weeks and 6 months | ||
Jellema81 | 2005 | Netherlands | 314 | MIS vs. usual care | RMDQ score, perceived recovery (7-point scale) and sick leave measured at baseline; 6, 13 and 26 weeks; and 1 year | ||
Johnson82 | 2007 | UK | 234 | Group exercise and education using a cognitive behavioural approach vs. usual care | Pain (VAS) and RMDQ score measured at baseline and 3, 9 and 15 months | Patient preference | |
Kalauokalani83 | 2001 | USA | 166 | Acupuncture vs. massage (subanalysis of Cherkin 2001 paper) | RMDQ score measured at baseline, 4 weeks, 10 weeks and 1 year | Patient expectations | |
Mellin84 | 1989 | Finland | 456 | Inpatient treatment vs. outpatient treatment vs. control (advice) | LBP disability index (scale 0–45) measured at baseline and 3 months | ||
Klaber Moffett85 | 2004 | UK | 187 | Exercise vs. usual care | RMDQ score measured at baseline, 6 weeks, 6 months and 1 year | ||
Myers86 | 2008 | USA | 444 | Usual care vs. usual care plus patient choice of acupuncture, chiropractic or massage | RMDQ score measured at baseline, 5 weeks and 12 weeks | None | |
Seferlis87 | 1998 | Sweden | 180 | MTP vs. ITP vs. GPP | Main outcome unclear Outcomes measured at baseline and 1, 3 and 12 months |
||
Thomas88 | 2006 | UK | 241 | Traditional acupuncture vs. usual care | Bodily pain dimension of the SF-36 (0–100 scale) measured at baseline and 3, 12 and 24 months | Expectation | |
Van der Roer89 | 2008 | Netherlands | 114 | Intensive group training protocol vs. guideline group | RMDQ score measured at baseline and 6, 13, 26 and 52 weeks | ||
Vollenbroek-Hutten90 | 2004 | Netherlands | 163 | RRP vs. usual care | RMDQ score measured at baseline, 1 week after treatment and 4 months after treatment |
Methodological quality of subgroup analyses
The methodological quality of the subgroup analyses performed in the identified papers was assessed to determine the strength of evidence that they provide. Of the 39 papers:35,45,46,48–50,58–90
-
Three (8%) papers46,53,54,58,59 met all five criteria and therefore provided confirmatory evidence. Two of these papers58,59 were too small to anticipate finding any important interaction if it were present (n = 148 and 259).
-
Eighteen (46%) papers provided exploratory evidence, that is, they met criteria 3, 4 and 5 (see Table 6).
-
Eighteen (46%) papers provided insufficient evidence (see Table 6).
Assessment of conduct and reporting of subgroups
We examined the conduct and reporting of subgroups in terms of design and methods and found that:
-
One study50 had sufficient power to detect an interaction; however, subgroups of interest were not prespecified a priori.
-
Thirty-one (79%) studies35,45,48–50,60–63,66–74,76–78,80,81,83–90 did not prespecify subgroups of interest.
-
Eight studies46,58,59,64,65,75,79,82 reported prespecified subgroups for confirmatory analyses; six of these studies also carried out exploratory analyses without clear distinction between analysis types.
-
Sometimes it was not clear from the methods that subgroup analyses were going to be performed; they were just presented in the results. 62,69,74,80
-
All papers measured subgroups of interest prior to randomisation, with most using adequate measurements.
-
Prior to performing analyses, only one paper58 reported the expected size and direction of the subgroup effect. A further three papers46,59,85 predicted the direction of the subgroup effect.
-
One-third (13/39) of the papers45,46,48,58,59,72,77,79,83,85–87,90 provided some justification regarding the choice of subgroups to be analysed.
-
In two papers45,59 around 60 interaction tests were conducted, substantially increasing the chances of detecting false-positive findings. Of the three papers46,58,59 that provided confirmatory findings, only one of them46 adjusted for multiplicity. The authors applied a Bonferroni correction to their confirmatory subgroup analyses.
-
Twelve (31%) of the papers64,73,74,76,79–81,84,85,87,89,90 did not use a statistical test for interaction to assess for treatment effect modification. Of these, two of the papers74,87 did not give any indication as to what statistical method they used. Two papers73,84 looked at correlations between individual subgroups and outcomes within each treatment arm separately. Two papers79,80 used t-tests between treatment groups within individual subgroups. Five papers76,81,85,89,90 used either multiple linear regression or multiple logistic regression for each individual subgroup. One paper64 compared the medians across three trial arms within individual subgroups using Kruskal–Wallis tests.
We examined the conduct and reporting of subgroups in terms of reporting of results and found that:
-
A statistical test for interaction was reported to have been used in 27 (69%) of the papers. 35,45,46,48–50,58–72,75,77,82,83,86,88
-
Six studies45,48,61,72,75,77 reported both the interaction effect sizes with CIs and the corresponding p-values.
-
Four studies46,58,59,82 reported only the interaction effect sizes with CIs.
-
Eight studies35,50,66,67,69,83,86,88 reported only the p-values.
-
Nine papers49,60,62–65,68,70,71 did not report the interaction effect sizes, CIs or p-values.
-
Four studies60,66,70,88 reported subgroup analyses within individual subgroups rather than between-group interaction.
We examined the conduct and reporting of subgroups in terms of reporting of interpretation and discussion and found that:
-
Four60,66,70,88 out of 27 papers that performed interaction tests reported subgroup analyses within individual subgroups and thus based the interpretations and discussion on this as well.
-
Reference to other relevant studies (supporting or contradicting) were made in around one-third of the papers.
-
The limitations of subgroup analyses were reported in 12 papers. 45,46,48,58–61,65,76,79,86,90
Discussion and conclusion
Subgroup analyses have been attempted in several papers; however, there is confusion between investigating ‘subgroup effects’ and ‘differential subgroup effects’. 55 The overall quality of the subgroups is poor, with most papers providing only exploratory or insufficient findings. The overall reporting in papers for subgroups is generally of poor standard. The sample sizes of the trials have been small and thus underpowered to detect interactions. Only one trial50 was appropriately powered for the analysis; however, the authors failed to specify the subgroups a priori. The recommended guidelines should be used when performing subgroup analyses to ensure that they are reliable and of a good standard. 56,91 The current approaches are not suitable to address the research question. New methods to perform subgroup analyses are required to address the methodological concerns highlighted.
Summary of reviews
Both reviews conducted during this programme of work have been informative in developing our understanding of subgrouping in LBP.
Review 1 looked at identifying potential moderators to be tested within the back pain repository. The literature on moderators is weak and, subsequently, lacking in rigour to inform clinical practice. Despite this, the review has helped us to identify some potential moderators of treatment effect, including age, educational attainment, employment status, symptoms of anxiety or depression, longer history of back pain and treatment expectations in at least one trial. We used these variables in our later analyses within our repository of data.
Review 2 looked at the quality of subgroup analyses conducted in the LBP literature. This review concluded that the overall quality was poor. A trial that is sufficiently powered to detect subgroups would need to be approximately four times larger than a traditional trial powered to detect a main effect of the same magnitude. 92 This would be a timely and costly undertaking, for which care would also need to be taken to select moderators that were clinically relevant and applicable.
In addition to these reviews we have previously published a systematic review93 that summarised findings from RCTs testing the effects of a clinical prediction rule for NSLBP. Clinical prediction rules have been developed and are being used in clinical practice to help clinicians to make decisions on treatment; however, the overall effect of such tools is unclear. Multicomponent clinical prediction rules have the potential to be much more powerful tools for targeting treatments than single-component measures. We identified 1821 potential citations after all duplications had been removed. Two reviewers independently screened the titles and abstracts, and consensus was reached on obtaining 35 papers for full detailed evaluation. Of these, only three papers94–96 were included in the review. The results from the available trials do not convincingly support the use of clinical prediction rules in the management of NSLBP. We concluded that the existing RCTs looking to validate clinical prediction rules in LBP are limited. Methodologies for the validation of these rules lack clarity and, subsequently, the evidence for, and development of, the existing prediction rules in LBP is generally weak.
Current approaches have failed to provide the data needed to target treatments for LBP. There is therefore a need to look at alternative methods to address this problem. We propose three recommendations:
-
To develop new and novel methods to identify multiple participant characteristics or clusters of moderators that would identify who is most or least likely to benefit. 97–99
-
To apply individual participant data meta-analysis to homogeneous pooled data sets, as this would improve statistical power.
-
To develop subgroups, and suggested interventions, based on clinical reasoning, and test these within trials to determine if the targeted intervention produces a larger average effect size than existing non-specific interventions.
In this programme we address points 1 and 2, leaving point 3 for others within the back pain research community to consider and address.
Chapter 3 Collating data
In this chapter we detail the process of identifying and approaching chief investigators and/or data custodians for trial data for inclusion in our repository of back pain trials.
Identification of potential trials
We used the search results generated from review 1 (Identification of potential moderators, described in Chapter 2) as a starting point for identifying trials of interest. In the first instance we were interested in only:
-
RCTs
-
trials of therapist-delivered interventions
-
trials with a sample size of > 179 participants.
Based on these criteria we filtered the original search output to identify 658 citations. These were systematically screened by two members of the team independently (see Figure 2). Additionally, we also obtained further data through snowballing; essentially, we were offered data (from researchers aware of the project) from trials that were not on our original list. Although some of the trials obtained through the snowballing process are smaller in sample size than our target studies, we decided to include these to add power to our analysis.
Justification of sample size
We started with an original lower limit of 200 for the sample size. Allowing for some loss to follow-up, a trial of 200 participants would have 90% statistical power to identify a SMD of 0.5 between two treatment groups. Any individual trials smaller than this are likely to be seriously underpowered for their primary outcome. Upon screening the trials there were many that obtained a final sample size of just fewer than 200 participants; typically these were studies aiming for around 200 participants, which fell short of the final target. We therefore revised our inclusion to more than 179 participants. From a practical perspective of approaching trial investigators, this yielded a manageable number of trials to approach; large trials (those with thousands of participants) and small trials (fewer than 100 participants) each create a similar amount of work to collate.
Process for approaching investigators
We identified 42 trials33,49,50,62,63,65,70,76,82,84,94,100–130 that fitted our inclusion criteria. For these trials we identified the chief investigator and the best e-mail contacts for them. Between 2011 and 2012 each investigator was sent an e-mail to invite him/her to participate in the repository. Each e-mail included the following attachments:
-
formal invitation letter (see Appendix 2)
-
information sheet (see Appendix 3)
-
sample data sharing agreement (see Appendix 4).
If a response was not received within a 6- to 8-week period, a reminder e-mail was then sent. If a response was received indicating an interest in sharing data then the data sharing agreement was personalised and sent back to the investigator for review and signature. Once the signed document was received by the university, the investigator was provided with details on how to securely send the data to us. We used the University of Warwick secure file transfer service.
Secure data transfer
We requested all data from a trial. Investigators were advised that any data sets being sent to us needed to be anonymised and encrypted using an open-source compression software programme such as 7-Zip 9.20 © Igor Pavlov (www.7-zip.org/). Investigators were then provided with details on how to securely transfer this data to the University of Warwick (see Appendix 5) using an upload system that was set up for the project (available at https://files.warwick.ac.uk/repositorylbpdata/sendto).
Once these data were received it was the responsibility of the team’s statisticians and/or health economists to transform the original data to the repository standard. To aid this process we requested all trial-specific information, including the protocol and questionnaires if they were available.
Final data set obtained
We obtained 14 (33%) trial data sets31,33,50,65,70,76,101–107,131 from the original 42 trials33,49,50,62,63,65,70,76,82,84,94,100–130 we approached. A further five trials132–136 were obtained through snowballing, resulting in a total of 19 data sets (Figure 4). We were unsuccessful in getting a response from 15 (36%) investigators and a further six (14%) data sets were not available for data sharing. We still have seven (17%) data sets in negotiation, for which we were unable to agree on the data sharing before starting our formal analysis; therefore, these trials have not been included in this report.
Through the process of snowballing, further smaller data sets were offered to be included in the repository. The offer of these trials was carefully considered by the research team, and it was decided that any additional data would be helpful in increasing power. Therefore, three (16%) of the 19 trials obtained have a sample size of < 179 participants.
Table 7 shows the trials that were excluded and the reason for the exclusion. Details of papers excluded as a result of multiple publications can be found in Appendix 6. A list of trials that were unavailable because of a lack of response from the investigator, data sets not available and those still under negotiation are documented in Appendix 7. A final table of included trials and associated papers is presented in Table 8.
Author | Number of participants | Reason for exclusion |
---|---|---|
Jellema137 | 314 | Not therapist delivered |
Kainz B138 | 1274 | Paper not in English |
Long A139 | 312 | Trial of exercise vs. exercise |
Von Korff140 | 255 | Not therapist delivered |
Name of/given name of trial | Corresponding author/chief investigator | Relevant publications related to the trial of interest | Number of participants |
---|---|---|---|
Witt | Witt | Witt CM, Jena S, Selim D, Brinkhaus B, Reinhold T, Wruck K, et al. Pragmatic randomized trial evaluating the clinical and economic effectiveness of acupuncture for chronic low back pain. Am J Epidemiol 2006;164:487–9650 | 3093 |
UK BEAM | Underwood | UK BEAM Trial Team. United Kingdom back pain exercise and manipulation (UK BEAM) randomised trial: effectiveness of physical treatments for back pain in primary care. BMJ 2004;329:137731 Underwood MR, Morton V, Farrin A. Do baseline characteristics predict response to treatment for low back pain? Secondary analysis of the UK BEAM data set [ISRCTN32683578]. Rheumatology (Oxford) 2007;46:1297–30245 |
1334 |
Haake | Haake | Haake M, Müller HH, Schade-Brittinger C, Basler HD, Schäfer H, Maier C, et al. Acupuncture Trials (GERAC) for chronic low back pain: randomized, multicenter, blinded, parallel-group trial with 3 groups. Arch Intern Med 2007;167:1892–8132 | 1163 |
BeST | Lamb | Lamb SE, Hansen Z, Lall R, Castelnuovo E, Withers EJ, Nichols V, et al. Group cognitive behavioural treatment for low-back pain in primary care: a randomised controlled trial and cost-effectiveness analysis. Lancet 2010;375:916–2333 Lamb SE, Lall R, Hansen Z, Castelnuovo E, Withers EJ, Nichols V, et al. A multicentred randomised controlled trial of a primary care-based cognitive behavioural programme for low back pain. The Back Skills Training (BeST) trial. Health Technol Assess 2010;14(41)34 |
701 |
Keele | Hay | Hay EM, Mullis R, Lewis M, Vohora K, Main CJ, Watson P, et al. Comparison of physical treatments versus a brief pain-management programme for back pain in primary care: a randomised clinical trial in physiotherapy practice. Lancet 2005;365:2024–3065 Whitehurst DG, Lewis M, Yao GL, Bryan S, Raftery JP, Mullis R, et al. A brief pain management program compared with physical therapy for low back pain: results from an economic analysis alongside a randomized clinical trial. Arthritis Rheum 2007;57:466–73141 |
402 |
Brinkhaus | Brinkhaus | Brinkhaus B, Witt CM, Jena S, Linde K, Streng A, Wagenpfeil S, et al. Acupuncture in patients with chronic low back pain: a randomized controlled trial. Arch Intern Med 2006;166:450–7101 | 298 |
Dufour | Dufour | Dufour N, Thamsborg G, Oefeldt A, Lundsgaard C, Stender S. Treatment of chronic low back pain: a randomized, clinical trial comparing group-based multidisciplinary biopsychosocial rehabilitation and intensive individual therapist-assisted back muscle strengthening exercises. Spine 2010;35:469–76102 | 286 |
Pengel | Pengel | Pengel LH, Refshauge KM, Maher CG, Nicholas MK, Herbert RD, McNair P. Physiotherapist-directed exercise, advice, or both for subacute low back pain: a randomized trial. Ann Intern Med 2007;146:787–96103 Smeets RJ, Maher CG, Nicholas MK, Refshauge KM, Herbert RD. Do psychological characteristics predict response to exercise and advice for subacute low back pain? Arthritis Rheum 2009;61:1202–959 |
260 |
YACBAC | Thomas | Thomas KJ, MacPherson H, Thorpe L, Brazier J, Fitter M, Campbell MJ, et al. Randomised controlled trial of a short course of traditional acupuncture compared with usual care for persistent non-specific low back pain. BMJ 2006;333:62388 Ratcliffe J, Thomas KJ, MacPherson H, Brazier J. A randomised controlled trial of acupuncture care for persistent low back pain: cost effectiveness analysis. BMJ 2006;333:626142 Thomas KJ, MacPherson H, Ratcliffe J, Thorpe L, Brazier J, Campbell M, et al. Longer term clinical and economic benefits of offering acupuncture care to patients with chronic low back pain. Health Technol Assess 2005;9(32)107 |
241 |
Hancock | Hancock | Hancock MJ, Maher CG, Latimer J, Herbert RD, McAuley JH. Independent evaluation of a clinical prediction rule for spinal manipulative therapy: a randomised controlled trial. Eur Spine J 2008;17:936–4394 Hancock MJ, Maher CG, Latimer J, Herbert RD, McAuley JH. Can rate of recovery be predicted in patients with acute low back pain? Development of a clinical prediction rule. Eur J Pain 2009;13:51–5143 Hancock MJ, Maher CG, Latimer J, McLachlan AJ, Cooper CW, Day RO, et al. Assessment of diclofenac or spinal manipulative therapy, or both, in addition to recommended first-line treatment for acute low back pain: a randomised controlled trial. Lancet 2007;370:1638–43131 |
240 |
VKBIA | Von Korff | Von Korff M, Balderson BH, Saunders K, Miglioretti DL, Lin EH, Berry S, et al. A trial of an activating intervention for chronic back pain in primary care and physical therapy settings. Pain 2005;113:323–30104 | 240 |
HullExPro | Carr | Carr JL, Klaber MJA, Howarth E, Richmond SJ, Torgerson DJ, Jackson DA, et al. A randomized trial comparing a group exercise programme for back pain patients with individual physiotherapy in a severely deprived area. Disabil Rehabil 2005;27:929–3776 | 237 |
VKSC2 | Moore | Moore JE, von Korff M, Cherkin D, Saunders K, Lorig K. A randomized trial of a cognitive-behavioural program for enhancing back pain self care in a primary care setting. Pain 2000;88:145–53105 | 226 |
Smeets | Smeets | Smeets RJ, Vlaeyen JW, Hidding A, Kester AD, van der Heijden GJ, van Geel AC, et al. Active rehabilitation for chronic low back pain: cognitive-behavioural, physical, or both? First direct post-treatment results from a randomized controlled trial [ISRCTN22714229]. BMC Musculoskelet Disord 2006;7:570 | 223 |
Cecchi | Cecchi | Cecchi F, Molino-Lova R, Chiti M, Pasquini G, Paperini A, Conti AA, et al. Spinal manipulation compared with back school and with individually delivered physiotherapy for the treatment of chronic low back pain: a randomized trial with one-year follow-up. Clin Rehabil 2010;24:26–36106 | 210 |
York BP | Torgerson | Moffett JK, Torgerson D, Bell-Syer S, Jackson D, Llewellyn-Phillips H, Farrin A, et al. Randomised controlled trial of exercise for low back pain: clinical outcomes, costs, and preferences. BMJ 1999;319:279–83133 | 187 |
Macedo | Macedo | Macedo LG, Latimer J, Maher CG, Hodges PW, McAuley JH, Nicholas MK, et al. Effect of motor control exercises versus graded activity in patients with chronic nonspecific low back pain: a randomized controlled trial. Phys Ther 2012;92:363–77134 | 172 |
Carlsson | Carlsson | Carlsson CP, Sjölund BH. Acupuncture for chronic low back pain: a randomized placebo-controlled study with long-term follow-up. Clin J Pain 2001;17:296–305135 | 50 |
Kennedy | Kennedy | Kennedy S, Baxter GD, Kerr DP, Bradbury I, Park J, McDonough SM. Acupuncture for acute non-specific low back pain: a pilot randomised non-penetrating sham controlled trial. Complement Ther Med 2008;16:139–46136 | 48 |
Summary of the included trials in the repository
The agreed and included trials in this repository are detailed in Table 9.
Name of/given name of trial | Witt, n = 309350 |
---|---|
Country | Germany |
Interventions | In the RCT part of study there were two arms
|
Recruitment | Patients consulting a physician for LBP that were insured by one of the participating social health insurance funds were recruited. Details of the study were provided to those patients requesting acupuncture or when the physician considered acupuncture to be a suitable treatment option |
Inclusion criteria | Age ≥ 18 years, with the ability to provide informed consent. A diagnosis of CLBP with a duration of more than 6 months |
Exclusion criteria | Disc prolapse/protrusion of with concurrent neurological symptoms, previous back surgery, infectious spondylopathy, LBP caused by inflammatory, malignant or autoimmune disease, congenital deformation fracture caused by osteoporosis, spinal stenosis, and spondylolysis or spondylolisthesis |
Name of/given name of trial | UK BEAM including feasibility study, n = 133445,100 |
Country |
|
Interventions |
|
Recruitment | Recruited from GP practices after searching computerised records for potential eligible participants |
Inclusion criteria | Aged between 18 and 65 years, consulted with LBP, score of ≥ 4 on RMDQ at randomisation, pain experienced every day for the 28 days before randomisation or 21 out of 28, agreement to avoid other physical treatments during the treatment period |
Exclusion criteria | Aged ≥ 65 years, potential spinal disorder, including malignancy, osteoporosis, ankylosing spondylitis, cauda equina compression, and infection, pain primarily below the knee, previous spinal surgery, another musculoskeletal disorder reported to be more troublesome than the back pain, a previous referral or attendance at a pain management clinic, a severe psychiatric or psychological disorder, other medical condition that could interfere with therapy, moderate to severe hypertension, intake of anticoagulants or long-term steroids, inability to walk 100 m when free of back pain, inability to get up off the floor unaided, receipt of physical therapy in the preceding 3 months, RMDQ score of ≤ 3 on the day of randomisation, inability to read and write English fluently |
Name of/given name of trial | Haake, n = 1163132 |
Country | Germany |
Interventions | All groups received 10 30-minute sessions (two per week). Five additional sessions were offered if after the tenth session patients experienced a 10–50% reduction in pain intensity (von Korff CPG)
|
Recruitment | Patients were recruited through advertising in newspapers, magazines, radio and television |
Inclusion criteria | Aged ≥ 18 years with a clinical diagnosis of CLBP of ≥ 6 months, no previous experience of acupuncture for LBP Mean von Korff CPG score of ≥ 1 and a FFbHR score of < 70% |
Exclusion criteria | Any previous spinal surgery or fractures, infectious or tumorous spondylopathy, and chronic pain caused by other diseases |
Name of/given name of trial | BeST, n = 70133,34 |
Country |
|
Interventions |
|
Recruitment | Recruited from GP practices after being identified from patient records or from consultation with the GP or practice nurse |
Inclusion criteria | Aged ≥ 18 years, with at least moderately troublesome subacute or chronic LBP, with a minimum of 6 weeks’ duration, consultation with the GP for LBP within the preceding 6 months |
Exclusion criteria | LBP related to a serious cause such as infection, fracture, malignancy, those with severe psychiatric or psychological disorders, and individuals with previous experience of a cognitive–behavioural intervention for LBP |
Name of/given name of trial | Keele, n = 40265,141 |
Country | UK |
Interventions |
|
Recruitment | Recruited from GP practices |
Inclusion criteria | Adults aged 18–64 years consulting with NSLBP of < 12 weeks’ duration for the first or second time, able to give informed consent |
Exclusion criteria | Those with signs of red flags, sick leave of > 12 weeks, diagnosed with osteoporosis or inflammatory arthritis, taking systemic steroids for > 12 weeks, pregnant, previous fracture or hip/back surgery, any abdominal surgery in the preceding 3 months, receipt of treatment by any other professional for the current episode of back pain |
Name of/given name of trial | Brinkhaus, n = 298101 |
Country | Germany |
Interventions | The acupuncture and minimal acupuncture treatments consisted of 12 30-minute sessions delivered over 8 weeks
|
Recruitment | Primary recruitment method was via advertisement in local newspapers and subsequent snowballing |
Inclusion criteria | Aged between 40 and 75 years, with a clinical diagnosis of chronic LBP present for > 6 months, a VAS of ≥ 40 for average pain intensity over the previous 7 days and the use of only oral NSAIDs in the four weeks preceding treatment |
Exclusion criteria | Disc prolapse/protrusion with concurrent neurological symptoms; radicular pain, previous back surgery; infectious spondylopathy; LBP caused by inflammation, malignancy or autoimmune disease; congenital spine problems excluding minor lordosis or scoliosis; compression fracture caused by osteoporosis; spinal stenosis; spondylolysis or spondylolisthesis; those with diagnoses with Chinese medicine warranting treatment with moxibustion and receipt of acupuncture treatment in the preceding 12 months |
Name of/given name of trial | Dufour, n = 286102 |
Country | Denmark |
Interventions |
|
Recruitment | Rheumatologists and GPs referred patients |
Inclusion criteria | Patients aged 18–60 years with LBP of > 12 weeks with or without pain radiating into the leg(s). The lumbar spine was assessed through radiography, CT or MRI scans. Physical examinations were also used |
Exclusion criteria | Those with symptoms of spinal pathology, including malignancy, osteoporosis, vertebral fracture and spinal, stenosis, clinical symptoms of an acute herniated disc accompanied by nerve root entrapment, unstable spondylolisthesis, spondylitis, other health conditions preventing engagement in exercise and language problems |
Name of/given name of trial | Pengel, n = 26059,103 |
Country | Australia |
Interventions |
|
Recruitment | Recruited by referral to trial from health-care professional, invitation to those on a WL for physiotherapy and advert in newspaper |
Inclusion criteria | Those aged 18–80 years, NSLBP lasting for at least 6 weeks but no longer than 12 weeks |
Exclusion criteria | Those who have had spinal surgery in the past 12 months, any serious spinal abnormality, pregnancy, nerve root compromise, limited understanding of English and a contraindication to exercise |
Name of/given name of trial | YACBAC, n = 24188,142 |
Country | UK |
Interventions |
|
Recruitment | Recruited from GP practices |
Inclusion criteria | 18–65 years with non-specific LBP of 4–52 weeks’ duration |
Exclusion criteria | Patients currently having acupuncture, those with possible spinal disease, motor weakness, prolapsed central disc, past spinal surgery, bleeding disorders or pending litigation |
Name of/given name of trial | Hancock, n = 24094,131,143 |
Country | Australia |
Interventions |
|
Recruitment | Recruited from GP practices |
Inclusion criteria | Pain present in the region between the twelfth rib and buttock crease, causing moderate pain and moderate disability |
Exclusion criteria | Present episode of pain not preceded by a pain-free period of at least 1 month, suspected or known serious spinal pathology; nerve root compromise; presently taking NSAIDs or undergoing spinal manipulation; any spinal surgery within the preceding 6 months; and contraindication to paracetamol/diclofenac or SMT |
Name of/given name of trial | VKBIA, n = 240104 |
Country | USA |
Interventions |
|
Recruitment | Invitations were sent to patients who had consulted in primary care for their back pain and who were enrolled in the Group Health Cooperative |
Inclusion criteria | Patients with back pain, aged 25–65 years, with a RMDQ score of ≥ 7 on a 23-item scale |
Exclusion criteria | Those waiting for back surgery, seeing a physical therapist or psychologist, patients planning to unenrolled from the Group Health Cooperative |
Name of/given name of trial | HullExPro, n = 23776 |
Country | UK |
Interventions |
|
Recruitment | Physiotherapy departments at acute hospitals |
Inclusion criteria | Those with mechanical LBP lasting at least 6 weeks |
Exclusion criteria | Those with sciatica, recent significant surgery, the presence of a neurological or systemic condition, psychiatric illness or pregnancy; individuals who have had spinal surgery, in receipt of physiotherapy in the 6 weeks prior |
Name of/given name of trial | VKSC2, n = 226105 |
Country | USA |
Interventions |
|
Recruitment | Patients were recruited from primary care by mail 6–8 weeks after a back pain visit to a Group Health primary care physician |
Inclusion criteria | Patients with back pain, aged 25–70 years; patients who had been enrolled into Group Health for at least 1 year |
Exclusion criteria | Those being considered for surgery |
Name of/given name of trial | Smeets, n = 22370 |
Country | The Netherlands |
Interventions |
|
Recruitment | Patients referred for the first time to a rehabilitation centre by their GP or other medical professional were invited to the study |
Inclusion criteria | Aged 18–65 years with CLBP of ≥ 3 months with or without radiation to leg, a RMDQ score of > 3 and ability to walk at least 100 m without interruption |
Exclusion criteria | Vertebral fracture, spinal inflammatory disease, spinal infections or malignancy, current nerve root pathology, spondylolysis or spondylolisthesis, lumbar spondylodesis. A comorbidity preventing exercise, ongoing treatment or investigation for CLBP at the time of referral or a clear treatment preference. Use of other treatments for back pain except pain medication. Any psychopathology affecting ability to take part. Not proficient in Dutch, being pregnant and having substance abuse |
Name of/given name of trial | Cecchi, n = 210106 |
Country | Italy |
Interventions | All patients were given an educational booklet on the back
|
Recruitment | Rehabilitation outpatient department by psychiatrists |
Inclusion criteria | NSLBP over at least the last 6 months reported as present ‘often’ or ‘always’ |
Exclusion criteria | Neurological signs or symptoms, spondylolisthesis, spinal stenosis, scoliosis of > 20°, rheumatoid arthritis/spondylitis, previous vertebral fracture, psychiatric condition, cognitive impairment or pain-related litigation |
Name of/given name of trial | York BP, n = 187133 |
Country | UK |
Interventions |
|
Recruitment | Recruited from GP practices |
Inclusion criteria | Patients aged between 18 and 60 years with LBP that has lasted at least 4 weeks but < 6 months, who had consulted their GP. Patients had to be deemed fit to be able to undertake exercise |
Exclusion criteria | Those with a potentially serious pathology, unable to attend or participate in the classes, and those receiving ongoing physiotherapy |
Name of/given name of trial | Macedo, n = 172134 |
Country | Australia |
Interventions | In both arms patients received 12 1-hour sessions over an 8-week period. Home exercises were encouraged in both groups. The home exercises and treatment sessions totalled 20 hours
|
Recruitment | Recruitment via GPs, physiotherapists and public hospitals |
Inclusion criteria | Aged 18–80 years with NSLBP of at least 3 months and seeking care. English speaking, living in the study region for the duration of the study, fit to engage in exercise, score of moderate or greater for amount of bodily pain in the past week, and interference of pain with normal activities |
Exclusion criteria | Serious spinal pathology suspected or known, patients who have had spinal surgery or who are due to have such surgery during the study period, nerve root compromise, any comorbidities preventing participation in exercise |
Name of/given name of trial | Carlsson, n = 50135 |
Country | Sweden |
Interventions |
|
Recruitment | Patients with CLBP, who were referred to an outpatient pain clinic during a 3-year period, were included |
Inclusion criteria | Patients with LBP without radiation below the knee for > 6 months, normal neurological examination function of lumbosacral nerve |
Exclusion criteria | Those who have had previous acupuncture treatment, patients with major trauma or systemic disease and pregnancy |
Name of/given name of trial | Kennedy, n = 48136 |
Country | UK |
Interventions |
|
Recruitment | Patients put on a WL for physiotherapy by their GP |
Inclusion criteria | Adults aged 18–70 years, who are able to give informed consent with NSLBP, with or without referred pain, of up to 12 weeks’ duration |
Exclusion criteria | Those with red flags, pain that has lasted for > 12 weeks, those with a contraindications to acupuncture or previous acupuncture treatment, any other conflicting or ongoing treatments |
Grouping of interventions
Initial examination of the data showed that no two trials studied identical interventions. Even the usual-care arms of included studies are likely to differ according to jurisdiction, site of recruitment and age of the study. Even with our initial large sample size it was clear that, to be able to make meaningful comparisons, we would need to pool interventions into broad groups for our analyses. As a first stage we identified the control interventions and classified these as either usual care or a sham control. There is, for example, evidence from the acupuncture literature that the difference between sham acupuncture and usual care is greater than any difference between sham and verum acupuncture. 145 We therefore opted to separate the sham interventions from the usual care control in our analyses comparing different treatments with control or with each other.
There may be qualitative differences between sham treatments. For example, sham acupuncture, through which the participant has had the sensation of being needled, might have a different effect from a sham educational intervention. In some analyses we have included sham interventions, typically sham acupuncture as a separate category. For this reason we have, where appropriate, specified the nature of the sham intervention considered.
We used the following approach to develop our final grouping of interventions:
-
Careful reading of each trial intervention to decide on core groups [individual physiotherapy, exercise, manipulation, advice/education, psychological therapy, graded activity, acupuncture, combination therapy, mock transcutaneous electrical nerve stimulation (TENS), sham acupuncture and control]. We listed all of the trials contributing to each of the core groups together with the number of participants. Subsequently, links were made between core groups to indicate potential direct and indirect comparisons (Figure 5).
-
To explore further the potential direct and indirect comparisons, a second figure was constructed (Figure 6). This shows the same groups presented in the first step with the additional information on the number of trials and total number of participants contributing to each of the comparisons.
-
Finally, to allow for any meaningful comparisons, we split the groups mentioned in steps 1 and 2 into three broad categories, namely active physical (exercise and graded activity), passive physical (individual physiotherapy, manipulation and acupuncture) and psychological (advice/education and psychological therapy) (Table 10).
Parent group | Subgroup | Subtype |
---|---|---|
Intervention | Active physical | Exercise |
Graded activity | ||
Passive physical | Acupuncture | |
Manual therapy | ||
Individual physiotherapy | ||
Psychological | Advice/education | |
Psychological (cognitive behavioural approach) | ||
Sham control | Sham acupuncture | |
Sham electrotherapy | ||
Mock TENS | ||
Sham advice/education | ||
Control (GP/usual care) | GP | |
WL |
In this programme of work we were not seeking to estimate the true effect size of any individual intervention. Rather, we were seeking to identify predictors of treatment response. These analyses were constrained by the availability of data on potential moderators that could be pooled across trials. Considering the potential mechanisms through which the potential moderators might affect outcome, the study team concluded that it was reasonable to pool interventions that might under other circumstances appear rather heterogeneous. In particular, the decision to include several superficially different interventions as passive physiotherapy might surprise some readers. Our view, however, is that these are very distinctly different from active exercise-based interventions, or those working through a psychological approach. Essentially, they all consist of an assessment, whatever reassurance and education is provided as part of the treatment session, plus whatever modality is being offered, be it massage/mobilisation/manipulation or needling. We consider these to be conceptually sufficiently close in their mode of action that it is unlikely there will be distinctions in how the potential moderators included in our analyses might affect outcomes. They are, however, distinctly different from their active physical or psychological interventions in how treatment moderation might operate.
In organising the data we also identified combined interventions but there were too few data points for it to be worthwhile pursuing these analyses. For this reason these were excluded from our final analyses.
Chapter 4 Creating the repository database and data control
Typographical conventions
This chapter presents the methods we used to create the repository database. To distinguish database vocabulary and commands from regular texts, different typographical fonts are used. Database object-class vocabulary is printed in sans-serif font [like this] and the command for mapping and transformation procedures is printed in monospaced typewriter font . In addition, coloured command fonts in the text are for ease of referencing between program commands shown in figures and text explanations.
Background
Clinical trial data sets can be stored in a tabular format, for example Microsoft Excel® or SPSS (Statistical Package for the Social Sciences). A tabular format typically uses each row to represent data from a participant and each column to represent an item from a case report form (CRF).
Tabular formats have the advantage of being intuitive, relatively simple to create and machine readable. However, this format can be susceptible to excessive growth, especially when clinical and non-clinical items are measured across multiple time points. Data collected for withdrawn participants or non-responders would still require columns for all variables irrespective of whether or not they were used. Repeating questions pose a similar problem, whereby storage space must be allocated across the whole domain to accommodate all responses. For example, asking for a participant’s medical history of prescribed drugs would require a new column to be added for every drug listed. If only one participant documented a long list of drugs then many columns would have to be created for all participants.
Tabular formats are effective for only the smallest of trials and quickly become inefficient and difficult to maintain when the range of data collected increases. For larger trials, a more robust solution is to use a relational database. The relational database model allows individual tables to be created for each CRF and for repeating sets of questions. Normalisation rules are often applied to define the columns for each table and the logical relationships are used to create table joins. 146
Figure 7 shows sample data in a tabular format and the normalised equivalent in a relational database. The sample data consist of the subject identification, recruitment date, demographic data and the RMDQ scores taken at baseline and at 3-month follow-up. The data are normalised into four tables, namely SUBJECT, DEMOGRAPHICS, RMDQ (for the RMDQ measurement) and FU (follow-up); the last is used to store the time points for each follow-up visit.
Each table has a primary key (PKey) column for storing a unique record identifier that is used as the basis for creating relationships between tables (see Figure 7b). The relationship between SUBJECT and DEMOGRAPHICS is one-to-zero-or-one, that is a subject can have zero or one demographic record. The PKey from the SUBJECT table is copied to the DEMOGRAPHICS tables, thereby creating a join using a shared value.
The relationship between SUBJECT and RMDQ is one-to-zero-or-many, that is, a subject can have zero or many RMDQ completed questionnaires. The FU table is joined to the RMDQ table using a one-to-zero-or-many relationship. This join allows a RMDQ score to be associated with either a baseline or a 3-month follow-up time point.
To create the relationships to the RMDQ table, the PKeys from both the SUBJECT and FU tables are added as foreign keys (FKeys). This has the result of allowing a subject to have either zero or many RMDQ scores at all time points. A composite unique constraint is applied to the Subject FKey and FU FKey columns to prevent a subject from having duplicate RMDQ scores for the same time point.
The repository differs from a typical clinical trial database in that it is not possible to predetermine requirements by using annotated CRFs. The repository relies on data from multiple trials to be periodically reviewed and classified, and must be frequently altered to accommodate new discoveries. The relational database is not a suitable model for such a scenario because modifications to the schema can be time-consuming and complex, often requiring the expertise of information technology specialists. Thus, the database for this project needs to be flexible so that the end users, namely, statisticians and health economists, can carry out modifications without having to change the database schema.
Our solution is to create a hybrid database that is a cross between an entity-attribute-value (EAV) open schema model and a relational database. This hybrid database has the flexibility of storing sparse heterogeneous data, which allows dynamic changes while enforcing data integrity.
The next section describes the architecture of the hybrid database. The rules used to map and transform the original source data to the repository standard are described below in Mapping and transformation. Using entity-attribute-value data shows how the repository database is manipulated, such that the data can be viewed in an analysis-friendly format from any statistical program that supports Open Database Connectivity (ODBC). Extract, transform and load describes how data from multiple RCTs were extracted, transformed and harmonised to the repository standard and, finally, loaded to the repository database.
System architecture
Tables and columns in a relational database can be represented as classes and attributes in an EAV model. 147 In the subsequent text the terms ‘class’ and ‘attribute’ will be used to conform to the EAV vocabulary. The term ‘entity’ is interchangeable with the term ‘object’ and can be thought of as providing a similar role to a table row but with the significant difference of storing only a pointer to the data and not the actual data itself. The entity–relationship diagram for the hybrid database is shown in Figure 8.
We anticipated that there would be some consistent data present in all of the RCTs for describing the trial and for identifying the trial’s subjects. The two tables Primary Source and Subject were created with fixed schemas to store this data (see Figure 8). The Primary Source table stores the name of the RCT (prms_TrialName), a brief description of the trial (prms_Description) and the date on which the data were imported into the repository (prms_ImportDate). The Subject table stores the original identifier assigned to the trial participant (subj_OriginalID), the date the participant enrolled into the trial (subj_EDate), the date the participant was randomised (subj_RDate) and a unique identifier generated by the system (subj_ID). A foreign key relationship is created to link each subject to the Primary Source.
The EAV model uses a subschema consisting of tables for classes, attributes, objects and the EAV data. The Class table is used to hold a list of all the identified domains, for example RMDQ and demographics. These domains generally map to a CRF but can also be used to describe a subset of repeating questions, for example repeated medical prescriptions.
The Attribute table is used to hold a list of all identified variables that typically map to a CRF question. The Attribute table has columns for storing a short name, a verbose name, a reference to the containing class and data type details. The short name is used to store a standardised version of the original CRF question.
The Object table stores a unique identifier for each instance of a class and a reference to the class itself. A foreign key relationship is created to link each Object to a Subject. This relationship essentially makes the EAV model subject centric, that is, all of the data stored in the Object and EAV tables must be directly related to an imported subject. Relationship between objects is possible by using an ‘ancestor column’ to store the unique identifier of a related object. For example, an object used for repeated medical prescriptions will store the unique identifier of the related follow-up object in the ‘ancestor column’.
The EAV data table has three columns and is used to store all of the repository’s RCT data. Two columns hold references to the related objects and attributes, with the other column used for storing the actual value of each object–attribute combination. The references to the objects and attributes take the form of foreign keys to the object and attribute tables. The format of the value is coerced into a string regardless of the intended data type. The intended data type – for example binary data, small integers or strings – details are stored in the related attribute table.
A simplification of how tabular data are represented in an EAV table is shown in Figure 9. In this example, the tabular data have one row for each subject (see Figure 9a). When the data are shown in the EAV table there are four rows for subject #1000, three rows for subject #1001 and three rows for subject #1002. For each populated cell in the tabular data a row is created in the EAV table. Subject #1000 has all cells populated and, therefore, has a row for each entry. Only three rows are entered for the other subjects because there was no RMDQ baseline score for #1001 and age was not recorded for #1002 (see Figure 9c).
In reality the EAV table will use the column Attribute ID to store the unique attribute identifier and not the text value as shown in Figure 9c. In addition, the column Object ID stores a reference to the object and not the subject ID. It is the related object that links back to the subject and to the class.
Mapping and transformation
Early evaluation of data sets from various RCTs in the project identified large variations between variable naming and coding conventions. For example, the RMDQ was used to measure back pain disability and the participant would tick all of the items that were applicable to him/her on that day. There are 24 items in the questionnaire and the score is the sum of all of the ticked items. One trial might name each column ‘rm1’, ‘rm2’ and so on until ‘rm24’ for all 24 individual items and ‘rmscore’ as the RMDQ score measured at baseline, ‘rm1_3mo’, ‘rm2_3mo’, . . ., ‘rm24_3mo’ and ‘rmscore_3mo’, for the 3-month follow-up data, and so on. Another trial might name them ‘rdq1’, ‘rdq2’, . . ., ‘rdq24’ and ‘rdq’ for items measured at baseline, ‘rdq11fu’, ‘rdq21fu’, . . ., ‘rdq241fu’ and ‘rdq1fu’ for items measured at the first follow-up, which could have been 1 or 3 months post randomisation, depending on the protocol. In addition, some trials might use the numerical value ‘1’ to represent a tick for that item and ‘0’ if it was not ticked. Other trials might use ‘1’ as ticked and ‘2’ as not.
Pilot mapping and transformation
A system was required to efficiently extract, transform and load (ETL) the original trial data sets into the repository. After evaluating a number of commercial and open-source ETL software packages, a prototype was developed using Microsoft SQL Server Integration Services (SSIS; SQL Server 2005 Enterprise Edition) and spreadsheets for documenting mapping and transformation instructions. The spreadsheet instructions were passed from the statisticians and health economists to the programmer who in turn created the SSIS program.
The pilot was deemed to be an inadequate solution. The versatility of SSIS as a data integration and transformation tool become a hindrance when attempting to customise a solution specifically for the repository. Setting up and configuring SSIS was found to be a laborious task, which was made even more difficult by frequent change requests and the manual interpretation of the mapping and transformation instructions. It became apparent that using SSIS was not viable and a decision was made to develop a bespoke ETL application.
XML and XSD for mapping and transforming
The method used to store mapping and transformation instructions was vastly improved by using extensible mark-up language (XML). XML is a free and open-source standard governed by the World Wide Web Consortium (WC3) and can be used to define a set of rules for encoding documents in a format that is both readable by human and machine. 148 The mapping and transformation XML document is made up of simple and intuitive keywords that both statisticians and health economists can easily interpret and apply. Having non-programmers directly enter the mapping and transformation rules forgos the requirement to pass these instructions on to a programmer, which, in turn, saves resources and decreases misinterpretation errors.
To ensure that all mapping and transformation rules were specified in the correct format and the correct order, an XML schema definition (XSD) was applied to validate the XML document. The XSD is a separate document that defines the permitted structure of the XML document.
Mapping clinical data
Figure 9b shows an example of the XML mark-up to map the original data to the equivalent repository attributes. The standard attributes age and sex from the DEMOGRAPHICS class are mapped to the original variables age and gender. RMDQ scores for baseline and 3-month follow-up are mapped to the attribute from the RMDQ class.
The XML element accepts values for the original variable name () and the follow-up time point () as XML attributes. The value of the XML element is set to the name of the repository attribute. In the example for class RMDQ the attribute name is .
Unlike in the original tabular data, the repository does not store different attribute names for each time point. Instead each time point will trigger a new object to be created. The XML attribute is used to track to which time point an original variable belongs.
Transforming clinical data
The original demographics and RMDQ scores have to be transformed into the repository standard before the data can be loaded into the repository database. Table 11 shows that the standard value for male is represented numerically by and female is for attribute . Based on the same example (see Figure 9a), the values for male and female in the original data were entered as and , respectively. Thus, the transformation for the attribute uses two match rules to find values and . When the value is matched, the rule has been set to update the attribute’s value to . Likewise, when the value is matched, the attribute’s value is updated to . There is no transformation rule for attribute, as the repository accepts any valid integer value.
Class | Attribute short name | Attribute long name | Data type | Value | Label |
---|---|---|---|---|---|
DEMOGRAPHICS | SEX | Participant’s sex | Integer | 1 | Male |
2 | Female | ||||
DEMOGRAPHICS | AGE | Participant’s age | Integer | > 0 | |
RMDQ | RDQ | RMDQ score | Integer | Range 0–24 | |
HE | RP | Recall period | Integer | > 0 | |
HE | TYPE | Types of resource | String | 1a | Primary care doctor |
3a | Physiotherapist | ||||
4M01 | NSAIDs | ||||
6 | Aids and adaptations | ||||
HE | REASON | Resource reason | Integer | 2 | LBP |
4 | Any condition | ||||
HE | LOCATION | Resource location | Integer | 1 | Primary care clinic |
3 | Private clinic | ||||
4 | Community clinic | ||||
HE | UNIT | Resource units | Integer | 1 | Visit |
3 | Prescription | ||||
4 | Item | ||||
HE | QUANTITY | Integer | > 0 | ||
HE | COST | Integer | > 0 | ||
HE | PAYER | Resource payer | Integer | 1 | Public health service |
4 | Individual |
In the example for class RMDQ, the transformation uses a range rule to allow values of only between 0 and 24 to be imported. If any value falls outside this range then the system will transform the value to (empty).
Mapping and transforming health-care resource-use data
Mapping health-care resource-use variables was more challenging because the different types of resources used across all RCTs do not conform to any standard and are completely variable. However, each question and answer in a typical health-care resource-use questionnaire can be broken down to the recall period, the type of resource, the reason for using the resource, the location of the resource, the unit of measurement, the quantity, the cost or expenses incurred and the payer.
Figure 10 shows a simplified version of a typical health-care resource-use questionnaire. In this example, participants were asked to record all of the health-care resources that they used at the 3-month follow-up time point (see Figure 10a). The answers provided by the participants were stored in a tabular format, which used 12 columns to capture all of the responses to the five questions (see Figure 10b). By using this format, the number of required columns to accommodate the data would grow in line with the maximum number of responses provided by any one individual. For example, if only one participant listed three items he/she bought over the counter to treat his/her LBP, the number of columns required would have to be increased from 12 to 13.
Figure 10c shows a view of the repository health-care resources data, generated from the EAV tables. This view displays the eight standard repository health-care resource-use attributes (table columns) and an additional attribute called ‘Text’, which is used to store all of the characters that are captured as comments in the CRF.
The process for creating the transformed health-care resource-use data involves splitting the original questions into a number of derived parts that will map to the standard attributes. For example, question 1 asked how many times the participant had consulted his/her doctor or any primary care doctor, for any reason, in the last 3 months. Using the information contained in the question, the recall period is set to ‘3’, the type of resource is ‘GP’, the reason for using the resource is ‘Any condition’, the location of the resource is ‘Primary Care Setting’, the unit of measurement is ‘Visit’ and the payer is ‘Public Health Service’. All of these values are derived solely from the information contained in the original question as opposed to the value of the variable. Only the attribute ‘Quantity’ is directly mapped to the original variable’s value.
The health-care resource-use data are stored in the EAV tables by creating relationships between objects. For each time point, one or many resource-use objects can be created. The HE class is used to define the time points for collecting only the health-care resource-use data. The actual resource-use data are defined in the HE-DATA class and the time point value is used to link an HE-DATA object to an HE object. The XML schema was modified to allow related classes to be described, which, in turn, gets interpreted by the system to create the relationships in the Object table.
Figure 11 shows the HE-DATA class being used as a child class, that is, it has the HE class as its parent. Creating child classes signifies to the system that a relationship exists between two classes. The linkedValue attribute is used to specify a shared value between the parent and child classes. In a relational database, this shared value would be created as a foreign key constraint. In the example shown in Figure 11, an HE class has been defined for the 3-month follow-up time point using the attribute . A child HE-DATA class has been defined and linked to the parent HE class by specifying the value for the linkedValue: . This corresponds with the 3-month follow-up time point specified in the HE class.
Child classes in the XML use elements to signify the number of objects that need to be created. In a relational database, this would result in adding a new element for every table row to be inserted. The value for the element has no significance except that it must be unique. In the example shown in Figure 11, six groups have been created for the 3-month resource-use data, namely , , , , and . These groups represent each question in the CRF shown in Figure 10a and the data shown in Figure 10b.
The original tabular data required 13 columns across three rows to store all of the data for the three participants. Instead of creating a new column for every resource, the repository creates a new object. The seven groups are used to create objects for GP visit (), NHS physiotherapist visit (), private physiotherapist visit (), two instances of prescribed medicine (, ) and two instances of aids or medications bought over the counter (, ). Although seven groups have been defined in this example, the ETL system will create objects only where data exist. For example, subject #1000 will create only four objects for GP visit (), NHS physiotherapist visit (), private physiotherapist visit () and medicine prescribed by GP ().
Once all resources have been identified and a group has been defined, the mapping rules are used to populate the repository’s standard resource-use attributes. Within the structure, the is used to allow the system to locate the correct object to process and the is used to store the name of the original variable. The element stores the name of the mapped repository attribute.
The original variable stores the quantity of doctor visits and hence is mapped to the repository attribute for the group . The other information required to make sense of this value are hard coded to the repository standard within the structure, which is within the structure. For example, the recall period (), the type (), the reason (), the location (), the unit () and the payer () of the resource allocated in group is hard coded to , , , , and , respectively (see Table 11 for list of values and corresponding labels). These values can be hard coded in the XML because they are known to be based on the CRF and do not affect the original data. When the system processes this mapping instruction, subject #1000 would have a health-care resource-use object that shows that there was one GP visit made during the 3-month follow-up time point (see Figure 10b).
Other rules can be applied to manipulate the original health-care resource-use data. For example, the original medicines prescribed have to be transformed to the repository standard to the standardised drug coding. Figure 11 shows a transformation for the attribute that uses a match rule to check for the value . If matched, the rule has been set to update the attribute’s value to .
The XML mapping and transformation instructions shown in Figure 11 were based on only one follow-up time point. For mapping data from more than one follow-up time point, simply create more HE objects, and map and transform health-care resource data within the child class HE-DATA that is linked to that follow-up time point, for example:
Using entity-attribute-value data
Using the EAV with classes and relationships data in its raw state for any kind of analysis work would be extremely difficult because of the fragmented nature of the EAV schema. For analysis purposes, it is therefore necessary to piece together the data to form complete data sets that are comparable to the data sets outputted from relational or tabular data sources. This task is achieved by processing the EAV table to derive a table for each class, a column for each attribute and a row for every object. An excerpt of the SQL statement to join the various data to extract the required data items for class RMDQ (whose identifier is in this example) is shown below:
The statement produces a table in a long format, which was subsequently pivoted to produce a row for each object and a column for every attribute. The outcome of this query is a data set that resembles a tabular structure that can easily be processed for further analysis.
Although this solution provides a means for generating a usable tabular format, the scalability is severely limited. The server performance was found to decrease as the volume of data increases, and multiple pivot operations were used for transforming object relationships. Querying the derived data sets directly was also impractical because of the huge numbers of data that can be generated in the server’s temporary database, causing the server to be unstable.
An initial solution used to overcome these issues was to disconnect from the actual query by using the in-built functionality of the statistical analysis software to create a copy of the query results. A more permanent solution, which is the current practice, is to periodically create a copy of the query results into actual tables within the database.
Extract, transform and load
The bespoke ETL application was required to read the original source data, automatically apply mapping and transformation rules from an XML document, and load the processed data into the repository. In addition to these basic functions, the ETL application was also required to permit end users to set up new RCTs for import, create new classes and attributes and make changes to existing ones, and to switch between a testing and live environment.
The bespoke ETL application was distributed as a Microsoft Windows® desktop application. It works by first uploading the original data set and the XML mapping and transformation rules. The instructions defined in the XML file are applied to the original data set and the transformed data are loaded into the repository database. The ETL application allows the statistician and health economist to execute these steps from their desktop computers. The ability to switch between a test and live environment gives the users the flexibility and convenience of checking whether or not the instructions that they have delineated in the XML file are correct before loading the data sets into the live database.
Data validation
Data integrity is vital throughout the repository ETL process. To check that the mapping and transformation procedures were carried out correctly, the repository data were routinely checked against the original data sets. To achieve this, at each time point (baseline and all follow-ups), a random sample of data was extracted and manually cross-checked against the source data. Any inconsistencies were flagged and, if required, the XML instructions were amended. This process was repeated until the data were deemed to have been transformed correctly.
Storage
In condition of our data sharing agreements to hold the RCT data sets and to meet local governance and standard operating procedures, the repository database server is held in a secure data centre, with robust disaster recovery policies in place.
The appeal of having this hybrid system architecture is that the structure takes up very little space in the server, and the time needed to query and retrieve data is very little, too. Naturally, the disk space needed to store the data in this repository will grow in proportion in accordance to the number of data points.
Future data sharing
At the end of this programme of work we would like to make the pooled data available for future analyses. We will go back to all of the principal investigators (PIs)/data custodians with a new data sharing agreement to enable us to share their pooled data. Once these agreements have been signed, we will set up a website with details of how to apply for the data. All requests will be:
-
forwarded to the study statistician who will carry out internal checks to ensure that the data being requested can be provided; the response from the study statistician will be supplied with the original request for the independent committee consideration
-
sent via e-mail to an independent committee, who will review the application and make a final decision on data sharing; for the data requested, if a PI/data custodian has:
-
agreed to sharing the data but has asked to see a copy of the request, a copy will be sent to them via e-mail for information purposes only
-
not agreed to sharing their data, this data set will be removed from the pooled data before providing the requested data to the applicant.
-
Chapter 5 Crosswalking between disability questionnaire scores
This chapter presents our methodological development, exploring how to most accurately map multiple participant-reported outcome measures (PROMs), which measure the same domain, to a common scale (crosswalking). This work has now been published in Spine. 149 We sought to develop a ‘crosswalk’ of values from multiple measures of the same domain to a common single outcome score. This would allow us to pool measures more accurately than normalising to a single scale (e.g. 0–100) or expressing values as a proportion of their SD. The first step in this work is to ensure that changes in outcomes from two measures in the same individuals are both correlated and similarly responsive to change. The results from this work would inform us how, and if, we could pool various back pain-related disability outcomes into a single outcome for the main analyses (see Chapter 6).
Background
There are six PROMs that have been used in one or more studies within the repository that aim to measure back pain-related disability, namely the Chronic Pain Grade Scale (CPG) disability score (CPG-DS), which is one of the two domains in the CPG that aims to grade chronic pain status,150 Hannover Functional Ability Questionnaire for measuring back pain-related functional limitations (Funktionsbeeintrachtigung durch Ruckenschmerzen) (FFbHR),151 Oswestry Disability Index (ODI),152 Pain Disability Index (PDI),153 Patient-Specific Functional Scale (PSFS)154 and RMDQ. 28 Some trials also included generic health-related quality-of-life instruments, such as the Short Form questionnaire-12 items (SF-12)155 or the Short Form questionnaire-36 items (SF-36),156 for which the physical component score (PCS) measures the physical functioning. As mentioned later in Chapter 6 (see Outcome variables), no common instrument was used by the trials that were included in the repository. We sought to assess the agreement of these instruments by determining their correlation and responsiveness at a trial level, in order to decide whether or not data pooling was feasible. After we had completed this work, a National Institutes of Health task force identified developing crosswalking values for ‘legacy’ measures of back pain outcome as a key priority for back pain research. 157
Data
We used data from 11 trials that had used at least two of the following measurements: CPG, FFbHR, PCS, PSFS, PDI, ODI and RMDQ. For all of these analyses we used the short-term change score, as this is where any treatment effects are likely to be greatest. For the purposes of this report we have defined a short-term follow-up as a measurement taken at between 2 and 3 months post randomisation or entry to the trial. The short-term change score is the difference between the baseline and the short-term follow-up (see Chapter 6, Follow-up time point). In each case we have standardised the reporting so that a positive change score is interpreted as an improvement. Where appropriate, we used the standardised response; change score divided by the SD of the change. We used this in preference to the standardised effect size (change score divided by the SD of the measure at baseline), so that all of the standardised scores had a SD of one. This enables visual comparisons to be made between all of the scatterplots.
Outcome conversion
All comparisons between instruments were carried out at an individual trial level. Each pair of outcome measures was fitted with simple linear regression models. Denoting the change scores for the two outcome measures by x and y, the simple linear model was:
where the intercept, α, and the coefficient, β, are parameters to be estimated and ϵ is the error term. For the conversion to be meaningful, the standardised change scores have to be correlated and have similar responsiveness; the latter is explained below. 158
Correlation
Correlation was assessed by scatterplots and Pearson’s correlation coefficient, with a correlation coefficient considered to be at least moderately high if it was > 0.5.
Responsiveness
Responsiveness is the ability to detect a change in condition; if a participant’s condition improves or worsens over time then this should be reflected by a change in the participant’s score. If two outcome measures do not have similar responsiveness then combining them in a meta-analysis may introduce heterogeneity that could be falsely attributed to other sources, such as the treatment effect.
Similarity of responsiveness of two outcome measures was examined by categorising the change scores as negative change (change score of < 0), no change (change score = 0) or positive change (change score of > 0), and applying Cohen’s kappa (κ) to these categorisations. 159 We considered κ > 0.4 to indicate sufficiently similar responsiveness. 160 These broad categories were chosen to demonstrate whether or not the outcome measures had similar responsiveness in the most basic sense (improved, worsened or no change). We also planned to examine narrower categorisations in the event that the agreements within these three categories were good (κ > 0.4). However, as there was no standard on the levels of categorisations, a few would be examined.
For it to be acceptable to pool two measures, they needed to meet two criteria; to be at least moderately correlated (correlation > 0.5) and to have at least moderately similar responsiveness (κ > 0.4).
Results
Eleven trials31,33,50,76,101,103,104,107,131,132,134 (n = 6089) and seven instruments were included in these analyses (Table 12). There was a total of 21 within-trial pairwise comparisons between two outcomes. Figures 12–16 show scatterplots of standardised change scores for each such pair of outcome measures. See Appendix 8 for scatterplots between raw change. It is clear from these plots that the outcomes were positively correlated. Note also that the standardised change scores were widely scattered around the reference line, suggesting that there was a lack of agreement between the outcomes.
Trial | n | Outcome measures | ||
---|---|---|---|---|
BeST33 | 426 | RMDQ | CPG | PCSa |
Brinkhaus101 | 281 | PCS | FFbHR | PDI |
Haake132 | 1110 | CPG | FFbHR | PCS |
Hancock131 | 235 | RMDQ | PSFS | |
HullExPro76 | 203 | RMDQ | PCS | |
Macedo134 | 158 | RMDQ | PCS | PSFS |
Pengel103 | 232 | RMDQ | PSFS | |
UK BEAM31 | 885 | RMDQ | CPG | PCS |
VKBIA104 | 227 | RMDQ | CPG | |
Witt50 | 2229 | PCS | FFbHR | |
YACBAC107 | 206 | PCS | ODI |
The correlations between outcomes ranged from 0.21 to 0.70; implying that the linear associations between them range from weak to moderately strong (Table 13). Three trials50,101,132 had both SF-12/36 PCS and FFbHR data, and their correlations were very similar, about 0.58. Another three trials33,101,132 had both SF-12/36 PCS and CPG, and the correlations were reasonably similar, ranging from 0.41 to 0.56, and four trials31,33,76,134 had both a SF-12/36 PCS and a RMDQ score with range 0.38–0.52, again similar. However, correlations between other outcomes were quite wide ranging: between CPG and RMDQ scores (m = 3 trials;31,33,104 range 0.21–0.47) and between PSFS and RMDQ scores (m = 3;103,131,134 range 0.40–0.70).
Outcome measure | Trial | Pearson’s correlation coefficient | Cohen’s kappa | |
---|---|---|---|---|
1 | 2 | |||
CPG | RMDQ | BeST33 | 0.44 | 0.22 |
UK BEAM31 | 0.47 | 0.27 | ||
VKBIA104 | 0.21 | 0.12 | ||
CPG | FFbHR | Haake132 | 0.48 | 0.25 |
PCSa | RMDQ | BeST33 | 0.38 | 0.17 |
HullExPro76 | 0.45 | 0.29 | ||
Macedo134 | 0.52 | 0.27 | ||
UK BEAM31 | 0.51 | 0.33 | ||
PCS | CPG | BeST33 | 0.41 | 0.27 |
Haake132 | 0.49 | 0.27 | ||
UK BEAM31 | 0.56 | 0.31 | ||
PCS | FFbHR | Brinkhaus101 | 0.59 | 0.30 |
Haake132 | 0.58 | 0.29 | ||
Witt50 | 0.59 | 0.27 | ||
PCS | PSFS | Macedo134 | 0.36 | 0.17 |
PCS | ODI | YACBAC107 | 0.60 | 0.28 |
RMDQ | PSFS | Hancock131 | 0.70 | 0.38 |
Macedo134 | 0.40 | 0.26 | ||
Pengel103 | 0.53 | 0.18 | ||
PDI | FFbHR | Brinkhaus101 | 0.55 | 0.32 |
PDI | PCS | Brinkhaus101 | 0.54 | 0.31 |
Cohen’s kappa was < 0.4 for all 21 comparisons. Some were similar between trials, namely for PCS and FFbHR (range 0.27–0.30) and for PCS and CPG (range 0.27–0.31). However, the level of agreement was never more than fair. 160 As the Cohen’s kappa agreement was not > 0.4 narrower categorisations were not investigated.
There were no pairs of outcomes that satisfied both criteria of at least moderately correlated (correlation > 0.5) and at least moderately similar responsive (Cohen’s kappa > 0.4). Therefore, it was not meaningful to convert any outcome to another one.
Conclusion
In view of the lack of correlation and responsiveness, it is not recommended to map any physical disability outcome measures to another considered in this investigation.
For each of our subsequent analyses we have pooled data only where the same participant-reported outcomes are available from multiple trials. The one exception is that the SF-12 and SF-36 are explicitly designed to have similar measurement properties when converted into their physical and mental component scores. We have therefore pooled the mental component score (MCS) and PCS from studies using SF-12 or SF-36.
Chapter 6 Preliminary statistical analyses and results
Background
In this chapter we present the results of preliminary statistical analyses performed on the individual participant data, specifically the analysis of covariance (ANCOVA) comparing all treatments with all controls (usual care plus sham) to identify individual potential moderators to take forward into our main analyses. The methodological development work to identify multiple covariates’ baseline characteristics that moderate treatment effect is presented in later chapters (see Chapters 7–10). We do not, in this preliminary analysis, seek to define subgroups using multiple parameters.
Statistical analysis plan
In accordance with the standard operating procedure in the Warwick Clinical Trials Unit, a detailed statistical analysis plan was written by the study’s statistician (SWH) and health economist (JJ). The plan was subsequently reviewed and approved by the study team and members of the repository oversight committee (see Appendix 9), whereas the overview of the plan is described in the following sections.
Definitions
Treatment arms
Treatments are broadly classified into intervention, sham control and control. The intervention grouping may be further classified into three broad categories, namely active physical, passive physical and psychological. Exercise and graded activity are considered as active physical; acupuncture, manual therapy and individual physiotherapy are considered as passive physical; and advice or education, and a cognitive–behavioural approach or cognitive–behavioural therapy, are considered as psychological interventions. Sham control may be sham acupuncture, sham electrotherapy, mock TENS or sham advice or education. The control arm is the non-active usual care, namely GP treatment or a waiting list control. Sham acupuncture may be a special case of a sham intervention. If it is the sensation of needling that is the active ingredient of acupuncture then the location of any needling, whether or not skin penetration takes place, or depth of any needling might have little effect on outcomes seen. Thus, sham acupuncture might be considered to be a ‘true’ intervention and is included in our analyses of passive physical treatments.
Follow-up time point
The follow-up times are classified into short term, mid term and long term. A short-term follow-up is a measurement taken between 2 and 3 months post randomisation or entry to the trial. A mid-term follow-up is a measurement taken at 6 months post randomisation or entry to the trial. A long-term follow-up is a measurement taken at 12 months post randomisation or entry to the trial. Data collected at immediate follow-up (< 2 months post randomisation or entry to the trial) and beyond the long-term follow-up (> 12 months post randomisation or entry to the trial) were also entered into the repository but were not considered for analysis.
Selection of follow-up time points
Some RCTs collected weekly data. For the short-term follow-up, data from the 3-month follow-up were considered for analysis. If data were missing (non-response), data from the nearest week to the 3-month follow-up were used as long as the time point was within the 2- and 3-month follow-up time point.
Outcome variables
Clinical outcomes
The response for each of the outcome variables of interest is presented as change score and standardised change score. The change score is the change from baseline to the follow-up time point. A positive change score is interpreted as an improvement.
Health-economic outcomes
For the initial economic analysis presented here, the outcome of QALYs was used. Estimated QALY gains from treatment were compared with the mean estimated costs of treatment to assess cost-effectiveness. Individual participant data on resource use or costs were available for some trials but, after allowing for availability of the European Quality of Life-5 Dimensions (EQ-5D) or SF-12/36 scores (required to calculate QALYs) and of a common set of moderator variables, no two studies provided both individual-level cost and QALY data for a common comparison. We were, therefore, unable to generate pooled cost/QALY data.
The heterogeneous nature of the trials posed some challenges for the economic analysis. To pool the data across trials, a consistent health outcome measure over time was required. The QALY is a standardised measure of health outcomes used for economic analysis, which summarises patients’ profiles of health-related quality of life (‘utility’) over time. The QALY score for each patient was estimated using the EQ-5D, which is a generic measure of quality of life, suitable for calculation of QALYs. The EQ-5D index score, calculated using the UK Tariff, measures an individual’s health state at a single time point. 161 EQ-5D index scores can be integrated over time to estimate QALYs. QALYs were calculated for trial participants over 1 year of follow-up, using the area under the curve (AUC) method. For each participant the AUC was calculated from the EQ-5D index scores that were captured at each follow-up point for that participant, from baseline to 52 weeks (with linear interpolation between observations). Trials with more follow-up points arguably have greater resolution and therefore the QALY estimated will be more precise. However, in all regression analyses the differences between trials were controlled for, so this potential issue was mitigated.
For one trial,132 EQ-5D data were not available, but full data on patient responses to the SF-12 instrument were recorded. The SF-12 is a generic measure of health, similar to the EQ-5D, and a number of methods to estimate a utility index score from the SF-12 instrument have been published. To ensure that the index scores provided by the SF-12 are comparable to those obtained for the other trials using the EQ-5D, a mapping approach was applied. This mapped the SF-12 item responses on to the EQ-5D index scores. The specific mapping approach applied was based on the work of Gray et al. ;162 in this study, a multinominal logit model was used to estimate the probability that a particular EQ-5D level would be chosen, based on the participants’ SF-12 responses. The authors have made available an algorithm applying this method as an add-on programme in Stata 12 (StataCorp LP, College Station, TX, USA). This mapping approach was compared with other published methods by Rowen et al. 163 They found similar levels of performance across the alternative approaches. In our analysis, the mapped SF-12 index scores were integrated over time in the same manner as the EQ-5D scores to estimate an individual-level QALY. Use of SF-6D (Short Form questionnaire-6 Dimensions) to EQ-5D mapping might have introduced additional errors or bias, although the method was well developed and has been subject to validation. The potential for bias should also have been mitigated by the method of analysis, with a mixed model accounting for differences between trials. Furthermore, the outcomes of interest were the treatment subgroup coefficients rather than the magnitude of main effects per se.
One trial132 had data only up to 26 weeks. For this trial,132 it was assumed that the quality-of-life score measured at 26 weeks persisted up to 52 weeks, which allowed QALYs over 1 year to be estimated in the same way as for the other trials. This assumption might be seen as a limitation but, again, the potential for bias from this source should have been reduced through the inclusion of the trial as a random effect and the focus on treatment–subgroup interactions.
It is important to adjust for any baseline differences in EQ-5D scores when comparing QALY estimates between treatment groups. There are two ways of making this adjustment: by calculating a ‘change from baseline’ QALY at the individual level or adding the baseline EQ-5D score as a covariate in regression analysis. The latter approach has been used in the analyses presented here, as it is recommended as more efficient. 164
Selection of instrument
Clinical outcomes are classified broadly into physical disability, pain, psychological distress and non-utility quality of life. Nine instruments in the repository have been identified as measurements for physical disability and four instruments for pain (see Appendix 9). No single instrument was used by all RCTs to measure physical disability and hence we explored how to map some of these instruments to one single outcome. The mapping methodology is described in Chapter 5. We concluded that it was not possible to map to one single outcome. Therefore, analyses were undertaken on common outcomes only.
Most of the RCTs in the repository had asked the participant to rate or mark on a numerical rating scale or a visual analogue scale (VAS) that described either their average or worst pain at the present time or over defined weeks or months. This item was presented either as a single stand-alone instrument or as an item that was part of a collective pain measurement; for example, in the McGill Pain Questionnaire a VAS was presented as a line that anchors with ‘no pain’ at one end and ‘worst possible pain’ at the other end. 165 For the analyses of average pain, one of the following instruments from each trial, where available, was chosen (in descending order):
-
individual VAS on average pain today
-
average pain over the past 1 week
-
average pain over the past 2 weeks, average pain over the past 1 month
-
average pain over the past 3 months
-
the individual item of the CPG pain intensity score (CPG-PS) that is equivalent to the VAS if it is available150
-
the summary score of the CPG-PS otherwise or
Where a numerical rating scale (range 0–10) was used, it was scaled to an analogue scale so that it gives a range from 0 to 100.
There are two dimensions of psychological distress that are of interest: depression and anxiety. Six and four instruments have been identified to measure depression and anxiety, respectively (see Appendix 9). Within each instrument there is usually a classification system that is widely used to classify participants into ordinal categories, for example with a minimal, a moderate or a severe level of depression. Thus, all instruments were mapped into a single ordinal categorical variable. Instruments with no threshold guideline to discriminate level of risk or severity were categorised into tertiles to discriminate the low- and high-risk or low- and high-severity groups from the moderate-risk or moderate-severity group. Other psychosocial measures – catastrophising, coping and fear avoidance – were handled in the same manner. In each case, the reference standard for comparison was the tertile with the least favourable score.
Data sets
Individual participant data without treatment assignment were excluded from the repository. This exclusion criterion applies to individual participants whose data were included in the data set but the treatment allocation was not available in the data set. We were not able to allocate these participants to a treatment group and they were thus excluded.
Clinical analysis
The main analysis, which is to confirm proof of concept, was based on complete case analysis. Missing data due to non-responders or withdrawals were not imputed. Missing items were imputed and the method for imputation is as described in the statistical analysis plan (see Appendix 9). When available, individual items were used to obtain the composite score for each measurement, otherwise the composite score provided to the repository was used for all analyses.
For the overall exploration of moderation by single variables, the sham control was grouped with non-active usual care. All direct analyses were based on pairwise comparisons, that is, only two treatment arms were compared each time. For the overall analysis, intervention was compared against control/placebo arm, where intervention was any therapist-delivered intervention given either singly or in combination with another intervention, and the control/placebo arm was either the non-active usual care control or sham treatment. Other pairwise comparisons considered were active physical against non-active usual care control, passive physical against non-active usual care control, psychological against non-active usual care control, and sham against non-active usual care control. In all cases for the pairwise comparisons we separated sham and usual care controls, as this reflects more accurately the clinical choice than adding of an intervention on to a sham control intervention.
Direct analyses were performed if the individual participant data were from at least two trials, that is, no direct analysis was performed if the individual participant data were from one single trial.
Health-economic analysis
The health-economic analysis focused on the QALY score as the outcome measure. QALYs were calculated for individuals, using the estimated EQ-5D index scores or a mapped SF-12 outcome at multiple follow-up points. This means that missing data can be more of a problem than for outcomes measured at a single time point. If data are missing at any follow-up point, the QALY cannot be estimated and the entire observation is lost. An observation was also lost if data on the moderator at baseline were missing. All analyses were based on complete cases only; therefore, caution must be taken in interpretation of the results, as the missing data may be a source of bias.
In order to simplify the analysis it was split into four overarching comparisons; all interventions collectively against non-active usual care, active physical interventions against non-active usual care, passive physical interventions against non-active usual care and active physical against passive physical. For each analysis, the treatment arms for the included trials were pooled appropriately by the type of treatment and used collectively as the intervention group for each of the respective analyses. Seven trials in total were included in the analysis. The first three analyses described limited the sample to a maximum of six trials, which included a non-active usual care as the control arm and reporting EQ-5D outcomes or a mapped SF-12 outcome. The comparison between active physical and passive physical allowed the inclusion of one additional trial. Data for comparisons against a sham treatment arm were excluded from this analysis, as these are not plausible choices for a health-economic analysis.
Methods
Descriptive summary
The baseline data were summarised by treatment arm (non-active usual care, active physical, passive physical, psychological, combination or sham control). The continuous data were summarised as mean and SD, and the categorical data were summarised as the number of participants and percentage.
One-step meta-analysis
In a one-step meta-analysis, individual participant data from all studies were modelled simultaneously in a single model adjusting for the study effect. 166 It can be viewed analogously as an analysis of a multicentre study, for which, instead of multicentres in a study, we have multitrials in a study. The one-step meta-analysis was performed to explore the efficacy between treatment arms. A mixed-effects model was used as analysis, for which the intercept and the interaction between treatment arm and trial were modelled as random effects, and treatment arm as the fixed effect.
Moderator identification
Systematic review
We identified potential moderators from the literature via a systematic review. Details of this review and the outcomes are presented in Chapter 2.
Analysis of covariance
Analysis of covariance was performed to identify any covariate that moderates outcomes. Similarly, the one-step meta-analysis approach was used, that is, all of the available individual participant data were pooled into a single mixed-effects model for which the intercept and the interaction between treatment and trial were modelled as random effects. The treatment arm (intervention against control), covariate and the interaction between treatment and covariate were modelled as fixed effects. For analysis with QALYs as the outcome measure, the baseline EQ-5D score was also included as a fixed effects model in the mixed-effects model described above.
As stated in the statistical analysis plan, covariates were declared weakly statistically significant at the two-sided 20% level and statistically significant at the two-sided 5% level. This ensured that covariates that approach the conventional statistical significance at 5% level would not be missed for the final clinical and health-economic prediction rule analyses. All moderators identified from the systematic review and ANCOVA were considered for the clinical and health-economic prediction rule analyses. The prediction rule analyses were to determine which participant characteristics at baseline were optimal to different treatments and associated with the end points of interest, namely disability or pain, or cost-effective treatments for LBP. The methodology of identifying a combination of characteristics is presented in detail in Chapters 7–10.
As seen in the results from the one-step meta-analysis, the estimated efficacy between intervention and control/placebo arm for most of the outcomes at mid-term and long term were not statistically significant. Therefore, the ANCOVA was not performed for the mid- and long-term outcomes. In addition, the short-term outcomes were those in which the maximum clinical effects were observed between intervention and control/placebo. This is where the largest differential subgroup effects are likely to be seen. In the absence of substantial short-term effect moderation there is little point in exploring mid- and long-term effect moderation.
The list of moderators assessed for each of the short-term clinical outcomes and QALYs were presented. As not all trials have the same moderators, the sample size varied depending on which moderator was being assessed and for which outcome.
Results
Descriptive
Table 14 shows the response rates for each of the outcomes of interest per treatment groups in different time points. Most trials collected data 3 months post randomisation or entry to the trial, and this is recorded as 13 weeks, whereas one RCT had specifically mentioned in its protocol to collect data at 12 weeks and thus this was recorded as per protocol.
Outcomes | Follow-up (weeks) | Active physical (m = 7; n = 914) | Passive physical (m = 12; n = 3270) | Psychological (m = 7; n = 1120) | Combination (m = 3; n = 451) | Sham (m = 6; n = 688) | Control (m = 10; n = 2885) | All (m = 19; n = 9326) |
---|---|---|---|---|---|---|---|---|
Physical disability | ||||||||
CPG-DSa | 0 | m = 1; n = 284 | m = 2; n = 721 | m = 2; n = 572 | m = 1; n = 312 | m = 1; n = 387 | m = 4; n = 1052 | m = 5; n = 3328 |
4 | m = 1; n = 228 | m = 1; n = 315 | – | m = 1; n = 280 | – | m = 1; n = 262 | m = 4; n = 1085 | |
8 | – | – | m = 1; n = 109 | – | – | m = 1; n = 120 | m = 2; n = 229 | |
13 | m = 1; n = 214 | m = 2; n = 653 | m = 1; n = 345 | m = 1; n = 252 | m = 1; n = 376 | m = 3; n = 797 | m = 5; n = 2637 | |
26 | – | m = 1; n = 377 | m = 2; n = 491 | – | m = 1; n = 376 | m = 3; n = 656 | m = 2; n = 1900 | |
52 | m = 1; n = 212 | m = 1; n = 267 | m = 2; n = 473 | m = 1; n = 254 | – | m = 3; n = 530 | m = 5; n = 1736 | |
104 | – | – | m = 1; n = 94 | – | – | m = 1; n = 92 | m = 2; n = 186 | |
FFbHR | 0 | – | m = 3; n = 1927 | – | – | m = 2; n = 460 | m = 3; n = 1789 | m = 3; n = 4176 |
6 | – | m = 1; n = 370 | – | – | m = 1; n = 375 | m = 1; n = 362 | m = 1; n = 1107 | |
8 | – | m = 1; n = 140 | – | – | m = 1; n = 70 | m = 1; n = 74 | m = 1; n = 284 | |
13 | – | m = 2; n = 1723 | – | – | m = 1; n = 376 | m = 2; n = 1605 | m = 2; n = 3704 | |
26 | – | m = 3; n = 1825 | – | – | m = 2; n = 446 | m = 3; n = 1620 | m = 3; n = 3891 | |
52 | – | m = 1; n = 137 | – | – | m = 1; n = 68 | m = 1; n = 70 | m = 1; n = 275 | |
ODI | 0 | – | m = 1; n = 159 | – | – | – | m = 1; n = 80 | m = 1; n = 239 |
13 | – | m = 1; n = 146 | – | – | – | m = 1; n = 71 | m = 1; n = 217 | |
52 | – | m = 1; n = 136 | – | – | – | m = 1; n = 57 | m = 1; n = 193 | |
104 | – | m = 1; n = 114 | – | – | – | m = 1; n = 50 | m = 1; n = 164 | |
PDI | 0 | – | m = 1; n = 146 | – | – | m = 1; n = 73 | m = 1; n = 79 | m = 1; n = 298 |
8 | – | m = 1; n = 140 | – | – | m = 1; n = 70 | m = 1; n = 74 | m = 1; n = 284 | |
26 | – | m = 1; n = 138 | – | – | m = 1; n = 70 | m = 1; n = 73 | m = 1; n = 281 | |
52 | – | m = 1; n = 137 | – | – | m = 1; n = 66 | m = 1; n = 69 | m = 1; n = 272 | |
PSFS | 0 | m = 2; n = 150 | m = 1; n = 119 | m = 2; n = 148 | m = 1; n = 62 | m = 2; n = 188 | – | m = 3; n = 667 |
1 | – | m = 1; n = 119 | – | – | m = 1; n = 118 | – | m = 2; n = 237 | |
2 | – | m = 1; n = 119 | – | – | m = 1; n = 119 | – | m = 1; n = 238 | |
4 | – | m = 1; n = 118 | – | – | m = 1; n = 117 | – | m = 1; n = 235 | |
6 | m = 1; n = 58 | – | m = 1; n = 54 | m = 1; n = 57 | m = 1; n = 59 | – | m = 1; n = 228 | |
8 | m = 1; n = 82 | – | m = 1; n = 76 | – | – | – | m = 1; n = 158 | |
12 | m = 1; n = 57 | – | m = 1; n = 56 | m = 1; n = 58 | m = 1; n = 61 | – | m = 1; n = 232 | |
13 | – | m = 1; n = 118 | – | – | m = 1; n = 117 | – | m = 1; n = 235 | |
26 | m = 1; n = 81 | – | m = 1; n = 74 | – | – | – | m = 1; n = 155 | |
52 | m = 2; n = 136 | – | m = 2; n = 132 | m = 1; n = 56 | m = 1; n = 56 | – | m = 2; n = 380 | |
RMDQ | 0 | m = 7; n = 907 | m = 7; n = 1087 | m = 7; n = 1120 | m = 3; n = 446 | m = 3; n = 212 | m = 6; n = 938 | m = 14; n = 4710 |
1 | – | m = 1; n = 119 | – | – | m = 1; n = 118 | – | m = 1; n = 237 | |
2 | – | m = 2; n = 119 | – | – | m = 1; n = 118 | – | m = 1; n = 237 | |
4 | m = 1; n = 234 | m = 2; n = 436 | – | m = 1; n = 283 | m = 1; n = 117 | m = 1; n = 264 | m = 2; n = 1334 | |
6 | m = 2; n = 144 | m = 1; n = 23 | m = 1; n = 55 | m = 1; n = 58 | m = 2; n = 81 | m = 1; n = 94 | m = 3; n = 455 | |
8 | m = 1; n = 82 | – | m = 2; n = 186 | – | – | m = 1; n = 120 | m = 2; n = 388 | |
10 | m = 1; n = 107 | – | – | m = 1; n = 55 | – | m = 1; n = 50 | m = 1; n = 212 | |
12 | m = 1; n = 58 | – | m = 1; n = 58 | m = 1; n = 59 | m = 1; n = 61 | – | m = 1; n = 236 | |
13 | m = 3; n = 433 | m = 7; n = 963 | m = 4; n = 670 | m = 1; n = 255 | m = 2; n = 135 | m = 3; n = 537 | m = 9; n = 2993 | |
26 | m = 4; n = 371 | m = 2; n = 262 | m = 5; n = 706 | m = 1; n = 53 | – | m = 5; n = 474 | m = 8; n = 1866 | |
52 | m = 7; n = 722 | m = 5; n = 771 | m = 7; n = 903 | m = 3; n = 365 | m = 1; n = 56 | m = 6; n = 690 | m = 12; n = 3507 | |
104 | m = 1; n = 83 | m = 1; n = 95 | m = 1; n = 94 | – | – | m = 1; n = 92 | m = 2; n = 364 | |
Troublesomeness | 0 | m = 2; n = 344 | m = 3; n = 556 | m = 1; n = 426 | m = 1; n = 312 | – | m = 3; n = 604 | m = 4; n = 2242 |
4 | m = 1; n = 225 | m = 1; n = 313 | – | m = 1; n = 279 | – | m = 1; n = 262 | m = 1; n = 1079 | |
13 | m = 2; n = 280 | m = 3; n = 494 | – | m = 1; n = 253 | – | m = 2; n = 318 | m = 3; n = 1345 | |
52 | m = 2; n = 302 | m = 3; n = 493 | – | m = 1; n = 252 | – | m = 2; n = 297 | m = 8; n = 1344 | |
104 | – | m = 1; n = 113 | – | – | – | m = 1; n = 50 | m = 3; n = 162 | |
Pain | ||||||||
CPG-PSb | 0 | m = 1; n = 283 | m = 2; n = 721 | m = 2; n = 582 | m = 1; n = 312 | m = 1; n = 387 | m = 4; n = 1054 | m = 4; n = 3339 |
4 | m = 1; n = 228 | m = 1; n = 316 | – | m = 1; n = 281 | – | m = 1; n = 261 | m = 1; n = 1086 | |
6 | – | m = 1; n = 370 | – | – | m = 1; n = 375 | m = 1; n = 362 | m = 1; n = 1107 | |
8 | – | – | m = 1; n = 110 | – | – | m = 1; n = 120 | m = 1; n = 230 | |
13 | m = 1; n = 214 | m = 2; n = 653 | m = 1; n = 354 | m = 1; n = 252 | m = 1; n = 376 | m = 3; n = 799 | m = 3; n = 2648 | |
26 | – | m = 1; n = 377 | m = 2; n = 497 | – | m = 1; n = 376 | m = 3; n = 661 | m = 3; n = 1911 | |
52 | m = 1; n = 211 | m = 1; n = 269 | m = 2; n = 491 | m = 1; n = 253 | – | m = 4; n = 536 | m = 3; n = 1760 | |
104 | – | – | m = 1; n = 94 | – | – | m = 1; n = 92 | m = 1; n = 186 | |
VAS | ||||||||
Average pain today | 0 | m = 2; n = 253 | m = 3; n = 461 | m = 1; n = 196 | m = 1; n = 61 | m = 1; n = 120 | m = 1; n = 51 | m = 3; n = 1142 |
1 | – | m = 1; n = 119 | – | – | m = 1; n = 119 | – | m = 1; n = 238 | |
2 | – | m = 1; n = 119 | – | – | m = 1; n = 119 | – | m = 1; n = 238 | |
3 | – | m = 1; n = 118 | – | – | m = 1; n = 118 | – | m = 1; n = 236 | |
4 | m = 1; n = 83 | m = 1; n = 118 | m = 1; n = 80 | – | m = 1; n = 118 | – | m = 2; n = 399 | |
6 | – | m = 1; n = 36 | – | – | m = 1; n = 38 | – | m = 1; n = 74 | |
8 | m = 1; n = 81 | m = 1; n = 24 | m = 1; n = 79 | – | m = 1; n = 23 | – | m = 2; n = 207 | |
10 | m = 1; n = 107 | m = 1; n = 16 | – | m = 1; n = 55 | m = 1; n = 18 | m = 1; n = 49 | m = 2; n = 245 | |
11 | – | m = 1; n = 15 | – | – | m = 1; n = 17 | – | m = 1; n = 32 | |
12 | – | m = 1; n = 15 | – | – | m = 1; n = 17 | – | m = 1; n = 32 | |
13 | m = 1; n = 81 | m = 1; n = 153 | m = 2; n = 231 | – | – | – | m = 1; n = 465 | |
17 | m = 1; n = 79 | – | m = 1; n = 75 | – | – | – | m = 1; n = 154 | |
21 | m = 1; n = 81 | – | m = 1; n = 76 | – | – | – | m = 1; n = 157 | |
26 | m = 2; n = 186 | – | m = 1; n = 75 | m = 1; n = 53 | – | – | m = 2; n = 314 | |
30 | m = 1; n = 79 | – | m = 1; n = 72 | – | – | – | m = 1; n = 151 | |
34 | m = 1; n = 81 | – | m = 1; n = 73 | – | – | – | m = 1; n = 154 | |
39 | m = 1; n = 80 | – | m = 1; n = 74 | – | – | – | m = 1; n = 154 | |
43 | m = 1; n = 78 | – | m = 1; n = 74 | – | – | – | m = 1; n = 152 | |
47 | m = 1; n = 76 | – | m = 1; n = 71 | – | – | – | m = 1; n = 147 | |
52 | m = 2; n = 183 | m = 1; n = 164 | m = 2; n = 238 | m = 1; n = 53 | – | – | m = 6; n = 638 | |
Average pain over past 1 week | 0 | m = 2; n = 150 | m = 2; n = 235 | m = 3; n = 349 | m = 1; n = 63 | m = 2; n = 84 | – | m = 4; n = 881 |
1 | – | m = 1; n = 235 | – | – | m = 1; n = 119 | – | m = 1; n = 238 | |
2 | – | m = 1; n = 235 | – | – | m = 1; n = 119 | – | m = 1; n = 238 | |
3 | – | m = 1; n = 235 | – | – | m = 1; n = 118 | – | m = 1; n = 237 | |
4 | m = 1; n = 82 | m = 2; n = 152 | m = 1; n = 80 | – | m = 2; n = 134 | – | m = 3; n = 448 | |
6 | m = 1; n = 59 | m = 1; n = 49 | m = 1; n = 55 | m = 1; n = 58 | m = 2; n = 97 | – | m = 2; n = 306 | |
8 | m = 1; n = 81 | m = 1; n = 24 | m = 1; n = 79 | – | m = 1; n = 24 | – | m = 2; n = 208 | |
10 | – | m = 1; n = 16 | – | – | m = 1; n = 19 | – | m = 1; n = 35 | |
11 | – | m = 1; n = 11 | – | – | m = 1; n = 17 | – | m = 1; n = 33 | |
12 | m = 1; n = 58 | m = 1; n = 15 | m = 1; n = 58 | m = 1; n = 59 | m = 2; n = 78 | – | m = 2; n = 268 | |
13 | m = 1; n = 81 | m = 2; n = 180 | m = 2; n = 231 | – | m = 1; n = 9 | – | m = 3; n = 501 | |
17 | m = 1; n = 79 | – | m = 1; n = 75 | – | – | – | m = 1; n = 154 | |
21 | m = 1; n = 81 | – | m = 1; n = 76 | – | – | – | m = 1; n = 157 | |
26 | m = 1; n = 81 | m = 1; n = 21 | m = 1; n = 75 | – | m = 1; n = 6 | – | m = 2; n = 183 | |
30 | m = 1; n = 79 | – | m = 1; n = 72 | – | – | – | m = 1; n = 151 | |
34 | m = 1; n = 81 | – | m = 1; n = 73 | – | – | – | m = 1; n = 154 | |
39 | m = 1; n = 80 | – | m = 1; n = 74 | – | – | – | m = 1; n = 154 | |
43 | m = 1; n = 78 | – | m = 1; n = 74 | – | – | – | m = 1; n = 152 | |
47 | m = 1; n = 77 | – | m = 1; n = 71 | – | – | – | m = 1; n = 148 | |
52 | m = 2; n = 140 | m = 1; n = 163 | m = 3; n = 297 | m = 1; n = 57 | m = 1; n = 56 | – | m = 3; n = 713 | |
Average pain over past 1 month | 0 | – | m = 1; n = 24 | – | – | m = 1; n = 24 | – | m = 1; n = 48 |
6 | – | m = 1; n = 23 | – | – | m = 1; n = 22 | – | m = 1; n = 45 | |
13 | – | m = 1; n = 22 | – | – | m = 1; n = 18 | – | m = 1; n = 40 | |
Worst pain today | 0 | m = 1; n = 111 | – | – | m = 1; n = 61 | – | m = 1; n = 51 | m = 1; n = 223 |
10 | m = 1; n = 107 | – | – | m = 1; n = 53 | – | m = 1; n = 49 | m = 1; n = 209 | |
26 | m = 1; n = 103 | – | – | m = 1; n = 53 | – | – | m = 1; n = 156 | |
52 | m = 1; n = 103 | – | – | m = 1; n = 52 | – | – | m = 1; n = 155 | |
Worst pain over past 1 month | 0 | – | m = 2; n = 24 | – | – | m = 1; n = 24 | – | m = 2; n = 48 |
6 | – | m = 1; n = 23 | – | – | m = 1; n = 22 | – | m = 2; n = 45 | |
13 | – | m = 1; n = 22 | – | – | m = 1; n = 18 | – | m = 2; n = 40 | |
Quality of life | ||||||||
SF-12/SF-36 PCSa | 0 | m = 4; n = 617 | m = 7; n = 2544 | m = 2; n = 507 | m = 1; n = 305 | m = 2; n = 460 | m = 6; n = 2262 | m = 9; n = 6695 |
4 | m = 1; n = 214 | m = 1; n = 300 | – | m = 1; n = 264 | – | m = 1; n = 249 | m = 1; n = 1027 | |
8 | m = 1; n = 82 | m = 1; n = 139 | m = 1; n = 76 | – | m = 1; n = 69 | m = 1; n = 73 | m = 2; n = 439 | |
13 | m = 3; n = 415 | m = 6; n = 2276 | m = 1; n = 332 | m = 1; n = 243 | m = 1; n = 376 | m = 5; n = 2006 | m = 7; n = 5648 | |
26 | m = 2; n = 185 | m = 4; n = 1850 | m = 2; n = 436 | – | m = 2; n = 444 | m = 4; n = 1711 | m = 6; n = 4626 | |
52 | m = 4; n = 469 | m = 5; n = 719 | m = 2; n = 449 | m = 1; n = 235 | m = 1; n = 68 | m = 4; n = 545 | m = 7; n = 2485 | |
104 | m = 1; n = 83 | m = 2; n = 206 | – | – | – | m = 1; n = 49 | m = 2; n = 338 | |
SF-12/SF-36 MCSb | 0 | m = 4; n = 617 | m = 7; n = 2544 | m = 2; n = 507 | m = 1; n = 305 | m = 2; n = 460 | m = 6; n = 2262 | m = 9; n = 6695 |
4 | m = 1; n = 214 | m = 1; n = 300 | – | m = 1; n = 264 | – | m = 1; n = 249 | m = 1; n = 1027 | |
8 | m = 1; n = 82 | m = 1; n = 139 | m = 1; n = 76 | – | m = 1; n = 69 | m = 1; n = 73 | m = 2; n = 439 | |
13 | m = 3; n = 415 | m = 6; n = 2276 | m = 1; n = 332 | m = 1; n = 243 | m = 1; n = 376 | m = 5; n = 2006 | m = 7; n = 5648 | |
26 | m = 2; n = 185 | m = 4; n = 1850 | m = 2; n = 436 | – | m = 2; n = 444 | m = 4; n = 1711 | m = 6; n = 4626 | |
52 | m = 4; n = 469 | m = 5; n = 719 | m = 2; n = 449 | m = 1; n = 235 | m = 1; n = 68 | m = 4; n = 545 | m = 7; n = 2485 | |
104 | m = 1; n = 83 | m = 2; n = 206 | – | – | – | m = 1; n = 49 | m = 2; n = 338 | |
Health utility | ||||||||
EQ-5D-3L | 0 | m = 1; n = 85 | – | – | – | – | m = 1; n = 94 | m = 1; n = 179 |
6 | m = 1; n = 85 | – | – | – | – | m = 1; n = 94 | m = 1; n = 179 | |
26 | m = 1; n = 77 | – | – | – | – | m = 1; n = 86 | m = 1; n = 163 | |
52 | m = 1; n = 82 | – | – | – | – | m = 1; n = 88 | m = 1; n = 170 |
Most of the RCTs collected short- and mid-term outcomes and some collected more immediate outcomes (typically measured within 6 weeks post randomisation or entry to the trial) (see Table 14). Two RCTs collected longer-term effects (outcomes measured at or after 12 months post randomisation or entry to the trial). Each of the RCTs was designed with a unique protocol and this was apparent from the choice of different instruments used to measure the physical disability, pain and psychological distress outcomes, and at different time points.
There were 9328 participants in the trials included in the repository. Table 15 shows the demographics and clinical characteristics at baseline by treatment arms. All of the trials were able to provide information on sex and age. Of the 9326 participants (missing data from two participants), 5316 (57%) were females. The proportion of males and females was similar across all treatment arms. The average age of the participants in the repository was 49 years (SD 14 years). The average age of participants from trials that had active physical therapies (APTs) was slightly lower [44 years (n = 914; SD 12 years)] compared with the average age from trials that had passive and psychological treatments [49 years (n = 3270; SD 14 years) and 50 years (n = 1118; SD 14 years)], respectively. This difference is mainly due to the inclusion criteria of the trials.
Characteristics | Active physical (m = 7; n = 914) | Passive physical (m = 12; n = 3270) | Psychological (m = 7; n = 1120) | Combination (m = 3; n = 451) | Sham (m = 6; n = 688) | Control (m = 10; n = 2885) | All (m = 19; n = 9328) |
---|---|---|---|---|---|---|---|
Demographics | |||||||
Age, years | |||||||
Number of trials, m | 7 | 12 | 7 | 3 | 6 | 10 | 19 |
n | 914 | 3270 | 1118 | 451 | 688 | 2885 | 9326 |
Mean | 43.67 | 49.39 | 50.08 | 43.77 | 48.54 | 50.51 | 48.92 |
SD | 11.74 | 14.13 | 14.22 | 12.51 | 15.22 | 13.37 | 13.88 |
Sex | |||||||
Number of trials, m | 7 | 12 | 7 | 3 | 6 | 10 | 19 |
Female (%) | 497 (54.4) | 1907 (58.3) | 655 (58.5) | 237 (52.6) | 412 (59.9) | 1641 (56.9) | 5349 (57.4) |
Male (%) | 417 (45.6) | 1363 (41.7) | 464 (41.5) | 214 (47.5) | 276 (40.1) | 1243 (43.1) | 3977 (42.6) |
Ethnicity | |||||||
Number of trials, m | 1 | 1 | 4 | – | – | 4 | 5 |
White (%) | 65 (75.6) | 159 (100.0) | 667 (87.8) | – | – | 478 (89.4) | 1369 (88.9) |
Mixed (%) | – | – | 4 (0.5) | – | – | 3 (0.6) | 7 (0.5) |
Black (%) | – | – | 26 (3.4) | – | – | 21 (3.9) | 47 (3.1) |
Asian (Indian, Pakistani, Bangladeshi, others) (%) | 7 (8.1) | – | 37 (4.9) | – | – | 17 (3.2) | 61 (4.0) |
Chinese (%) | 1 (1.2) | – | 1 (0.1) | – | – | 1 (0.2) | 3 (0.2) |
Others (%) | 13 (15.1) | – | 25 (3.3) | – | – | 15 (2.8) | 53 (3.4) |
Smoking status | |||||||
Number of trials, m | 5 | 3 | 3 | 1 | 1 | 1 | 6 |
No (%) | 333 (66.7) | 211 (52.4) | 167 (76.3) | 52 (82.5) | 54 (79.4) | 69 (70.4) | 886 (65.6) |
Yes (%) | 167 (33.3) | 192 (47.6) | 52 (23.7) | 11 (17.5) | 14 (20.6) | 29 (29.6) | 465 (34.4) |
Employment status | |||||||
Number of trials, m | 5 | 6 | 5 | 1 | 1 | 6 | 11 |
Full-time employment (%) | 307 (51.3) | 424 (51.7) | 360 (42.2) | 165 (64.7) | 4 (25.0) | 485 (54.3) | 1745 (50.8) |
Part-time employment (%) | 120 (20.0) | 130 (15.9) | 132 (15.5) | 60 (23.5) | – | 190 (21.3) | 632 (18.4) |
No employment (%) | 172 (28.7) | 266 (32.4) | 362 (42.4) | 30 (11.8) | 12 (75.0) | 218 (24.4) | 1060 (30.8) |
BMI | |||||||
Number of trials, m | 2 | 4 | 2 | – | 2 | 2 | 5 |
n | 222 | 811 | 156 | – | 453 | 462 | 2,104 |
Mean | 27.03 | 26.60 | 26.52 | – | 26.45 | 26.42 | 26.57 |
SD | 5.31 | 4.60 | 5.22 | – | 4.73 | 4.48 | 4.73 |
Physical disability | |||||||
CPG-DS (0–100; 100 = worst)a | |||||||
Number of trials, m | 1 | 2 | 2 | 1 | 1 | 5 | 4 |
n | 284 | 721 | 572 | 312 | 387 | 1052 | 3328 |
Mean | 47.44 | 51.82 | 49.38 | 44.76 | 55.36 | 49.87 | 50.16 |
SD | 22.66 | 20.9 | 23.77 | 21.86 | 18.92 | 22.14 | 21.99 |
FFbHR (0–100; 100 = best) | |||||||
Number of trials, m | – | 3 | – | – | 2 | 3 | 3 |
n | – | 1927 | – | – | 460 | 1789 | 4176 |
Mean | – | 58.33 | – | – | 48.01 | 59.38 | 57.64 |
SD | – | 20.63 | – | – | 16.14 | 20.69 | 20.5 |
ODI (0–100; 100 = worst) | |||||||
Number of trials, m | – | 1 | – | – | – | 1 | 1 |
n | – | 159 | – | – | – | 80 | 239 |
Mean | – | 33.72 | – | – | – | 31.36 | 32.93 |
SD | – | 15.40 | – | – | – | 14.24 | 15.03 |
PDI (0–70; 70 = worst) | |||||||
Number of trials, m | – | 1 | – | – | 1 | 1 | 1 |
n | – | 146 | – | – | 73 | 79 | 298 |
Mean | – | 28.92 | – | – | 31.53 | 30.95 | 30.10 |
SD | – | 11.12 | – | – | 11.14 | 13.27 | 11.75 |
PSFS (0–10; 10 = best) | |||||||
Number of trials, m | 2 | 1 | 2 | 1 | 2 | – | 3 |
n | 150 | 119 | 148 | 62 | 188 | – | 667 |
Mean | 3.57 | 3.78 | 3.76 | 3.83 | 3.97 | – | 3.79 |
SD | 1.79 | 1.60 | 1.67 | 1.94 | 1.84 | – | 1.76 |
RMDQ (0–24; 24 = worst) | |||||||
Number of trials, m | 7 | 7 | 7 | 3 | 3 | 6 | 14 |
n | 907 | 1087 | 1,120 | 446 | 212 | 938 | 4710 |
Mean | 10.07 | 10.89 | 9.85 | 9.59 | 11.09 | 8.57 | 9.91 |
SD | 5.08 | 5.03 | 5.33 | 4.33 | 5.95 | 4.69 | 5.09 |
Troublesomeness | |||||||
Number of trials, m | 2 | 3 | 1 | 1 | – | 3 | 4 |
Not at all troublesome (%) | 3 | 4 | – | – | – | 4 | 11 |
Slightly troublesome (%) | 41 | 62 | 26 | 29 | – | 51 | 209 |
Moderately troublesome (%) | 146 | 213 | 211 | 154 | – | 284 | 1008 |
Very troublesome (%) | 115 | 205 | 151 | 107 | – | 211 | 789 |
Extremely troublesome (%) | 39 | 72 | 38 | 22 | – | 54 | 225 |
Pain | |||||||
CPG-PS (0–100; 100 = worst)a | |||||||
Number of trials, m | 1 | 2 | 3 | 1 | 1 | 5 | 5 |
n | 283 | 721 | 582 | 312 | 387 | 1054 | 3,339 |
Mean | 60.82 | 64.93 | 58.93 | 59.91 | 67.60 | 62.65 | 62.66 |
SD | 17.62 | 16.79 | 18.53 | 17.91 | 13.16 | 17.41 | 17.31 |
Average pain (0–100; 100 = worst)b | |||||||
Number of trials, m | 4 | 6 | 6 | 3 | 5 | 6 | 12 |
n | 472 | 922 | 969 | 380 | 493 | 1118 | 4354 |
Mean | 52.42 | 59.79 | 48.20 | 50.63 | 65.54 | 52.53 | 54.40 |
SD | 22.49 | 20.96 | 24.74 | 21.50 | 15.20 | 24.64 | 23.18 |
Quality of life | |||||||
SF-12/SF-36 PCSa (0–100; 100 = best) | |||||||
Number of trials, m | 4 | 7 | 2 | 1 | 2 | 6 | 9 |
n | 617 | 2544 | 507 | 305 | 460 | 2262 | 6695 |
Mean | 37.14 | 36.03 | 37.15 | 38.14 | 32.87 | 36.30 | 36.19 |
SD | 7.42 | 8.05 | 9.06 | 7.46 | 7.09 | 8.74 | 8.29 |
SF-12/SF-36 MCSb (0–100; 100 = best) | |||||||
Number of trials, m | 4 | 7 | 2 | 1 | 2 | 6 | 9 |
n | 617 | 2544 | 507 | 305 | 460 | 2262 | 6695 |
Mean | 43.94 | 44.89 | 44.38 | 44.84 | 46.61 | 45.89 | 45.22 |
SD | 11.66 | 12.23 | 11.28 | 10.84 | 11.42 | 11.90 | 11.90 |
Health utility | |||||||
EQ-5D-3L (–0.11 to 1;1 = best) | |||||||
Number of trials, m | 4 | 4 | 2 | 2 | – | 5 | 7 |
n | 593 | 740 | 652 | 371 | – | 724 | 3080 |
Mean | 0.57 | 0.61 | 0.6 | 0.58 | – | 0.59 | 0.59 |
SD | 0.27 | 0.27 | 0.29 | 0.25 | – | 0.26 | 0.27 |
Depression (DE) | |||||||
DASS–DE (0–42; 42 = worst) | |||||||
Number of trials, m | 1 | – | 1 | 1 | 1 | – | 1 |
n | 65 | – | 62 | 63 | 68 | – | 258 |
Mean | 7.11 | – | 7.55 | 7.08 | 7.06 | – | 7.19 |
SD | 7.84 | – | 7.67 | 8.79 | 7.61 | – | 7.94 |
DRAM | |||||||
Number of trials, m | 2 | 1 | – | 1 | – | 2 | 2 |
Type N (%) | 135 (36.49) | 122 (36.75) | – | 116 (37.54) | – | 184 (44.88) | 557 (39.20) |
Type R (%) | 147 (39.73) | 147 (44.28) | – | 120 (38.83) | – | 158 (38.54) | 572 (40.25) |
Type DD (%) | 55 (14.86) | 41 (12.35) | – | 46 (14.89) | – | 49 (11.95) | 191 (13.44) |
Type DS (%) | 33 (8.92) | 22 (6.63) | – | 27 (8.74) | – | 19 (4.63) | 101 (7.11) |
HADS–DE (0–21; 21 = worst) | |||||||
Number of trials, m | – | – | 1 | – | – | 1 | 1 |
n | – | – | 464 | – | – | 231 | 695 |
Mean | – | – | 6.04 | – | – | 5.54 | 5.87 |
SD | – | – | 3.81 | – | – | 3.6 | 3.75 |
MZDI (0–69; 69 = worst) | |||||||
Number of trials, m | 2 | 2 | 1 | 1 | – | 2 | 3 |
n | 411 | 485 | 148 | 309 | – | 411 | 1724 |
Mean | 19.77 | 21.44 | 22.41 | 21.24 | – | 19.77 | 21.06 |
SD | 10.75 | 10.55 | 9.37 | 10.93 | – | 10.75 | 10.70 |
Anxiety (AN) | |||||||
DASS–AN (0–42; 42 = worst) | |||||||
Number of trials, m | 1 | – | 1 | 1 | 1 | – | 1 |
n | 65 | – | 62 | 63 | 68 | – | 258 |
Mean | 6.22 | – | 5.23 | 4.76 | 5.35 | – | 5.40 |
SD | 7.57 | – | 7.44 | 6.68 | 6.92 | – | 7.14 |
HADS–AN (0–21; 21 = worst) | |||||||
Number of trials, m | – | – | 1 | – | – | 1 | 1 |
n | – | – | 458 | – | – | 230 | 688 |
Mean | – | – | 8.22 | – | – | 7.49 | 7.98 |
SD | – | – | 4.3 | – | – | 4.43 | 4.35 |
Fear avoidance | |||||||
ALBPSQ–FA (0–30; 30 = worst) | |||||||
Number of trials, m | 2 | – | 2 | 1 | 1 | – | 2 |
n | 121 | – | 117 | 36 | 33 | – | 307 |
Mean | 18.14 | – | 18.58 | 17.14 | 18.42 | – | 18.22 |
SD | 6.91 | – | 6.16 | 5.97 | 5.90 | – | 6.40 |
FABQ–PC (0–24; 24 = worst) | |||||||
Number of trials, m | 2 | 3 | 1 | 1 | 2 | 4 | 5 |
n | 366 | 840 | 443 | 311 | 506 | 1016 | 3482 |
Mean | 14.70 | 16.65 | 13.59 | 14.96 | 17.79 | 15.85 | 15.84 |
SD | 5.27 | 5.24 | 6.34 | 5.30 | 4.87 | 5.65 | 5.61 |
TSK (16–68; 68 = worst) | |||||||
Number of trials, m | 2 | 1 | 4 | 2 | 1 | 3 | 5 |
n | 176 | 177 | 472 | 124 | 68 | 285 | 1302 |
Mean | 39.08 | 44.05 | 41.64 | 39.33 | 38.07 | 39.71 | 40.79 |
SD | 7.44 | 7.09 | 8.14 | 7.51 | 8.16 | 8.58 | 8.12 |
Catastrophising (CAT) | |||||||
CSQ–CAT (0–36; 36 = worst) | |||||||
Number of trials, m | 1 | 1 | 2 | – | – | – | 2 |
n | 86 | 193 | 282 | – | – | – | 561 |
Mean | 10.84 | 7.83 | 9.62 | – | – | – | 9.19 |
SD | 7.61 | 6.65 | 7.22 | – | – | – | 7.16 |
PRSS–CAT (0–45; 45 = worst) | |||||||
Number of trials, m | 1 | 1 | 1 | 1 | 2 | – | 2 |
n | 65 | 119 | 62 | 63 | 188 | – | 497 |
Mean | 17.92 | 16.43 | 17.9 | 17.29 | 17.23 | – | 17.22 |
SD | 8.61 | 8.12 | 10.55 | 9.05 | 8.53 | – | 8.77 |
Coping (CSS) | |||||||
CSQ–CSS (0–36; 36 = best) | |||||||
Number of trials, m | – | 1 | 1 | – | – | – | 1 |
n | – | 198 | 196 | – | – | – | 394 |
Mean | – | 25.13 | 25.33 | – | – | – | 25.23 |
SD | – | 6.23 | 6.64 | – | – | – | 6.43 |
PRSS–CSS (0–45; 45 = best) | |||||||
Number of trials, m | 1 | 2 | 1 | 1 | 2 | – | 2 |
n | 65 | 119 | 62 | 63 | 188 | – | 497 |
Mean | 30.18 | 31.26 | 30.06 | 30.37 | 31.97 | – | 31.13 |
SD | 7.34 | 6.95 | 8.36 | 6.81 | 6.85 | – | 7.15 |
PSEQ (0–60; 60 = best) | |||||||
Number of trials, m | 3 | 1 | 3 | 1 | 1 | 1 | 4 |
n | 268 | 117 | 601 | 63 | 67 | 223 | 1,339 |
Mean | 40.49 | 36.85 | 40.12 | 44.38 | 43.70 | 41.15 | 40.46 |
SD | 12.93 | 10.94 | 13.17 | 12.77 | 13.38 | 12.54 | 12.90 |
Somatic perception | |||||||
MSPQ (0–39; 39 = worst) | |||||||
Number of trials, m | 2 | 2 | 1 | 1 | – | 2 | 3 |
n | 372 | 526 | 195 | 310 | – | 411 | 1814 |
Mean | 6.78 | 6.43 | 5.58 | 7.07 | – | 6.14 | 6.45 |
SD | 5.52 | 5.38 | 4.29 | 5.43 | – | 5.34 | 5.32 |
Sensory index (SE) | |||||||
McGill–SE (0–33; 33 = worst) | |||||||
Number of trials, m | – | 1 | 1 | – | – | – | 1 |
n | – | 185 | 170 | – | – | – | 355 |
Mean | – | 14.21 | 14.26 | – | – | – | 14.24 |
SD | – | 6.10 | 6.36 | – | – | – | 6.22 |
SES–SE (10–40; 40 = worst) | |||||||
Number of trials, m | – | 1 | – | – | 1 | 1 | 1 |
n | – | 146 | – | – | 73 | 79 | 298 |
Mean | – | 49.7 | – | – | 49.11 | 49.77 | 49.57 |
SD | – | 9.05 | – | – | 8.39 | 11.06 | 9.45 |
Affective index (AF) | |||||||
McGill–AF (0–12; 12 = worst) | |||||||
Number of trials, m | – | 1 | 1 | – | – | – | 1 |
n | – | 192 | 187 | – | – | – | 379 |
Mean | – | 4.21 | 4.25 | – | – | – | 4.23 |
SD | – | 3.31 | 3.36 | – | – | – | 3.33 |
SES–AF (14–56; 56 = worst) | |||||||
Number of trials, m | – | 1 | – | – | 1 | 1 | 1 |
n | – | 146 | – | – | 73 | 79 | 298 |
Mean | – | 50.19 | – | – | 50.88 | 50.01 | 50.31 |
SD | – | 8.38 | – | – | 8.17 | 9.34 | 8.57 |
Most of the participants with data in the repository had similar physical disability or functional limitation at baseline. One trial88 (n = 239) used the ODI as its outcome measure and the average baseline score was 33 (SD 15), which was somewhere between no disability and moderate disability. Three trials50,101,132 (n = 4176) used the FFbHR and the average baseline score was 58 (SD 21), which was slightly above moderate functional limitation. Fourteen trials31,33,65,70,76,102–106,131,133,134,136 (n = 4710) used the RMDQ as their outcome measure and the average baseline score was 10 (SD 5), which was slightly below moderate disability.
Nine trials31,33,50,76,101,102,107,132,134 (n = 6695) collected quality-of-life information with either the SF-12 or SF-36 instrument. The mean PCS at baseline was 36 (SD 8) and the mean MCS at baseline was 45 (SD 12). The mean values were similar across treatment arms.
Only a minority of the RCTs provided information on psychological distress at baseline and were insufficient to provide any qualitative comparison across treatment arms.
One-step meta-analysis
Box plots of change of outcome measures from baseline to short-, mid- and long-term follow-up by treatment arms show that participants in all groups are behaving as expected, with all groups improving over time (data not shown). This observation was examined further in the one-step meta-analysis (adjusting for study effects) and the results are shown in Figures 17–19 and Table 16. There was a statistically significant difference between control and intervention for all outcomes at the short-term follow-up.
Outcomes | Number of trials, m | Intervention | Controlb | Differencec | p-value |
---|---|---|---|---|---|
FFbHR | 3 | n = 1841 | n = 2118 | 0.0165 | |
13.88 | 5.80 | 8.08 | |||
1.24 to 26.51 | –6.93 to 18.53 | 3.46 to 12.69 | |||
RMDQ | 8 | n = 1778 | n = 897 | < 0.0001 | |
4.43 | 2.97 | 1.46 | |||
1.56 to 7.29 | 0.10 to 5.84 | 1.10 to 1.81 | |||
Average paind | 10 | n = 2061 | n = 1546 | < 0.0001 | |
18.03 | 11.57 | 6.46 | |||
8.65 to 27.41 | 2.18 to 20.97 | 4.86 to 8.06 | |||
PCSe | 6 | n = 2793 | n = 2415 | 0.0006 | |
6.86 | 3.72 | 3.15 | |||
4.90 to 8.83 | 1.75 to 5.68 | 1.99 to 4.30 | |||
MCSf | 6 | n = 2793 | n = 2415 | 0.0044 | |
2.69 | 0.62 | 2.07 | |||
1.54 to 3.84 | –0.55 to 1.79 | 0.93 to 3.20 | |||
EQ-5D | 4 | n = 1271 | n = 503 | < 0.0001 | |
0.1065 | 0.03422 | 0.072 | |||
0.008 to 0.205 | –0.059 to 0.127 | 0.04538 to 0.099 |
Analyses of covariance
Table 17 shows the list of moderators for each of the outcomes of interest at short-term follow-up, namely FFbHR, RMDQ, average pain, PCS and MCS. There were three trials50,101,132 with FFbHR short-term outcomes and the explanatory variables that may potentially be treatment moderators provided by these trials were age, sex, SF-12/36 PCS and SF-12/36 MCS. For the change of FFbHR from baseline to short-term follow-up, the treatment effect for younger participants was weakly statistically significant (p = 0.2018). Participants with lower value of FFbHR at baseline (more physical disability) had a larger treatment effect and this was statistically significant (p < 0.0001). Similarly, participants with lower value of PCS at baseline (substantial physical limitations) had larger treatment effect (p < 0.0001). Therefore, age, and the baseline values of FFbHR and PCS were considered for inclusion in further analyses.
Outcome | Covariates | Number of trials, m | Number of participants, AT : UC | Estimate (interaction term) | LCI | UCI | p-value |
---|---|---|---|---|---|---|---|
FFbHR | Age | 3 | 1841 : 2118 | –0.051 | –0.131 | 0.028 | 0.2018 |
Sex (male vs. female)b | 3 | 1841 : 2118 | –0.684 | –2.851 | 1.483 | 0.5361 | |
FFbHR | 3 | 1841 : 2118 | –0.177 | –0.229 | –0.125 | < 0.0001 | |
PCS (< 50 vs. ≥ 50)c | 3 | 1718 : 2000 | 2.521 | –2.361 | 7.403 | 0.3114 | |
PCS (continuous) | 3 | 1718 : 2000 | –0.318 | –0.451 | –0.186 | < 0.0001 | |
MCS (< 50 vs. ≥ 50) | 3 | 1718 : 2000 | 0.612 | –1.618 | 2.842 | 0.5903 | |
MCS (continuous) | 3 | 1718 : 2000 | –0.039 | –0.130 | 0.051 | 0.3949 | |
RMDR | Age | 8 | 1778 : 897 | –0.009 | –0.036 | 0.018 | 0.514 |
Sex (male vs. female) | 8 | 1778 : 896 | 0.136 | –0.591 | 0.863 | 0.7133 | |
RMDQ | 8 | 1778 : 897 | –0.017 | –0.085 | 0.050 | 0.6176 | |
Average pain | 8 | 1649 : 790 | –0.003 | –0.018 | 0.011 | 0.6548 | |
PCS (continuous) | 2 | 1009 : 401 | –0.016 | –0.076 | 0.044 | 0.594 | |
PCS (< 50 vs. ≥ 50) | 2 | 1009 : 401 | 0.546 | –1.463 | 2.556 | 0.5939 | |
MCS (continuous) | 2 | 1009 : 401 | –0.002 | –0.046 | 0.042 | 0.9177 | |
MCS (< 50 vs. ≥ 50) | 2 | 1009 : 401 | –0.423 | –1.435 | 0.589 | 0.4123 | |
EQ-5D | 3 | 1201 : 460 | –0.366 | –2.162 | 1.429 | 0.6892 | |
Anxiety | 4 | 1388 : 523 | 0.3332 | ||||
Low riskd | –0.295 | –1.713 | 1.123 | 0.6832 | |||
Moderate riske | 0.452 | –1.089 | 1.994 | 0.5649 | |||
Depression | 4 | 1387 : 525 | 0.5684 | ||||
Low risk | 0.078 | –1.337 | 1.492 | 0.9143 | |||
Moderate risk | 0.559 | –0.933 | 2.051 | 0.4622 | |||
Catastrophising | 2 | 293 : 178 | 0.2360 | ||||
Positivef | 0.387 | –2.271 | 3.046 | 0.7747 | |||
Moderateg | 2.030 | –0.461 | 4.521 | 0.1099 | |||
Coping | 3 | 620 : 348 | 0.6797 | ||||
Positiveh | 0.428 | –1.127 | 1.982 | 0.5895 | |||
Moderatei | 0.729 | –0.904 | 2.362 | 0.3813 | |||
Fear avoidance | 7 | 1706 : 858 | 0.1933 | ||||
Positivej | 0.786 | –0.125 | 1.697 | 0.0907 | |||
Moderatek | 0.714 | –0.225 | 1.653 | 0.1361 | |||
Average painl | Age | 10 | 2061 : 1546 | –0.047 | –0.162 | 0.068 | 0.4216 |
Sex (male vs. female) | 10 | 2061 : 1545 | 0.784 | –2.381 | 3.950 | 0.6272 | |
RMDQ | 8 | 1657 : 794 | 0.156 | –0.293 | 0.604 | 0.497 | |
Average pain | 10 | 2061 : 1546 | 0.047 | –0.017 | 0.111 | 0.1451 | |
PCS (continuous) | 3 | 1390 : 1144 | –0.167 | –0.400 | 0.066 | 0.1587 | |
PCS (< 50 vs. ≥ 50) | 3 | 1390 : 1144 | 1.569 | –8.473 | 11.610 | 0.7594 | |
MCS (continuous) | 3 | 1390 : 1144 | 0.111 | –0.047 | 0.268 | 0.1677 | |
MCS (< 50 vs. ≥ 50) | 3 | 1390 : 1144 | –1.270 | –4.942 | 2.403 | 0.498 | |
EQ-5D | 3 | 1208 : 464 | –3.192 | –13.603 | 7.219 | 0.5477 | |
Anxiety | 4 | 1394 : 528 | 0.2488 | ||||
Low risk | –6.939 | –15.111 | 1.233 | 0.096 | |||
Moderate risk | –5.509 | –14.423 | 3.405 | 0.2256 | |||
Depression | 4 | 1394 : 530 | 0.9355 | ||||
Low risk | –1.519 | –9.809 | 6.772 | 0.7195 | |||
Moderate risk | –1.076 | –9.843 | 7.692 | 0.8099 | |||
Catastrophising | 2 | 198 : 85 | 0.9797 | ||||
Positive | –0.400 | –19.050 | 18.250 | 0.9664 | |||
Moderate | –1.573 | –17.280 | 14.133 | 0.8438 | |||
Coping | 3 | 544 : 264 | 0.4009 | ||||
Positive | –6.107 | –14.999 | 2.786 | 0.178 | |||
Moderate | –2.864 | –11.995 | 6.266 | 0.5382 | |||
Fear avoidance | 8 | 1991 : 1505 | 0.3577 | ||||
Positive | 1.396 | –2.525 | 5.317 | 0.4851 | |||
Moderate | 2.808 | –1.031 | 6.646 | 0.1516 | |||
SF-12/36 PCSm | Age | 6 | 2793 : 2415 | –0.034 | –0.068 | 0.001 | 0.0538 |
Sex (male vs. female) | 6 | 2793 : 2414 | –0.176 | –1.106 | 0.755 | 0.7111 | |
FFbHR | 3 | 1675 : 1955 | –0.016 | –0.045 | 0.013 | 0.2766 | |
RMDQ | 2 | 966 : 383 | 0.012 | –0.210 | 0.234 | 0.9187 | |
Average pain | 3 | 1346 : 1125 | –0.011 | –0.044 | 0.023 | 0.5313 | |
PCS (continuous) | 6 | 2793 : 2415 | –0.057 | –0.109 | –0.005 | 0.0313 | |
PCS (< 50 vs. ≥ 50) | 6 | 2793 : 2415 | 1.995 | 0.018 | 3.973 | 0.048 | |
MCS (continuous) | 6 | 2793 : 2415 | 0.023 | –0.015 | 0.060 | 0.2395 | |
MCS (< 50 vs. ≥ 50) | 6 | 2793 : 2415 | –0.913 | –1.827 | 0.002 | 0.0504 | |
EQ-5D | 3 | 1046 : 425 | 1.216 | –2.364 | 4.795 | 0.5054 | |
Anxiety | 3 | 1051 : 428 | 0.6537 | ||||
Low risk | 1.315 | –1.638 | 4.267 | 0.3826 | |||
Moderate risk | 1.398 | –1.750 | 4.545 | 0.3839 | |||
Depression | 3 | 1053 : 430 | 0.6277 | ||||
Low risk | 1.261 | –1.640 | 4.163 | 0.3939 | |||
Moderate risk | 1.462 | –1.559 | 4.483 | 0.3427 | |||
Fear avoidance | 3 | 1332 : 1114 | 0.8438 | ||||
Positive | –0.311 | –2.029 | 1.408 | 0.7229 | |||
Moderate | 0.211 | –1.435 | 1.857 | 0.8019 | |||
Somatic symptoms | 2 | 805 : 365 | 0.9147 | ||||
Positiven | 0.542 | –1.989 | 3.072 | 0.6746 | |||
Moderateo | 0.249 | –1.907 | 2.405 | 0.8206 | |||
SF-12/36 MCSp | Age | 6 | 2793 : 2415 | 0.008 | –0.035 | 0.050 | 0.7273 |
Sex (male vs. female) | 6 | 2793 : 2414 | –0.324 | –1.470 | 0.822 | 0.579 | |
FFbHR | 3 | 1675 : 1955 | –0.046 | –0.081 | –0.011 | 0.0093 | |
RMDQ | 2 | 966 : 383 | –0.011 | –0.298 | 0.276 | 0.9395 | |
Average pain | 3 | 1346 : 1125 | –0.007 | –0.048 | 0.034 | 0.7423 | |
PCS (continuous) | 6 | 2793 : 2415 | –0.035 | –0.102 | 0.033 | 0.3133 | |
PCS (< 50 vs. ≥ 50) | 6 | 2793 : 2415 | 0.649 | –1.821 | 3.118 | 0.6067 | |
MCS (continuous) | 6 | 2793 : 2415 | –0.052 | –0.093 | –0.011 | 0.0128 | |
MCS (< 50 vs. ≥ 50) | 6 | 2793 : 2415 | 1.490 | 0.442 | 2.539 | 0.0054 | |
EQ-5D | 3 | 1046 : 425 | –0.059 | –4.576 | 4.458 | 0.9795 | |
Anxiety | 3 | 1051 : 428 | 0.4267 | ||||
Low risk | –1.201 | –4.918 | 2.517 | 0.5265 | |||
Moderate risk | 0.406 | –3.558 | 4.369 | 0.8409 | |||
Depression | 3 | 1053 : 430 | 0.863 | ||||
Low risk | –0.334 | –3.983 | 3.314 | 0.8573 | |||
Moderate risk | 0.343 | –3.456 | 4.142 | 0.8594 | |||
Fear avoidance | 3 | 1332 : 1114 | 0.7926 | ||||
Positive | 0.732 | –1.378 | 2.843 | 0.4964 | |||
Moderate | 0.278 | –1.744 | 2.299 | 0.7877 | |||
Somatic symptoms | 2 | 805 : 365 | 0.575 | ||||
Least | –0.978 | –4.351 | 2.395 | 0.5695 | |||
Moderate | 0.789 | –2.087 | 3.665 | 0.5906 | |||
EQ-5D | Age | 4 | 1271 : 503 | 0.001 | –0.001 | 0.003 | 0.503 |
Sex (male vs. female) | 4 | 1271 : 502 | –0.040 | –0.094 | 0.015 | 0.1543 | |
RMDQ | 3 | 1177 : 455 | 0.007 | 0.001 | 0.013 | 0.0219 | |
Average pain | 3 | 1183 : 459 | 0.002 | 0.000 | 0.003 | 0.0094 | |
PCS (continuous) | 3 | 1068 : 439 | –0.004 | –0.008 | –0.001 | 0.0128 | |
PCS (< 50 vs. ≥ 50) | 3 | 1068 : 439 | 0.045 | –0.072 | 0.162 | 0.4494 | |
MCS (continuous) | 3 | 1068 : 439 | –0.002 | –0.004 | 0.001 | 0.1834 | |
MCS (< 50 vs. ≥ 50) | 3 | 1068 : 439 | 0.024 | –0.034 | 0.082 | 0.4102 | |
EQ-5D | 4 | 1271 : 503 | –0.054 | –0.144 | 0.035 | 0.2358 | |
Anxiety | 4 | 1269 : 500 | 0.0032 | ||||
Low risk | –0.143 | –0.232 | –0.055 | 0.0015 | |||
Moderate risk | –0.086 | –0.180 | 0.009 | 0.0753 | |||
Depression | 4 | 1265 : 500 | 0.5331 | ||||
Low risk | –0.033 | –0.120 | 0.054 | 0.4573 | |||
Moderate risk | –0.003 | –0.094 | 0.088 | 0.9511 | |||
Fear avoidance | 3 | 1163 : 450 | 0.0533 | ||||
Positive | –0.001 | –0.072 | 0.071 | 0.9856 | |||
Moderate | 0.073 | –0.002 | 0.147 | 0.0565 | |||
QALY | Age | 6 | 1539 : 814 | 0.001 | –0.0003 | 0.002 | 0.1850 |
RMDQ | 4 | 1092 : 422 | 0.003 | –0.001 | 0.008 | 0.1270 | |
PCS (continuous) | 4 | 1273 : 715 | –0.001 | –0.003 | 0.0004 | 0.1160 | |
MCS (continuous) | 4 | 1273 : 715 | –0.0001 | –0.002 | 0.001 | 0.8340 | |
EQ-5D | 4 | 1273 : 715 | –0.018 | –0.082 | 0.045 | 0.5730 |
Roland–Morris Disability Questionnaire
There were eight trials31,33,70,103–105,131,136 with RMDQ score as a short-term outcome, and the explanatory covariates provided by them were age, sex, RMDQ score, average pain, PCS, MCS, EQ-5D, anxiety level, depression level, catastrophising, coping strategy and fear avoidance at baseline. Seven trials31,33,70,103–105,131 provided information on fear avoidance at baseline and the original values were mapped to a single ordinal categorical variable. The covariate was weakly statistically significant, at our lower threshold for inclusion in further analyses (p < 0.20), in moderating the change of RMDQ score over the short term, for which those with either positive or moderate attitude (lower fear avoidance score) had greater treatment effect than those with negative attitude (higher fear avoidance score). Although the covariate catastrophising was not statistically significant (p = 0.236) in predicting the change of RMDQ score in the short term, there was a weakly statistically significant difference between the moderate and negative statement (mean difference = 2.03; p = 0.1099), that is, those with a moderate attitude towards catastrophising had greater treatment effect than those with a negative attitude. Therefore, both fear avoidance and catastrophising were considered for the prediction rule analyses.
Pain
Ten trials31,33,70,103–105,131,132,135,136 provided an average pain short-term outcome. The list of covariates that were considered in the ANCOVA were age, sex, RMDQ score, average pain, PCS, MCS, EQ-5D, anxiety level, depression level, catastrophising, coping strategy and fear avoidance at baseline. Similar to the results seen for the change of RMDQ score in the short term, anxiety level, coping strategy and fear avoidance were not statistically significant but there was weakly significant difference between the low and high risk of anxiety level (p = 0.0960), between the positive and negative statement of coping strategy (p = 0.1780), and between the moderate and negative statement of fear avoidance (p = 0.1516). Similar to the results seen above, those with moderate fear avoidance belief had greater treatment effect than those with a negative attitude. However, those with low risk of anxiety had less treatment effect than those with high risk of anxiety. Similarly, those with a positive attitude towards coping had less treatment effect than those with a negative attitude. As the average pain increased, the estimated treatment effect was greater, that is, as participants had worse average pain, they gained greater treatment effect and this was weakly significant (p = 0.1451). The estimated treatment effect decreased as PCS increased, that is, as a participant’s physical functioning score got worse, he/she had greater treatment effect (p = 0.1587). The interaction term between treatment and MCS was also weakly statistically significant (p = 0.1677), for which participants with higher (better) MCS had larger treatment effect. Therefore, average pain, PCS, MCS, anxiety level, coping strategy and fear avoidance at baseline were considered for the prediction rule analyses.
Mental component score and physical component score
There were six trials31,33,50,88,101,132 with PCS and MCS short-term outcomes and the covariates considered were age, sex, FFbHR, RMDQ, average pain, PCS, MCS, EQ-5D, anxiety level, depression level, fear avoidance and somatic symptoms. Psychological distress at baseline measured by the MCS instrument was not significant in predicting the change of PCS at short term but when the score was dichotomised to < 50 against ≥ 50, that is, ‘below the norm’ against ‘at or above the general population norm’, participants with more psychological distress (score of < 50) had worse treatment effect and this was possibly statistically significant (p = 0.0504). In addition, age and PCS at baseline were significant when those who were younger and those with substantial physical limitations had a larger treatment effect. Therefore, age, PCSs and MCSs at baseline were included for the prediction rule analyses for the change of SF-12/36 PCS at short term.
For the short-term MCS outcome, only FFbHR and MCS at baseline were found to be statistically significant in predicting the change of SF-12/36 MCS. Those with higher physical disability and more psychological distress had a greater treatment effect. Therefore, both FFbHR score and MCS at baseline were included for the prediction rule analyses for the change of SF-12/36 MCS in the short term.
European Quality of Life-5 Dimensions (EQ-5D)
Four trials31,33,70,105 provided health utility measured by EQ-5D over the short term. The covariates examined in the ANCOVA were age, sex, RMDQ score, average pain, PCS, MCS, EQ-5D, anxiety level, depression level and fear avoidance. Seven of these were statistically or weakly significant in predicting the change of EQ-5D at short term and these were sex, RMDQ score, average pain, PCS, MCS, anxiety level and fear avoidance at baseline. Females had greater treatment effect (p = 0.1543) and so had those with worse physical disability (RMDQ score, p = 0.0219; average pain, p = 0.0094; PCS, p = 0.0128). Participants with more psychological distress at baseline, high risk of anxiety, high risk of depression, negative beliefs about physical activity affecting their LBP (fear avoidance) or frequent psychological distress (MCS) had a larger treatment effect. Therefore, these were considered for the prediction rule analyses.
Quality-adjusted life-years
There were six trials,31,33,70,105,132,133 with QALY data, age and baseline RMDQ score and PCS, which were possibly statistically significant in moderating QALYs. The age-by-treatment interaction was possibly significant, with a coefficient of 0.001 and a p-value of 0.19. The coefficient was positive, suggesting that older participants within this sample achieved a higher treatment effect. The RMDQ score by treatment interaction was significant (p = 0.13) at our prespecified level of 0.2. The coefficient of 0.003 was positive. The scale on the RMDQ score is such that lower scores denote better health states; therefore, participants with better (lower) RMDQ scores should be peeled off first for the health-economic prediction rule analyses (see Chapter 9). The coefficient of PCS-by-treatment interaction was –0.001 (p = 0.12). The negative coefficient indicates that participants with a worse physical functioning score at baseline achieved a greater treatment effect than those with better physical functioning scores at baseline. The baseline scores of EQ-5D and MCS were not significant. The EQ-5D-by-treatment interaction was not significant, with a coefficient of –0.018 (p = 0.57). The coefficient was negative, suggesting that participants with worse baseline EQ-5D scores achieved better treatment outcomes. However, this result should not be considered reliable, given the low level of significance. The coefficient of MCS by treatment interaction was –0.0001 (p = 0.83).
Summary
This analysis has provided the largest analysis of possible treatment moderation in LBP. Overall, these analyses do not provide strong evidence for substantial effect moderation. Using conventional criteria for statistical significance we can conclude, overall, only that back pain disability moderates effect size on back pain disability outcomes (FFbHR moderates FFbHR); physical state and back pain moderate effect size on physical outcomes (PCS and FFbHR moderate PCS); psychological state moderates effect size on psychological outcomes (MCS moderates MCS); overall psychological state and anxiety moderate effect size on quality of life (PCS and anxiety moderate EQ-5D); and back pain severity moderates effect size on psychological outcomes (FFbHR moderates MCS).
Age, gender, back pain disability, pain severity, MCS, PCS, anxiety, catastrophising and coping were all at least weakly statistically significant (p < 0.2) in one, or more, ANCOVA and were considered further for our main analyses.
Chapter 7 Methodology and statistical developments 1: subgroup identification with recursive partitioning
In Chapter 2 we concluded that current approaches using tests for interactions on single potential moderators may not be the best approach to identifying subgroups; specifically in the case of LBP but this may be generalisable to other disorders. We argued that new statistical methods may be needed to improve subgroup identification. In the succeeding chapters we describe our exploration of the different methods that we have applied to addressing this problem. In particular, we were interested in how subgroups might be defined using multiple parameters. We first describe two recursive partitioning approaches then an adaptive peeling approach and, finally, an indirect meta-analytical approach.
This chapter presents the two methodological developments, using recursive partitioning, to identify subgroup characteristics that moderate response to treatment. Both methods were the works of a PhD project that was part of this programme grant. 167 The other methods are described in later chapters (see Chapters 8–10).
Background
Two methods were considered as suitable and appropriate to perform subgroup analyses using a recursive partitioning approach. They are the interaction tree (IT) and subgroup identification based on a differential effect search (SIDES). 97,99 These methods were initially developed and implemented in a single-trial setting. Therefore, they have to be extended so that they can be applied in an individual patient data (IPD) meta-analysis framework. The extended IT and SIDES methods are known as IPD-IT and IPD-SIDES, respectively. Details of each of these methods are given below.
Both IT and SIDES are tree-based methods that rely on technique referred to as recursive partitioning. This technique recursively forms binary splits of the covariate space in order to grow a tree-like structure. An example of a tree structure is displayed in Figure 20. In this example, we start off with the root node of the tree, which consists of the entire data set. The method then searches all possible binary splits for every covariate to find the best split that maximises some splitting criterion. Suppose that ‘sex’ is identified as the first best split. The method, therefore, splits the root node using the sex covariate to form two child nodes; females (left child node) and males (right child node). The newly formed child nodes are also referred to as internal nodes. The same search process is then conducted on all of the internal nodes of the tree, that is, the two child nodes, to try and identify the next best split. No additional splits are identified for the left child node and hence the node is not split any further. This node is thus referred to as a terminal node, as it cannot be split any further and is represented by a square box in Figure 20. For the right child node, the method identifies age of ≤ 50 years as the next best split and thus forms two new child nodes accordingly. In the same manner, this search process is repeated until a full tree is grown.
The objective of both the IPD-IT and IPD-SIDES methods are somewhat different. The aim of the IPD-IT method is to identify moderators of treatment effect whereas the aim of the IPD-SIDES method is to identify candidate subgroups with enhanced treatment effect. In other words, the IPD-IT method is driven by identifying the split that results in the largest interaction effect whereas the IPD-SIDES method is driven by identifying the spilt that maximises the overall treatment benefit in one of the subgroups formed from the split.
Individual patient data interaction tree
The IPD-IT method primarily consists of three steps:
-
growing an initial tree
-
pruning the initial tree
-
selecting the best tree.
The third and final step in the process will result in either tree structure with just the root node (i.e. no moderators identified) or a larger tree structure that stems from the root node (i.e. some moderators identified). In the latter case, the subgroups identified by the final selected tree are interpreted using its terminal nodes.
Growing an initial tree
The first iteration of the procedure starts at the root node and evaluates a splitting criterion that assesses the interaction effect for every possible binary split of each covariate in order to identify an optimal split. For a continuous or discrete ordered covariate, the total number of binary split points is just one fewer than the total number of distinct values. For example, a discrete ordered covariate with 10 distinct values will have 10 – 1 = 9 possible split points. For a categorical covariate with k different categories, there are 2k–1 – 1 different split points. For example, a categorical covariate such as ethnicity with four different categories (white, Asian/Asian British, black/African/Caribbean/black British, and other) will have seven possible ways of forming two groups using a binary split.
The splitting criterion is used to evaluate the interaction effect for any particular split. The original IT method used a splitting criterion that was equivalent to the square of the t-test statistic of the interaction term in a linear regression model consisting of a treatment indicator variable T, a covariate indicator representing a particular split X and the interaction between T and X. As we are now applying this method to IPD from different trials, we extended the original method so that the splitting criterion adjusts for the between-trial variability when evaluating the interaction. This was done by fitting the same linear regression model but also including dummy variables for each trial, that is, fitting a fixed-effects model. 167 A split with a larger splitting criterion value indicates a larger interaction effect. Therefore, an optimal split is defined as the split that maximises the splitting criterion having searched every possible split point of each covariate. Having defined the splitting criterion, the algorithm for growing a full tree can be applied as follows:
-
Start at the root node consisting of the entire data set.
-
Iteration:
-
Step 1 Evaluate the splitting criterion for all possible splits for every single covariate.
-
Step 2 Select the optimal split from step 1 and form a split to create two new child nodes.
-
Step 3 Repeat steps 1 and 2 for each of the newly formed child nodes.
-
Step 4 Repeat steps 1–3 until either a full tree is grown or some stopping criterion is satisfied, for example minimum number of observations in a node is 30.
-
Pruning the initial tree
The fully grown tree is well fitted to the available data; however, it would be quite poorly fitted and unstable if applied to new data. For this reason, a pruning procedure is applied to the full tree to sequentially remove any branches of the tree that least contribute to the overall predictive accuracy of the tree. The procedure continues until we are just left with the root node and thus have a sequence of subtrees from which the optimal final subtree will be chosen. A more detailed description of the pruning procedure can be found elsewhere. 97,167,168
Selecting the best tree
Once the sequence of subtrees has been determined, an interaction complexity measure is used to evaluate the quality of each tree. The interaction complexity is basically the total amount of interaction of the internal nodes for a tree. Although the interaction–complexity measure is computed for each of the subtrees, these estimates are known to be over-optimistic and thus need to be validated to obtain more reliable estimates. To validate the tree selection, the method applies a bootstrapping procedure, used by LeBlanc and Crowley,169 for validating the trees. As a guideline, LeBlanc and Crowley169 suggested that around 25–100 bootstrap samples is sufficient. The subtree with the largest interaction–complexity measure estimated from the bootstrapping procedure is chosen as the best tree. Conclusions can then be drawn from the best tree by simply computing the treatment effect in each of the terminal nodes of the tree.
Individual patient data subgroup identification based on a differential effect search
The IPD-SIDES method consists of two key steps:
-
growing an initial tree
-
selecting the final candidate subgroups.
The tree growing procedure for the IPD-SIDES method (step 1) relies on two different criteria; a splitting criterion to help search the covariate space for the best splits and a continuation criterion to control the complexity of the tree. Details are given below. Unlike the IPD-IT procedure, the IPD-SIDES method does not require a pruning step, as the tree complexity is controlled using the continuation criterion. Ultimately, after step 2, the method outputs a list of candidate subgroups that have enhanced treatment effect.
Growing an initial tree
We first describe the algorithm for the IPD-SIDES procedure followed by a more detailed description of the splitting criterion and the continuation criterion. The algorithm for growing the tree is as follows:
-
Start at the root node consisting of the entire data set.
-
Iteration:
-
Step 1 Evaluate the splitting criterion for all splits of every covariate, excluding any covariates that have already been used to define the parent node, retaining only the best split for each covariate. Order the covariates from smallest adjusted p-value to largest adjusted p-value, where the adjusted p-values are computed using the Sidak-based multiplicity adjustment.
-
Step 2 Select the best M covariates from the ordered best splits. The value of M is specified by the user where the recommended value is 5. For each of the M splits, form the split creating two child nodes and retain the child node with the larger positive treatment effect, providing that it satisfies the continuation criterion. The retained nodes now become parent nodes for the next iteration.
-
Step 3 Repeat steps 1 and 2 for the newly formed parent nodes.
-
Step 4 Repeat steps 1–3 until either a prespecified maximum number of levels is reached or no more splits can be formed, that is, the continuation criterion is not satisfied. In both cases, the previously formed parent nodes become terminal nodes.
-
The IPD-SIDES procedure starts at the root node consisting of the entire data set. The method then evaluates the splitting criterion for all splits for every covariate, retaining only the single best split for each covariate. The original SIDES method used a splitting criterion in a single-trial setting, which tested the difference in the treatment effect precision between two child nodes with the aim of identifying the subgroup or child node with the most significant treatment effect. This objective is different from what we require the method to do; we require the method to test the differential treatment effect between the two groups in an IPD meta-analysis setting. For this reason, a new splitting criterion was proposed, which uses the same fixed-effects model described earlier for the IPD-IT method but instead uses the p-value of the interaction effect, for which a smaller p-value is indicative of a larger interaction effect. If a covariate has more than two distinct cut-off points, the p-value computed using the splitting criterion is adjusted to overcome variable selection bias – a well-known issue with recursive partitioning-based methods when covariates with a larger number of splits have a greater probability of being chosen as the splitting variable. 170,171 The method adjusts the p-value by applying a Sidak-based multiplicity adjustment, as described in the original SIDES method paper. 99
Continuation criterion
In step 2 of the IPD-SIDES iteration algorithm, a child node with a large positive treatment effect is retained only if it satisfies the continuation criterion. The continuation criterion is given by Equation 2:
where pc is the treatment effect p-value of the child node, pp is the treatment effect p-value of the parent node and γ is the relative improvement parameter that controls the complexity of the tree. Prior to running the method, the user must specify the maximum number of covariates, L, that defines a subgroup, for which the recommended value is ’3’. This means that any identified subgroups will at most be defined by L covariates and hence the tree will have at most L levels. Each level of the tree has a relative improvement parameter value that ranges from 0 to 1, for which a smaller value makes the procedure more selective. The values for each level can be either user specified or optimally selected using a cross-validation procedure as described by the authors. 99 Hence once the relative improvement parameter values are in place, a child node is retained only if its treatment effect p-value is less than or equal to the right-hand side of the continuation criterion.
Selecting the final candidate subgroups
The first step of the IPD-SIDES procedure grows the tree and produces a list of candidate subgroups. Many of these subgroups may be spurious findings and thus need to be removed. To control for this, the authors of the original SIDES method proposed a resampling-based procedure that computes an adjusted treatment effect p-value for each of the identified candidate subgroups to control the overall type I error in the weak sense. 99 Comparing the unadjusted p-value to the adjusted p-value gives a good indication of whether or not the identified subgroups are spurious.
Analyses
Two sets of analyses were performed using the repository data. In the first analyses (analysis 1), we grouped all of the interventions together as being one arm, and grouped the non-active usual care and sham control together as being the comparator arm. We then sought to identify subgroups within these data by applying the IPD-IT and IPD-SIDES methods. These analyses were performed for all of the following absolute change from baseline to short-term follow-up outcome variables: average pain, EQ-5D, FFbHR, MCS of SF-12/36, PCS of SF-12/36 and RMDQ.
In addition to the above outcome measures, we also looked at the QALYs health-economics outcome. This analysis provides proof of principle that the analytical techniques are robust when used with real data rather than simply in the simulated data sets in which we originally developed our techniques. 167
In the second set of analyses (analysis 2), the following interventions against the non-active usual care comparisons were investigated for subgroups:
-
active physical against non-active usual care
-
passive physical against non-active usual care
-
psychological against non-active usual care
-
sham against non-active usual care.
Both the IPD-IT and IPD-SIDES methods were applied to the above for each of the short-term outcomes common to all trials. For example, active physical against non-active usual care may consist of three trials with RMDQ, MCS and PCS as common short-term outcome measures. Thus the analyses would be applied to only these three outcome measures.
Prior to performing each of the analyses, any observations with missing data were removed from the data set. A mixed-effects model was then applied to adjust for the clustering inherent within the data and thus obtain an estimate of the overall treatment effect. In both sets of analyses, the potential moderator variables identified from the univariate analyses as well as those moderators identified in systematic review 1 (see Chapter 2) were considered. From this set of moderator variables, only the variables that were most common across all trials were entered into each of the analyses in order to retain as much data as possible.
The IPD-IT and IPD-SIDES methods both require certain parameters to be prespecified to aid or control the methods when applied to the data. For both methods, the minimum number of participants in any given node of a tree was set to r = 1/20 of the population being analysed. The maximum number of splits for the fully grown IPD-IT tree was set as 15. For the IPD-SIDES methods, the maximum number of levels, that is, the maximum number of covariates defining any particular subgroup, was set as being the number of potential moderators being considered. Moreover, the maximum number of best splits to consider for each node during the IPD-SIDES procedure was set to ‘3’, with a restriction of p ≤ 0.20 placed on the splitting criterion. This is the same constraint that we set in the identification of a promising moderator.
Before applying the IPD-SIDES method, we performed a grid search to obtain an optimal sequence of complexity control parameters for the first three levels of the tree. The grid search considered all permutations from 0.2 to 1, in steps of 0.2 at the first level and then from 0 to 1 in steps of 0.2 at levels two and three. When validating or selecting the final subgroups, we used 500 bootstraps for the IPD-IT procedure and used 1000 repetitions of the resampling procedure for the IPD-SIDES procedure. Any identified subgroups from the analyses were then summarised using the treatment effect and 95% CI. All analyses were performed using R version 3.0.3 (The R Foundation for Statistical Computing, Vienna, Austria).
Results
Analysis 1
The intervention (active physical, passive physical or psychological given either singly or as combined regimen with the other interventions) against control/placebo data were searched for subgroups for the first set of analyses. Table 18 provides a summary of the trials included and the variables used to search for subgroups for each short-term outcome measure. Number included from each trial is dependent on the number of complete cases available for each analysis.
Outcomea | Trials | Variables |
---|---|---|
Average pain | m = 2; n = 1377 bUK BEAM31 (n = 910) cBeST33 (n = 467) |
Age, sex, anxiety, fear avoidance, MCS, PCS, average pain and RMDQ score at baseline |
EQ-5D | m = 2; n = 1339 UK BEAM31 (n = 883) BeST33 (n = 456) |
Age, sex, anxiety, fear avoidance, MCS, PCS, RMDQ and average pain at baseline |
FFbHR | m = 3; n = 3718 dBrinkhaus101 (n = 284) eHaake132 (n = 1110) fWitt50 (n = 2324) |
Age, sex, PCS, FFbHR and MCS at baseline |
MCSg | m = 3; n = 3630 Brinkhaus101 (n = 281) Haake132 (n = 1110) Witt50 (n = 2239) |
Age, sex, FFbHR, MCS and PCS at baseline |
PCSh | m = 6; n = 5208 UK BEAM31 (n = 893) BeST33 (n = 470) Brinkhaus101 (n = 281) Haake132 (n = 1110) Witt50 (n = 2248) iYACBAC107 (n = 206) |
Age, sex, MCS and PCS at baseline |
RMDQ | m = 7; n = 2564 UK BEAM31 (n = 951) BeST33 (n = 488) jHancock131 (n = 235) kPengel103 (n = 236) lSmeets70 (n = 212) mVKBIA104 (n = 229) nVKSC2105 (n = 213) |
Age, sex, fear avoidance and RMDQ score at baseline |
QALYo | m = 4; n = 1514 UK BEAM31 (n = 728) BeST33 (n = 468) Smeets70 (n = 151) pYork BP133 (n = 167) |
Age and RMDQ score at baseline |
Subgroups identified by the individual patient data interaction tree method
The IPD-IT method did not identify any subgroups that moderate treatment effect when comparing any intervention compared with usual care control/sham.
Subgroups identified by the individual patient data subgroup identification based on a differential effect search method
The application of the IPD-SIDES method for the first set of analyses found candidate subgroups for three of the short-term outcome measures when comparing intervention with control/placebo (Table 19); namely short-term FFbHR (Figure 21), SF-12/36 MCS (Figure 22) and SF-12/36 PCS (Figure 23). No candidate subgroups were identified for the average pain, EQ-5D and RMDQ short-term outcomes, as well as the QALY health outcome measure.
Subgroups | n | Treatment effect (95% CI) | Interaction effect | Unadjusted p-value |
---|---|---|---|---|
Outcome: short-term FFbHRb | ||||
Overall treatment effect (95% CI): 8.93 (7.81 to 10.05) | ||||
Candidate 1 | ||||
FFbHR ≤ 54.2 | 1709 | 11.31 (9.38 to 13.23) | 4.69 | < 0.001 |
FFbHR > 54.2 | 2009 | 6.62 (5.46 to 7.78) | ||
Candidate 2 | ||||
FFbHR ≤ 54.2 and Age ≤ 60 | 1043 | 13.17 (10.56 to 15.77) | 5.03 | 0.019 |
FFbHR ≤ 54.2 and Age > 60 | 666 | 8.14 (5.47 to 10.80) | ||
Candidate 3 | ||||
FFbHR ≤ 54.2 and Age ≤ 66 | 1367 | 12.26 (10.06 to 14.46) | 5.14 | 0.043 |
FFbHR ≤ 54.2 and Age > 66 | 342 | 7.12 (3.42 to 10.82) | ||
Outcome: short-term MCSc | ||||
Overall treatment effect (95% CI): 2.61 (1.92 to 3.29) | ||||
Candidate 1 | ||||
MCS ≤ 54.4 | 2541 | 3.46 (2.62 to 4.30) | 2.62 | 0.002 |
MCS > 54.4 | 1089 | 0.84 (0.01 to 1.67) | ||
Outcome: short-term PCSd | ||||
Overall treatment effect (95% CI): 3.48 (3.01 to 3.96) | ||||
Candidate 1 | ||||
MCS > 50.9 | 2082 | 4.09 (3.32 to 4.87) | 0.97 | 0.033 |
MCS ≤ 50.9 | 3126 | 3.12 (2.54 to 3.71) | ||
Candidate 2 | ||||
MCS > 50.9 and Sex = Female | 1125 | 4.72 (3.67 to 5.78) | 1.38 | 0.097 |
MCS > 50.9 and Sex = Male | 957 | 3.34 (2.20 to 4.48) | ||
Candidate 3 | ||||
MCS > 50.9 and PCS ≤ 43.2 | 1666 | 4.62 (3.75 to 5.49) | 2.61 | 0.020 |
MCS > 50.9 and PCS > 43.2 | 416 | 2.01 (0.69 to 3.33) | ||
Candidate 4 | ||||
MCS > 50.9 and PCS ≤ 40.0 | 1457 | 4.89 (3.96 to 5.82) | 2.61 | 0.007 |
MCS > 50.9 and PCS > 40.0 | 625 | 2.28 (1.12 to 3.44) |
Short-term Hannover Functional Ability Questionnaire for measuring back pain-related functional limitations outcome
For the short-term FFbHR outcome, five variables were included in the IPD-SIDES analyses. The overall treatment effect for the FFbHR outcome was 8.93 (95% CI 7.81 to 10.05). Three candidate subgroups with enhanced treatment effect were identified by the IPD-SIDES procedure. Those with baseline FFbHR score of ≤ 54.2 had a treatment effect of 11.31 (95% CI 9.38 to 13.23), those with baseline FFbHR score of ≤ 54.2 and age ≤ 60 years had a treatment effect of 13.17 (95% CI 10.56 to 15.77) and those with FFbHR score of ≤ 54.2 and age ≤ 66 years had a treatment effect of 12.26 (95% CI 10.06 to 14.46).
-
Those with more disability at baseline and who are younger are likely to gain a greater benefit on disability.
Short-term mental component scale of SF-12/36 outcome
For the short-term MCS outcome, five variables were included in the IPD-SIDES analyses. The overall treatment effect for the MCS outcome was 2.61 (95% CI 1.92 to 3.29). Only one candidate subgroup was identified for MCS outcome. Those with baseline MCS of ≤ 54.4 had a treatment effect of 3.46 (95% CI 2.62 to 4.30).
-
Those with more psychological distress at baseline will get better outcomes on psychological distress.
Short-term physical component scale of SF-12/36 outcome
For the short-term PCS outcome, four variables were included in the analyses and four candidate subgroups were identified. The overall treatment effect for the PCS outcome was 3.48 (95% CI 3.01 to 3.96). Those with baseline MCS of > 50.9 had a treatment effect of 4.09 (95% CI 3.32 to 4.87), those with baseline MCS of > 50.9 and female had a treatment effect of 4.72 (95% CI 3.67 to 5.78), those with baseline MCS of > 50.9 and baseline PCS of ≤ 43.2 had a treatment effect of 4.62 (95% CI 3.75 to 5.49) and, finally, those with baseline MCS of > 50.9 and baseline PCS of ≤ 40.0 had a treatment effect of 4.89 (95% CI 3.96 to 5.82).
-
Those with less psychological distress and worse physical status will get better outcomes on physical status.
-
Women with low levels of psychological distress will get better outcomes on physical status.
These analyses do not consider any differences between different treatment approaches.
Analysis 2: pairwise comparisons
Each of the subgrouped interventions (active physical, passive physical or psychological) against non-active usual care data were searched for subgroups for the second set of analyses. Table 20 provides a summary of the trials included and the variables used to search for subgroups for each short-term outcome measure analysed for the different comparisons.
Comparison | Short-term outcome measures | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
FFbHR | RMDQ | MCS | PCS | QALY | ||||||
Trialsa | Variables | Trialsa | Variables | Trialsa | Variables | Trialsa | Variables | Trialsa | Variables | |
Active vs. non-active usual care | – | – | m = 2; n = 576 UK BEAM31 (n = 421), Smeets70 (n = 155) |
Fear avoidance, age, sex, RMDQ, average pain today, EQ-5D, HADS anxiety, HADS depression | – | – | – | – | m = 2; n = 496 UK BEAM31 (n = 329), York BP133 (n = 167) |
Age, RMDQ |
Passive vs. non-active usual care | m = 3; n = 3272 Brinkhaus101 (n = 214), Haake132 (n = 734), Witt50 (n = 2324) |
Age, PCS, FFbHR, sex, MCS | – | – | m = 5; n = 3879 UK BEAM31 (n = 479), Brinkhaus101 (n = 212), Haake132 (n = 734), Witt50 (n = 2248), YACBAC107 (n = 206) |
MCS, age, sex, PCS | m = 5; n = 3879 UK BEAM31 (n = 479), Brinkhaus101 (n = 212), Haake132 (n = 734), Witt50 (n = 2248), YACBAC107 (n = 206) |
Age, MCS, PCS, sex | m = 3; n = 1209 UK BEAM31 (n = 379), Haake132 (n = 716), YACBAC107 (n = 114) |
Age, PCS |
Psychological vs. non-active usual care | – | – | m = 3; n = 928 BeST33 (n = 487), VKBIA104 (n = 229), VKSC2105 (n = 212) |
Fear avoidance, age, sex, RMDQ, average pain today | – | – | – | – | – | – |
Sham vs. non-active usual care | m = 2; n = 881 Brinkhaus101 (n = 144), Haake132 (n = 737) |
Age, PCS, FFbHR, sex, MCS | – | – | m = 2; n = 879 Brinkhaus101 (n = 142), Haake132 (n = 737) |
MCS, age, sex, PCS | m = 2; n = 879 Brinkhaus101 (n = 142), Haake132 (n = 737) |
Age, MCS, PCS, sex | – | – |
Subgroups identified by the ‘individual patient data interaction tree’ method
The IPD-IT method did not identify any subgroups that moderate treatment effect when comparing any of the subgrouped interventions against non-active usual care.
Subgroups identified by the ‘individual patient data subgroup identification based on a differential effect search’ method
The application of the IPD-SIDES method for the second set of analyses found candidate subgroups for one or more short-term outcome measures for the passive physical against non-active usual care (Table 21), psychological against non-active usual care (Table 22) and sham against non-active usual care (Table 23). No candidate subgroups were identified for the active physical against non-active usual care comparison.
Subgroups | n | Treatment effect (95% CI) | Interaction effect | Unadjusted p-value |
---|---|---|---|---|
Outcome: short-term FFbHR | ||||
Overall treatment effect (95% CI): 9.95 (8.80 to 11.11) | ||||
Candidate 1 | ||||
FFbHR score of ≤ 54.2 | 1424 | 12.86 (10.81 to 14.91) | 5.45 | < 0.001 |
FFbHR score of > 54.2 | 1848 | 7.41 (6.23 to 8.59) | ||
Candidate 2 | ||||
FFbHR score of ≤ 54.2 and age ≤ 57 years | 731 | 15.86 (12.80 to 18.92) | 6.63 | 0.002 |
FFbHR score of ≤ 54.2 and age > 57 years | 693 | 9.23 (6.64 to 11.82) | ||
Candidate 3 | ||||
FFbHR score of ≤ 54.2 and age ≤ 53 years | 571 | 16.67 (13.16 to 20.18) | 6.85 | 0.001 |
FFbHR score of ≤ 54.2 and age > 53 years | 853 | 9.83 (7.43 to 12.22) | ||
Candidate 4 | ||||
FFbHR score of ≤ 41.7 | 792 | 15.03 (12.06 to 18.01) | 6.71 | < 0.001 |
FFbHR score of > 41.7 | 2480 | 8.32 (7.19 to 9.45) | ||
Outcome: short-term MCS | ||||
Overall treatment effect (95% CI): 2.96 (2.31 to 3.61) | ||||
Candidate 1 | ||||
MCS of ≤ 54.3 | 2714 | 3.76 (2.97 to 4.55) | 2.82 | < 0.001 |
MCS of > 54.3 | 1165 | 0.93 (0.10 to 1.76) | ||
Candidate 2 | ||||
MCS of ≤ 54.3 and PCS ≤ 43.9 | 2171 | 4.27 (3.39 to 5.15) | 2.43 | 0.019 |
MCS of ≤ 54.3 and PCS > 43.9 | 543 | 1.85 (0.11 to 3.59) | ||
Candidate 3 | ||||
MCS of ≤ 51.3 | 2327 | 3.83 (2.96 to 4.70) | 2.57 | < 0.001 |
MCS of > 51.3 | 1552 | 1.26 (0.52 to 1.99) | ||
Outcome: short-term PCS | ||||
Overall treatment effect (95% CI): 4.10 (3.56 to 4.63) | ||||
Candidate 1 | ||||
PCS of ≤ 43.6 | 3103 | 4.39 (3.78 to 4.99) | 1.61 | 0.013 |
PCS of > 43.6 | 776 | 2.77 (1.87 to 3.67) | ||
Candidate 2 | ||||
PCS of ≤ 43.6 and age ≤ 44 years | 942 | 5.35 (4.21 to 6.49) | 1.45 | 0.040 |
PCS of ≤ 43.6 and age > 44 years | 2161 | 3.90 (3.20 to 4.60) | ||
Candidate 3 | ||||
PCS of ≤ 37.8 | 2326 | 4.61 (3.90 to 5.32) | 1.23 | 0.025 |
PCS of > 37.8 | 1553 | 3.37 (2.66 to 4.09) | ||
Candidate 4 | ||||
PCS of ≤ 37.8 and age ≤ 62 years | 1682 | 5.08 (4.21 to 5.94) | 1.97 | 0.016 |
PCS of ≤ 37.8 and age > 62 years | 644 | 3.11 (1.94 to 4.28) | ||
Candidate 5 | ||||
PCS of ≤ 37.8 and MCS > 44.0 | 1396 | 5.48 (4.55 to 6.41) | 1.80 | 0.011 |
PCS of ≤ 37.8 and MCS ≤ 44.0 | 930 | 3.68 (2.64 to 4.71) | ||
Candidate 6 | ||||
PCS of ≤ 37.8 and MCS > 51.8 | 932 | 5.77 (4.63 to 6.91) | 1.78 | 0.012 |
PCS of ≤ 37.8 and MCS ≤ 51.8 | 1394 | 3.99 (3.11 to 4.87) | ||
Candidate 7 | ||||
PCS of ≤ 37.8 and MCS > 51.8 and sex = female | 520 | 6.64 (5.12 to 8.16) | 1.73 | 0.167 |
PCS of ≤ 37.8 and MCS > 51.8 and sex = male | 412 | 4.91 (3.17 to 6.65) | ||
Candidate 8 | ||||
PCS of ≤ 40.3 | 2715 | 4.51 (3.85 to 5.16) | 1.61 | 0.006 |
PCS of > 40.3 | 1164 | 2.90 (2.11 to 3.68) | ||
Candidate 9 | ||||
PCS of ≤ 40.3 and MCS > 51.5 | 1086 | 5.43 (4.37 to 6.48) | 1.38 | 0.042 |
PCS of ≤ 40.3 and MCS ≤ 51.5 | 1629 | 4.05 (3.24 to 4.85) |
Subgroups | n | Treatment effect (95% CI) | Interaction effect | Unadjusted p-value |
---|---|---|---|---|
Outcome: short-term RMDQ | ||||
Overall treatment effect (95% CI): 1.40 (0.89 to 1.91) | ||||
Candidate 1 | ||||
RMDQ score of > 4 | 697 | 1.72 (1.12 to 2.31) | 1.07 | 0.038 |
RMDQ score of ≤ 4 | 231 | 0.65 (–0.11 to 1.40) |
Subgroups | n | Treatment effect (95% CI) | Interaction effect | Unadjusted p-value |
---|---|---|---|---|
Outcome: short-term MCS | ||||
Overall treatment effect (95% CI): 2.59 (1.13 to 4.04) | ||||
Candidate 1 | ||||
Age ≤ 65 years | 705 | 3.42 (1.80 to 5.04) | 4.32 | 0.019 |
Age > 65 years | 174 | –0.90 (–4.16 to 2.35) | ||
Candidate 2 | ||||
PCS of ≤ 42.0 | 791 | 3.10 (1.55 to 4.65) | 4.99 | 0.043 |
PCS of > 42.0 | 88 | –1.89 (–6.07 to 2.28) |
Passive physical results compared with non-active usual care results
Short-term Hannover Functional Ability Questionnaire for measuring back pain-related functional limitations outcome
The overall treatment effect for the FFbHR short-term outcome was 9.95 (95% CI 8.80 to 11.11). Four candidate subgroups were identified for the FFbHR short-term outcome. Those with a baseline FFbHR score of ≤ 54.2 had a treatment effect of 12.86 (95% CI 10.81 to 14.91), those with a baseline FFbHR score of ≤ 54.2 and age ≤ 57 years had a treatment effect of 15.86 (95% CI 12.80 to 18.92), those with a FFbHR score of ≤ 54.2 and age ≤ 53 years had a treatment effect of 16.67 (95% CI 13.16 to 20.18) and those with a baseline FFbHR score of ≤ 41.7 had a treatment effect of 15.03 (95% CI 12.06 to 18.01).
-
Overall, those with more disability and who are younger are likely to gain a greater benefit on disability from passive physical treatments.
Short-term mental component score of SF-12/36 outcome
The overall treatment effect for the SF-12/36 MCS short-term outcome was 2.96 (95% CI 2.31 to 3.61). Three candidate subgroups were identified for the MCS short-term outcome. Those with a baseline MCS of ≤ 54.3 had a treatment effect of 3.76 (95% CI 2.97 to 4.55), those with a MCS of ≤ 54.3 and PCS of ≤ 43.9 had a treatment effect of 4.27 (95% CI 3.39 to 5.15) and those with a MCS of ≤ 51.3 had a treatment effect of 3.83 (95% CI 2.96 to 4.70).
-
These results suggest that those with more psychological distress and worse physical status at baseline will get better outcomes on psychological distress from passive physical treatments.
Short-term physical component score of SF-12/36 outcome
The overall treatment effect for the SF-12/36 PCS short-term outcome was 4.10 (95% CI 3.56 to 4.63). Nine candidate subgroups were identified for the PCS short-term outcome. Those with a baseline PCS of ≤ 43.6 had a treatment effect of 4.39 (95% CI 3.78 to 4.99), those with a baseline PCS of ≤ 43.6 and age ≤ 44 years had a treatment effect of 5.35 (95% CI 4.21 to 6.49), those with a baseline PCS of ≤ 37.8 had a treatment effect of 4.61 (95% CI 3.90 to 5.32), those with a PCS of ≤ 37.8 and age ≤ 62 years had a treatment effect of 5.08 (95% CI 4.21 to 5.94), those with a baseline PCS of ≤ 37.8 and a MCS of > 44.0 had a treatment effect of 5.48 (95% CI 4.55 to 6.41), those with a PCS of ≤ 37.8 and a MCS of > 51.8 had a treatment effect of 5.77 (95% CI 4.63 to 6.91), those with a PCS of ≤ 37.8, a MCS of > 51.8 and female had a treatment effect of 6.64 (95% CI 5.12 to 8.16), those with PCS ≤ 40.3 had a treatment effect of 4.51 (95% CI 3.85 to 5.16) and, finally, those with a PCS of ≤ 40.3 and a MCS of > 51.5 had a treatment effect of 5.43 (95% CI 4.37 to 6.48). Broadly speaking, these results suggest that:
-
Younger patients with worse physical status at baseline will get better outcomes on physical status from passive physical treatments.
-
Those with worse physical status but less psychological distress at baseline will get better outcomes on physical status from passive physical treatments.
-
Females with worse physical status and less psychological distress at baseline will get better outcomes on physical status from passive physical treatments.
Psychological results compared with non-active usual care results
Short-term Roland–Morris Disability Questionnaire outcome
The overall treatment effect for the RMDQ short-term outcome was 1.40 (95% CI 0.89 to 1.91). One candidate subgroup was identified for the RMDQ short-term outcome. Those with a baseline RMDQ score of > 4 had a treatment effect of 1.72 (95% CI 1.12 to 2.31).
-
This suggests that those with worse disability at baseline gain more benefit from psychological treatment on disability than usual care control.
Sham results compared with non-active usual care results
Short-term mental component score of SF-12/36 outcome
Two trials were included in the analyses and the sham treatment in both was sham acupuncture. The overall treatment effect for the MCS short-term outcome was 2.59 (95% CI 1.13 to 4.04). Two candidate subgroups were identified for the MCS short-term outcome. Those with age ≤ 65 years at baseline had a treatment effect of 3.42 (95% CI 1.80 to 5.04) and those with a baseline PCS of ≤ 42.0 had a treatment effect of 3.10 (95% CI 1.55 to 4.65). No candidate subgroups were identified for the FFbHR and PCS short-term outcomes.
-
This suggests that younger people and those with worse physical status at baseline have a greater benefit from sham treatment on psychological distress than usual care control.
Chapter 8 Methodology and statistical developments 2: subgroup identification using an adaptive refinement by directed peeling algorithm
Background
The adaptive risk group refinement introduced by LeBlanc et al. 172 aims to identify subgroups of participants with poor prognosis, whereby the subgroups are defined by cut-offs for the covariates resulting in box-shaped subgroups that are easy to interpret. The approach is based on a so-called ‘adaptive refinement by directed peeling’ (ARDP) algorithm. Starting with the whole data set, the algorithms peel off fractions of the data in a series of locally optimal steps optimising a prognostic indicator (e.g. median survival in the paper by LeBlanc et al. 172). We aim to identify subgroups of participants who benefit in particular from a specific treatment in that they respond particularly well to the treatment. The approach to subgroup identification presented in this chapter builds on the work by LeBlanc et al. 172 and extends it in two ways: (1) the criterion for optimisation is now based on the interaction effects between treatment and subgroup, and (2) data from multiple trials can now be analysed, allowing between-trial heterogeneity in the treatment-by-subgroup interactions thereby generalising the ARDP algorithm from a single-study setting to individual participant data meta-analysis setting. With regard to the latter, this is similar to the IT and SIDES methods (see Chapter 7). In the following sections we describe the modified ARDP algorithm for individual participant data meta-analysis.
Adaptive refinement by directed peeling in individual patient data meta-analysis
The ARDP in individual patient data meta-analysis (ARDP-MA) algorithm to construct a region that predicts the best or worst response to treatment consists of the following steps:
-
To determine the covariates to be included and their direction of peeling, run regression analyses on the entire data set to investigate interactions of covariates with treatment. For the identified moderators, the sign of the interaction effect determines the direction of peeling. If larger values of a covariate lead to larger treatment effects then peel off the cases with a smaller value of this covariate. Correspondingly, if smaller values of the covariate lead to larger treatment effects then peel off the larger values of the covariate.
-
Start with a ‘subgroup’ B0 that includes all observations, n.
-
The proportion of data to be removed in one step is denoted by α and the minimum number of observations to be peeled off is denoted by nmin. For each variable, we move the threshold so that max (αn,nmin observations are removed; the resulting subgroups for the L covariates we denote by Bjm,j=1,…,L. For each subgroup Bjm calculate the treatment-by-subgroup interaction effect and select the Bjm, which gives the largest improvement on the interaction effect in comparison with the previous iteration standardised by change in subgroup size. In the setting of data from multiple trials, the interaction effects estimated from the individual trials are combined in a random-effects meta-analysis (two-stage procedure); alternatively an equivalent hierarchical model can be fitted (one-step procedure).
-
The selected subgroup is then called Bm+1.
-
Estimate the treatment effects for the outcome of interest for subgroup Bm+1.
-
Repeat steps 3–5 until the size of the remaining region is not smaller than r.
Figure 24 illustrates the ARDP algorithm for the identification of subgroups of treatment responders. Expecting a large number of covariates to be included in the analyses, we developed this algorithm earlier on in the project. However, it turned out that situations with a small number of covariates were most relevant for the data sets to be analysed. By restricting the number of covariates to four, we could do far more extensive searches by considering all of the possible combinations of boxes described in the ARDP algorithm above. This allowed us to interrogate the data sets more thoroughly.
Note that this algorithm can be applied to various kinds of end points, as we assume that only appropriate regression models can be fitted to model the outcome. For instance, Gaussian linear models could be applied to continuous outcomes, logistic regression to binomial outcomes, and Cox’s proportional hazard models to time-to-event data. No distributional assumption regarding the covariates is required, but they should be ordinal and have a sufficient number of possible outcomes so that the peeling in several steps makes sense. If a covariate is not ordinal then an order could be imposed on it by ordering the outcomes by the regression coefficients estimated in step 1 of the algorithm. 172
Analyses
The minimum sample size of the subpopulation was defined as r = 0.10 of the population analysed. The appeal of the ARDP-MA method is the ability to remove a small proportion of participants at each iteration. Categorical covariates that delineate participants into three or fewer categories would cause the ARDP-MA method to remove a large proportion of participants, an unappealing feature. As all the categorical covariates identified in the analyses of covariance have three or fewer categories, none of them was considered in the ARDP-MA analyses.
Similar to analyses seen in Chapter 7 (see Analyses), two sets of analyses were performed. The first one was to confirm proof of concept, when all interventions (active physical, passive physical and psychological, delivered singly or in combination with the others) were grouped together as being one arm and the non-active usual care grouped with the sham as a control/placebo arm. Analyses were performed for these measurements: average pain, EQ-5D, FFbHR, MCS of SF-12/36, PCS of SF-12/36 and RMDQ score. The outcome was the absolute change from baseline to short-term follow-up. In the second set of analyses, similarly, two treatments are compared and the pairwise comparisons investigated were active physical against non-active usual care, passive physical against non-active usual care, psychological against non-active usual care, and sham against non-active usual care.
Results
We programmed the ARDP-MA method to do a full search, but this limits the number of covariates. As the number of covariates increased, the computational time and resources needed to store the data increased exponentially, causing a massive strain on the system server. Therefore, up to four covariates when necessary were included in the analyses.
Analysis 1: overall comparison treatment compared with control
Table 24 shows the summary of the trials and continuous variables used in the ARDP-MA algorithm to construct a region that predicts the best or worst response for each of the short-term outcome measures.
Outcomeb | Trials | Variables |
---|---|---|
Average pain | m = 3; n = 2534 UK BEAM31 (n = 926), BeST33 (n = 498), Haake (n = 1110) |
Age, average pain, PCS and MCS at baseline |
EQ-5D | m = 2; n = 1365 UK BEAM31 (n = 890), BeST33 (n = 475) |
RMDQ, average pain, PCS and MCS at baseline |
FFbHR | m = 3; n = 3718 Brinkhaus101 (n = 284), Haake132 (n = 1110), Witt50 (n = 2324) |
Age, FFbHR, PCS and MCS at baseline |
MCSc | m = 3; n = 3,630 Brinkhaus101 (n = 281), Haake132 (n = 1110), Witt50 (n = 2239) |
Age, FFbHR, PCS and MCS at baseline |
PCSd | m = 6; n = 5208 UK BEAM31 (n = 893), BeST33 (n = 470), Brinkhaus101 (n = 281), Haake132 (n = 1110), Witt50 (n = 2248), YACBAC107 (n = 206) |
Age, PCS and MCS at baseline |
RMDQ | m = 8; n = 2675 UK BEAM31 (n = 995), BeST33 (n = 514), Hancock131 (n = 235), Kennedy136 (n = 40), Pengel103 (n = 236), Smeets70 (n = 212), VKBIA104 (n = 230), VKSC2105 (n = 213) |
Age and RMDQ score at baseline |
Short-term average pain outcome
Figure 25 shows the trajectory plot for the treatment effect for the short-term outcome of average pain. The treatment effect increased as more and more participants were excluded from the subgroup. However, Table 25 shows that age and average pain might not be important covariates in improving the treatment effect as their thresholds fluctuate. Of note was that substantial physical limitation (low PCS) seemed to gain benefit in short-term average pain.
Subgroup size | Age, years (<) | Pain (>) | PCSa (<) | MCSb (>) | Treatment effect |
---|---|---|---|---|---|
0.106c | 50 | 50 | 33.62 | 38.21 | 14.04 |
0.206 | 67 | 50 | 31.34 | 28.93 | 13.18 |
0.217 | 67 | 40 | 31.34 | 28.93 | 12.10 |
0.227 | 67 | 0 | 31.34 | 28.93 | 13.48 |
0.238 | 62 | 50 | 33.62 | 28.93 | 11.49 |
0.247 | 91 | 50 | 31.34 | 28.93 | 13.22 |
0.255 | 91 | 0 | 31.34 | 34.18 | 11.86 |
0.262 | 91 | 40 | 31.34 | 28.93 | 12.38 |
0.275 | 91 | 0 | 31.34 | 28.93 | 13.08 |
0.285 | 91 | 40 | 31.34 | 9.46 | 10.81 |
0.300 | 91 | 0 | 31.34 | 9.46 | 12.23 |
0.307 | 67 | 30 | 33.62 | 28.93 | 10.11 |
0.402 | 91 | 50 | 35.66 | 28.93 | 11.20 |
0.414 | 67 | 50 | 47.59 | 38.21 | 9.39 |
0.426 | 91 | 20 | 40.45 | 42.95 | 9.77 |
0.434 | 67 | 20 | 43.62 | 42.95 | 10.07 |
0.442 | 67 | 0 | 43.62 | 42.95 | 10.34 |
0.459 | 91 | 30 | 35.66 | 28.93 | 9.30 |
0.501 | 91 | 0 | 43.62 | 42.95 | 9.58 |
0.600 | 91 | 0 | 40.45 | 34.18 | 8.76 |
0.710 | 67 | 40 | 47.59 | 9.46 | 7.72 |
0.804 | 91 | 30 | 47.59 | 28.93 | 8.23 |
Short-term European Quality of Life-5 Dimensions outcome
Figure 26 shows the trajectory plot for the short-term outcome of health utility measured by the EQ-5D. As seen in Table 26, approximately 90% of the initial 1365 participants (corresponding to PCS of < 68 and MCS of < 60, regardless of the average pain and RMDQ score at baseline) had an average treatment effect of 0.073. The treatment effect increased sharply to 0.100 after approximately 30% of the participants were excluded in the model. From then on the treatment effect was quite ‘stable’ despite a further 40% of participants being excluded from the analysis. There was a marked increase in treatment effect for about 20% of the population (corresponding to PCS of < 31, MCS of < 72, average pain > 0 and RMDQ score of > 6), for whom the average treatment effect was about 0.160.
Subgroup size | PCSa (<) | MCSb (<) | Pain (>) | RMDQ (>) | Treatment effect |
---|---|---|---|---|---|
0.101c | 35.66 | 60.35 | 0.00 | 14 | 0.208 |
0.119 | 38.01 | 60.35 | 0.00 | 14 | 0.196 |
0.127 | 38.01 | 72.11 | 0.00 | 14 | 0.185 |
0.136 | 47.59 | 60.35 | 0.00 | 14 | 0.185 |
0.144 | 47.59 | 72.11 | 0.00 | 14 | 0.174 |
0.151 | 31.34 | 56.82 | 0.00 | 0 | 0.170 |
0.166 | 31.34 | 60.35 | 0.00 | 6 | 0.158 |
0.171 | 31.34 | 60.35 | 0.00 | 0 | 0.153 |
0.188 | 31.34 | 72.11 | 20.00 | 6 | 0.157 |
0.190 | 31.34 | 72.11 | 0.00 | 6 | 0.160 |
0.210 | 33.62 | 56.82 | 0.00 | 6 | 0.134 |
0.219 | 40.45 | 47.17 | 20.00 | 10 | 0.125 |
0.221 | 40.45 | 47.17 | 0.00 | 10 | 0.127 |
0.233 | 33.62 | 60.35 | 0.00 | 6 | 0.126 |
0.244 | 38.01 | 47.17 | 0.00 | 6 | 0.124 |
0.259 | 33.62 | 72.11 | 30.00 | 6 | 0.122 |
0.267 | 33.62 | 72.11 | 0.00 | 6 | 0.124 |
0.303 | 40.45 | 47.17 | 0.00 | 6 | 0.123 |
0.407 | 67.75 | 72.11 | 57.00 | 0 | 0.106 |
0.415 | 43.62 | 50.61 | 20.00 | 6 | 0.095 |
0.429 | 40.45 | 56.82 | 30.00 | 6 | 0.099 |
0.437 | 38.01 | 72.11 | 20.00 | 6 | 0.099 |
0.446 | 40.45 | 56.82 | 0.00 | 6 | 0.106 |
0.451 | 47.59 | 50.61 | 30.00 | 6 | 0.094 |
0.464 | 47.59 | 72.11 | 50.00 | 6 | 0.102 |
0.477 | 40.45 | 60.35 | 20.00 | 6 | 0.102 |
0.482 | 40.45 | 60.35 | 0.00 | 6 | 0.103 |
0.498 | 43.62 | 56.82 | 30.00 | 6 | 0.093 |
0.505 | 40.45 | 72.11 | 30.00 | 6 | 0.098 |
0.512 | 47.59 | 56.82 | 20.00 | 7 | 0.099 |
0.530 | 40.45 | 72.11 | 0.00 | 6 | 0.100 |
0.540 | 47.59 | 53.87 | 0.00 | 6 | 0.095 |
0.541 | 67.75 | 60.35 | 40.00 | 6 | 0.095 |
0.552 | 47.59 | 56.82 | 30.00 | 6 | 0.099 |
0.570 | 43.62 | 60.35 | 0.00 | 6 | 0.100 |
0.574 | 67.75 | 56.82 | 30.00 | 6 | 0.097 |
0.581 | 47.59 | 56.82 | 20.00 | 6 | 0.102 |
0.593 | 47.59 | 56.82 | 0.00 | 6 | 0.103 |
0.610 | 67.75 | 56.82 | 20.00 | 6 | 0.099 |
0.704 | 47.59 | 60.35 | 20.00 | 5 | 0.085 |
0.803 | 47.59 | 60.35 | 0.00 | 0 | 0.080 |
0.909 | 67.75 | 60.35 | 0.00 | 0 | 0.073 |
Short-term Hannover Functional Ability Questionnaire outcome
Figure 27 shows the trajectory plot for the treatment effect against the size of the constructed region for the change of FFbHR score between baseline and short-term follow-up. In the first iteration, approximately 10% of the initial 3718 participants were excluded from the subgroup box and these participants had a high value of PCS at baseline, that is, the remaining 90% in the subgroup correspond to any age, FFbHR score of < 100, PCS of < 48 and MCS of < 72. The average treatment effect was 8.5 (Table 27). The average treatment effect increased as more participants were excluded from the subgroup box. The average treatment effect for the last 10% of the participants (corresponding to any age, FFbHR score of < 29, PCS of < 68 and MCS of < 57) was 16.8. Although an increase of 8 units of the FFbHR score may be of clinical importance, the proportion of participants who would benefit from such improvement is very small. Nevertheless, those with more functional limitation (greater disability) and more psychological distress would benefit more on the FFbHR disability outcome at short term. If we were interested in an improvement from an average of 8.5 to at least 12 then approximately 30% of the participants (age < 67 years, FFbHR score of < 54, PCS of < 40 and MCS of < 72) would benefit more on the disability outcome at short term, a similar result to that observed in the IPD-SIDES Analysis 1, for which participants with FFbHR score of ≤ 54.2 and age ≤ 66 years had an enhanced treatment effect (see Chapter 7, Subgroups identified by the individual patient data subgroup identification based on a differential effect search method). It is of note that results from both methods suggest that MCS may not be an essential covariate in improving treatment effect.
-
Those with more functional limitation at baseline and who were younger would gain greater improvement in short-term functional ability as measured by the FFbHR.
Subgroup size | Age, years (<) | FFbHR (<) | PCSa (<) | MCSb (<) | Treatment effect |
---|---|---|---|---|---|
0.102c | 91 | 29.17 | 67.75 | 56.82 | 16.79 |
0.118 | 54 | 58.33 | 40.45 | 47.17 | 16.35 |
0.121 | 54 | 45.83 | 67.75 | 60.35 | 16.07 |
0.132 | 54 | 45.83 | 67.75 | 72.11 | 15.97 |
0.150 | 54 | 62.50 | 33.62 | 72.11 | 14.92 |
0.155 | 54 | 54.17 | 40.45 | 56.82 | 14.43 |
0.163 | 58 | 45.83 | 40.45 | 72.11 | 14.49 |
0.171 | 54 | 54.17 | 40.45 | 60.35 | 14.06 |
0.190 | 54 | 54.17 | 40.45 | 72.11 | 14.35 |
0.200 | 54 | 54.17 | 43.62 | 72.11 | 13.74 |
0.206 | 54 | 54.17 | 67.75 | 72.11 | 14.18 |
0.308 | 62 | 54.17 | 67.75 | 72.11 | 12.72 |
0.314 | 67 | 54.17 | 40.45 | 60.35 | 11.90 |
0.327 | 62 | 58.33 | 40.45 | 72.11 | 12.05 |
0.340 | 67 | 54.17 | 67.75 | 60.35 | 11.70 |
0.345 | 58 | 62.50 | 67.75 | 72.11 | 11.82 |
0.352 | 67 | 54.17 | 40.45 | 72.11 | 12.03 |
0.361 | 62 | 58.33 | 67.75 | 72.11 | 11.76 |
0.378 | 67 | 54.17 | 67.75 | 72.11 | 11.82 |
0.385 | 91 | 54.17 | 40.45 | 60.35 | 11.33 |
0.400 | 62 | 70.83 | 40.45 | 60.35 | 11.32 |
0.402 | 67 | 58.33 | 40.45 | 72.11 | 11.20 |
0.509 | 67 | 62.50 | 67.75 | 72.11 | 10.36 |
0.513 | 62 | 100.00 | 40.45 | 72.11 | 10.30 |
0.528 | 91 | 75.00 | 40.45 | 56.82 | 9.99 |
0.535 | 91 | 58.33 | 67.75 | 72.11 | 10.16 |
0.548 | 91 | 62.50 | 40.45 | 72.11 | 10.37 |
0.553 | 91 | 83.33 | 40.45 | 56.82 | 9.82 |
0.570 | 67 | 75.00 | 40.45 | 72.11 | 9.95 |
0.573 | 91 | 70.83 | 40.45 | 60.35 | 10.22 |
0.582 | 91 | 62.50 | 43.62 | 72.11 | 9.96 |
0.599 | 67 | 75.00 | 47.59 | 60.35 | 9.37 |
0.602 | 91 | 75.00 | 40.45 | 60.35 | 9.96 |
0.702 | 91 | 75.00 | 47.59 | 60.35 | 9.14 |
0.808 | 91 | 100.00 | 47.59 | 60.35 | 8.59 |
0.906 | 91 | 100.00 | 47.59 | 72.11 | 8.47 |
Short-term Short Form questionnaire-12 items/-36 items mental component score outcome
Figure 28 is the trajectory plot for the treatment effect for the short-term outcome of MCS. Table 28 shows a selection of constructed regions and the corresponding thresholds for covariates age, FFbHR score, PCS and MCS. The average treatment effect of approximately 90% of the initial 3630 participants (corresponding with age > 16 years, FFbHR score of < 100, PCS of < 48 and MCS of < 72) was 2.23, and this increased to 5.98 for approximately 10% of the participants (corresponding to age > 16 years, FFbHR score of < 100, PCS of < 29 and MCS of < 51). Approximately 55% of the participants (corresponding to age > 31 years, FFbHR score of < 63, PCS of < 44 and MCS of < 72) had an average treatment effect of 3 units. A smaller region consisting of 30% of the participants (corresponding to age > 54 years, FFbHR score of < 75, PCS of < 44 and MCS of < 57) would gain greater improvement in psychological outcome, that is, an average treatment effect of 4 units. Of interest is the conflicting cut-off suggested by FFbHR and PCS at baseline in constructing these regions, for which the former seemed not to play a critical role and the latter suggested that those with poor physical status would gain greater improvement.
-
Those with more psychological distress and who were younger would gain greater improvement in the short-term psychological outcome as measured by the SF-12/36 MCS.
Subgroup size | Age, years (>) | FFbHR (<) | PCSa (<) | MCSb (<) | Treatment effect |
---|---|---|---|---|---|
0.108c | 16 | 100.00 | 28.84 | 50.61 | 5.98 |
0.159 | 58 | 75.00 | 35.66 | 53.87 | 5.23 |
0.163 | 58 | 83.33 | 35.66 | 53.87 | 5.11 |
0.176 | 58 | 70.83 | 38.01 | 53.87 | 4.90 |
0.181 | 58 | 75.00 | 38.01 | 53.87 | 5.16 |
0.194 | 31 | 75.00 | 31.34 | 53.87 | 4.76 |
0.207 | 31 | 45.83 | 43.62 | 50.61 | 4.72 |
0.301 | 54 | 75.00 | 43.62 | 56.82 | 4.05 |
0.317 | 31 | 54.17 | 47.59 | 53.87 | 4.08 |
0.328 | 31 | 54.17 | 40.45 | 56.82 | 3.93 |
0.334 | 45 | 62.50 | 38.01 | 60.35 | 3.84 |
0.341 | 31 | 54.17 | 43.62 | 56.82 | 3.93 |
0.351 | 31 | 54.17 | 67.75 | 56.82 | 3.86 |
0.365 | 45 | 62.50 | 40.45 | 60.35 | 3.81 |
0.373 | 31 | 70.83 | 38.01 | 53.87 | 3.64 |
0.384 | 45 | 62.50 | 43.62 | 60.35 | 3.86 |
0.401 | 45 | 62.50 | 67.75 | 60.35 | 3.64 |
0.505 | 31 | 75.00 | 38.01 | 60.35 | 3.37 |
0.515 | 45 | 75.00 | 67.75 | 60.35 | 3.27 |
0.526 | 31 | 83.33 | 38.01 | 60.35 | 3.28 |
0.535 | 31 | 100.00 | 38.01 | 60.35 | 3.29 |
0.541 | 31 | 100.00 | 67.75 | 50.61 | 3.25 |
0.551 | 31 | 62.50 | 43.62 | 72.11 | 3.03 |
0.568 | 37 | 75.00 | 43.62 | 60.35 | 3.10 |
0.577 | 31 | 100.00 | 47.59 | 53.87 | 3.05 |
0.582 | 31 | 70.83 | 43.62 | 60.35 | 3.17 |
0.597 | 31 | 100.00 | 43.62 | 56.82 | 2.96 |
0.604 | 45 | 100.00 | 67.75 | 60.35 | 2.94 |
0.701 | 16 | 75.00 | 47.59 | 60.35 | 2.75 |
0.807 | 16 | 100.00 | 47.59 | 60.35 | 2.55 |
0.907 | 16 | 100.00 | 47.59 | 72.11 | 2.23 |
Short-term Short Form questionnaire-12 items/-36 items physical component score outcome
Figure 29 shows the trajectory plot for the treatment effect for the short-term outcome of PCS. Although it shows a general trend of higher treatment effect as subgroups were removed from the initial pool of 5208 participants, the treatment effect increased but was not monotonic and the improvement did not increase very much to warrant a clinical importance. Table 29 shows a selection of constructed regions and the corresponding thresholds for covariates age, PCS and MCS. We thus conclude that there was also no subgroup who would gain benefit in short-term SF-12/36 PCS.
Subgroup size | Age, years (<) | PCSa (<) | MCSb (>) | Treatment effect |
---|---|---|---|---|
0.110c | 54 | 40.45 | 56.82 | 5.30 |
0.153 | 54 | 35.66 | 47.17 | 5.14 |
0.169 | 67 | 31.34 | 47.17 | 5.29 |
0.176 | 91 | 31.34 | 50.61 | 4.95 |
0.189 | 67 | 40.45 | 56.82 | 5.15 |
0.193 | 67 | 33.62 | 50.61 | 4.89 |
0.202 | 91 | 31.34 | 47.17 | 5.03 |
0.211 | 58 | 35.66 | 42.95 | 4.76 |
0.224 | 62 | 35.66 | 47.17 | 4.98 |
0.233 | 67 | 35.66 | 50.61 | 4.87 |
0.245 | 62 | 43.62 | 53.87 | 4.47 |
0.253 | 67 | 40.45 | 53.87 | 4.82 |
0.263 | 58 | 40.45 | 47.17 | 4.79 |
0.270 | 67 | 35.66 | 47.17 | 4.98 |
0.289 | 67 | 43.62 | 53.87 | 4.42 |
0.292 | 91 | 40.45 | 53.87 | 4.38 |
0.307 | 62 | 43.62 | 50.61 | 4.67 |
0.316 | 67 | 40.45 | 50.61 | 4.78 |
0.326 | 67 | 47.59 | 53.87 | 4.15 |
0.334 | 54 | 40.45 | 34.18 | 4.14 |
0.348 | 62 | 47.59 | 50.61 | 4.23 |
0.360 | 58 | 43.62 | 42.95 | 4.39 |
0.366 | 62 | 40.45 | 42.95 | 4.58 |
0.372 | 67 | 40.45 | 47.17 | 4.77 |
0.385 | 67 | 35.66 | 34.18 | 3.85 |
0.391 | 62 | 67.75 | 50.61 | 4.14 |
0.409 | 91 | 43.62 | 50.61 | 4.29 |
0.413 | 62 | 47.59 | 47.17 | 4.21 |
0.427 | 91 | 40.45 | 47.17 | 4.50 |
0.430 | 67 | 40.45 | 42.95 | 4.47 |
0.443 | 58 | 40.45 | 28.93 | 3.86 |
0.459 | 91 | 47.59 | 50.61 | 4.05 |
0.467 | 62 | 67.75 | 47.17 | 3.93 |
0.471 | 58 | 67.75 | 42.95 | 3.75 |
0.486 | 91 | 43.62 | 47.17 | 4.17 |
0.496 | 91 | 40.45 | 42.95 | 4.22 |
0.508 | 91 | 67.75 | 47.17 | 3.85 |
0.609 | 67 | 40.45 | 28.93 | 3.73 |
0.703 | 91 | 40.45 | 28.93 | 3.59 |
0.802 | 91 | 43.62 | 28.93 | 3.37 |
0.903 | 91 | 47.59 | 28.93 | 3.26 |
Short-term Roland–Morris Disability Questionnaire outcome
As seen in Figure 30, the non-monotonic trajectory plot for the short-term outcome of RMDQ score suggested that there was no subgroup that would gain greater improvement in short-term disability outcome as measured by the RMDQ.
Table 30 shows the selection of a subgroup of participants with thresholds for covariate age and RMDQ score at baseline and their treatment effects.
Subgroup size | Age, years (<) | RMDQ (<) | Treatment effect |
---|---|---|---|
0.110a | 45 | 5 | 1.13 |
0.111 | 41 | 6 | 1.29 |
0.123 | 31 | 24 | 0.88 |
0.138 | 37 | 9 | 1.15 |
0.144 | 45 | 6 | 1.27 |
0.152 | 37 | 10 | 1.10 |
0.169 | 54 | 5 | 1.18 |
0.178 | 45 | 7 | 1.30 |
0.184 | 50 | 6 | 1.36 |
0.199 | 37 | 14 | 1.56 |
0.216 | 37 | 16 | 1.35 |
0.225 | 50 | 7 | 1.35 |
0.242 | 37 | 24 | 1.56 |
0.250 | 58 | 6 | 1.26 |
0.310 | 50 | 9 | 1.37 |
0.318 | 91 | 6 | 1.13 |
0.322 | 45 | 12 | 1.34 |
0.335 | 41 | 24 | 1.46 |
0.341 | 62 | 7 | 1.56 |
0.405 | 50 | 12 | 1.37 |
0.416 | 54 | 10 | 1.29 |
0.426 | 58 | 9 | 1.33 |
0.443 | 45 | 24 | 1.55 |
0.460 | 50 | 14 | 1.48 |
0.506 | 50 | 16 | 1.48 |
0.523 | 62 | 10 | 1.39 |
0.539 | 91 | 9 | 1.30 |
0.626 | 54 | 16 | 1.51 |
0.645 | 58 | 14 | 1.46 |
0.707 | 58 | 16 | 1.47 |
0.903 | 91 | 16 | 1.46 |
Analysis 2: pairwise comparisons
Similar to the analyses seen in Chapter 7 (see Analysis 2: Pairwise comparisons), a further examination of the treatment effect between active physical and non-active usual care (usual care/GP or waiting list only), between passive physical and non-active usual care, between psychological and non-active usual care, and between sham and non-active usual care arms, was performed for selected short-term outcomes. Table 31 summarises the trials and variables considered in the construction of a region that predicts the best or worst response for each pairwise comparison for selected short-term outcome measures.
Outcome: comparison | FFbHR | RMDQ | MCSa | PCSb | ||||
---|---|---|---|---|---|---|---|---|
Trials | Variables | Trials | Variables | Trials | Variables | Trials | Variables | |
Active physical vs. non-active usual carec | m = 2; n = 622 UK BEAM31 (n = 465), Smeets70 (n = 157) |
Age and RMDQ score at baseline | ||||||
Passive physical vs. non-active usual carec | m = 3; n = 3272 Brinkhaus101 (n = 214), Haake132 (n = 734), Witt50 (n = 2324) |
Age, FFbHR score, PCS and MCS at baseline | m = 5; n = 3879 UK BEAM31 (n = 479), Brinkhaus101 (n = 212), Haake132 (n = 734), Witt50 (n = 2248), YACBAC107 (n = 206) |
Age, PCS and MCS at baseline | m = 5; n = 3879 UK BEAM31 (n = 479), Brinkhaus101 (n = 212), Haake132 (n = 734), Witt50 (n = 2248), YACBAC107 (n = 206) |
Age, PCS and MCS at baseline | ||
Psychological vs. non-active usual carec | m = 3; n = 957 BeST33 (n = 514), VKBIA104 (n = 230), VKSC2105 (n = 213) |
Age and RMDQ score at baseline | ||||||
Sham vs. non-active usual carec | m = 2; n = 881 Brinkhaus101 (n = 144), Haake132 (n = 737) |
Age, FFbHR score, PCS and MCS at baseline | m = 2; n = 879 Brinkhaus101 (n = 142), Haake132 (n = 737) |
Age, PCS and MCS at baseline | m = 2; n = 879 Brinkhaus101 (n = 142), Haake132 (n = 737) |
Age, PCS and MCS at baseline |
Active physical versus non-active usual care
Short-term Roland–Morris Disability Questionnaire outcome
Figure 31 shows the trajectory plot for the treatment effect between active physical and non-active usual care for the short-term RMDQ outcome. The figure shows a similar result to the one seen above (see Analysis 1: overall comparison treatment compared with control/Short-term Roland–Morris Disability Questionnaire outcome), that is, there was no subgroup that would have a substantial improvement in treatment effect. Table 32 shows the average treatment effect for selected constructed regions with the corresponding thresholds.
Subgroup size | Age (>) | RMDQ (>) | Treatment effect |
---|---|---|---|
0.109 | 45 | 14 | 3.54 |
0.190 | 33 | 14 | 2.66 |
0.211 | 52 | 6 | 2.63 |
0.291 | 43 | 10 | 2.09 |
0.314 | 33 | 12 | 2.26 |
0.405 | 43 | 7 | 2.22 |
0.495 | 43 | 5 | 2.14 |
0.527 | 43 | 4 | 2.14 |
0.592 | 40 | 5 | 1.90 |
0.605 | 33 | 7 | 1.87 |
0.807 | 19 | 6 | 1.76 |
0.908 | 19a | 5 | 1.73 |
Passive physical care compared with non-active usual care
Short-term Hannover Functional Ability Questionnaire outcome
Figure 32 shows the trajectory plot for the treatment effect between passive physical and non-active usual care against the size of the constructed region for short-term outcome of FFbHR. Table 33 shows that the average treatment effect for approximately 90% of the population (corresponding to a FFbHR score of < 86, regardless of age, PCS and MCS values at baseline) was 10.41, which was slightly higher than the average treatment effect between any therapist-delivered intervention (active, passive, psychological or any combination treatment) and control/placebo (usual care/GP and sham treatment), which was 8.5. Approximately 20% of the population (corresponding to age < 59 years, FFbHR score of < 50, PCS of < 68 and MCS of < 72) gained at least an average treatment effect of 16 units. Younger participants with substantial physical disability (low FFbHR score) gained the most benefit. The PCS and MCS at baseline did not play an influential role in improving treatment effect.
Subgroup size | Age, years (<) | FFbHR (<) | PCSa (<) | MCSb (<) | Treatment effect |
---|---|---|---|---|---|
0.101 | 55 | 41.67 | 67.75 | 72.11 | 18.42 |
0.196 | 68 | 41.67 | 67.75 | 72.11 | 16.18 |
0.207 | 59 | 50.00 | 67.75 | 72.11 | 16.14 |
0.306 | 68 | 50.00 | 67.75 | 72.11 | 14.57 |
0.407 | 91 | 54.17 | 40.41 | 72.11 | 12.97 |
0.503 | 63 | 86.36 | 40.41 | 72.11 | 12.08 |
0.602 | 91 | 79.17 | 40.41 | 60.38 | 11.62 |
0.702 | 68 | 79.17 | 47.80 | 72.11 | 11.10 |
0.807 | 91 | 100.00 | 43.73 | 72.11 | 10.64 |
0.904 | 91 | 86.36 | 67.75 | 72.11 | 10.41 |
Short-term Short Form questionnaire-12 items/-36 items mental component score outcome
Figure 33 shows the trajectory plot for the treatment effect between passive physical and non-active usual care, which is quite similar to the one seen above (see Analysis 1: Overall comparison treatment compared with control/Short-term Short Form questionnaire-12 items/-36 items mental component score outcome) where approximately 90% of the initial 3879 participants (corresponding to age < 68 years, PCS of < 68 and MCS of < 71) had an average treatment effect of 3.06 (Table 34). The treatment effect increased as more participants were excluded from the region to a clinical important difference of 6.3, but this was applicable to only a small proportion of participants – approximately 10% of them (corresponding to age < 51 years, PCS of < 44 and MCS of < 38), that is, only younger participants with substantial physical limitations and psychological distress would benefit from greater improvement in passive physical treatment against control.
Subgroup size | Age, years (<) | PCSa (<) | MCSb (<) | Treatment effect |
---|---|---|---|---|
0.105 | 51 | 43.50 | 37.86 | 6.33 |
0.193 | 68 | 35.54 | 47.60 | 4.38 |
0.208 | 63 | 47.65 | 37.86 | 5.26 |
0.296 | 91c | 67.75 | 37.86 | 4.45 |
0.307 | 63 | 43.50 | 47.60 | 4.05 |
0.392 | 91 | 43.50 | 47.60 | 4.21 |
0.403 | 91 | 37.84 | 54.15 | 3.99 |
0.496 | 91 | 67.75 | 47.60 | 3.77 |
0.500 | 63 | 47.65 | 54.15 | 3.27 |
0.594 | 91 | 67.75 | 51.02 | 3.67 |
0.603 | 55 | 67.75 | 71.32 | 2.88 |
0.706 | 91 | 43.50 | 60.37 | 3.57 |
0.802 | 91 | 47.65 | 60.37 | 3.22 |
0.904 | 68 | 67.75 | 71.32 | 3.06 |
Short-term Short Form questionnaire-12 items/-36 items physical component score outcome
The trajectory plot for the treatment effect between passive physical and non-active usual care is shown in Figure 34. The trajectory indicates an increase of improvement as regions narrowed but the fluctuation of the treatment effect suggests that there might be no definite subgroup that would gain substantial treatment effect. Table 35 summarises the average treatment for selected constructed regions with the corresponding thresholds for the comparison seen in Figure 34.
Subgroup size | Age, years (<) | PCSa (<) | MCSb (>) | Treatment effect |
---|---|---|---|---|
0.107 | 63 | 31.19 | 51.02 | 6.17 |
0.192 | 68 | 35.54 | 51.02 | 5.84 |
0.205 | 91c | 31.19 | 43.02 | 5.99 |
0.292 | 68 | 43.50 | 51.02 | 5.30 |
0.310 | 55 | 40.28 | 33.48 | 5.09 |
0.394 | 68 | 35.54 | 28.47 | 4.56 |
0.406 | 91 | 43.50 | 47.60 | 4.93 |
0.495 | 91 | 40.28 | 37.86 | 5.02 |
0.503 | 68 | 43.50 | 37.86 | 4.95 |
0.599 | 91 | 37.84 | 9.46 | 4.45 |
0.604 | 91 | 67.75 | 43.02 | 4.33 |
0.709 | 68 | 43.50 | 9.46 | 4.47 |
0.802 | 91 | 67.75 | 33.48 | 4.14 |
0.904 | 68 | 67.75 | 9.46 | 3.88 |
Psychological versus non-active usual care
Short-term Roland–Morris Disability Questionnaire outcome
Figure 35 shows the trajectory plot for the treatment effect between psychological and non-active usual care for the short-term RMDQ outcome, and Table 36 shows the average treatment effect for selected constructed regions with the corresponding thresholds. The results are very similar to those seen above (see Analysis 1: Short-term Roland–Morris Disability Questionnaire outcome), that is, there was no subgroup that would gain a substantial improvement in treatment effect.
Subgroup size | Age, years (<) | RMDQ (>) | Treatment effect |
---|---|---|---|
0.107 | 41 | 7 | 2.84 |
0.197 | 49 | 8 | 2.58 |
0.214 | 69 | 13 | 1.46 |
0.295 | 45 | 0 | 1.81 |
0.305 | 49 | 5 | 2.52 |
0.400 | 52 | 4 | 2.19 |
0.493 | 56 | 4 | 2.02 |
0.528 | 85a | 8 | 1.39 |
0.591 | 60 | 4 | 1.90 |
0.606 | 63 | 5 | 1.79 |
0.809 | 63 | 0 | 1.48 |
0.909 | 69 | 0 | 1.39 |
Sham care compared with non-active usual care
Short-term Hannover Functional Ability Questionnaire outcome
Three trials50,101,132 were included in the comparison between passive physical and non-active usual care. All three trials50,101,132 had acupuncture as the therapist-delivered intervention; of these, two trials101,132 also had sham acupuncture. Figure 36 shows the trajectory plot for the treatment effect between sham acupuncture and non-active usual care. The average treatment effect was slightly lower between passive physical (acupuncture) and non-active usual care. However, the treatment effect increased as more and more participants were excluded from the ARDP-MA algorithm. Table 37 shows the average treatment effect between sham acupuncture and non-active usual care for selected constructed regions with the corresponding thresholds.
Subgroup size | Age, years (<) | FFbHR (<) | PCSa (<) | MCSb (<) | Treatment effect |
---|---|---|---|---|---|
0.103 | 52 | 41.67 | 44.78 | 51.61 | 12.64 |
0.199 | 52 | 54.17 | 60.47 | 51.61 | 12.58 |
0.208 | 62 | 45.83 | 44.78 | 51.61 | 12.26 |
0.301 | 62 | 45.83 | 60.47 | 72.11 | 9.85 |
0.402 | 52 | 95.83 | 60.47 | 57.68 | 7.53 |
0.510 | 68 | 58.33 | 41.50 | 61.38 | 6.49 |
0.605 | 87c | 62.50 | 41.50 | 57.68 | 6.84 |
0.700 | 68 | 66.67 | 44.78 | 61.38 | 6.00 |
0.806 | 68 | 95.83 | 44.78 | 72.11 | 5.95 |
Short-term SF-12/36 MCS outcome
Figure 37 shows the trajectory plot for the treatment effect between sham and non-active usual care. The two trials included in this pairwise analysis had sham acupuncture. The figure shows that the average treatment effect did not improve much in the exclusion of the first 70% participants (Table 38). Nevertheless, there was a markedly higher treatment effect which was 6.22 for approximately 20% of the participants (corresponding to PCS of < 36 and MCS of < 39, regardless of age).
Subgroup size | Age, years (<) | PCSa (<) | MCSb (<) | Treatment effect |
---|---|---|---|---|
0.104 | 43 | 36.48 | 51.97 | 7.86 |
0.199 | 43 | 39.17 | 61.54 | 6.43 |
0.201 | 87 | 36.48 | 39.07 | 6.22 |
0.296 | 87 | 57.59 | 39.07 | 5.06 |
0.300 | 65 | 42.29 | 44.25 | 4.01 |
0.396 | 87c | 39.17 | 48.42 | 4.40 |
0.410 | 52 | 42.29 | 61.54 | 4.57 |
0.501 | 61 | 57.59 | 55.18 | 3.09 |
0.709 | 70 | 39.17 | 70.46 | 3.59 |
0.809 | 70 | 42.29 | 70.46 | 3.67 |
0.902 | 70 | 57.59 | 70.46 | 3.09 |
Short-term physical component score outcome
The trajectory plot for the treatment effect between sham and non-active usual care is shown in Figure 38 and Table 39 summarises the average treatment for selected constructed regions with the corresponding thresholds. There was an increase of improvement as regions narrowed, but the fluctuation of the treatment effect suggests that there might be no definite subpopulation that would gain substantial treatment effect.
Subgroup size | Age, years (<) | PCSa (>) | MCSb (<) | Treatment effect |
---|---|---|---|---|
0.100 | 70 | 39.17 | 48.42 | 6.26 |
0.195 | 52 | 32.56 | 51.97 | 6.04 |
0.206 | 70 | 36.48 | 55.18 | 5.37 |
0.296 | 52 | 30.95 | 58.10 | 5.59 |
0.303 | 70 | 30.95 | 48.42 | 5.43 |
0.398 | 87c | 34.31 | 70.46 | 4.46 |
0.403 | 65 | 32.56 | 61.54 | 4.86 |
0.495 | 87 | 26.96 | 51.97 | 4.55 |
0.503 | 87 | 30.95 | 58.10 | 4.66 |
0.598 | 87 | 30.95 | 70.46 | 4.06 |
0.602 | 70 | 29.16 | 61.54 | 3.71 |
0.801 | 65 | 14.41 | 70.46 | 3.46 |
0.902 | 70 | 14.41 | 70.46 | 3.56 |
Chapter 9 Methodology and statistical developments 3: identification of cost-effective subgroups by directed peeling
Introduction
The economic analyses sought to identify the most cost-effective treatments for subgroups of patients with LBP. A search algorithm, similar to that used in the previous chapter, was used to identify subgroups to maximise the expected QALY gain from treatment. Although some of the trials in the database provided individual-level data on use of health-care resources, these data were not used in the analyses presented in this chapter. Instead, a threshold approach was used to assess the cost-effectiveness of treatment for defined groups of patients. This was done by comparing estimates of treatment cost from the literature with the maximum cost required to stay below the cost-effectiveness threshold (£20,000–30,000 per QALY, as recommended by NICE), given the estimated QALY gain from treatment. 173
The use of the QALY outcome reduced the available data for analysis more than for the short-term clinical outcomes in the previous chapter. We therefore used a search algorithm that is suited to data with a lower signal–noise ratio: the directed peeling approach of LeBlanc et al. ,172 which works by ‘peeling’ a fraction of patients (with the least favourable effect) from the subgroup in a series of steps. This differs from the full search algorithm described in the previous chapter, as each successively smaller subgroup is constrained to be a subset of the previous one. Both approaches use a ‘directed’ peeling approach, designed to provide simpler descriptions of groups for variables with a monotonic relationship with the outcome of interest. The LeBlanc et al. 172 algorithm was developed for analysis of data from a single trial, and so it was adapted here for IPD meta-analysis by incorporating random trial effects into the model.
The analysis was split into four overarching comparisons: all interventions collectively compared with best care; active physical interventions compared with best care; passive physical interventions compared with best care; and active physical interventions compared with passive physical interventions. Psychological interventions were not included in the comparison, as only one trial had the EQ-5D data that was necessary to calculate a QALY and a control arm. Data for comparisons against a ‘sham’ treatment arm were also excluded from this analysis.
Methods
Quality-adjusted life-years
The outcome used for the analysis was the QALY. We calculated QALYs for individuals based on EQ-5D utility scores at baseline and short-, medium- and long-term follow-up (up to 1 year). For trials with SF-36/SF-12 outcomes but no EQ-5D, we used a mapping algorithm162 to estimate EQ-5D scores. QALYs were estimated using an AUC approach, adjusting for baseline EQ-5D scores (see Chapter 6, Health-economic outcomes).
Moderator identification
The specification of the search algorithm required an initial analysis to identify moderating variables, and to determine the direction of peeling. A mixed-effects model was used to identify moderators with a significant interaction with treatment effect on the QALY outcome. The model was specified with moderator, treatment and treatment-by-moderator interaction as fixed effects, and trial and treatment-by-trial interaction as random effects (see Chapter 6, Outcome variables). The sign on the moderator by treatment interaction coefficient dictated whether or not the algorithm should peel from the top or the bottom of the moderator range. A positive relationship with treatment effect suggested that peeling away individuals with lower values of the moderator would yield higher average treatment benefits. A negative relationship suggests that peeling individuals with higher values of the moderator would be best.
Peeling algorithm
The peeling algorithm started by setting the subgroup indicator (B) to ‘1’ for all individuals. Incremental QALY gain from treatment for the whole patient sample was estimated using a mixed-effects model with baseline EQ-5D score and treatment as fixed effects, and trial and treatment-by-trial interaction as random effects.
The algorithm then looped through the following steps until the stopping criteria were met:
-
For each moderator, a small proportion of the data was peeled off, taking out the individuals with the highest (lowest) value of the moderator (depending on the direction of the moderator treatment interaction effect). The subgroup indicator (B) was set to ‘1’ for the remaining individuals (the ‘in’ group) and ‘0’ for the peeled individuals (the ‘out’ group).
-
The difference in incremental QALY gain was estimated for those inside the subgroup compared with those outside using a mixed-effects model: with baseline EQ-5D, treatment effect, subgroup identifier and treatment-by-subgroup interaction as fixed effects, and trial and treatment-by-trial interaction as random effects.
-
The magnitude of the treatment by subgroup interaction effect was compared for each moderator. The peel decision was then based on the moderator with the greatest effect.
-
Summary statistics were calculated, including the incremental QALY gain within the subgroup, the incremental QALY gain outside the subgroup and the weighted mean incremental QALY across the whole sample.
-
If the subgroup contained fewer individuals than a preset minimum number (nmin) then the algorithm stopped. Otherwise, the above steps were repeated.
Cost-effectiveness
Individual patient data on health-care resource use was available for some trials in the repository. An initial analysis was conducted using the data from the UK BEAM trial31 using individual-level estimates of costs (C) and QALYs (Q) over the 12-month follow-up period. From these data, the net monetary benefit (NMB) was calculated for each individual: NMB = λ × Q – C, where λ is a set cost-effectiveness threshold (£20,000 per QALY). This NMB variable was then used as an outcome in the above search algorithm. However, we found that the addition of the cost data increased variation without increasing predictive power. The results of this analysis are not presented here, as one condition of use of the repository data is that all results must include at least two trials to avoid re-analysis of the original trial data. Given that the addition of the individual-level costs was not advantageous in the UK BEAM31 analysis, and also the heterogeneity in the resource-use items recorded across those studies with data, we decided to focus on QALYs as the outcome for the economic analysis, and to use a threshold approach to assess cost-effectiveness.
The threshold analysis presents the maximum incremental cost of intervention in order for a treatment subgroup to be deemed cost-effective based on the lower and upper limits of the NICE-recommended threshold (£20,000–30,000 per QALY). For example, if a treatment yields an average incremental QALY gain for a treatment population of 0.05, one would pay up to £1000 (0.05 × £20,000) for the treatment, using the lower threshold or £1500 (0.05 × £30,000) at the upper threshold.
Published literature was used to provide indicative costs of treatment for comparison with the estimated thresholds. The incremental cost of passive treatment over 1 year was estimated at £541(SD £768) from the UK BEAM31 economic analysis: £147 for the intervention and £394 relating to other health-care costs. Estimates for other treatments varied, ranging from £422 (£187 for the intervention, £235 for other health-care costs) for a psychological intervention (BeST33) to £486 (SD £907) comprised £41 for the intervention and £445 relating to other health-care costs for active therapies (UK BEAM31).
Results
Six analyses were run (Table 40), dictated by the moderators with significant treatment interaction terms in the QALY ANCOVA. These included the following comparisons: all interventions compared with control; active physical interventions compared with control; passive physical interventions compared with control; and active physical interventions compared with passive physical interventions. As noted above, analyses of psychological intervention and sham were omitted, as in each case only one study provided data for QALY calculation.
As shown in Table 40, not all trials had data for all three potential moderators. We therefore conducted three analyses for the intervention against control comparison: the first to include as many trials as possible with QALY data (age and PCS as moderators).
Analysis | Outcome variable | Moderators included | Trials included | Sample size: intervention, control |
---|---|---|---|---|
All interventions vs. control | ||||
9.3.1 | QALY | Age, PCSa | UK BEAM,31 BeST,33 YACBAC,107 Haake132 | 1273, 715 |
9.3.2 | QALY | Age, RMDQ | UK BEAM,31 BeST,33 York BP,133 Smeets70 | 1092, 422 |
9.3.3 | QALY | Age, PCS, RMDQ | UK BEAM,31 BeST33 | 827, 323 |
Active physical interventions vs. control | ||||
9.3.4 | QALY | Age, RMDQ | UK BEAM,31 York BP133 | 232, 264 |
Passive physical interventions vs. control | ||||
9.3.5 | QALY | Age, PCS | UK BEAM,31 YACBAC,107 Haake132 | 643, 566 |
Active physical vs. passive physical interventions | ||||
9.3.6 | QALY | Age, RMDQ | UK BEAM,31 HullExPro76 | 232, 288 |
All interventions versus control: moderators – age and physical component score
The algorithm trace is shown in Figure 39. The y-axis shows the estimated treatment effect for the subgroup, that is, the ‘incremental QALYs’ gained from treatment compared with the control arm. The x-axis is the proportion of the starting population peeled away from the treatment group. Figure 40 shows the mean incremental QALYs for the whole sample, both inside and outside the treatment group. It can be seen that for the full sample, the incremental QALY is declining as a function of the treatment subgroup size. This suggests that those being peeled from the subgroup had a net QALY gain from treatment. However, there is no strong signal in these data. The peeling trace in Figure 39 shows no notable increase in QALY gain from treatment when up to 80% of the sample are removed from the treatment group. Full details of the peeling trace are available in Table 41. Both age and PCS were used for peeling, although over the trace the algorithm favoured peeling that was based on PCS. There is a small rise in QALY gain at the point where 90% of the sample had been removed; the subgroup comprising 10% of the sample included participants aged between 54 and 84 years with a PCS of between 7 and 28. The estimated QALY gain from treating only this subgroup was 0.0852, whereas the estimated mean QALY gain from treating the whole population was lower, at 0.0624.
Iteration | Moderator | Direction peeled | Proportion in subgroup | ≈n | Incremental QALYs | Age | PCSa | |||
---|---|---|---|---|---|---|---|---|---|---|
Subgroup | All | Minimum | Maximum | Minimum | Maximum | |||||
0 | – | – | 1.00 | 1988 | 0.0624 | 0.0624 | 18 | 87 | 7 | 61 |
1 | PCS | Top | 0.95 | 1889 | 0.0642 | 0.0610 | 18 | 87 | 7 | 50 |
2 | Age | Bottom | 0.90 | 1795 | 0.0648 | 0.0585 | 28 | 87 | 7 | 50 |
3 | Age | Bottom | 0.86 | 1706 | 0.0685 | 0.0588 | 32 | 87 | 7 | 50 |
4 | PCS | Top | 0.82 | 1621 | 0.0700 | 0.0571 | 32 | 87 | 7 | 47 |
5 | PCS | Top | 0.77 | 1540 | 0.0700 | 0.0542 | 32 | 87 | 7 | 45 |
6 | PCS | Top | 0.74 | 1463 | 0.0718 | 0.0529 | 32 | 87 | 7 | 43 |
7 | PCS | Top | 0.70 | 1390 | 0.0722 | 0.0505 | 32 | 87 | 7 | 42 |
8 | Age | Bottom | 0.66 | 1319 | 0.0718 | 0.0476 | 34 | 87 | 7 | 42 |
9 | PCS | Top | 0.63 | 1254 | 0.0688 | 0.0434 | 34 | 87 | 7 | 41 |
10 | PCS | Top | 0.60 | 1192 | 0.0695 | 0.0417 | 34 | 87 | 7 | 40 |
11 | PCS | top | 0.57 | 1133 | 0.0677 | 0.0386 | 34 | 87 | 7 | 39 |
12 | PCS | top | 0.54 | 1077 | 0.0706 | 0.0383 | 34 | 87 | 7 | 38 |
13 | PCS | Top | 0.52 | 1024 | 0.0668 | 0.0344 | 34 | 87 | 7 | 38 |
14 | PCS | Top | 0.49 | 973 | 0.0674 | 0.0330 | 34 | 87 | 7 | 37 |
15 | PCS | Top | 0.47 | 925 | 0.0679 | 0.0316 | 34 | 87 | 7 | 36 |
16 | PCS | Top | 0.44 | 879 | 0.0664 | 0.0294 | 34 | 87 | 7 | 36 |
17 | PCS | Top | 0.42 | 836 | 0.0645 | 0.0271 | 34 | 87 | 7 | 35 |
18 | PCS | Top | 0.40 | 795 | 0.0663 | 0.0265 | 34 | 87 | 7 | 35 |
19 | PCS | Top | 0.38 | 756 | 0.0696 | 0.0265 | 34 | 87 | 7 | 34 |
20 | Age | Bottom | 0.36 | 719 | 0.0686 | 0.0248 | 36 | 87 | 7 | 34 |
21 | Age | Bottom | 0.34 | 683 | 0.0652 | 0.0224 | 39 | 87 | 7 | 34 |
22 | PCS | Top | 0.33 | 649 | 0.0652 | 0.0213 | 39 | 87 | 7 | 34 |
23 | PCS | Top | 0.31 | 617 | 0.0688 | 0.0213 | 39 | 87 | 7 | 33 |
24 | PCS | Top | 0.30 | 587 | 0.0691 | 0.0204 | 39 | 87 | 7 | 33 |
25 | PCS | Top | 0.28 | 558 | 0.0682 | 0.0191 | 39 | 87 | 7 | 33 |
26 | Age | Bottom | 0.27 | 531 | 0.0655 | 0.0175 | 41 | 87 | 7 | 33 |
27 | Age | Bottom | 0.25 | 505 | 0.0698 | 0.0177 | 43 | 87 | 7 | 33 |
28 | Age | Bottom | 0.24 | 480 | 0.0716 | 0.0173 | 45 | 87 | 7 | 33 |
29 | Age | Bottom | 0.23 | 456 | 0.0687 | 0.0158 | 47 | 87 | 7 | 33 |
30 | Age | Bottom | 0.22 | 434 | 0.0694 | 0.0151 | 49 | 87 | 7 | 33 |
31 | Age | Bottom | 0.21 | 413 | 0.0671 | 0.0139 | 50 | 87 | 7 | 33 |
32 | Age | Bottom | 0.20 | 393 | 0.0652 | 0.0129 | 51 | 87 | 7 | 33 |
46 | PCS | Top | 0.10 | 196 | 0.0852 | 0.0084 | 54 | 84 | 7 | 28 |
Depending on the cost of intervention, and NHS ‘willingness-to-pay’ per QALY, it might be cost-effective for all patients to be offered treatment or for treatment to be limited to a selected subgroup. For example, at a cost-effectiveness threshold of £20,000 per QALY, the maximum that the NHS would pay for the ‘intervention’ reflected here would be £1248 (per patient over the course of a year) if all patients were to be offered treatment or £1704 if only patients in the 10% subgroup were to be offered treatment. If the threshold of £30,000 was applied then this will be £1872 and £2556, respectively. However, these results do not incorporate any measure of uncertainty and should be considered as only illustrative of the method.
-
Older patients with relatively worse physical functioning as measured using the PCS at baseline appear to have moderately better response to treatment.
All interventions versus control: moderators – age and Roland–Morris Disability Questionnaire
Figures 41 and 42 illustrate the peeling trace with moderator’s age and RMDQ. The inclusion of the RMDQ limited the sample to four trials (see Table 40). As shown by Figure 41, the peeling algorithm did achieve small but consistent gains in treatment effect within the subgroup, as participants with better (lower) baseline RMDQ scores and who were younger were removed from the treatment group. The algorithm favoured peeling that was based on RMDQ score during the earlier iterations. The apparent monotonicity of RMDQ with respect to treatment effect (as measured in QALYs) is consistent with the regression analysis used for moderator identification (see Table 17), as the RMDQ had a more significant relationship with treatment effect compared with age. Owing to some correlation with RMDQ score and age, some older patients were removed from the treatment subgroup as the algorithm peeled based on RMDQ score.
The peeling trace for Analysis 2 is shown in Table 42. The subgroup at 20% of the initial sample comprised participants aged > 34 years with a RMDQ score of ≥ 13. A modest improvement in QALYs gained from treatment can be seen for this subgroup: from 0.043 if the whole population were to be offered treatment to 0.076 for the subgroup. As described previously, the maximum willingness to pay for an intervention yielding these QALY gains would be £860 and £1520, respectively, for the whole population and for the subgroup, where a threshold of £20,000 is applied, or £1290 and £2280, respectively, at a threshold of £30,000 per QALY. As there is no estimation of uncertainty, this result should be seen as illustrative.
-
Older patients with worse baseline physical functioning as measured by the RMDQ score at baseline appear to achieve a moderately better response to treatment.
Iteration | Moderator | Direction peeled | Proportion in subgroup | ≈n | Incremental QALYs | Age | RMDQ score | |||
---|---|---|---|---|---|---|---|---|---|---|
Subgroup | All | Minimum | Maximum | Minimum | Maximum | |||||
0 | – | – | 1.00 | 1514 | 0.0431 | 0.0431 | 18 | 85 | 0 | 24 |
1 | RMDQ | Bottom | 0.95 | 1435 | 0.0463 | 0.0439 | 18 | 85 | 3 | 24 |
2 | RMDQ | Bottom | 0.82 | 1245 | 0.0517 | 0.0425 | 19 | 84 | 5 | 24 |
3 | RMDQ | Bottom | 0.73 | 1105 | 0.0525 | 0.0383 | 19 | 84 | 6 | 24 |
4 | Age | Bottom | 0.69 | 1050 | 0.0631 | 0.0437 | 28 | 84 | 6 | 24 |
5 | Age | Bottom | 0.66 | 998 | 0.0661 | 0.0436 | 32 | 84 | 6 | 24 |
6 | RMDQ | Bottom | 0.58 | 881 | 0.0661 | 0.0385 | 32 | 84 | 7 | 24 |
7 | RMDQ | Bottom | 0.50 | 756 | 0.0563 | 0.0281 | 32 | 84 | 8 | 24 |
8 | Age | Bottom | 0.47 | 719 | 0.0608 | 0.0289 | 34 | 84 | 8 | 24 |
9 | RMDQ | Bottom | 0.41 | 625 | 0.0608 | 0.0251 | 34 | 84 | 9 | 24 |
10 | RMDQ | Bottom | 0.35 | 537 | 0.0639 | 0.0227 | 34 | 84 | 10 | 24 |
11 | RMDQ | Bottom | 0.31 | 466 | 0.0794 | 0.0244 | 34 | 82 | 11 | 24 |
12 | RMDQ | Bottom | 0.26 | 387 | 0.0728 | 0.0186 | 34 | 82 | 12 | 24 |
13 | RMDQ | Bottom | 0.20 | 304 | 0.0760 | 0.0153 | 34 | 82 | 13 | 24 |
14 | RMDQ | Bottom | 0.15 | 232 | 0.0726 | 0.0111 | 34 | 79 | 14 | 24 |
15 | Age | Bottom | 0.14 | 217 | 0.1041 | 0.0149 | 38 | 79 | 14 | 24 |
16 | Age | Bottom | 0.14 | 206 | 0.1109 | 0.0151 | 39 | 79 | 14 | 24 |
17 | Age | Bottom | 0.13 | 194 | 0.1143 | 0.0146 | 41 | 79 | 14 | 24 |
18 | Age | Bottom | 0.12 | 179 | 0.1168 | 0.0138 | 44 | 79 | 14 | 24 |
19 | Age | Bottom | 0.11 | 170 | 0.1206 | 0.0135 | 44 | 79 | 14 | 24 |
20 | Age | Bottom | 0.11 | 161 | 0.1265 | 0.0134 | 46 | 79 | 14 | 24 |
All interventions versus control: moderators – age, physical component score and Roland–Morris Disability Questionnaire
Figures 43 and 44 illustrate the peeling results for the analysis with age, PCS and RMDQ score. As some trials did not have available PCSs and others did not have RMDQ scores, the sample was restricted to two trials. 31,33 The results of the peeling trace are very similar to those of analysis 2, shown in Table 42). The algorithm chose to peel almost exclusively on RMDQ score and age. PCS was used for the first iteration only. As the algorithm reduced the size of the treatment subgroup, the results showed that generally, older patients with worse (higher) RMDQ scores achieved better QALY gains from treatment. Although PCS was not much used for peeling, as the sample size was reduced participants with higher (better) PCSs were removed from the treatment subgroup; this is unsurprising as RMDQ score and PCS are correlated.
As shown in Table 43 at the point where 19% of the starting sample was left in the treatment subgroup, the subgroup comprised participants aged 44 to 82 years with a RMDQ score over 12 and a PCS of between 7 and 49. At this point the treatment subgroup achieved a QALY gain of 0.0981 from treatment. When the whole population was treated, the mean QALY gain was lower at 0.0504. At a £20,000 per QALY cost-effectiveness threshold, the maximum willingness to pay for an intervention yielding these QALY gains would be £1008 and £1962 for the whole population and the refined subgroup, respectively. At £30,000 per QALY, these figures are £1512 and £2943, respectively. However, as there is no measure of uncertainty reflected in these results, they should only be seen as illustrative.
-
Older patients with worse physical functioning as measured using the RMDQ score at baseline appear to have a moderately better response to treatment.
Iteration | Moderator | Direction peeled | Proportion in subgroup | ≈n | Incremental QALYs | Age | PCSa | RMDQ score | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Subgroup | All | Minimum | Maximum | Minimum | Maximum | Minimum | Maximum | |||||
0 | – | – | 1.00 | 1150 | 0.0504 | 18 | 85 | 7 | 61 | 0 | 24 | |
1 | PCS | Top | 0.95 | 1093 | 0.0534 | –0.0086 | 18 | 85 | 7 | 51 | 0 | 24 |
2 | RMDQ | Bottom | 0.90 | 1034 | 0.0533 | 0.0037 | 18 | 85 | 7 | 51 | 4 | 24 |
3 | RMDQ | Bottom | 0.82 | 941 | 0.0574 | –0.0133 | 19 | 84 | 7 | 51 | 5 | 24 |
4 | Age | Bottom | 0.78 | 894 | 0.0624 | 0.0187 | 29 | 84 | 7 | 51 | 5 | 24 |
5 | Age | Bottom | 0.74 | 850 | 0.0669 | 0.0087 | 32 | 84 | 7 | 51 | 5 | 24 |
6 | RMDQ | Bottom | 0.65 | 748 | 0.0669 | 0.0087 | 32 | 84 | 7 | 51 | 6 | 24 |
7 | RMDQ | Bottom | 0.56 | 648 | 0.0733 | 0.0100 | 32 | 84 | 7 | 51 | 7 | 24 |
8 | RMDQ | Bottom | 0.48 | 554 | 0.0629 | 0.0354 | 32 | 84 | 7 | 51 | 8 | 24 |
9 | RMDQ | Bottom | 0.41 | 472 | 0.0653 | 0.0410 | 32 | 84 | 7 | 51 | 9 | 24 |
10 | RMDQ | Bottom | 0.35 | 397 | 0.0684 | 0.0438 | 32 | 84 | 7 | 49 | 10 | 24 |
11 | Age | Bottom | 0.33 | 378 | 0.0751 | 0.0429 | 35 | 84 | 7 | 49 | 10 | 24 |
12 | RMDQ | Bottom | 0.28 | 321 | 0.0751 | 0.0429 | 35 | 82 | 7 | 49 | 11 | 24 |
13 | Age | Bottom | 0.27 | 305 | 0.0762 | 0.0394 | 38 | 82 | 7 | 49 | 11 | 24 |
14 | Age | Bottom | 0.25 | 290 | 0.0855 | 0.0367 | 40 | 82 | 7 | 49 | 11 | 24 |
15 | Age | Bottom | 0.24 | 276 | 0.0899 | 0.0363 | 42 | 82 | 7 | 49 | 11 | 24 |
16 | Age | Bottom | 0.23 | 263 | 0.0981 | 0.0343 | 44 | 82 | 7 | 49 | 11 | 24 |
17 | RMDQ | Bottom | 0.19 | 213 | 0.0981 | 0.0343 | 44 | 82 | 7 | 49 | 12 | 24 |
Active physical intervention versus control: moderators – age and Roland–Morris Disability Questionnaire
Analysis so far has pooled all treatment modalities and compared these collectively with control. For analysis 9.3.4 (see Table 40) the intervention considered is made up of only active physical interventions, in this case ‘exercise’. The comparator arm is still control. This approach limited the data set to two trials. 31,133 Figure 45 shows the peeling trace, with RMDQ score and age included as moderators within the algorithm.
The algorithm peeled almost exclusively based on the RMDQ score. As the algorithm reduced the sample size, patients with lower (better) RMDQ scores were removed, suggesting that patients with worse baseline RMDQ scores achieve better treatment outcomes. At iteration 10, age was peeled on, removing patients who were younger.
As can be seen in Figure 45, improvements in the mean incremental treatment effect for the subgroup were very small as no relevant subgroup could be identified from APT in these analyses.
Passive physical intervention versus control: moderators – age and physical component score
Analysis 9.3.5 (see Table 40) follows the same approach as analysis 9.3.4 (see Table 40); however, in this instance the treatment arm comprised only passive interventions, including manipulation and acupuncture treatments; the comparator remained as a control. These conditions limited the data set to three trials. 31,105,132 The peeling algorithm was set to peel based on age and PCS. RMDQ score was not available for all of the trials included in this analysis.
As can be seen on Figure 46, there was very little change in the incremental treatment effect as the algorithm refined the treatment subgroup. No relevant subgroup could be identified correlating age and/or PCS with above average treatment effect from passive physical treatment in these analyses.
Adaptive refinement by directed peeling in individual patient data meta-analysis directed peel
Active physical interventions compared with passive physical interventions: moderators – age and Roland–Morris Disability Questionnaire
Analysis 9.3.6 (see Table 40) was a comparison of active physical interventions and passive physical interventions. The analysis includes data from two trials. The active treatment was made up of exercise and the passive treatment was made up of manual therapy. For the analysis, passive treatment was considered the reference case for all of the incremental estimates. The peel algorithm was set to refine the subgroup based on the age and RMDQ moderators. The algorithm elected to peel predominantly on the RMDQ score, removing patients with lower (better) RMDQ scores from the treatment group. As can be seen in Figure 47, the incremental effect of changing between these two treatment modalities was near zero. The result of the analysis suggests there is no difference in these two treatment modalities across the whole sample, or for any subgroup explored within the analysis of these data.
Discussion
The application of the peeling algorithm was successful in identifying potentially interesting subgroups for the interventions against control comparison. These subgroups comprised patients who were older, with relatively worse physical functioning at baseline. The gain in treatment effect for the subgroup was small; therefore, given the relatively low cost of the intervention treatment, it is likely to be cost-effective to offer treatment to the whole patient group. The algorithm, however, was not successful in finding any convincing subgroup in the pairwise comparison of active and passive physical treatment. This may be caused by lack of power or simply because there is no subgroup to be found.
The QALY has some key advantages over the other available clinical outcomes. It is a holistic measure of health-related quality of life designed to encompass both physical and mental aspects of a patient’s health state. Constructed using EQ-5D responses over time, the QALY also takes account of a patient’s recovery profile, integrating short- and long-term treatment response into a single measure. The EQ-5D is scored using the UK social tariff, which is validated and standardised allowing direct comparison of the treatment response for different interventions and diseases. The QALY estimated using the EQ-5D tariff is the accepted measure used by NICE for assessing the cost-effectiveness of new treatments for approval in the NHS. The QALY did, however, raise some particular challenges for the analysis. The use of repeated measures to estimate the QALY restricted the size of the sample, as more observations were lost to missing data when compared with the point estimates used in the clinical analysis. This reduced the power of statistical analyses.
The same approach was taken for moderator identification for the economic component of the analysis as for the clinical analyses. Three potential moderators (age, PCS, RMDQ) of treatment response were identified for the economic analysis. However, the relationship of the QALY with the moderators differed in some cases to that of the clinical outcome measures. For the short-term clinical outcome of PCS, the age-by-treatment interaction was found to be negative and significant (p < 0.2), suggesting that younger patients had a better treatment effect. For the outcome of FFbHR, the age-by-treatment interaction was also negative but was just outside the significance threshold of p < 0.2. For the other included clinical outcomes, age was not significant. When the QALY was used as the outcome measure, the age-by-treatment interaction was significant at p < 0.2 but the relationship was positive, indicating that older patients had a better treatment effect. The EQ-5D at short-term follow-up also exhibited a positive relationship with age, although this relationship was not significant. It may not be surprising that the relationship of the moderators with the different outcomes differed, as they measure different aspects of patient health. Furthermore, the QALY differs by construction from the other outcome measures, as it is calculated as the AUC for a sequence of follow-up points. However, it is also possible that the results are susceptible to missing data bias. Patients with missing EQ-5D data at one or more follow-up points were on average 4 years younger than patients with complete EQ-5D data (p < 0.05). One could speculate that younger patients with better expected outcomes might have been excluded from our complete case analysis, as they failed to return follow-up questionnaires. This could bias the treatment response down for younger patients. Four trials had short-term EQ-5D data, comprising 1774 patients (1271 intervention, 503 control) for which there were complete data. Of the 1774 patients, 1467 (1093 intervention, 374 control) had complete data at all of the EQ-5D follow-up points that were necessary to calculate a QALY estimate. This equates to an additional 17% missing data for QALYs compared with short-term outcomes. This might possibly explain the difference in direction of relationship between age and treatment response by outcome measure, as the short-term measures were less prone to missing data than the QALY.
Chapter 10 Methodology and statistical developments 4: subgroup identification with individual participant data indirect network meta-analysis
Background
The recursive partitioning and adaptive peeling approaches described in our analysis plan, although technically of a high standard, failed to identify clinically useful subgroups for whom treatment choices might be prioritised. We therefore also did an exploratory network meta-analysis (NWMA) to identify groups that may gain the greatest benefit from different treatment choices from a Bayesian, rather than a frequentist, perspective.
Methods
We carried out NWMAs of the repository trials to explore how the optimal choice of treatment for LBP might vary across subgroups. NWMA is an extension of standard pairwise meta-analysis, applicable in situations in which we have multiple treatments and an evidence base of trials that individually provide evidence on different subsets of all possible pairwise treatment combinations. 174 NWMA involves analysing this network as a whole, by assuming consistency across treatment effects, so that a given pairwise comparison B against C can be derived from trials against a common comparator (A vs. B and A vs. C trials), even if no B versus C trials exist. 175 NWMA has become increasing popular in decision-making contexts because choosing among more than two treatments requires all pairwise treatment effects to be consistent in this way (the true treatment effects in the decision problem will always be consistent176,177). Given their widespread use in Health Technology Assessment, NWMA commonly uses aggregate data, although there are examples illustrating the value of this approach when IPD are available, particularly in understanding participant-level effect modification. 178,179
The standard model for pairwise meta-analysis involving a continuous normally distributed outcome with linear effect modification can be written as equations:
in which yit is the outcome for participant i in trial t, µit is the expected outcome for participant i if he/she had been given the control treatment for that trial, Δit is the expected impact of the treatment participant i received, Iit takes value ‘0’ if participant i is in the control arm of trial t and ‘1 if they are in the intervention arm, dt is the impact of the intervention for a reference participant, Xit is a vector of covariate values for participant i, X¯ is a vector of covariate values for the reference participant, and βt is a vector of coefficients determining how the effect of the intervention evaluated in trial t varies as a function of the covariates of interest. It is possible to further allow for µit to vary by participants, as shown by Equation 5:
where µt is the expected outcome in the control arm of trial t for the reference participant, and b is a vector of coefficients determining how the control outcome varies as a function of the covariates of interest.
Network meta-analysis extends this analysis by introducing the consistency assumption as shown by Equation 6:
in which d1,j is defined as the treatment effect of any treatment j in the network compared with a reference treatment (such as standard care), and active(t) and control(t) are the active and control treatments in trial t, respectively. The consistency assumption can further be applied to the βt parameters as shown by Equation 7:
We carried out three separate NWMAs for the outcomes of interest – short-term change in RMDQ score, short-term change in PCS of SF-12/36 and short-term change in MCS of SF-12/36. All models explore age, sex and baseline PCS/MCS as covariates for both control outcome variation and effect modification. RMDQ models also include baseline RMDQ score for both adjustments. Model estimation involved Bayesian Markov Chain Monte Carlo simulation carried out using WinBUGS 1.4.3 (MRC Biostatistics Unit, Cambridge, UK), using NWMA models that were adapted for IPD analysis from aggregate-data NWMA models that were developed for NICE. 180
Results
Short-term Roland–Morris Disability Questionnaire outcome
Thirteen trials31,33,65,70,76,102–106,131,134,136 (n = 3447) in the repository reported this outcome. The resulting network of evidence is illustrated in Figure 48.
Table 44 gives the predicted treatment effects from the NWMA of these trials for any pairwise comparison of the five treatment classes in the network, assuming a participant profile representing a typical (male) participant. This shows that, for the paradigmatic case of a male aged 50 years, with baseline values of RMDQ = 10, PCS = 40 and MCS = 40, all treatment choices are superior to usual care control treatment. For sham treatment, however, the point estimate for the 95% credible interval for RMDQ does include zero. In addition, the differences between any two treatment approaches can be estimated. For example, in this paradigmatic case there does not seem to be a meaningful difference between sham treatment and psychological treatment.
Intervention | Comparator | |||
---|---|---|---|---|
Control | Active physical | Passive physical | Psychological | |
Active physical | 1.94 (1.17 to 2.72) | |||
Passive physical | 2.17 (1.39 to 1.95) | 0.23 (–0.61 to 1.07) | ||
Psychological | 1.45 (0.74 to 2.15) | –0.49 (–1.31 to 0.32) | –0.72 (–1.52 to 0.08) | |
Sham | 1.60 (–1.07 to 4.11) | –0.34 (–2.95 to 2.1) | –0.57 (–3.2 to 1.9) | 0.15 (–2.47 to 2.63) |
Table 45 presents coefficient values reflecting the degree of effect modification for the participant characteristics of interest. The evidence for effect modification appears strongest for RMDQ; it is the only characteristic whose coefficient credible intervals for all three treatment serum interventions exclude zero; for sham treatment it does include zero. This analysis suggests that for each 1-point increase in baseline RMDQ score, an additional 0.17- to 0.26-point benefit from active treatments and a 0.43-point benefit from sham treatment will be achieved. However, the 95% credible intervals suggest that the evidence for effect modification related to other covariates is less strong. To quantify the strength of evidence for effect modification, we calculated ‘Bayesian Probabilities of effect modification’ (BP), defined as the greater of two probabilities: that an increase in the characteristic predicts an increase in treatment effect or that it predicts a decrease. A BP of 0.8, for example, suggests that we are 80% sure that a change in the characteristic will increase the effect of treatment. For RMDQ score, the BPs are all > 0.99 (except for sham, with a BP of 0.92) – overwhelming evidence that the effect of treatment depends on baseline scores.
Participant characteristics | Active physical | Passive physical | Psychological | Sham |
---|---|---|---|---|
Agea | –0.02 (–0.05 to 0.02) | 0.00 (–0.03 to 0.03) | –0.02 (–0.05 to 0.01) | –0.01 (–0.08 to 0.07) |
BP = 0.83 | BP = 0.60 | BP = 0.91 | BP = 0.56 | |
Sexb | –0.22 (–1 to 0.56) | –0.38 (–1.16 to 0.4) | –0.01 (–0.78 to 0.77) | –1.12 (–2.74 to 0.49) |
BP = 0.71 | BP = 0.83 | BP = 0.51 | BP = 0.91 | |
RMDQa | 0.18 (0.06 to 0.31) | 0.26 (0.14 to 0.39) | 0.17 (0.05 to 0.29) | 0.43 (–0.11 to 0.93) |
BP > 0.99 | BP > 0.99 | BP > 0.99 | BP = 0.92 | |
MCSa | –0.01 (–0.06 to 0.05) | 0 (–0.05 to 0.05) | 0.03 (–0.03 to 0.08) | –0.06 (–0.35 to 0.24) |
BP = 0.59 | BP = 0.51 | BP = 0.85 | BP = 0.59 | |
PCSa | 0.05 (–0.03 to 0.13) | 0.04 (–0.04 to 0.12) | 0.03 (–0.04 to 0.11) | –0.04 (–0.53 to 0.41) |
BP = 0.89 | BP = 0.84 | BP = 0.81 | BP = 0.52 |
The BPs indicate some, possibly important, differences in benefit by other baseline variables. For example, it is at least 70% likely that men respond more strongly than women to sham treatments and physical treatment but it is equally likely that men respond more or less strongly than women following psychological treatments. On the other hand, baseline MCS has a BP of 85% of positively influencing response to psychological treatments (i.e. those with low levels of psychological distress respond more strongly to psychological treatments than those with high levels of psychological distress), but is almost equally likely to be positively or negatively related to outcomes following physical treatments or sham treatment.
All treatment effects increase, but at different rates, so that the optimal treatment changes as RMDQ score varies. Passive physical therapy is the optimal therapy for the participant as described in Table 45, whose RMDQ score is 10. However, sham therapy becomes the optimal treatment if the RMDQ score increases beyond 14 points, whereas APT becomes optimal if the RMDQ score decreases below 7 points.
These thresholds depend on values for other effect modifiers, although their influence is less certain. The only other characteristics with a BP of > 0.90 are age (psychological therapy) and sex (sham therapy). There is evidence, albeit inconclusive, that, as age decreases, active physical and psychological therapies are relatively more effective. Figures 49 and 50 show how this relationship can be used to define age–RMDQ zones in which each treatment is optimal. Broadly speaking, passive physical therapy is optimal for older participants with a mild to moderate RMDQ score at baseline, APT is optimal for participants with a low RMDQ score at baseline, and sham therapy is optimal for participants with a high RMDQ score at baseline. If we disregard sham treatments as an inappropriate choice for clinical guidelines, passive physical therapies would be optimal for all but the youngest participants with high RMDQ baseline scores (the division would be determined by extending the active–passive equal line into the right-hand side of the graphs). There are no participant profiles for which no intervention is the optimal treatment.
To quantify the strength of evidence for these optimal zones, we calculated the probability that each treatment is optimal for a representative participant profile in each zone. The results (Table 46) show that there is considerable uncertainty around the optimal treatment: participant profile 1, for example, is in the passive physical optimal zone, but there is a 54% chance that this is not the optimal treatment for this profile. However, suboptimal treatments can be identified with a greater degree of certainty: psychological therapies, for example, are highly unlikely to be optimal for older participants, or those with a high RMDQ score at baseline (i.e. participant profiles 1, 3, 4 and 6).
Participant profile | Probability (%) that treatment is optimal for this participant profile | |||
---|---|---|---|---|
Active physical | Passive physical | Psychological | Sham | |
1. Male, RMDQ score of 10, age 50 years | 18 | 46 | < 1 | 35 |
2. Male, RMDQ score of 6, age 30 years | 57 | 11 | 19% | 13 |
3. Male, RMDQ score of 16, age 40 years | 8 | 34 | < 1 | 57 |
4. Female, RMDQ score of 14, age 50 years | 11 | 46 | 2 | 41 |
5. Female, RMDQ score of 10, age 30 years | 53 | 14 | 27 | 6 |
6. Female, RMDQ score of 20, age 40 years | 8 | 35 | 2 | 54 |
Short-term Short Form questionnaire-12 items/-36 items physical component summary outcome
Nine trials31,33,50,76,101,102,107,132,134 (n = 5574) in the repository reported this outcome. The resulting network of evidence is illustrated in Figure 51.
Table 47 gives the predicted treatment effects from the NWMA of these trials for any pairwise comparison of the five treatment classes in the network, assuming a participant profile representing a typical (male) participant. Table 48 presents coefficient values reflecting the degree of effect modification for the participant characteristics of interest. All characteristics, except for age, have at least one effect modification coefficient with a BP of > 0.95.
Intervention | Comparator | |||
---|---|---|---|---|
Control | Active physical | Passive physical | Psychological | |
Active physical | 3.93 (2.55 to 5.32) | |||
Passive physical | 3.16 (2.4 to 3.92) | –0.77 (–2.13 to 0.58) | ||
Psychological | 2.58 (0.85 to 4.29) | –1.36 (–3.36 to 0.63) | –0.58 (–2.33 to 1.18) | |
Sham | 1.64 (–0.03 to 3.32) | –2.29 (–4.33 to –0.25) | –1.52 (–3.18 to 0.15) | –0.93 (–3.23 to 1.38) |
Participant characteristics | Active physical | Passive physical | Psychological | Sham |
---|---|---|---|---|
Agea | 0.02 (–0.05 to 0.08) | –0.01 (–0.04 to 0.03) | –0.04 (–0.1 to 0.03) | 0.00 (–0.06 to 0.06) |
BP = 0.68 | BP = 0.71 | BP = 0.87 | BP = 0.52 | |
Sexb | 0.25 (–1.25 to 1.75) | 0.95 (0.04 to 1.87) | 0.29 (–1.43 to 2.01) | 1.55 (–0.15 to 3.23) |
BP = 0.63 | BP = 0.98 | BP = 0.63 | BP = 0.96 | |
MCS 0a | –0.01 (–0.07 to 0.06) | 0.01 (–0.02 to 0.05) | 0.03 (–0.04 to 0.11) | –0.07 (–0.14 to 0.00) |
BP = 0.59 | BP = 0.76 | BP = 0.80 | BP = 0.97 | |
PCS 0a | –0.05 (–0.15 to 0.05) | –0.07 (–0.13 to –0.02) | –0.03 (–0.13 to 0.06) | –0.10 (–0.22 to 0.02) |
BP = 0.85 | BP > 0.99 | BP = 0.76 | BP = 0.95 |
Figures 52 and 53 show how effect modification can be used to define PCS/MCS zones in which each treatment is optimal with short-term PCS as the outcome of interest. Broadly speaking, passive physical therapy is optimal for participants with low PCSs and high MCSs, whereas APT is optimal for participants with high PCSs and low MCSs. Sham appears optimal for participants with low PCSs and MCSs at baseline. If we disregard sham as a valid optimal treatment, the optimal non-sham treatment zones can be identified by extending the active–passive equal line, as with the RMDQ-based zones. Again, there are no participant profiles for which no intervention is optimal.
To quantify the strength of evidence for these optimal zones, we calculated the probability that each treatment is optimal for a representative participant profile in each zone.
The results (Table 49) show that, as with the RMDQ score, there is greater certainty around which treatments are suboptimal than around which treatments are optimal. For the paradigmatic cases in Figures 52 and 53, it is unlikely that psychological treatments would be the best choice for either gender, but there is a clear indication that there might be differences in proportions who might benefit from active or passive physical treatments if PCS/MCS and sex were the only parameters used for decision-making.
Participant profile | Probability (%) that treatment is optimal for this participant profile | |||
---|---|---|---|---|
Active physical | Passive physical | Psychological | Sham | |
1. Male, MCS 40 and PCS 40 | 81 | 11 | 7 | < 1 |
2. Male, MCS 70 and PCS 20 | 42 | 43 | 15 | < 1 |
3. Female, MCS 30 and PCS 50 | 55 | 18 | 6 | 21 |
4. Female, MCS 60 and PCS 30 | 23 | 68 | 9 | < 1 |
5. Female, MCS 20 and PCS 20 | 20 | 11 | 1 | 68 |
Short-term Short Form questionnaire-12 items/-36 items mental component score outcome
The network of evidence for this outcome is the same as for the SF-12/36 PCS. Table 50 gives the predicted treatment effects from the NWMA of these trials for any pairwise comparison of the five treatment classes in the network, assuming a participant profile representing a typical (male) participant. Table 51 presents coefficient values reflecting the degree of effect modification for the participant characteristics of interest. All characteristics, except for sex, have at least one effect modification coefficient with a BP of > 0.95. It is, perhaps, worth noting here that, for short-term MCS as an outcome, passive physical therapy has the largest effect size for our paradigmatic case. At least for the comparison with active physical, the 95% credibility interval does not cross zero.
Intervention | Comparator | |||
---|---|---|---|---|
Control | Active physical | Passive physical | Psychological | |
Active physical | 1.53 (0.04 to 3.02) | |||
Passive physical | 3.04 (2.23 to 3.85) | 1.50 (0.05 to 2.96) | ||
Psychological | 2.59 (0.80 to 4.39) | 1.06 (–1.04 to 3.17) | –0.44 (–2.26 to 1.39) | |
Sham | 2.13 (0.44 to 3.82) | 0.60 (–1.53 to 2.73) | –0.90 (–2.59 to 0.79) | –0.46 (–2.83 to 1.90) |
Participant characteristics | Active physical | Passive physical | Psychological | Sham |
---|---|---|---|---|
Agea | –0.02 (–0.09 to 0.05) | –0.03 (–0.07 to 0.01) | 0.00 (–0.06 to 0.07) | –0.09 (–0.15 to –0.03) |
BP = 74 | BP = 93 | BP = 53 | BP > 99 | |
Sexb | 0.36 (–1.23 to 1.96) | –0.20 (–1.18 to 0.78) | –0.47 (–2.26 to 1.34) | 0.73 (–0.99 to 2.44) |
BP = 67 | BP = 66 | BP = 70 | BP = 63 | |
MCSa | –0.06 (–0.13 to 0.01) | –0.10 (–0.14 to –0.06) | –0.05 (–0.13 to 0.03) | –0.17 (–0.24 to –0.09) |
BP = 97 | BP > 99 | BP > 90 | BP > 99 | |
PCSa | –0.03 (–0.13 to 0.08) | –0.08 (–0.14 to –0.02) | 0.05 (–0.04 to 0.15) | –0.15 (–0.27 to –0.03) |
BP = 68 | BP > 99 | BP > 86 | BP > 99 |
Figures 54 and 55 show how effect modification can be used to define PCS/MCS zones in which each treatment is optimal. Broadly speaking, psychological therapy is optimal for participants with high PCSs (low levels of disability) and moderate to high MCSs (low levels of psychological distress). Passive physical therapy is optimal for participants with low PCSs and high MCSs, and sham therapy is optimal for participants with low PCSs and MCSs (high disability and high levels of psychological distress). If we disregard sham as a feasible recommendation, passive physical therapy becomes optimal for these participants (there are no participant profiles for which no intervention is optimal). To quantify the strength of evidence for these optimal zones, we calculated the probability that each treatment is optimal for a representative participant profile in each zone. The results (Table 52) show that, as with the RMDQ score, there is greater certainty around which treatments are suboptimal than around which treatments are optimal. However, the evidence for effect modification appears strongest on this outcome. It is perhaps of note that for some participant groups (those with high disability and high levels of psychological distress) it appears that sham treatment is highly likely to be the most effective option.
Participant profile | Probability (%) that treatment is optimal for this participant profile | |||
---|---|---|---|---|
Active physical | Passive physical | Psychological | Sham | |
1. Male, MCS 60 and PCS 60 | 6 | < 1 | 91 | < 1 |
2. Male, MCS 70 and PCS 20 | 11 | 65 | 13 | 10 |
3. Male, MCS 30 and PCS 30 | < 1 | 31 | < 1 | 68 |
4. Female, MCS 60 and PCS 60 | 12 | < 1 | 82 | < 1 |
5. Female, MCS 80 and PCS 20 | 26 | 32 | 15 | 11 |
6. Female, MCS 80 and PCS 20 | < 1 | 13 | < 1 | 87 |
Chapter 11 Discussion
Introduction
This work is grounded in the pressing need to improve the outcomes for people living with LBP. The targeting of treatments of proven but modest average effectiveness at those who are likely to gain the greatest benefit holds promise. The driver for this research is the considerable uncertainty over which patients are most likely to benefit from which treatment strategy. Improved matching of patients to individual treatments has the potential to improve the overall health gain from, and cost-effectiveness of, treatments for LBP. In particular, how individual patient factors, including duration and severity of the back pain, and physical, social and psychological factors, might affect both adherence and treatment response. There is much published work on predictors of poor outcome for people with LBP, for example the psychosocial ‘yellow flags’181 or the STarT Back tool. 182 None, of this work has, however, addressed how these risk factors affect response to treatment. Without explicitly addressing if a particular patient characteristic moderates treatment outcome, targeting treatments at those who are perceived to be at high risk may not be an appropriate choice. During this programme of work we have explored in considerable detail – in two systematic reviews – what is already known about identifying subgroups of people with LBP. This work has demonstrated that the existing work to identify subgroups of patients with LBP within RCTs is generally of a poor methodological quality, and even the high-quality studies do not present evidence to support treatment choices at an individual patient level. Importantly, in this work we have moved beyond using data from single trials and use of single parameters to define subgroups. A large focus of this work has been very technical, on how best to address the challenge of pooling very complex data sets and how best to define subgroups using multiple parameters. To do this we made a series of methodological developments, including three novel methods for subgroup identification: two algorithmic approaches (recursive partitioning, and adaptive risk group refinement) and individual participant data indirect NWMA.
Within the limits of the data that were suitable for pooled analysis, we have identified exploratory subgroups of people who might gain a greater benefit from different treatment approaches in a consistent manner. Interestingly, the groups that we identified as possibly gaining greater benefit from therapist-delivered interventions rather than usual care were typically the converse of expectations. So far as the evidence goes, it seems that younger people with less psychological distress are likely to gain the greatest benefit from these treatments. Although the findings are not strong enough to support these as parameters to prioritise treatment, they do challenge conventional wisdom that people with psychological distress should be targeted for treatment.
Summary of key findings
Systematic reviews (see Chapter 2)
Notwithstanding the perceived importance of performing research to identify subgroups of people living with chronic LBP, there is a paucity of high-quality research in this area. We have identified that nearly all papers reporting analyses of subgroup effects provide no more than exploratory evidence, and that only one study reporting treatment moderation was adequately powered for this analysis. Although it is the identification of differential subgroup effects that is of interest, we failed to identify any robust research that considered subgroups defined by multiple parameters. Rather, we found studies that tested the effect of single potential effect moderators. We have previously found that the available data do not support the use of clinical prediction rules in the management of LBP. 93
Age, employment status, education level, back pain status, narcotic use, treatment expectations, moderated treatment effect with p < 0.05 in one or more study. The exploratory nature of nearly all of the comparisons, the inconsistent findings across the four included studies and the large number of comparisons made mean that these findings cannot, in themselves, be used to inform management. Notwithstanding the limitations of the existing research we were able to identify some potential moderators to include in our final analyses. The overall weakness of the underpinning data meant that we included potential moderators in our analyses that did not meet conventional criteria for statistical significance. By including moderators found to be significant at the 20% level, our pool of potential moderators became age, gender, employment status, education, back pain status, pain-related disability, narcotic use, treatment expectations, quality of life and psychosocial status.
Analyses of covariance (see Chapter 6)
Our ANCOVAs replicate the conventional approach to moderator identification in a pooled data set. The main purpose of these analyses was to inform selection of potential moderators for our main analysis based on identifying variables significant at the 20% level. In our analyses, we were restricted by the pool of trials using a common set of baseline covariates and outcomes. In this analysis, comparing all intervention groups with all control groups (‘non-active usual care plus sham for clinical outcomes’ or ‘usual care for health-economic outcomes’), we identified some moderators that reached conventional statistical significance for some outcomes. Summarising these findings, these data suggest that those who are worse on a measure of physical function (FFbHR/SF-12/36 PCS) have the most to gain from treatment on physical outcomes and those who are worse on the SF-12/36 MCS at baseline gain the most on this outcome measure. For the outcome of EQ-5D, its baseline value did not moderate treatment response, but pain, physical function (SF-12/36 PCS) and anxiety, which are arguably components of the EQ-5D, did moderate response. The exception to the observation that it is severity at baseline that predicts response to treatment on that measure is that a less favourable baseline FFbHR score moderates outcome on the SF-12/36 MCS. Anxiety – but not catastrophising, coping strategies and depression – moderated treatment response, at p < 0.05 in the analyses for the outcome of EQ-5D in which those with lower risk of anxiety had less treatment effect than those with higher risk of anxiety. This is the first meta-analysis to assess effect moderation in the treatment of LBP and hence gives a far more robust assessment than any previous work in this area. The numbers in our analyses mean that if there were true moderation effects in this comparison of all treatments against control then they should have been identified.
Although these observations are of some interest, the main purpose of these analyses was to select potential moderators that were significant at the 20% level to take forward for our main analyses. We were able to take forward FFbHR, RMDQ, SF-12/36 PCS and MCS, age, gender, pain, fear avoidance and coping as variable with a possible signal in one or more analysis.
Recursive partitioning (see Chapter 7)
We successfully adapted two recursive partitioning approaches to identify subgroups in an individual participant data meta-analysis. There are important distinctions in the way they work. The IPD-IT method is seeking to maximise the size of the interaction term when making splits, whereas the IPD-SIDES method is seeking to detect groups with the largest treatment effects. 167 The choice of approach in any future analyses using a recursive partitioning approach will depend on the primary outcome of interest. For our current purpose we prefer the IPD-SIDES approach, as we think it is more likely to identify clinically useful subgroups with large effect sizes. The IPD-IT approach may be more suitable for more exploratory analyses for which maximising any moderation is the outcome of interest. We have presented both analyses here to explore how they perform on a real data set. The IPD-SIDES approach appears to be more sensitive, as it has successfully identified some subgroups within our data, whereas the IPD-IT method did not (Tables 53–57 and see Chapter 7). Our overall analysis of all interventions compared with control (usual care or sham control) provides evidence that the IPD-SIDES method functions well; we found candidate subgroups in a real data set, as well as the simulation in which it was originally tested. For the choice of treatment compared with control (sham plus usual care) using the full data set, there are some clusters of characteristics with different treatment outcomes. For example, for the outcome FFbHR (range of the score is from 0 = great limitation to 100 = no limitation) the overall treatment effect of 8.93 (95% CI 7.81 to 10.05) increases to 13.17 (95% CI 10.56 to 15.77) in those with a FFbHR score of ≤ 54.2 and aged ≤ 60 years or for the SF-12/36 PCS (range 0–100 best) the overall treatment effect increases from 3.48 (95% CI 3.01 to 3.96) to 4.89 (95% CI 3.96 to 5.82) in those with a SF-12/36 PCS of ≤ 40.0 and a SF-12/36 MCS of > 54.2. It is, however, the pairwise comparisons, with usual care control, which might be useable to inform clinical practice.
METHOD (section) | OUTCOMEa | |||||||
---|---|---|---|---|---|---|---|---|
Physical health | Pain: average pain | Mental health: MCS | Quality of life | |||||
FFbHR | RMDQ | PCS | EQ-5D | QALYb | ||||
ANCOVA (see Chapter 6, Analyses of covariance) |
Positive moderator | None found | Moderate catastrophising;c positive fear avoidancec | None found | Pain;c MCS;c moderate fear avoidancec | None found | Female;c RMDQd pain;d moderate fear avoidanced | Age;c RMDQc |
Negative moderators | Age;c FFbHR;d PCSd | None found | Age;c PCS;d MCS < 50c | PCS;c low anxiety;c positive copingc | FFbHR;d MCSd | PCS;d MCS;c low/moderate anxietyd | PCSc | |
Recursive partitioning IPD-SIDES See Chapter 7 (Subgroups identified by the IPD-SIDES method) |
Subgroups | Younger with worse FFbHR | None found |
|
None found | Worse MCS | None found | None found |
Directed searche See Chapter 8 (Analysis 1: Overall comparison treatment compared with control) and Chapter 9 (Results) |
Subgroups |
|
None found | None found | None found | Younger with worse MCS | None found |
|
METHOD (section) | OUTCOMEa | |||||||
---|---|---|---|---|---|---|---|---|
Physical health | Pain: average pain | Mental health: MCS | Quality of life | |||||
FFbHR | RMDQ | PCS | EQ-5D | QALYb | ||||
Recursive partitioning IPD-SIDES See Chapter 7 (Subgroups identified by the IPD-SIDES method) |
Subgroups | None found | None found | None found | None found | None found | None found | None found |
Directed searchc See Chapter 8 (Analysis 1: Overall comparison treatment compared with control) and Chapter 9 (Results) |
Subgroups | None found | None found | None found | None found | None found | None found | None found |
NWMAd See Chapter 10 (Results) |
Positive moderators | Not conducted | RMDQ; PCS | None found | Not conducted | None found | Not conducted | Not conducted |
Negative moderators | Not conducted | Age | PCS | Not conducted | MCS | Not conducted | Not conducted |
METHOD (section) | OUTCOMEa | |||||||
---|---|---|---|---|---|---|---|---|
Physical health | Pain: average pain | Mental health: MCS | Quality of life | |||||
FFbHR | RMDQ | PCS | EQ-5D | QALYb | ||||
Recursive partitioning IPD-SIDES See Chapter 7 (Subgroups identified by the IPD-SIDES method) |
Subgroups | Younger with worse FFbHR | None found |
|
None found | Worse MCS and worse PCS | None found | None found |
Directed searchc See Chapter 8 (Analysis 1: Overall comparison treatment compared with control) and Chapter 9 (Results) |
Subgroups | Younger with worse FFbHR | Not conducted | None found | Not conducted | Younger with worse PCS and worse MCS | Not conducted | None found |
NWMAd See Chapter 10 (Results) |
Positive moderators | Not conducted | Men; RMDQ; PCS | Women | Not conducted | None found | Not conducted | Not conducted |
Negative moderators | Not conducted | None found | PCS | Not conducted | Age; PCS; MCS | Not conducted | Not conducted |
METHOD (section) | OUTCOMEa | |||||||
---|---|---|---|---|---|---|---|---|
Physical health | Pain: average pain | Mental health: MCS | Quality of life | |||||
FFbHR | RMDQ | PCS | EQ-5D | QALYb | ||||
Recursive partitioning IPD-SIDES See Chapter 7 (Subgroups identified by the IPD-SIDES method) |
Subgroups | None found | Worse RMDQ | None found | None found | None found | None found | None found |
Directed searchc See Chapter 8 (Analysis 1: Overall comparison treatment compared with control) and Chapter 9 (Results) |
Subgroups | Not conducted | None found | Not conducted | Not conducted | Not conducted | Not conducted | Not conducted |
NWMAd See Chapter 10 (Results) |
Positive moderators | Not conducted | RMDQ; PCS; MCS | MCS | Not conducted | PCS | Not conducted | Not conducted |
Negative moderators | Not conducted | Age | Age | Not conducted | MCS | Not conducted | Not conducted |
METHOD (section) | OUTCOMEa | |||||||
---|---|---|---|---|---|---|---|---|
Physical health | Pain: average pain | Mental health: MCS | Quality of life | |||||
FFbHR | RMDQ | PCS | EQ-5D | QALY | ||||
Recursive partitioning IPD-SIDES See Chapter 7 (Subgroups identified by the IPD-SIDES method) |
Subgroups | None found | None found | None found | None found | Younger with worse PCS | None found | None found |
Directed searchb See Chapter 8 (Analysis 1: Overall comparison treatment compared with control) and Chapter 9 (Results) |
Subgroups | Younger with either worse FFbHR or PCS | Not conducted | Not conducted | Not conducted | Any age; worse PCS; worse MCS | Not conducted | Not conducted |
NWMAc See Chapter 10 (Results) |
Positive moderator | Not conducted | Men RMDQ | Women | Not conducted | None found | Not conducted | Not conducted |
Negative moderator | Not conducted | None found | MCS; PCS | Not conducted | Age; PCS; MCS | Not conducted | Not conducted |
Passive physical therapy
For passive physical therapy we identified subgroups for the outcomes of FFbHR, plus SF-12/36 MCS/PCS. The results for FFbHR, which represent just acupuncture trials, find a maximal effect of 16.67 (95% CI 13.16 to 20.18) compared with an overall treatment effect of 9.95 (95% CI 8.80 to 11.11) in those aged ≤ 53 years and with a FFbHR score of ≤ 54.2. Thus acupuncture is likely to be more effec