No items found.

View Article PDF

Illness severity scoring systems such as the Acute Physiological and Chronic Health Evaluation (APACHE) have become important tools for the evaluation and planning of intensive care practice patterns. These systems objectively estimate patient risk for mortality from acute physiological and chronic health status. They are not, however, a tool used for deciding treatment for individual patients; they are a group measurement used for patients who have similar disease processes. Their origin in the late 1970s and early 1980s was driven by the need to relate such practice patterns to patient outcomes.In the modern setting, tools such as the APACHE scoring system allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is essential for benchmarking and program evaluation. The interest in illness severity scoring systems is evidenced by the extensive body of literature that continues to advance both technical aspects of the systems themselves, and the applications for which they are used.Middlemore Hospital was one the earlier facilities in New Zealand to implement APACHE II scoring in clinical settings. The routine scoring of patients began in the Intensive Care Unit (ICU) in 1986. This advance was facilitated in no small part by the availability of one of the developers of the APACHE system, Dr Jack Zimmerman, who spent an extended sabbatical in New Zealand, some of which was at Middlemore Hospital.Despite enthusiastic support for APACHE II scoring by international opinion leaders at the time, the relevance and utility in New Zealand has been questioned from an early stage. The external validity of the system in such a different population from that in is developed was acknowledged by Zimmerman et al:1 The NZ hospitals designated 1.7% of their total beds for intensive care compared to 5.6% in the US hospitals. The average age for NZ admissions was 42 compared to 55 in the US (p<0.0001). The NZ ICUs admitted fewer patients with severe chronic failing health (NZ 8.7%, US 18%) and following elective surgery (NZ 8%, US 40%). Approximately half the NZ admissions were for trauma, drug overdose, and asthma while these diagnoses accounted for 11% of US admissions. When controlled for differences in casemix and severity of illness, hospital mortality rates in NZ were comparable to the US. This study demonstrates substantial differences in patient selection among these US and NZ. Furthermore, after more than two decades of use, it is unclear whether the performance of the APACHE II has been maintained. Patient casemix in New Zealand has changed from earlier times, and the improvements in supportive care that are now available may have decreased mortality for any given illness severity. International opinion leaders are in general moving towards the more recently developed scoring systems such as APACHE versions III and IV, which have been shown to outperform older versions in studies of North American and European ICU populations.2 This paradigm is slowly translating to clinical practice in this part of the world: the Australian and New Zealand Intensive Care Society adult patient database now collect data sufficient to model both APACHE II and III scores.3 There are three aims of this study. We aim to: Assess change in APACHE II scores and hospital standardised mortality ratio at our ICU over a 9-year period from 1 January 1997 to 31 December 2005; Assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and Assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise. Methods Study population and setting—Middlemore Hospital is the main hospital within the Counties Manukau District Health Board (CMDHB). The hospital serves a large urban population. The district catchment includes Manukau City which is rapidly expanding: the population has grown from 356,006 in 1996 to 454,655 at last census in 2006. The population can be summarily characterised as being young , multi-ethnic, and of low socioeconomic status compared with the rest of New Zealand.4 Middlemore Hospital is a tertiary referral centre for plastic surgery, burns, orthopaedics, and a range of medical sub-specialities. Any patient requiring neurosurgical or cardiothoracic surgical intervention is referred on to Auckland City Hospital as Middlemore Hospital does not have these facilities; all other patient categories remain at Middlemore Hospital. Although there is a specialist regional paediatric hospital in the area, Middlemore Hospital is also a paediatric hospital; the Middlemore Hospital ICU therefore cares for those children down to 2 kg weight requiring intensive care accounting for approximately 120 paediatric admissions per year. The hospital is academically affiliated and thus a teaching institution. Middlemore Hospital has had between 700 and 900 acute beds over the time in which this research was done, and now also includes a satellite surgical centre which caters for the majority of elective cases apart from those that are particularly high risk. Currently, the Middlemore ICU is nominally a seven funded-bed Level 3 facility. Since the inception of the Middlemore Hospital ICU in the late 1960s, the unit has been structurally modified on several occasions. As a result of the both national and local changes in healthcare strategy, the unit had at times had nominated HDU beds, and at other times not. Since 2004, there has been a four-funded bed Level I intensive care unit at a satellite surgical centre, which currently shares clinical governance, staff, policies and procedures with the main ICU at Middlemore Hospital. These patients were not included in this study. Data source—All data were sourced from a single-centre relational database that has been in continuous use at the Middlemore Hospital ICU since January 1986. The database contains information on all patients admitted to ICU during this period, using data that is prospectively collected, collated, and agreed upon by senior specialists and the charge nurse at the time. Data collection was progressively expanded during this period to ultimately include demographic information, APACHE II score, diagnostic information, ventilatory and inotropic support, procedures performed, and patient outcome. Patients who were less than 15 years of age, or who had been admitted solely for the purpose of a procedure such as difficult central venous line or endoscopy were not scored, as the system was not devised for these groups. The database specifically includes both patient death at both ICU and hospital discharge. The database includes locally developed diagnostic codes ("adclasses" and "subclasses") in addition to the APACHE II ones, which were developed to better reflect and discriminate disease categories related to the local population (see Appendix). Generic APACHE II diagnostic codes do necessarily provide a realistic reflection of the local disease categories and population outcomes. They can be ‘localised' by adjustments to either disease categorisation and / or the category weights subsequently used with the APACHE II scores for calculating risk of death supported in the case of Middlemore Hospital by Zimmerman et al who emphasised differences between North American and New Zealand ICU patient populations.1 Data were prospectively stored in Microsoft Access (Microsoft Corporation, Seattle, WA, USA), and retrospectively abstracted for analyses from a 9-year period from 1 January 1997 to 31 December 2005. Calculation of APACHE II scores and risk of death—All APACHE II scores and risk of death were calculated at patient hospital discharge using the prospectively stored data and the logistic regression equation developed by Knaus et al.5 The data for calculation of the APACHE II score included physiological measurements in the first 24 hours of ICU admission, age and chronic health status. The APACHE II risk of death is calculated not only from scores but also diagnostic categories, which were rigorously and continuously evaluated by the senior ICU medical staff during the process of prospective data collection. Such minimisation of misclassification was necessary to avoid error arising from the heavy reliance of the APACHE II risk of death formula on reason for ICU admission. Statistics—Standard statistics were used to describe data, making particular use of median and interquartile range to avoid assumptions around data distribution. Hypothesis testing was undertaken using Kruskal-Wallis equality-of-populations rank test for continuous variables, and the Pearson's Chi-squared test for categorical ones. Risk-adjusted mortality by year was assessed by hospital standardised mortality ratios and 95% confidence intervals (regarding observed mortality as a binomial variable), which were obtained by dividing the number of observed hospital deaths in each year by the number of predicted ones using the APACHE II system.6 Overall predictive performance of the APACHE II scoring system by year was gauged through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) and calibration (ability to predict mortality rate over classes of risk). Discrimination was assessed using receiver operating characteristic (ROC) curves, which plot the true positive rate (sensitivity, or predicted hospital deaths / observed hospital deaths) against the false positive rate (1-specificity, or 1-predicted hospital deaths / observed hospital deaths). The predictive performance is indicated in this method by the ROC area under the curve (AUC), with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. The slope the curve indicates ratio of true positives and false positives, which also is known as the likelihood ratio.7 For the analyses in this article, equality of ROC AUC for each year of study was compared.8 Calibration was assessed using the correspondence between the number of observed hospital deaths and the number of predicted hospital deaths within each 10% stratum (decile) of the cohort's expected risk of death. The predictive performance is indicated in this method by goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.9 The predictive performance of the APACHE II scoring system in major clinical subgroups was assessed by discrimination using hospital standardised mortality ratios within each of the major "adclasses". All analyses were performed using Microsoft Excel (Microsoft Corporation, Seattle, WA, USA) and Intercooled Stata 9.2 (Statacorp, College Station, TX, USA) software. Ethics—The need for formal approval for the research process was waived by the National (New Zealand) Health and Disability Ethics Committee under the provisions made for clinical audit. Results Data from 7703 patients were available for analysis. Baseline patient characteristics are presented in Table 1. Numbers of patients admitted to the ICU increased steadily from 686 in 1997 to 730 in 2005. The demographic characteristics of patients changed over the period of observation, with a trend to older and more Māori patients. There has also been a change in casemix of patients, with a reduction in the number of patients with diagnoses of poisoning and trauma, and an increase in the number of patients admitted after elective or emergency surgery. Patient length of stay has progressively reduced, as has the proportion of patients requiring mechanical ventilation. Overall hospital mortality decreased from approximately 19% at the beginning of the period of observation to approximately 12% at the end. Figure 1. APACHE II scores and risk scores by year, presented as boxplots Note: In these plots, the middle horizontal line represents the median; the box the second and third quartiles; and the whiskers the upper and lower extreme values which are no more than 1.5 × the interquartile range beyond the middle quartiles. Figure 2. Hospital-standardised mortality ratio and 95% confidence intervals, by year The APACHE II score decreased marginally over the period of observation as illustrated in Figure 1, with a median value of 14 in 1997 (IQR 9–21) and a corresponding value of 13 in 2005 (IQR 9–21). Although this reduction did achieve statistical significance (p=0.0001), it cannot be regarded as being clinically important. APACHE II predicted risk of death has remained stable over the period of observation, with a minor trend to reduction that did not achieve statistical significance (p=0.11). The hospital-standardised mortality ratio decreased over the period observation as illustrated in Figure 2, with a value of 0.94 (95% confidence intervals 0.82–1.06) in 1997 and a corresponding value of 0.66 (95% confidence intervals 0.55–0.76) in 2005. Model adequacy for discrimination by APACHE II score is illustrated by year in Figures 3 and 4. In general, the APACHE II score performs adequately in each year with ROC curve AUCs of >0.8. However, there is deteriorating accuracy of mortality predictions over time (otherwise known as ‘model fade'10 that approaches statistical significance. Corresponding model adequacy for discrimination by APACHE II predicted risk of death is illustrated in Figures 5 and 6. The risk model performs similarly to the APACHE II score showing a like degree of ‘model fade'. Figure 3. ROC curves for APACHE II Score, by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. Figure 4. ROC curve AUC (95%CI) for APACHE II Score, by year, as shown in figure 3 Note: Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997. Figure 5. ROC curves for the APACHE II Risk score by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction, and a value of 1.0 equating to perfect discrimination. Figure 6. ROC curve AUC (95%CI) for APACHE II Risk Score by year as shown in figure 5.. Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997 Number of Patients in Each Decile of Risk Figure 7. Calibration curves for APACHE II predicted risk of death, by year showing the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death. Predictive performance is assessed by the Hosmer-Lemeshow statistic (see table 2) Table 2. Model adequacy for calibration by APACHE II predicted risk of death, by year as indicated by the Hosmer-Lemeshow goodness-of-fit statistic for each of the calibration curves in figure 7. A high Hosmer-Lemeshow statistic and a P value <0.05 indicates poor correspondence between the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death Year Hosmer-Lemeshow Goodness-of-fit Statistics (P values) 1997 1998 1999 2000 2001 2002 2003 2004 2005 9.82 (0.132) 11.14 (0.084) 5.49 (0.482) 3.31 (0.769) 10.24 (0.111) 8.04 (0.235) 25.31 (0.0003) 21.49 (0.001) 19.41 (0.004) Figure 8. Hospital-standardised mortality ratio (observed/predicted hospital deaths) for clinical diagnostic subgroups ("adclasses" as described in Appendix) Model adequacy for calibration by APACHE II predicted risk of death is illustrated by year in Figure 7. There is progressively poorer goodness-of-fit as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards as shown in Table 2. Figure 8 illustrates model adequacy for discrimination by APACHE II predicted risk of death, according to clinical diagnostic subgroup. Although model adequacy was poorest in patients with neurological failure, there were only a small number of patients in this group. In contrast, the large number of patients with sepsis, respiratory failure, postoperative status, and circulatory failure makes the moderately poor model adequacy in these clinical subgroups more clinically relevant. Discussion Our data show that there has generally been a change in the overall casemix of patients admitted to the Middlemore Hospital ICU, with a decrease in the number of patients with poisonings and trauma over the period of observation, and an increase in those with complications as a result of surgery. APACHE II scores have remained fairly constant over the period of observation, with only a subtle trend to decreasing patient illness severity that did not achieve statistical significance. The data also show that there has been with a reduction in crude and risk-adjusted mortality, as assessed mortality rates and by hospital standardised mortality ratios. Despite this, there has been a steady drop in the proportion of patients receiving mechanical ventilation over the period of observation, and the average length of patient stay. Correlation between mechanical ventilation and increments in length of patient ICU stay has been noted in other studies 3. This change in outcomes and practice pattern may reflect the benefits of clinical pathways within our hospital, and the earlier detection and correction of physiological derangements that occurs in the modern, more pro-active approach to provision intensive care. An alternative, more pessimistic view is that this scenario may reflect earlier discharges from our ICU to accommodate increasing demand in a setting of increasingly limited resources. Reassuringly, if this latter scenario is the true one, then outcomes appear to have been maintained despite this. The data are in general terms consistent with a recent paper by Moran et al reporting on intensive care outcomes using an international Australian and New Zealand ICU database (ANZICS database), which to date has not included data from Middlemore Hospital and can therefore be regarded as independent. These investigators reported an improvement in overall risk-adjusted mortality over the last 11 years, which they did not attribute to any one specific factor3. Most medical administrators and practitioners would consider these improved outcomes to be in some part causally related to corresponding improvements in clinical care and therapeutic interventions. It would, however, take a more complex minimum dataset than both the ANZICS database and our local one to study this question appropriately. There are two major findings of this study relating to the predictive performance of the APACHE II system. The first is that there has been progressive deterioration model adequacy in terms of both discrimination and calibration. Predictive performance is generally acceptable when ROC curve AUCs are >0.8, and using these and similar criteria it seems that continuing use this system in our current practice may be unreasonable. The second is that the APACHE II system has been better sustained in some clinical diagnostic subgroups but not others. As is common to most ICUs, the largest clinical diagnostic subgroup in our dataset is sepsis and post-surgical complications, and the APACHE II system has moderately poor model adequacy in this subgroup, with prediction error of between 25-50%. Of note, the subgroups with the largest prediction error in our dataset constitute only ~10% of the entire Middlemore ICU population. The finding of ‘model fade' over time is also consistent with those of Moran et al, who demonstrated deteriorating model adequacy for the APACHE II system over time, both in terms of both discrimination and calibration. This was the case even after the authors recalibrated the APACHE II model by re-estimating coefficients for the Australasian population, thereby optimising discrimination and calibration. This is an important subtlety, since the performance of all illness severity scoring models is well known to be poorer in populations that are different from those in which they have been developed. This simple recalibration adjusts for geographical differences in measured patient characteristics (physiology and diagnosis), although it does not consider ICU characteristics and different organizational characteristics of healthcare systems as a predictive variable. The Intensive Care National Audit and Research Centre (ICNARC) model is in essence an adaptation of the APACHE model that was developed by Rowan et al. in the 1990s in the United Kingdom,10but over the years has resulted in a completely independent model that is widely used in the UK.11 Opinion leaders now recommend regular recalibration of illness scoring systems to local and more contemporary cohorts,12 although to our knowledge there is no consensus or even propositions concerning thresholds for model performance that would trigger the recalibration process, or standardised methodology around the recalibration itself. ‘Model fade' and poor model performance in diagnostic subgroups have led to the evolution of existing scores into a third and fourth generations of illness severity scoring systems, such as SAPS III and APACHE III and IV.2,12 The evolution of these scores did not involve simple recalibration of models by re-estimating coefficients, and instead involved the application of new statistical methods, the addition of new variables, an increase in the number of diagnostic groups, and a change to the measurement of certain physiological and diagnostic variables. These scores can be expected to perform better as a result of their development in a cohort that is more contemporary and externally valid in terms of casemix, and also by using clinical information that was not initially taken into consideration during the development of the earlier systems. There is a widespread move amongst ICUs to this newer generation of illness scoring systems, although their performance is only marginally better than earlier versions of the scores that have been more simply recalibrated by re-estimating coefficients.13 Notwithstanding, the APACHE III system is currently used more widely in the USA, with demonstrably greater discrimination and calibration than the original APACHE II system.2 It is too early to say at this time whether more recent evolutions of these systems such as the APACHE IV and SAPS III systems will demonstrate continued improvement. The findings of our study do not address one of the conundrums of illness severity scoring: the interpretation of changes in scores and outcomes over time. As with other studies, it is impossible to tell from our data whether our results are due to improved patient care and access to care, or alternatively from the deteriorating performance of scoring systems because of changing patient casemix. Our cumulative clinical experienc

Summary

Abstract

Aim

The Acute Physiological and Chronic Health Evaluation (APACHE) II score is a popular illness severity scoring system for intensive care units. Scoring systems such as the APACHE II allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is critical when evaluating practice patterns and outcomes, both within or between intensive care units. The study aims to: assess changes in APACHE II scores and hospital-standardised mortality ratio at our ICU over a nine year period from 1 January 1997 to 31 December 2005; assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise.

Method

Retrospective audit of a single centre relational database, with evaluation of the APACHE II scoring system by year through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) using receiver operating characteristic (ROC) curves, and calibration (ability to predict mortality rate over classes of risk) using goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.

Results

Data from 7703 patients were available for analysis. There was a decrease in overall hospital mortality, from approximately 19% at the beginning of the period of observation to approximately 12% at the end. There was also a decrease in the hospital standardised mortality ratio from 0.94 (95%CI 0.82-1.06) to 0.66 (95%CI 0.55-0.76). In general, both the APACHE II score and risk of death model performed adequately in each year with ROC curve AUCs of >0.8, albeit with progressively poorer performance over time and model fade that approached statistical significance. There was progressively poorer calibration with the APACHE II risk of death model as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards. Overall, there was moderately poor model performance in the diagnostic groups with the largest number of patients (sepsis and post-surgical complications).

Conclusion

This study shows the progressively worse performance of the APACHE II illness severity scoring system over time due to model fade. This is especially so in common diagnostic categories, making this a clinically relevant finding. Future approaches to illness severity scoring should be tested and compared, such as re-estimating coefficients of the APACHE II diagnostic categories or using locally developed ones, moving to later evolutions of the system such as the APACHE III or APACHE IV, or developing novel artificial intelligence approaches.

Author Information

Susan L Mann, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland; Mark R Marshall, Nephrologist, Department of Internal Medicine, Counties Manukau District Health Board, Manukau, Auckland; Alec Holt, Director Health Informatics Programme, Department of Information Science, University of Otago, Dunedin; Brendon Woodford, Department of Information Science, University of Otago, Dunedin; Anthony B Williams, Intensivist, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland

Acknowledgements

The authors thank Mr Mpatisi Moyo (Decision Support, Middlemore Hospital) and Mr Gary Jackson (Public Health Physician, Counties Manukau District Health Board).

Correspondence

Susan L Mann, PO Box 25-075, St Heliers, Auckland 1740, New Zealand. Fax: +64 (0)9 2760034

Correspondence Email

smann@xtra.co.nz

Competing Interests

None known.

Zimmerman JE, Knaus WA, Judson JA, et al. Patient selection for intensive care: a comparison of New Zealand and United States hospitals. Crit Care Med. 1988;16(4):318-26.Zimmerman J, Kramer A. Outcome prediction in critical care: the APACHE Physiology and chronic health evaluation models. Curr Opin Crit Care. 2008;14:491-7.Moran JL, Bristow P, Solomon P, et al. Mortality and length-of-stay outcomes, 1993-2003 in the binational Australian and New Zealand intensive care adult patient database. Crit Care Med. 2008;36(1):46-60.http://www.cmdhb.govt.nz/About_CMDHB/Overview/population-profile.htmKnaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985 Oct;13(10):818-29.Morris J, Gardner M. Calculating confidence intervals for relative risks (odds ratios) and standardised ratios and rates. . BMJ. 1988;296:1313-6.Hanley JA, McNeil B. The meaning and use of a ROC curve. Radiology. 1982;143:29-36.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating curves: A nonparametric approach. Biometrics. 1988;44:837-45.Lemeshow S, Hosmer D. A review of goodness of fit statistics for use in the developement of logistic regression models. Am J Epidemiol. 1982;115:92-106.Rowan K, Kerr J, Major K, et al. Intensive Care Society's APACHE II study in Britain and Ireland-II: Outcome comparisons of intensive care units after adjustment for case-mix by the American APACHE II method. BMJ. 1993;307:977-81.Harrison D, Parry G, Carpenter J, et al. A new risk prediction model for critical care: the Intensive Care National Audit & Research Centre (ICNARC) model. Crit Care Med. 2007;35:1091-8.Capuzzo M, Moreno R, LeGall J. Outcome Prediction in critical care: the simplified acute physiology score models. Curr Opin Crit Care. 2008;14:485-90.Harrison D, Brady AR, Parry GJ, et al. Recalibration of risk prediction models in a large multicentre cohort of admissions to adult, general critical cared units in the United Kingdom. Crit Care Med. 2006;34(5):1378-88.Moreno R. Outcome prediction in intensive care: why we need to reinvent the wheel. Curr Opin Crit Care. 2008;14:483-4.Frize M, Walker R. Clinical decision-support systems for intensive care units using case-based reasoning. Med Eng Phys. 2000 Nov;22(9):671-7.Clermont G. Artificial neural networks as prediction tools in the critically ill. Crit Care. 2005 Apr;9(2):153-4.Clermont G, Angus DC, DiRusso SM, et al. Predicting hospital mortality for patients in the intensive care unit: a comparison of artificial neural networks with logistic regression models. Crit Care Med. 2001 Feb;29(2):291-6.Holt A, Bichindaritz I, Schmidt R, Perner P. Medical applications in case-based reasoning. The Knowledge of Engineering Review. 2006;20(3):289-92.

For the PDF of this article,
contact nzmj@nzma.org.nz

View Article PDF

Illness severity scoring systems such as the Acute Physiological and Chronic Health Evaluation (APACHE) have become important tools for the evaluation and planning of intensive care practice patterns. These systems objectively estimate patient risk for mortality from acute physiological and chronic health status. They are not, however, a tool used for deciding treatment for individual patients; they are a group measurement used for patients who have similar disease processes. Their origin in the late 1970s and early 1980s was driven by the need to relate such practice patterns to patient outcomes.In the modern setting, tools such as the APACHE scoring system allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is essential for benchmarking and program evaluation. The interest in illness severity scoring systems is evidenced by the extensive body of literature that continues to advance both technical aspects of the systems themselves, and the applications for which they are used.Middlemore Hospital was one the earlier facilities in New Zealand to implement APACHE II scoring in clinical settings. The routine scoring of patients began in the Intensive Care Unit (ICU) in 1986. This advance was facilitated in no small part by the availability of one of the developers of the APACHE system, Dr Jack Zimmerman, who spent an extended sabbatical in New Zealand, some of which was at Middlemore Hospital.Despite enthusiastic support for APACHE II scoring by international opinion leaders at the time, the relevance and utility in New Zealand has been questioned from an early stage. The external validity of the system in such a different population from that in is developed was acknowledged by Zimmerman et al:1 The NZ hospitals designated 1.7% of their total beds for intensive care compared to 5.6% in the US hospitals. The average age for NZ admissions was 42 compared to 55 in the US (p<0.0001). The NZ ICUs admitted fewer patients with severe chronic failing health (NZ 8.7%, US 18%) and following elective surgery (NZ 8%, US 40%). Approximately half the NZ admissions were for trauma, drug overdose, and asthma while these diagnoses accounted for 11% of US admissions. When controlled for differences in casemix and severity of illness, hospital mortality rates in NZ were comparable to the US. This study demonstrates substantial differences in patient selection among these US and NZ. Furthermore, after more than two decades of use, it is unclear whether the performance of the APACHE II has been maintained. Patient casemix in New Zealand has changed from earlier times, and the improvements in supportive care that are now available may have decreased mortality for any given illness severity. International opinion leaders are in general moving towards the more recently developed scoring systems such as APACHE versions III and IV, which have been shown to outperform older versions in studies of North American and European ICU populations.2 This paradigm is slowly translating to clinical practice in this part of the world: the Australian and New Zealand Intensive Care Society adult patient database now collect data sufficient to model both APACHE II and III scores.3 There are three aims of this study. We aim to: Assess change in APACHE II scores and hospital standardised mortality ratio at our ICU over a 9-year period from 1 January 1997 to 31 December 2005; Assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and Assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise. Methods Study population and setting—Middlemore Hospital is the main hospital within the Counties Manukau District Health Board (CMDHB). The hospital serves a large urban population. The district catchment includes Manukau City which is rapidly expanding: the population has grown from 356,006 in 1996 to 454,655 at last census in 2006. The population can be summarily characterised as being young , multi-ethnic, and of low socioeconomic status compared with the rest of New Zealand.4 Middlemore Hospital is a tertiary referral centre for plastic surgery, burns, orthopaedics, and a range of medical sub-specialities. Any patient requiring neurosurgical or cardiothoracic surgical intervention is referred on to Auckland City Hospital as Middlemore Hospital does not have these facilities; all other patient categories remain at Middlemore Hospital. Although there is a specialist regional paediatric hospital in the area, Middlemore Hospital is also a paediatric hospital; the Middlemore Hospital ICU therefore cares for those children down to 2 kg weight requiring intensive care accounting for approximately 120 paediatric admissions per year. The hospital is academically affiliated and thus a teaching institution. Middlemore Hospital has had between 700 and 900 acute beds over the time in which this research was done, and now also includes a satellite surgical centre which caters for the majority of elective cases apart from those that are particularly high risk. Currently, the Middlemore ICU is nominally a seven funded-bed Level 3 facility. Since the inception of the Middlemore Hospital ICU in the late 1960s, the unit has been structurally modified on several occasions. As a result of the both national and local changes in healthcare strategy, the unit had at times had nominated HDU beds, and at other times not. Since 2004, there has been a four-funded bed Level I intensive care unit at a satellite surgical centre, which currently shares clinical governance, staff, policies and procedures with the main ICU at Middlemore Hospital. These patients were not included in this study. Data source—All data were sourced from a single-centre relational database that has been in continuous use at the Middlemore Hospital ICU since January 1986. The database contains information on all patients admitted to ICU during this period, using data that is prospectively collected, collated, and agreed upon by senior specialists and the charge nurse at the time. Data collection was progressively expanded during this period to ultimately include demographic information, APACHE II score, diagnostic information, ventilatory and inotropic support, procedures performed, and patient outcome. Patients who were less than 15 years of age, or who had been admitted solely for the purpose of a procedure such as difficult central venous line or endoscopy were not scored, as the system was not devised for these groups. The database specifically includes both patient death at both ICU and hospital discharge. The database includes locally developed diagnostic codes ("adclasses" and "subclasses") in addition to the APACHE II ones, which were developed to better reflect and discriminate disease categories related to the local population (see Appendix). Generic APACHE II diagnostic codes do necessarily provide a realistic reflection of the local disease categories and population outcomes. They can be ‘localised' by adjustments to either disease categorisation and / or the category weights subsequently used with the APACHE II scores for calculating risk of death supported in the case of Middlemore Hospital by Zimmerman et al who emphasised differences between North American and New Zealand ICU patient populations.1 Data were prospectively stored in Microsoft Access (Microsoft Corporation, Seattle, WA, USA), and retrospectively abstracted for analyses from a 9-year period from 1 January 1997 to 31 December 2005. Calculation of APACHE II scores and risk of death—All APACHE II scores and risk of death were calculated at patient hospital discharge using the prospectively stored data and the logistic regression equation developed by Knaus et al.5 The data for calculation of the APACHE II score included physiological measurements in the first 24 hours of ICU admission, age and chronic health status. The APACHE II risk of death is calculated not only from scores but also diagnostic categories, which were rigorously and continuously evaluated by the senior ICU medical staff during the process of prospective data collection. Such minimisation of misclassification was necessary to avoid error arising from the heavy reliance of the APACHE II risk of death formula on reason for ICU admission. Statistics—Standard statistics were used to describe data, making particular use of median and interquartile range to avoid assumptions around data distribution. Hypothesis testing was undertaken using Kruskal-Wallis equality-of-populations rank test for continuous variables, and the Pearson's Chi-squared test for categorical ones. Risk-adjusted mortality by year was assessed by hospital standardised mortality ratios and 95% confidence intervals (regarding observed mortality as a binomial variable), which were obtained by dividing the number of observed hospital deaths in each year by the number of predicted ones using the APACHE II system.6 Overall predictive performance of the APACHE II scoring system by year was gauged through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) and calibration (ability to predict mortality rate over classes of risk). Discrimination was assessed using receiver operating characteristic (ROC) curves, which plot the true positive rate (sensitivity, or predicted hospital deaths / observed hospital deaths) against the false positive rate (1-specificity, or 1-predicted hospital deaths / observed hospital deaths). The predictive performance is indicated in this method by the ROC area under the curve (AUC), with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. The slope the curve indicates ratio of true positives and false positives, which also is known as the likelihood ratio.7 For the analyses in this article, equality of ROC AUC for each year of study was compared.8 Calibration was assessed using the correspondence between the number of observed hospital deaths and the number of predicted hospital deaths within each 10% stratum (decile) of the cohort's expected risk of death. The predictive performance is indicated in this method by goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.9 The predictive performance of the APACHE II scoring system in major clinical subgroups was assessed by discrimination using hospital standardised mortality ratios within each of the major "adclasses". All analyses were performed using Microsoft Excel (Microsoft Corporation, Seattle, WA, USA) and Intercooled Stata 9.2 (Statacorp, College Station, TX, USA) software. Ethics—The need for formal approval for the research process was waived by the National (New Zealand) Health and Disability Ethics Committee under the provisions made for clinical audit. Results Data from 7703 patients were available for analysis. Baseline patient characteristics are presented in Table 1. Numbers of patients admitted to the ICU increased steadily from 686 in 1997 to 730 in 2005. The demographic characteristics of patients changed over the period of observation, with a trend to older and more Māori patients. There has also been a change in casemix of patients, with a reduction in the number of patients with diagnoses of poisoning and trauma, and an increase in the number of patients admitted after elective or emergency surgery. Patient length of stay has progressively reduced, as has the proportion of patients requiring mechanical ventilation. Overall hospital mortality decreased from approximately 19% at the beginning of the period of observation to approximately 12% at the end. Figure 1. APACHE II scores and risk scores by year, presented as boxplots Note: In these plots, the middle horizontal line represents the median; the box the second and third quartiles; and the whiskers the upper and lower extreme values which are no more than 1.5 × the interquartile range beyond the middle quartiles. Figure 2. Hospital-standardised mortality ratio and 95% confidence intervals, by year The APACHE II score decreased marginally over the period of observation as illustrated in Figure 1, with a median value of 14 in 1997 (IQR 9–21) and a corresponding value of 13 in 2005 (IQR 9–21). Although this reduction did achieve statistical significance (p=0.0001), it cannot be regarded as being clinically important. APACHE II predicted risk of death has remained stable over the period of observation, with a minor trend to reduction that did not achieve statistical significance (p=0.11). The hospital-standardised mortality ratio decreased over the period observation as illustrated in Figure 2, with a value of 0.94 (95% confidence intervals 0.82–1.06) in 1997 and a corresponding value of 0.66 (95% confidence intervals 0.55–0.76) in 2005. Model adequacy for discrimination by APACHE II score is illustrated by year in Figures 3 and 4. In general, the APACHE II score performs adequately in each year with ROC curve AUCs of >0.8. However, there is deteriorating accuracy of mortality predictions over time (otherwise known as ‘model fade'10 that approaches statistical significance. Corresponding model adequacy for discrimination by APACHE II predicted risk of death is illustrated in Figures 5 and 6. The risk model performs similarly to the APACHE II score showing a like degree of ‘model fade'. Figure 3. ROC curves for APACHE II Score, by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. Figure 4. ROC curve AUC (95%CI) for APACHE II Score, by year, as shown in figure 3 Note: Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997. Figure 5. ROC curves for the APACHE II Risk score by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction, and a value of 1.0 equating to perfect discrimination. Figure 6. ROC curve AUC (95%CI) for APACHE II Risk Score by year as shown in figure 5.. Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997 Number of Patients in Each Decile of Risk Figure 7. Calibration curves for APACHE II predicted risk of death, by year showing the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death. Predictive performance is assessed by the Hosmer-Lemeshow statistic (see table 2) Table 2. Model adequacy for calibration by APACHE II predicted risk of death, by year as indicated by the Hosmer-Lemeshow goodness-of-fit statistic for each of the calibration curves in figure 7. A high Hosmer-Lemeshow statistic and a P value <0.05 indicates poor correspondence between the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death Year Hosmer-Lemeshow Goodness-of-fit Statistics (P values) 1997 1998 1999 2000 2001 2002 2003 2004 2005 9.82 (0.132) 11.14 (0.084) 5.49 (0.482) 3.31 (0.769) 10.24 (0.111) 8.04 (0.235) 25.31 (0.0003) 21.49 (0.001) 19.41 (0.004) Figure 8. Hospital-standardised mortality ratio (observed/predicted hospital deaths) for clinical diagnostic subgroups ("adclasses" as described in Appendix) Model adequacy for calibration by APACHE II predicted risk of death is illustrated by year in Figure 7. There is progressively poorer goodness-of-fit as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards as shown in Table 2. Figure 8 illustrates model adequacy for discrimination by APACHE II predicted risk of death, according to clinical diagnostic subgroup. Although model adequacy was poorest in patients with neurological failure, there were only a small number of patients in this group. In contrast, the large number of patients with sepsis, respiratory failure, postoperative status, and circulatory failure makes the moderately poor model adequacy in these clinical subgroups more clinically relevant. Discussion Our data show that there has generally been a change in the overall casemix of patients admitted to the Middlemore Hospital ICU, with a decrease in the number of patients with poisonings and trauma over the period of observation, and an increase in those with complications as a result of surgery. APACHE II scores have remained fairly constant over the period of observation, with only a subtle trend to decreasing patient illness severity that did not achieve statistical significance. The data also show that there has been with a reduction in crude and risk-adjusted mortality, as assessed mortality rates and by hospital standardised mortality ratios. Despite this, there has been a steady drop in the proportion of patients receiving mechanical ventilation over the period of observation, and the average length of patient stay. Correlation between mechanical ventilation and increments in length of patient ICU stay has been noted in other studies 3. This change in outcomes and practice pattern may reflect the benefits of clinical pathways within our hospital, and the earlier detection and correction of physiological derangements that occurs in the modern, more pro-active approach to provision intensive care. An alternative, more pessimistic view is that this scenario may reflect earlier discharges from our ICU to accommodate increasing demand in a setting of increasingly limited resources. Reassuringly, if this latter scenario is the true one, then outcomes appear to have been maintained despite this. The data are in general terms consistent with a recent paper by Moran et al reporting on intensive care outcomes using an international Australian and New Zealand ICU database (ANZICS database), which to date has not included data from Middlemore Hospital and can therefore be regarded as independent. These investigators reported an improvement in overall risk-adjusted mortality over the last 11 years, which they did not attribute to any one specific factor3. Most medical administrators and practitioners would consider these improved outcomes to be in some part causally related to corresponding improvements in clinical care and therapeutic interventions. It would, however, take a more complex minimum dataset than both the ANZICS database and our local one to study this question appropriately. There are two major findings of this study relating to the predictive performance of the APACHE II system. The first is that there has been progressive deterioration model adequacy in terms of both discrimination and calibration. Predictive performance is generally acceptable when ROC curve AUCs are >0.8, and using these and similar criteria it seems that continuing use this system in our current practice may be unreasonable. The second is that the APACHE II system has been better sustained in some clinical diagnostic subgroups but not others. As is common to most ICUs, the largest clinical diagnostic subgroup in our dataset is sepsis and post-surgical complications, and the APACHE II system has moderately poor model adequacy in this subgroup, with prediction error of between 25-50%. Of note, the subgroups with the largest prediction error in our dataset constitute only ~10% of the entire Middlemore ICU population. The finding of ‘model fade' over time is also consistent with those of Moran et al, who demonstrated deteriorating model adequacy for the APACHE II system over time, both in terms of both discrimination and calibration. This was the case even after the authors recalibrated the APACHE II model by re-estimating coefficients for the Australasian population, thereby optimising discrimination and calibration. This is an important subtlety, since the performance of all illness severity scoring models is well known to be poorer in populations that are different from those in which they have been developed. This simple recalibration adjusts for geographical differences in measured patient characteristics (physiology and diagnosis), although it does not consider ICU characteristics and different organizational characteristics of healthcare systems as a predictive variable. The Intensive Care National Audit and Research Centre (ICNARC) model is in essence an adaptation of the APACHE model that was developed by Rowan et al. in the 1990s in the United Kingdom,10but over the years has resulted in a completely independent model that is widely used in the UK.11 Opinion leaders now recommend regular recalibration of illness scoring systems to local and more contemporary cohorts,12 although to our knowledge there is no consensus or even propositions concerning thresholds for model performance that would trigger the recalibration process, or standardised methodology around the recalibration itself. ‘Model fade' and poor model performance in diagnostic subgroups have led to the evolution of existing scores into a third and fourth generations of illness severity scoring systems, such as SAPS III and APACHE III and IV.2,12 The evolution of these scores did not involve simple recalibration of models by re-estimating coefficients, and instead involved the application of new statistical methods, the addition of new variables, an increase in the number of diagnostic groups, and a change to the measurement of certain physiological and diagnostic variables. These scores can be expected to perform better as a result of their development in a cohort that is more contemporary and externally valid in terms of casemix, and also by using clinical information that was not initially taken into consideration during the development of the earlier systems. There is a widespread move amongst ICUs to this newer generation of illness scoring systems, although their performance is only marginally better than earlier versions of the scores that have been more simply recalibrated by re-estimating coefficients.13 Notwithstanding, the APACHE III system is currently used more widely in the USA, with demonstrably greater discrimination and calibration than the original APACHE II system.2 It is too early to say at this time whether more recent evolutions of these systems such as the APACHE IV and SAPS III systems will demonstrate continued improvement. The findings of our study do not address one of the conundrums of illness severity scoring: the interpretation of changes in scores and outcomes over time. As with other studies, it is impossible to tell from our data whether our results are due to improved patient care and access to care, or alternatively from the deteriorating performance of scoring systems because of changing patient casemix. Our cumulative clinical experienc

Summary

Abstract

Aim

The Acute Physiological and Chronic Health Evaluation (APACHE) II score is a popular illness severity scoring system for intensive care units. Scoring systems such as the APACHE II allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is critical when evaluating practice patterns and outcomes, both within or between intensive care units. The study aims to: assess changes in APACHE II scores and hospital-standardised mortality ratio at our ICU over a nine year period from 1 January 1997 to 31 December 2005; assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise.

Method

Retrospective audit of a single centre relational database, with evaluation of the APACHE II scoring system by year through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) using receiver operating characteristic (ROC) curves, and calibration (ability to predict mortality rate over classes of risk) using goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.

Results

Data from 7703 patients were available for analysis. There was a decrease in overall hospital mortality, from approximately 19% at the beginning of the period of observation to approximately 12% at the end. There was also a decrease in the hospital standardised mortality ratio from 0.94 (95%CI 0.82-1.06) to 0.66 (95%CI 0.55-0.76). In general, both the APACHE II score and risk of death model performed adequately in each year with ROC curve AUCs of >0.8, albeit with progressively poorer performance over time and model fade that approached statistical significance. There was progressively poorer calibration with the APACHE II risk of death model as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards. Overall, there was moderately poor model performance in the diagnostic groups with the largest number of patients (sepsis and post-surgical complications).

Conclusion

This study shows the progressively worse performance of the APACHE II illness severity scoring system over time due to model fade. This is especially so in common diagnostic categories, making this a clinically relevant finding. Future approaches to illness severity scoring should be tested and compared, such as re-estimating coefficients of the APACHE II diagnostic categories or using locally developed ones, moving to later evolutions of the system such as the APACHE III or APACHE IV, or developing novel artificial intelligence approaches.

Author Information

Susan L Mann, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland; Mark R Marshall, Nephrologist, Department of Internal Medicine, Counties Manukau District Health Board, Manukau, Auckland; Alec Holt, Director Health Informatics Programme, Department of Information Science, University of Otago, Dunedin; Brendon Woodford, Department of Information Science, University of Otago, Dunedin; Anthony B Williams, Intensivist, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland

Acknowledgements

The authors thank Mr Mpatisi Moyo (Decision Support, Middlemore Hospital) and Mr Gary Jackson (Public Health Physician, Counties Manukau District Health Board).

Correspondence

Susan L Mann, PO Box 25-075, St Heliers, Auckland 1740, New Zealand. Fax: +64 (0)9 2760034

Correspondence Email

smann@xtra.co.nz

Competing Interests

None known.

Zimmerman JE, Knaus WA, Judson JA, et al. Patient selection for intensive care: a comparison of New Zealand and United States hospitals. Crit Care Med. 1988;16(4):318-26.Zimmerman J, Kramer A. Outcome prediction in critical care: the APACHE Physiology and chronic health evaluation models. Curr Opin Crit Care. 2008;14:491-7.Moran JL, Bristow P, Solomon P, et al. Mortality and length-of-stay outcomes, 1993-2003 in the binational Australian and New Zealand intensive care adult patient database. Crit Care Med. 2008;36(1):46-60.http://www.cmdhb.govt.nz/About_CMDHB/Overview/population-profile.htmKnaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985 Oct;13(10):818-29.Morris J, Gardner M. Calculating confidence intervals for relative risks (odds ratios) and standardised ratios and rates. . BMJ. 1988;296:1313-6.Hanley JA, McNeil B. The meaning and use of a ROC curve. Radiology. 1982;143:29-36.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating curves: A nonparametric approach. Biometrics. 1988;44:837-45.Lemeshow S, Hosmer D. A review of goodness of fit statistics for use in the developement of logistic regression models. Am J Epidemiol. 1982;115:92-106.Rowan K, Kerr J, Major K, et al. Intensive Care Society's APACHE II study in Britain and Ireland-II: Outcome comparisons of intensive care units after adjustment for case-mix by the American APACHE II method. BMJ. 1993;307:977-81.Harrison D, Parry G, Carpenter J, et al. A new risk prediction model for critical care: the Intensive Care National Audit & Research Centre (ICNARC) model. Crit Care Med. 2007;35:1091-8.Capuzzo M, Moreno R, LeGall J. Outcome Prediction in critical care: the simplified acute physiology score models. Curr Opin Crit Care. 2008;14:485-90.Harrison D, Brady AR, Parry GJ, et al. Recalibration of risk prediction models in a large multicentre cohort of admissions to adult, general critical cared units in the United Kingdom. Crit Care Med. 2006;34(5):1378-88.Moreno R. Outcome prediction in intensive care: why we need to reinvent the wheel. Curr Opin Crit Care. 2008;14:483-4.Frize M, Walker R. Clinical decision-support systems for intensive care units using case-based reasoning. Med Eng Phys. 2000 Nov;22(9):671-7.Clermont G. Artificial neural networks as prediction tools in the critically ill. Crit Care. 2005 Apr;9(2):153-4.Clermont G, Angus DC, DiRusso SM, et al. Predicting hospital mortality for patients in the intensive care unit: a comparison of artificial neural networks with logistic regression models. Crit Care Med. 2001 Feb;29(2):291-6.Holt A, Bichindaritz I, Schmidt R, Perner P. Medical applications in case-based reasoning. The Knowledge of Engineering Review. 2006;20(3):289-92.

For the PDF of this article,
contact nzmj@nzma.org.nz

View Article PDF

Illness severity scoring systems such as the Acute Physiological and Chronic Health Evaluation (APACHE) have become important tools for the evaluation and planning of intensive care practice patterns. These systems objectively estimate patient risk for mortality from acute physiological and chronic health status. They are not, however, a tool used for deciding treatment for individual patients; they are a group measurement used for patients who have similar disease processes. Their origin in the late 1970s and early 1980s was driven by the need to relate such practice patterns to patient outcomes.In the modern setting, tools such as the APACHE scoring system allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is essential for benchmarking and program evaluation. The interest in illness severity scoring systems is evidenced by the extensive body of literature that continues to advance both technical aspects of the systems themselves, and the applications for which they are used.Middlemore Hospital was one the earlier facilities in New Zealand to implement APACHE II scoring in clinical settings. The routine scoring of patients began in the Intensive Care Unit (ICU) in 1986. This advance was facilitated in no small part by the availability of one of the developers of the APACHE system, Dr Jack Zimmerman, who spent an extended sabbatical in New Zealand, some of which was at Middlemore Hospital.Despite enthusiastic support for APACHE II scoring by international opinion leaders at the time, the relevance and utility in New Zealand has been questioned from an early stage. The external validity of the system in such a different population from that in is developed was acknowledged by Zimmerman et al:1 The NZ hospitals designated 1.7% of their total beds for intensive care compared to 5.6% in the US hospitals. The average age for NZ admissions was 42 compared to 55 in the US (p<0.0001). The NZ ICUs admitted fewer patients with severe chronic failing health (NZ 8.7%, US 18%) and following elective surgery (NZ 8%, US 40%). Approximately half the NZ admissions were for trauma, drug overdose, and asthma while these diagnoses accounted for 11% of US admissions. When controlled for differences in casemix and severity of illness, hospital mortality rates in NZ were comparable to the US. This study demonstrates substantial differences in patient selection among these US and NZ. Furthermore, after more than two decades of use, it is unclear whether the performance of the APACHE II has been maintained. Patient casemix in New Zealand has changed from earlier times, and the improvements in supportive care that are now available may have decreased mortality for any given illness severity. International opinion leaders are in general moving towards the more recently developed scoring systems such as APACHE versions III and IV, which have been shown to outperform older versions in studies of North American and European ICU populations.2 This paradigm is slowly translating to clinical practice in this part of the world: the Australian and New Zealand Intensive Care Society adult patient database now collect data sufficient to model both APACHE II and III scores.3 There are three aims of this study. We aim to: Assess change in APACHE II scores and hospital standardised mortality ratio at our ICU over a 9-year period from 1 January 1997 to 31 December 2005; Assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and Assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise. Methods Study population and setting—Middlemore Hospital is the main hospital within the Counties Manukau District Health Board (CMDHB). The hospital serves a large urban population. The district catchment includes Manukau City which is rapidly expanding: the population has grown from 356,006 in 1996 to 454,655 at last census in 2006. The population can be summarily characterised as being young , multi-ethnic, and of low socioeconomic status compared with the rest of New Zealand.4 Middlemore Hospital is a tertiary referral centre for plastic surgery, burns, orthopaedics, and a range of medical sub-specialities. Any patient requiring neurosurgical or cardiothoracic surgical intervention is referred on to Auckland City Hospital as Middlemore Hospital does not have these facilities; all other patient categories remain at Middlemore Hospital. Although there is a specialist regional paediatric hospital in the area, Middlemore Hospital is also a paediatric hospital; the Middlemore Hospital ICU therefore cares for those children down to 2 kg weight requiring intensive care accounting for approximately 120 paediatric admissions per year. The hospital is academically affiliated and thus a teaching institution. Middlemore Hospital has had between 700 and 900 acute beds over the time in which this research was done, and now also includes a satellite surgical centre which caters for the majority of elective cases apart from those that are particularly high risk. Currently, the Middlemore ICU is nominally a seven funded-bed Level 3 facility. Since the inception of the Middlemore Hospital ICU in the late 1960s, the unit has been structurally modified on several occasions. As a result of the both national and local changes in healthcare strategy, the unit had at times had nominated HDU beds, and at other times not. Since 2004, there has been a four-funded bed Level I intensive care unit at a satellite surgical centre, which currently shares clinical governance, staff, policies and procedures with the main ICU at Middlemore Hospital. These patients were not included in this study. Data source—All data were sourced from a single-centre relational database that has been in continuous use at the Middlemore Hospital ICU since January 1986. The database contains information on all patients admitted to ICU during this period, using data that is prospectively collected, collated, and agreed upon by senior specialists and the charge nurse at the time. Data collection was progressively expanded during this period to ultimately include demographic information, APACHE II score, diagnostic information, ventilatory and inotropic support, procedures performed, and patient outcome. Patients who were less than 15 years of age, or who had been admitted solely for the purpose of a procedure such as difficult central venous line or endoscopy were not scored, as the system was not devised for these groups. The database specifically includes both patient death at both ICU and hospital discharge. The database includes locally developed diagnostic codes ("adclasses" and "subclasses") in addition to the APACHE II ones, which were developed to better reflect and discriminate disease categories related to the local population (see Appendix). Generic APACHE II diagnostic codes do necessarily provide a realistic reflection of the local disease categories and population outcomes. They can be ‘localised' by adjustments to either disease categorisation and / or the category weights subsequently used with the APACHE II scores for calculating risk of death supported in the case of Middlemore Hospital by Zimmerman et al who emphasised differences between North American and New Zealand ICU patient populations.1 Data were prospectively stored in Microsoft Access (Microsoft Corporation, Seattle, WA, USA), and retrospectively abstracted for analyses from a 9-year period from 1 January 1997 to 31 December 2005. Calculation of APACHE II scores and risk of death—All APACHE II scores and risk of death were calculated at patient hospital discharge using the prospectively stored data and the logistic regression equation developed by Knaus et al.5 The data for calculation of the APACHE II score included physiological measurements in the first 24 hours of ICU admission, age and chronic health status. The APACHE II risk of death is calculated not only from scores but also diagnostic categories, which were rigorously and continuously evaluated by the senior ICU medical staff during the process of prospective data collection. Such minimisation of misclassification was necessary to avoid error arising from the heavy reliance of the APACHE II risk of death formula on reason for ICU admission. Statistics—Standard statistics were used to describe data, making particular use of median and interquartile range to avoid assumptions around data distribution. Hypothesis testing was undertaken using Kruskal-Wallis equality-of-populations rank test for continuous variables, and the Pearson's Chi-squared test for categorical ones. Risk-adjusted mortality by year was assessed by hospital standardised mortality ratios and 95% confidence intervals (regarding observed mortality as a binomial variable), which were obtained by dividing the number of observed hospital deaths in each year by the number of predicted ones using the APACHE II system.6 Overall predictive performance of the APACHE II scoring system by year was gauged through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) and calibration (ability to predict mortality rate over classes of risk). Discrimination was assessed using receiver operating characteristic (ROC) curves, which plot the true positive rate (sensitivity, or predicted hospital deaths / observed hospital deaths) against the false positive rate (1-specificity, or 1-predicted hospital deaths / observed hospital deaths). The predictive performance is indicated in this method by the ROC area under the curve (AUC), with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. The slope the curve indicates ratio of true positives and false positives, which also is known as the likelihood ratio.7 For the analyses in this article, equality of ROC AUC for each year of study was compared.8 Calibration was assessed using the correspondence between the number of observed hospital deaths and the number of predicted hospital deaths within each 10% stratum (decile) of the cohort's expected risk of death. The predictive performance is indicated in this method by goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.9 The predictive performance of the APACHE II scoring system in major clinical subgroups was assessed by discrimination using hospital standardised mortality ratios within each of the major "adclasses". All analyses were performed using Microsoft Excel (Microsoft Corporation, Seattle, WA, USA) and Intercooled Stata 9.2 (Statacorp, College Station, TX, USA) software. Ethics—The need for formal approval for the research process was waived by the National (New Zealand) Health and Disability Ethics Committee under the provisions made for clinical audit. Results Data from 7703 patients were available for analysis. Baseline patient characteristics are presented in Table 1. Numbers of patients admitted to the ICU increased steadily from 686 in 1997 to 730 in 2005. The demographic characteristics of patients changed over the period of observation, with a trend to older and more Māori patients. There has also been a change in casemix of patients, with a reduction in the number of patients with diagnoses of poisoning and trauma, and an increase in the number of patients admitted after elective or emergency surgery. Patient length of stay has progressively reduced, as has the proportion of patients requiring mechanical ventilation. Overall hospital mortality decreased from approximately 19% at the beginning of the period of observation to approximately 12% at the end. Figure 1. APACHE II scores and risk scores by year, presented as boxplots Note: In these plots, the middle horizontal line represents the median; the box the second and third quartiles; and the whiskers the upper and lower extreme values which are no more than 1.5 × the interquartile range beyond the middle quartiles. Figure 2. Hospital-standardised mortality ratio and 95% confidence intervals, by year The APACHE II score decreased marginally over the period of observation as illustrated in Figure 1, with a median value of 14 in 1997 (IQR 9–21) and a corresponding value of 13 in 2005 (IQR 9–21). Although this reduction did achieve statistical significance (p=0.0001), it cannot be regarded as being clinically important. APACHE II predicted risk of death has remained stable over the period of observation, with a minor trend to reduction that did not achieve statistical significance (p=0.11). The hospital-standardised mortality ratio decreased over the period observation as illustrated in Figure 2, with a value of 0.94 (95% confidence intervals 0.82–1.06) in 1997 and a corresponding value of 0.66 (95% confidence intervals 0.55–0.76) in 2005. Model adequacy for discrimination by APACHE II score is illustrated by year in Figures 3 and 4. In general, the APACHE II score performs adequately in each year with ROC curve AUCs of >0.8. However, there is deteriorating accuracy of mortality predictions over time (otherwise known as ‘model fade'10 that approaches statistical significance. Corresponding model adequacy for discrimination by APACHE II predicted risk of death is illustrated in Figures 5 and 6. The risk model performs similarly to the APACHE II score showing a like degree of ‘model fade'. Figure 3. ROC curves for APACHE II Score, by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. Figure 4. ROC curve AUC (95%CI) for APACHE II Score, by year, as shown in figure 3 Note: Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997. Figure 5. ROC curves for the APACHE II Risk score by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction, and a value of 1.0 equating to perfect discrimination. Figure 6. ROC curve AUC (95%CI) for APACHE II Risk Score by year as shown in figure 5.. Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997 Number of Patients in Each Decile of Risk Figure 7. Calibration curves for APACHE II predicted risk of death, by year showing the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death. Predictive performance is assessed by the Hosmer-Lemeshow statistic (see table 2) Table 2. Model adequacy for calibration by APACHE II predicted risk of death, by year as indicated by the Hosmer-Lemeshow goodness-of-fit statistic for each of the calibration curves in figure 7. A high Hosmer-Lemeshow statistic and a P value <0.05 indicates poor correspondence between the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death Year Hosmer-Lemeshow Goodness-of-fit Statistics (P values) 1997 1998 1999 2000 2001 2002 2003 2004 2005 9.82 (0.132) 11.14 (0.084) 5.49 (0.482) 3.31 (0.769) 10.24 (0.111) 8.04 (0.235) 25.31 (0.0003) 21.49 (0.001) 19.41 (0.004) Figure 8. Hospital-standardised mortality ratio (observed/predicted hospital deaths) for clinical diagnostic subgroups ("adclasses" as described in Appendix) Model adequacy for calibration by APACHE II predicted risk of death is illustrated by year in Figure 7. There is progressively poorer goodness-of-fit as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards as shown in Table 2. Figure 8 illustrates model adequacy for discrimination by APACHE II predicted risk of death, according to clinical diagnostic subgroup. Although model adequacy was poorest in patients with neurological failure, there were only a small number of patients in this group. In contrast, the large number of patients with sepsis, respiratory failure, postoperative status, and circulatory failure makes the moderately poor model adequacy in these clinical subgroups more clinically relevant. Discussion Our data show that there has generally been a change in the overall casemix of patients admitted to the Middlemore Hospital ICU, with a decrease in the number of patients with poisonings and trauma over the period of observation, and an increase in those with complications as a result of surgery. APACHE II scores have remained fairly constant over the period of observation, with only a subtle trend to decreasing patient illness severity that did not achieve statistical significance. The data also show that there has been with a reduction in crude and risk-adjusted mortality, as assessed mortality rates and by hospital standardised mortality ratios. Despite this, there has been a steady drop in the proportion of patients receiving mechanical ventilation over the period of observation, and the average length of patient stay. Correlation between mechanical ventilation and increments in length of patient ICU stay has been noted in other studies 3. This change in outcomes and practice pattern may reflect the benefits of clinical pathways within our hospital, and the earlier detection and correction of physiological derangements that occurs in the modern, more pro-active approach to provision intensive care. An alternative, more pessimistic view is that this scenario may reflect earlier discharges from our ICU to accommodate increasing demand in a setting of increasingly limited resources. Reassuringly, if this latter scenario is the true one, then outcomes appear to have been maintained despite this. The data are in general terms consistent with a recent paper by Moran et al reporting on intensive care outcomes using an international Australian and New Zealand ICU database (ANZICS database), which to date has not included data from Middlemore Hospital and can therefore be regarded as independent. These investigators reported an improvement in overall risk-adjusted mortality over the last 11 years, which they did not attribute to any one specific factor3. Most medical administrators and practitioners would consider these improved outcomes to be in some part causally related to corresponding improvements in clinical care and therapeutic interventions. It would, however, take a more complex minimum dataset than both the ANZICS database and our local one to study this question appropriately. There are two major findings of this study relating to the predictive performance of the APACHE II system. The first is that there has been progressive deterioration model adequacy in terms of both discrimination and calibration. Predictive performance is generally acceptable when ROC curve AUCs are >0.8, and using these and similar criteria it seems that continuing use this system in our current practice may be unreasonable. The second is that the APACHE II system has been better sustained in some clinical diagnostic subgroups but not others. As is common to most ICUs, the largest clinical diagnostic subgroup in our dataset is sepsis and post-surgical complications, and the APACHE II system has moderately poor model adequacy in this subgroup, with prediction error of between 25-50%. Of note, the subgroups with the largest prediction error in our dataset constitute only ~10% of the entire Middlemore ICU population. The finding of ‘model fade' over time is also consistent with those of Moran et al, who demonstrated deteriorating model adequacy for the APACHE II system over time, both in terms of both discrimination and calibration. This was the case even after the authors recalibrated the APACHE II model by re-estimating coefficients for the Australasian population, thereby optimising discrimination and calibration. This is an important subtlety, since the performance of all illness severity scoring models is well known to be poorer in populations that are different from those in which they have been developed. This simple recalibration adjusts for geographical differences in measured patient characteristics (physiology and diagnosis), although it does not consider ICU characteristics and different organizational characteristics of healthcare systems as a predictive variable. The Intensive Care National Audit and Research Centre (ICNARC) model is in essence an adaptation of the APACHE model that was developed by Rowan et al. in the 1990s in the United Kingdom,10but over the years has resulted in a completely independent model that is widely used in the UK.11 Opinion leaders now recommend regular recalibration of illness scoring systems to local and more contemporary cohorts,12 although to our knowledge there is no consensus or even propositions concerning thresholds for model performance that would trigger the recalibration process, or standardised methodology around the recalibration itself. ‘Model fade' and poor model performance in diagnostic subgroups have led to the evolution of existing scores into a third and fourth generations of illness severity scoring systems, such as SAPS III and APACHE III and IV.2,12 The evolution of these scores did not involve simple recalibration of models by re-estimating coefficients, and instead involved the application of new statistical methods, the addition of new variables, an increase in the number of diagnostic groups, and a change to the measurement of certain physiological and diagnostic variables. These scores can be expected to perform better as a result of their development in a cohort that is more contemporary and externally valid in terms of casemix, and also by using clinical information that was not initially taken into consideration during the development of the earlier systems. There is a widespread move amongst ICUs to this newer generation of illness scoring systems, although their performance is only marginally better than earlier versions of the scores that have been more simply recalibrated by re-estimating coefficients.13 Notwithstanding, the APACHE III system is currently used more widely in the USA, with demonstrably greater discrimination and calibration than the original APACHE II system.2 It is too early to say at this time whether more recent evolutions of these systems such as the APACHE IV and SAPS III systems will demonstrate continued improvement. The findings of our study do not address one of the conundrums of illness severity scoring: the interpretation of changes in scores and outcomes over time. As with other studies, it is impossible to tell from our data whether our results are due to improved patient care and access to care, or alternatively from the deteriorating performance of scoring systems because of changing patient casemix. Our cumulative clinical experienc

Summary

Abstract

Aim

The Acute Physiological and Chronic Health Evaluation (APACHE) II score is a popular illness severity scoring system for intensive care units. Scoring systems such as the APACHE II allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is critical when evaluating practice patterns and outcomes, both within or between intensive care units. The study aims to: assess changes in APACHE II scores and hospital-standardised mortality ratio at our ICU over a nine year period from 1 January 1997 to 31 December 2005; assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise.

Method

Retrospective audit of a single centre relational database, with evaluation of the APACHE II scoring system by year through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) using receiver operating characteristic (ROC) curves, and calibration (ability to predict mortality rate over classes of risk) using goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.

Results

Data from 7703 patients were available for analysis. There was a decrease in overall hospital mortality, from approximately 19% at the beginning of the period of observation to approximately 12% at the end. There was also a decrease in the hospital standardised mortality ratio from 0.94 (95%CI 0.82-1.06) to 0.66 (95%CI 0.55-0.76). In general, both the APACHE II score and risk of death model performed adequately in each year with ROC curve AUCs of >0.8, albeit with progressively poorer performance over time and model fade that approached statistical significance. There was progressively poorer calibration with the APACHE II risk of death model as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards. Overall, there was moderately poor model performance in the diagnostic groups with the largest number of patients (sepsis and post-surgical complications).

Conclusion

This study shows the progressively worse performance of the APACHE II illness severity scoring system over time due to model fade. This is especially so in common diagnostic categories, making this a clinically relevant finding. Future approaches to illness severity scoring should be tested and compared, such as re-estimating coefficients of the APACHE II diagnostic categories or using locally developed ones, moving to later evolutions of the system such as the APACHE III or APACHE IV, or developing novel artificial intelligence approaches.

Author Information

Susan L Mann, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland; Mark R Marshall, Nephrologist, Department of Internal Medicine, Counties Manukau District Health Board, Manukau, Auckland; Alec Holt, Director Health Informatics Programme, Department of Information Science, University of Otago, Dunedin; Brendon Woodford, Department of Information Science, University of Otago, Dunedin; Anthony B Williams, Intensivist, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland

Acknowledgements

The authors thank Mr Mpatisi Moyo (Decision Support, Middlemore Hospital) and Mr Gary Jackson (Public Health Physician, Counties Manukau District Health Board).

Correspondence

Susan L Mann, PO Box 25-075, St Heliers, Auckland 1740, New Zealand. Fax: +64 (0)9 2760034

Correspondence Email

smann@xtra.co.nz

Competing Interests

None known.

Zimmerman JE, Knaus WA, Judson JA, et al. Patient selection for intensive care: a comparison of New Zealand and United States hospitals. Crit Care Med. 1988;16(4):318-26.Zimmerman J, Kramer A. Outcome prediction in critical care: the APACHE Physiology and chronic health evaluation models. Curr Opin Crit Care. 2008;14:491-7.Moran JL, Bristow P, Solomon P, et al. Mortality and length-of-stay outcomes, 1993-2003 in the binational Australian and New Zealand intensive care adult patient database. Crit Care Med. 2008;36(1):46-60.http://www.cmdhb.govt.nz/About_CMDHB/Overview/population-profile.htmKnaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985 Oct;13(10):818-29.Morris J, Gardner M. Calculating confidence intervals for relative risks (odds ratios) and standardised ratios and rates. . BMJ. 1988;296:1313-6.Hanley JA, McNeil B. The meaning and use of a ROC curve. Radiology. 1982;143:29-36.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating curves: A nonparametric approach. Biometrics. 1988;44:837-45.Lemeshow S, Hosmer D. A review of goodness of fit statistics for use in the developement of logistic regression models. Am J Epidemiol. 1982;115:92-106.Rowan K, Kerr J, Major K, et al. Intensive Care Society's APACHE II study in Britain and Ireland-II: Outcome comparisons of intensive care units after adjustment for case-mix by the American APACHE II method. BMJ. 1993;307:977-81.Harrison D, Parry G, Carpenter J, et al. A new risk prediction model for critical care: the Intensive Care National Audit & Research Centre (ICNARC) model. Crit Care Med. 2007;35:1091-8.Capuzzo M, Moreno R, LeGall J. Outcome Prediction in critical care: the simplified acute physiology score models. Curr Opin Crit Care. 2008;14:485-90.Harrison D, Brady AR, Parry GJ, et al. Recalibration of risk prediction models in a large multicentre cohort of admissions to adult, general critical cared units in the United Kingdom. Crit Care Med. 2006;34(5):1378-88.Moreno R. Outcome prediction in intensive care: why we need to reinvent the wheel. Curr Opin Crit Care. 2008;14:483-4.Frize M, Walker R. Clinical decision-support systems for intensive care units using case-based reasoning. Med Eng Phys. 2000 Nov;22(9):671-7.Clermont G. Artificial neural networks as prediction tools in the critically ill. Crit Care. 2005 Apr;9(2):153-4.Clermont G, Angus DC, DiRusso SM, et al. Predicting hospital mortality for patients in the intensive care unit: a comparison of artificial neural networks with logistic regression models. Crit Care Med. 2001 Feb;29(2):291-6.Holt A, Bichindaritz I, Schmidt R, Perner P. Medical applications in case-based reasoning. The Knowledge of Engineering Review. 2006;20(3):289-92.

Contact diana@nzma.org.nz
for the PDF of this article

View Article PDF

Illness severity scoring systems such as the Acute Physiological and Chronic Health Evaluation (APACHE) have become important tools for the evaluation and planning of intensive care practice patterns. These systems objectively estimate patient risk for mortality from acute physiological and chronic health status. They are not, however, a tool used for deciding treatment for individual patients; they are a group measurement used for patients who have similar disease processes. Their origin in the late 1970s and early 1980s was driven by the need to relate such practice patterns to patient outcomes.In the modern setting, tools such as the APACHE scoring system allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is essential for benchmarking and program evaluation. The interest in illness severity scoring systems is evidenced by the extensive body of literature that continues to advance both technical aspects of the systems themselves, and the applications for which they are used.Middlemore Hospital was one the earlier facilities in New Zealand to implement APACHE II scoring in clinical settings. The routine scoring of patients began in the Intensive Care Unit (ICU) in 1986. This advance was facilitated in no small part by the availability of one of the developers of the APACHE system, Dr Jack Zimmerman, who spent an extended sabbatical in New Zealand, some of which was at Middlemore Hospital.Despite enthusiastic support for APACHE II scoring by international opinion leaders at the time, the relevance and utility in New Zealand has been questioned from an early stage. The external validity of the system in such a different population from that in is developed was acknowledged by Zimmerman et al:1 The NZ hospitals designated 1.7% of their total beds for intensive care compared to 5.6% in the US hospitals. The average age for NZ admissions was 42 compared to 55 in the US (p<0.0001). The NZ ICUs admitted fewer patients with severe chronic failing health (NZ 8.7%, US 18%) and following elective surgery (NZ 8%, US 40%). Approximately half the NZ admissions were for trauma, drug overdose, and asthma while these diagnoses accounted for 11% of US admissions. When controlled for differences in casemix and severity of illness, hospital mortality rates in NZ were comparable to the US. This study demonstrates substantial differences in patient selection among these US and NZ. Furthermore, after more than two decades of use, it is unclear whether the performance of the APACHE II has been maintained. Patient casemix in New Zealand has changed from earlier times, and the improvements in supportive care that are now available may have decreased mortality for any given illness severity. International opinion leaders are in general moving towards the more recently developed scoring systems such as APACHE versions III and IV, which have been shown to outperform older versions in studies of North American and European ICU populations.2 This paradigm is slowly translating to clinical practice in this part of the world: the Australian and New Zealand Intensive Care Society adult patient database now collect data sufficient to model both APACHE II and III scores.3 There are three aims of this study. We aim to: Assess change in APACHE II scores and hospital standardised mortality ratio at our ICU over a 9-year period from 1 January 1997 to 31 December 2005; Assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and Assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise. Methods Study population and setting—Middlemore Hospital is the main hospital within the Counties Manukau District Health Board (CMDHB). The hospital serves a large urban population. The district catchment includes Manukau City which is rapidly expanding: the population has grown from 356,006 in 1996 to 454,655 at last census in 2006. The population can be summarily characterised as being young , multi-ethnic, and of low socioeconomic status compared with the rest of New Zealand.4 Middlemore Hospital is a tertiary referral centre for plastic surgery, burns, orthopaedics, and a range of medical sub-specialities. Any patient requiring neurosurgical or cardiothoracic surgical intervention is referred on to Auckland City Hospital as Middlemore Hospital does not have these facilities; all other patient categories remain at Middlemore Hospital. Although there is a specialist regional paediatric hospital in the area, Middlemore Hospital is also a paediatric hospital; the Middlemore Hospital ICU therefore cares for those children down to 2 kg weight requiring intensive care accounting for approximately 120 paediatric admissions per year. The hospital is academically affiliated and thus a teaching institution. Middlemore Hospital has had between 700 and 900 acute beds over the time in which this research was done, and now also includes a satellite surgical centre which caters for the majority of elective cases apart from those that are particularly high risk. Currently, the Middlemore ICU is nominally a seven funded-bed Level 3 facility. Since the inception of the Middlemore Hospital ICU in the late 1960s, the unit has been structurally modified on several occasions. As a result of the both national and local changes in healthcare strategy, the unit had at times had nominated HDU beds, and at other times not. Since 2004, there has been a four-funded bed Level I intensive care unit at a satellite surgical centre, which currently shares clinical governance, staff, policies and procedures with the main ICU at Middlemore Hospital. These patients were not included in this study. Data source—All data were sourced from a single-centre relational database that has been in continuous use at the Middlemore Hospital ICU since January 1986. The database contains information on all patients admitted to ICU during this period, using data that is prospectively collected, collated, and agreed upon by senior specialists and the charge nurse at the time. Data collection was progressively expanded during this period to ultimately include demographic information, APACHE II score, diagnostic information, ventilatory and inotropic support, procedures performed, and patient outcome. Patients who were less than 15 years of age, or who had been admitted solely for the purpose of a procedure such as difficult central venous line or endoscopy were not scored, as the system was not devised for these groups. The database specifically includes both patient death at both ICU and hospital discharge. The database includes locally developed diagnostic codes ("adclasses" and "subclasses") in addition to the APACHE II ones, which were developed to better reflect and discriminate disease categories related to the local population (see Appendix). Generic APACHE II diagnostic codes do necessarily provide a realistic reflection of the local disease categories and population outcomes. They can be ‘localised' by adjustments to either disease categorisation and / or the category weights subsequently used with the APACHE II scores for calculating risk of death supported in the case of Middlemore Hospital by Zimmerman et al who emphasised differences between North American and New Zealand ICU patient populations.1 Data were prospectively stored in Microsoft Access (Microsoft Corporation, Seattle, WA, USA), and retrospectively abstracted for analyses from a 9-year period from 1 January 1997 to 31 December 2005. Calculation of APACHE II scores and risk of death—All APACHE II scores and risk of death were calculated at patient hospital discharge using the prospectively stored data and the logistic regression equation developed by Knaus et al.5 The data for calculation of the APACHE II score included physiological measurements in the first 24 hours of ICU admission, age and chronic health status. The APACHE II risk of death is calculated not only from scores but also diagnostic categories, which were rigorously and continuously evaluated by the senior ICU medical staff during the process of prospective data collection. Such minimisation of misclassification was necessary to avoid error arising from the heavy reliance of the APACHE II risk of death formula on reason for ICU admission. Statistics—Standard statistics were used to describe data, making particular use of median and interquartile range to avoid assumptions around data distribution. Hypothesis testing was undertaken using Kruskal-Wallis equality-of-populations rank test for continuous variables, and the Pearson's Chi-squared test for categorical ones. Risk-adjusted mortality by year was assessed by hospital standardised mortality ratios and 95% confidence intervals (regarding observed mortality as a binomial variable), which were obtained by dividing the number of observed hospital deaths in each year by the number of predicted ones using the APACHE II system.6 Overall predictive performance of the APACHE II scoring system by year was gauged through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) and calibration (ability to predict mortality rate over classes of risk). Discrimination was assessed using receiver operating characteristic (ROC) curves, which plot the true positive rate (sensitivity, or predicted hospital deaths / observed hospital deaths) against the false positive rate (1-specificity, or 1-predicted hospital deaths / observed hospital deaths). The predictive performance is indicated in this method by the ROC area under the curve (AUC), with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. The slope the curve indicates ratio of true positives and false positives, which also is known as the likelihood ratio.7 For the analyses in this article, equality of ROC AUC for each year of study was compared.8 Calibration was assessed using the correspondence between the number of observed hospital deaths and the number of predicted hospital deaths within each 10% stratum (decile) of the cohort's expected risk of death. The predictive performance is indicated in this method by goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.9 The predictive performance of the APACHE II scoring system in major clinical subgroups was assessed by discrimination using hospital standardised mortality ratios within each of the major "adclasses". All analyses were performed using Microsoft Excel (Microsoft Corporation, Seattle, WA, USA) and Intercooled Stata 9.2 (Statacorp, College Station, TX, USA) software. Ethics—The need for formal approval for the research process was waived by the National (New Zealand) Health and Disability Ethics Committee under the provisions made for clinical audit. Results Data from 7703 patients were available for analysis. Baseline patient characteristics are presented in Table 1. Numbers of patients admitted to the ICU increased steadily from 686 in 1997 to 730 in 2005. The demographic characteristics of patients changed over the period of observation, with a trend to older and more Māori patients. There has also been a change in casemix of patients, with a reduction in the number of patients with diagnoses of poisoning and trauma, and an increase in the number of patients admitted after elective or emergency surgery. Patient length of stay has progressively reduced, as has the proportion of patients requiring mechanical ventilation. Overall hospital mortality decreased from approximately 19% at the beginning of the period of observation to approximately 12% at the end. Figure 1. APACHE II scores and risk scores by year, presented as boxplots Note: In these plots, the middle horizontal line represents the median; the box the second and third quartiles; and the whiskers the upper and lower extreme values which are no more than 1.5 × the interquartile range beyond the middle quartiles. Figure 2. Hospital-standardised mortality ratio and 95% confidence intervals, by year The APACHE II score decreased marginally over the period of observation as illustrated in Figure 1, with a median value of 14 in 1997 (IQR 9–21) and a corresponding value of 13 in 2005 (IQR 9–21). Although this reduction did achieve statistical significance (p=0.0001), it cannot be regarded as being clinically important. APACHE II predicted risk of death has remained stable over the period of observation, with a minor trend to reduction that did not achieve statistical significance (p=0.11). The hospital-standardised mortality ratio decreased over the period observation as illustrated in Figure 2, with a value of 0.94 (95% confidence intervals 0.82–1.06) in 1997 and a corresponding value of 0.66 (95% confidence intervals 0.55–0.76) in 2005. Model adequacy for discrimination by APACHE II score is illustrated by year in Figures 3 and 4. In general, the APACHE II score performs adequately in each year with ROC curve AUCs of >0.8. However, there is deteriorating accuracy of mortality predictions over time (otherwise known as ‘model fade'10 that approaches statistical significance. Corresponding model adequacy for discrimination by APACHE II predicted risk of death is illustrated in Figures 5 and 6. The risk model performs similarly to the APACHE II score showing a like degree of ‘model fade'. Figure 3. ROC curves for APACHE II Score, by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction and a value of 1.0 equating to perfect discrimination. Figure 4. ROC curve AUC (95%CI) for APACHE II Score, by year, as shown in figure 3 Note: Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997. Figure 5. ROC curves for the APACHE II Risk score by year Note: The predictive performance is indicated by the ROC area, with a value of 0.5 equating to random prediction, and a value of 1.0 equating to perfect discrimination. Figure 6. ROC curve AUC (95%CI) for APACHE II Risk Score by year as shown in figure 5.. Marker labels indicate the P value for the test of equality of ROC areas relative to the reference year of 1997 Number of Patients in Each Decile of Risk Figure 7. Calibration curves for APACHE II predicted risk of death, by year showing the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death. Predictive performance is assessed by the Hosmer-Lemeshow statistic (see table 2) Table 2. Model adequacy for calibration by APACHE II predicted risk of death, by year as indicated by the Hosmer-Lemeshow goodness-of-fit statistic for each of the calibration curves in figure 7. A high Hosmer-Lemeshow statistic and a P value <0.05 indicates poor correspondence between the number of observed and predicted deaths within each 10% stratum (decile) of the cohort's expected risk of death Year Hosmer-Lemeshow Goodness-of-fit Statistics (P values) 1997 1998 1999 2000 2001 2002 2003 2004 2005 9.82 (0.132) 11.14 (0.084) 5.49 (0.482) 3.31 (0.769) 10.24 (0.111) 8.04 (0.235) 25.31 (0.0003) 21.49 (0.001) 19.41 (0.004) Figure 8. Hospital-standardised mortality ratio (observed/predicted hospital deaths) for clinical diagnostic subgroups ("adclasses" as described in Appendix) Model adequacy for calibration by APACHE II predicted risk of death is illustrated by year in Figure 7. There is progressively poorer goodness-of-fit as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards as shown in Table 2. Figure 8 illustrates model adequacy for discrimination by APACHE II predicted risk of death, according to clinical diagnostic subgroup. Although model adequacy was poorest in patients with neurological failure, there were only a small number of patients in this group. In contrast, the large number of patients with sepsis, respiratory failure, postoperative status, and circulatory failure makes the moderately poor model adequacy in these clinical subgroups more clinically relevant. Discussion Our data show that there has generally been a change in the overall casemix of patients admitted to the Middlemore Hospital ICU, with a decrease in the number of patients with poisonings and trauma over the period of observation, and an increase in those with complications as a result of surgery. APACHE II scores have remained fairly constant over the period of observation, with only a subtle trend to decreasing patient illness severity that did not achieve statistical significance. The data also show that there has been with a reduction in crude and risk-adjusted mortality, as assessed mortality rates and by hospital standardised mortality ratios. Despite this, there has been a steady drop in the proportion of patients receiving mechanical ventilation over the period of observation, and the average length of patient stay. Correlation between mechanical ventilation and increments in length of patient ICU stay has been noted in other studies 3. This change in outcomes and practice pattern may reflect the benefits of clinical pathways within our hospital, and the earlier detection and correction of physiological derangements that occurs in the modern, more pro-active approach to provision intensive care. An alternative, more pessimistic view is that this scenario may reflect earlier discharges from our ICU to accommodate increasing demand in a setting of increasingly limited resources. Reassuringly, if this latter scenario is the true one, then outcomes appear to have been maintained despite this. The data are in general terms consistent with a recent paper by Moran et al reporting on intensive care outcomes using an international Australian and New Zealand ICU database (ANZICS database), which to date has not included data from Middlemore Hospital and can therefore be regarded as independent. These investigators reported an improvement in overall risk-adjusted mortality over the last 11 years, which they did not attribute to any one specific factor3. Most medical administrators and practitioners would consider these improved outcomes to be in some part causally related to corresponding improvements in clinical care and therapeutic interventions. It would, however, take a more complex minimum dataset than both the ANZICS database and our local one to study this question appropriately. There are two major findings of this study relating to the predictive performance of the APACHE II system. The first is that there has been progressive deterioration model adequacy in terms of both discrimination and calibration. Predictive performance is generally acceptable when ROC curve AUCs are >0.8, and using these and similar criteria it seems that continuing use this system in our current practice may be unreasonable. The second is that the APACHE II system has been better sustained in some clinical diagnostic subgroups but not others. As is common to most ICUs, the largest clinical diagnostic subgroup in our dataset is sepsis and post-surgical complications, and the APACHE II system has moderately poor model adequacy in this subgroup, with prediction error of between 25-50%. Of note, the subgroups with the largest prediction error in our dataset constitute only ~10% of the entire Middlemore ICU population. The finding of ‘model fade' over time is also consistent with those of Moran et al, who demonstrated deteriorating model adequacy for the APACHE II system over time, both in terms of both discrimination and calibration. This was the case even after the authors recalibrated the APACHE II model by re-estimating coefficients for the Australasian population, thereby optimising discrimination and calibration. This is an important subtlety, since the performance of all illness severity scoring models is well known to be poorer in populations that are different from those in which they have been developed. This simple recalibration adjusts for geographical differences in measured patient characteristics (physiology and diagnosis), although it does not consider ICU characteristics and different organizational characteristics of healthcare systems as a predictive variable. The Intensive Care National Audit and Research Centre (ICNARC) model is in essence an adaptation of the APACHE model that was developed by Rowan et al. in the 1990s in the United Kingdom,10but over the years has resulted in a completely independent model that is widely used in the UK.11 Opinion leaders now recommend regular recalibration of illness scoring systems to local and more contemporary cohorts,12 although to our knowledge there is no consensus or even propositions concerning thresholds for model performance that would trigger the recalibration process, or standardised methodology around the recalibration itself. ‘Model fade' and poor model performance in diagnostic subgroups have led to the evolution of existing scores into a third and fourth generations of illness severity scoring systems, such as SAPS III and APACHE III and IV.2,12 The evolution of these scores did not involve simple recalibration of models by re-estimating coefficients, and instead involved the application of new statistical methods, the addition of new variables, an increase in the number of diagnostic groups, and a change to the measurement of certain physiological and diagnostic variables. These scores can be expected to perform better as a result of their development in a cohort that is more contemporary and externally valid in terms of casemix, and also by using clinical information that was not initially taken into consideration during the development of the earlier systems. There is a widespread move amongst ICUs to this newer generation of illness scoring systems, although their performance is only marginally better than earlier versions of the scores that have been more simply recalibrated by re-estimating coefficients.13 Notwithstanding, the APACHE III system is currently used more widely in the USA, with demonstrably greater discrimination and calibration than the original APACHE II system.2 It is too early to say at this time whether more recent evolutions of these systems such as the APACHE IV and SAPS III systems will demonstrate continued improvement. The findings of our study do not address one of the conundrums of illness severity scoring: the interpretation of changes in scores and outcomes over time. As with other studies, it is impossible to tell from our data whether our results are due to improved patient care and access to care, or alternatively from the deteriorating performance of scoring systems because of changing patient casemix. Our cumulative clinical experienc

Summary

Abstract

Aim

The Acute Physiological and Chronic Health Evaluation (APACHE) II score is a popular illness severity scoring system for intensive care units. Scoring systems such as the APACHE II allow researchers and clinicians to quantify patient illness severity with a greater degree of accuracy and precision, which is critical when evaluating practice patterns and outcomes, both within or between intensive care units. The study aims to: assess changes in APACHE II scores and hospital-standardised mortality ratio at our ICU over a nine year period from 1 January 1997 to 31 December 2005; assess for changes in the performance of the APACHE II scoring system in predicting patient hospital mortality over the same period; and assess for any clinical subgroups in which APACHE II scoring was particularly inaccurate or imprecise.

Method

Retrospective audit of a single centre relational database, with evaluation of the APACHE II scoring system by year through discrimination (ability to discriminate between the patients who will die or survive at hospital discharge) using receiver operating characteristic (ROC) curves, and calibration (ability to predict mortality rate over classes of risk) using goodness-of-fit as assessed by the Hosmer-Lemeshow statistic.

Results

Data from 7703 patients were available for analysis. There was a decrease in overall hospital mortality, from approximately 19% at the beginning of the period of observation to approximately 12% at the end. There was also a decrease in the hospital standardised mortality ratio from 0.94 (95%CI 0.82-1.06) to 0.66 (95%CI 0.55-0.76). In general, both the APACHE II score and risk of death model performed adequately in each year with ROC curve AUCs of >0.8, albeit with progressively poorer performance over time and model fade that approached statistical significance. There was progressively poorer calibration with the APACHE II risk of death model as indicated by the Hosmer-Lemeshow statistic, with a statistically significant difference between the predicted and observed mortality from 2003 onwards. Overall, there was moderately poor model performance in the diagnostic groups with the largest number of patients (sepsis and post-surgical complications).

Conclusion

This study shows the progressively worse performance of the APACHE II illness severity scoring system over time due to model fade. This is especially so in common diagnostic categories, making this a clinically relevant finding. Future approaches to illness severity scoring should be tested and compared, such as re-estimating coefficients of the APACHE II diagnostic categories or using locally developed ones, moving to later evolutions of the system such as the APACHE III or APACHE IV, or developing novel artificial intelligence approaches.

Author Information

Susan L Mann, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland; Mark R Marshall, Nephrologist, Department of Internal Medicine, Counties Manukau District Health Board, Manukau, Auckland; Alec Holt, Director Health Informatics Programme, Department of Information Science, University of Otago, Dunedin; Brendon Woodford, Department of Information Science, University of Otago, Dunedin; Anthony B Williams, Intensivist, Department of Intensive Care Medicine, Counties Manukau District Health Board, Manukau, South Auckland

Acknowledgements

The authors thank Mr Mpatisi Moyo (Decision Support, Middlemore Hospital) and Mr Gary Jackson (Public Health Physician, Counties Manukau District Health Board).

Correspondence

Susan L Mann, PO Box 25-075, St Heliers, Auckland 1740, New Zealand. Fax: +64 (0)9 2760034

Correspondence Email

smann@xtra.co.nz

Competing Interests

None known.

Zimmerman JE, Knaus WA, Judson JA, et al. Patient selection for intensive care: a comparison of New Zealand and United States hospitals. Crit Care Med. 1988;16(4):318-26.Zimmerman J, Kramer A. Outcome prediction in critical care: the APACHE Physiology and chronic health evaluation models. Curr Opin Crit Care. 2008;14:491-7.Moran JL, Bristow P, Solomon P, et al. Mortality and length-of-stay outcomes, 1993-2003 in the binational Australian and New Zealand intensive care adult patient database. Crit Care Med. 2008;36(1):46-60.http://www.cmdhb.govt.nz/About_CMDHB/Overview/population-profile.htmKnaus WA, Draper EA, Wagner DP, Zimmerman JE. APACHE II: a severity of disease classification system. Crit Care Med. 1985 Oct;13(10):818-29.Morris J, Gardner M. Calculating confidence intervals for relative risks (odds ratios) and standardised ratios and rates. . BMJ. 1988;296:1313-6.Hanley JA, McNeil B. The meaning and use of a ROC curve. Radiology. 1982;143:29-36.DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating curves: A nonparametric approach. Biometrics. 1988;44:837-45.Lemeshow S, Hosmer D. A review of goodness of fit statistics for use in the developement of logistic regression models. Am J Epidemiol. 1982;115:92-106.Rowan K, Kerr J, Major K, et al. Intensive Care Society's APACHE II study in Britain and Ireland-II: Outcome comparisons of intensive care units after adjustment for case-mix by the American APACHE II method. BMJ. 1993;307:977-81.Harrison D, Parry G, Carpenter J, et al. A new risk prediction model for critical care: the Intensive Care National Audit & Research Centre (ICNARC) model. Crit Care Med. 2007;35:1091-8.Capuzzo M, Moreno R, LeGall J. Outcome Prediction in critical care: the simplified acute physiology score models. Curr Opin Crit Care. 2008;14:485-90.Harrison D, Brady AR, Parry GJ, et al. Recalibration of risk prediction models in a large multicentre cohort of admissions to adult, general critical cared units in the United Kingdom. Crit Care Med. 2006;34(5):1378-88.Moreno R. Outcome prediction in intensive care: why we need to reinvent the wheel. Curr Opin Crit Care. 2008;14:483-4.Frize M, Walker R. Clinical decision-support systems for intensive care units using case-based reasoning. Med Eng Phys. 2000 Nov;22(9):671-7.Clermont G. Artificial neural networks as prediction tools in the critically ill. Crit Care. 2005 Apr;9(2):153-4.Clermont G, Angus DC, DiRusso SM, et al. Predicting hospital mortality for patients in the intensive care unit: a comparison of artificial neural networks with logistic regression models. Crit Care Med. 2001 Feb;29(2):291-6.Holt A, Bichindaritz I, Schmidt R, Perner P. Medical applications in case-based reasoning. The Knowledge of Engineering Review. 2006;20(3):289-92.

Contact diana@nzma.org.nz
for the PDF of this article

Subscriber Content

The full contents of this pages only available to subscribers.
Login, subscribe or email nzmj@nzma.org.nz to purchase this article.

LOGINSUBSCRIBE
No items found.