Re-analysis and generation of Overstay2 model
This page is about the development of the model for generating scores/colours for Project Overstay2. Since our data collection and the healthcare system changed since the first iteration, we did a re-analysis and generation of Overstay2 model, resulting in the Overstay2 scoring models that generate the Overstay2 colour. Also see the Overstay2 Overview.
Defining the contributing factors data
The model depends on a regression analysis of a number of possible factors in our regularly collected data. Our data structure had changed since the original project, so we cleaned up our definitions, resulting in the Data definition for factor candidates for the Overstay2 project.
Still needs:
|
Model dataset and date range
- Dataset: We used the file 2025-2-3_13.56.31_Centralized_data.accdb as a basis for the project. A copy for future reference is at
- \\ad.wrha.mb.ca\WRHA\HSC\shared\MED\MED_CCMED\Julie\MedProjects\Overstay_Project_2025
- Reference Admit DtTm: We based the date range on the first medicine admit date during a Data definition for factor candidates for the Overstay2 project#Hospitalization, based on the earliest Boarding Loc dttm.
- Dataset inclusion criteria: (all/and) of the following
- Reference Admit DtTm >=2020-11-01 and <2025-01-01
- RecordStatus = Vetted
- final dispo of the Data definition for factor candidates for the Overstay2 project#Hospitalization is to a destination outside of the hospital of the admission (can be to other hospital)
- HOBS: include the record only if:
- the first medicine admission during a hospitalization is on a HOBS unit, and
- there is a Transfer_Ready_Dttm associated with that unit, and
- the patient is discharged from that unit to a a destination outside of the hospital of the admission (can be to other hospital)
- This resulted in a dataset with the following:
- Total hospitalizations: 42,078
Site | Data Set | Total | Overstay >= 10d | Overstay < 10 days |
---|---|---|---|---|
All | All | 42,078 | 1741 (4.1%) | 40,337 (95.9%) |
All | Training | 21,054 | 859 (4.1%) | 20,195 (95.9%) |
All | Validation | 21,024 | 882 (4.2%) | 20,142 (95.8%) |
HSC | All | 16,813 | 616 (3.7%) | 16,197 (96.3%) |
HSC | Training | 8,371 | 295 (3.5%) | 8,076(96.5%) |
HSC | Validation | 8,442 | 321 (3.8%) | 8,121 (96.2%) |
SBGH | All | 13,762 | 398 (2.9%) | 13,364 (97.1%) |
SBGH | Training | 6,905 | 204 (3.0%) | 6,701 (97.0%) |
SBGH | Validation | 6,857 | 194 (2.8%) | 6,663 (97.2%) |
GGH | All | 11,503 | 727 (6.3%) | 10,776 (93.7%) |
GGH | Training | 5,778 | 360 (6.2%) | 5,418 (93.8%) |
GGH | Validation | 5,725 | 367 (6.4%) | 5,358 (93.6%) |
The SAS code defining this dataset can be found in S:\MED\MED_CCMED\Julie\MedProjects\Overstay_Project_2025\Data\prepdata_7Feb2025.sas
The CFE code defining this dataset
Specific decisions were discussed and made. |
JM had found Vetted n=226 cases with Last discharge DtTm (in ICU or Med) after 2024 until Feb 3,2025. Only 13 did not leave own site, 19 expired, 194 left the site. From the 213, some are long stayed patients admitted Aug –1, Sept-3, Oct-8, Nov-18, Dec=196. (DR agreed in the meeting with JM Feb10).
|
Model development Inclusion/Exclusion of "Green" admissions
If we plan to generate overstay colours like the last time, then the one group who would not have the model applied to them would be the “greens”, since the decision tree turns them green before the model would be applied. If we were able to determine who these greens would have been, would we want to exclude them from the model?
There is no way to exclude the greens from the model, so we won’t try.
Analysis and model generation
Parameter candidates
See Data definition for factor candidates for the Overstay2 project for the definitions.
Location Grouping considerations
|
reference/examples for links
|
- Age
- PCH/Chronic Care
- other Location / living arrangement
- ADL components and
- ADL_Adlmean_NH - among those who came from PCH/CHF
- ADL_Adlmean_age - interaction with Age
- Glasgow Coma Scale
- Location / living arrangement Postal Code (also see [[#Location Grouping for Postal Code is N/A]])
- Charlson Diagnoses (Categories and Total Score)
- MI, CHF, PVD, CVA , Pulmonary, Connective, Ulcer, Renal
- Charlson Comorbidity Index
- Charlson Score * NH - among those who came from PCH/CHF
- Diagnoses that might prevent/delay meeting PCH/Home Care criteria
- Homeless
Location Grouping for Postal Code is N/A
Analysis notes: JM found postal code N/A =2759, JM used the R_Province, Pre_inpt_Location, Previous Location instead to define the 5 categories above. Also encountered no match in the Postal_Code_Master List but was able to categorized based on the first 3 characters (N=273) - list given to Pagasa to add. (DR agreed in the meeting with JM Feb10)
Dataset split into training and validation data
We separated the population into two datasets based on the odd/even status of the last digit of the Chart number:
- Even: Training set
- Odd: validation set
Model generation and testing
See \\ad.wrha.mb.ca\WRHA\HSC\shared\MED\MED_CCMED\Julie\MedProjects\Overstay_Project_2025 and emails between Julie, Tina and Dan Roberts ~2025-02
Decision on a model
- For each site's training set and validation set, perform chi square test for independence between the variable OS (Overstay >= 10days and Overstay < 10d) and each factors listed Data definition for factor candidates for the Overstay2 project to identify the factors that may affect the overstay individually.
- Training data set - Methodology to find the best model involves
- Basic plan for selecting the variables for the model -
- Perform logistic model with the OS as the dependent variable and the independent variables beginning with the results from univariable analysis above and
- Then by multivariable analysis using all independent variables (full model) and select via stepwise procedure both forward and backward selection.
- Examine the importance of each variable included based on the probability result of its coefficient.
- Those not contributing to the model are eliminated and new model is fitted. The process of deleting, refitting and verifying continues until it appears that all important variables are already included.
- Assess the adequacy of the model both in terms of the individual variables and its overall fit by the following :
- Estimated coefficients showing p-values of < 0.05 or having clinical relevance with p-values higher or close to 0.05 are included in the model.
- The association of the predicted probabilities and observed responses is calculated by the Concordance (C) index and area under the curve (AUC) between the true positive rate (sensitivity) and false positive rate (1-specificity). A value > 0.5 implies ability to discriminate the positive and negative outcomes while a value 1 implies perfect classification. This quantity indicates how well the model ranks predictions .
- The Hosmer-Lemeshow Goodness-of-fit test is used to assess how well the logistic regression model fits the data. A high p-value (usually > 0.05) means the model fits well while a low p-value (≤ 0.05) indicates poor fit of the model to the data.
- Basic plan for selecting the variables for the model -
- Validation data set involves:
- Using the candidate models from the training data set - fit the model using the validation data set.
- From the predicted values, determine the Concordance (C) index and area under the curve (AUC) between the true positive rate (sensitivity) and false positive rate (1-specificity). It must result to values closer to 1.
- Group the predicted data into deciles (10 groups) and for each group, the observed number of events is compared to the expected number of events predicted by the model. The sum of these 10 groups called Chi-square statistic with 8 degrees of freedom must have p-value > 0.05 to denote good fit.
- If both the training data set and validation data set gave good results in all tests, then the model is a candidate for selection. If there are more than one candidate models, the one having more clinical relevance is opted.
- This resulted in Overstay2 scoring models by site.
Decision on a probability threshold
The predictive models we established are used to stratify the patient population for different Overstay2 processes on the units to reduce discharge delay. Details about establishing a threshold for the probabilities of the Overstay2 scoring models are in