Re-analysis and generation of Overstay2 model: Difference between revisions
| Line 167: | Line 167: | ||
=== Decision on a model === | === Decision on a model === | ||
*For each site's training set and validation set, chi square test for independence between the variable OS (Overstay >= 10days and Overstay < 10d) and each factors listed [[Data definition for contributing factors for the Overstay2 project]] to identify the factors that may affect the overstay individually. | *For each site's training set and validation set, chi square test for independence between the variable OS (Overstay >= 10days and Overstay < 10d) and each factors listed [[Data definition for contributing factors for the Overstay2 project]] to identify the factors that may affect the overstay individually. | ||
*Methodology to find the '''best''' model involves | *Training data set - Methodology to find the '''best''' model involves | ||
** Basic plan for selecting the variables for the model - perform logistic model with the OS as the dependent variable and the independent variables beginning with the results from univariable analysis above and then by multivariable analysis using all independent variables (full model) and select via stepwise procedure both forward and backward selection. Examine the importance of each variable included based on the probability result of its coefficient. Those not contributing to the model are eliminated and new model is fitted. The process of deleting, refitting and verifying continues until it appears that all important variables are already included. | ** Basic plan for selecting the variables for the model - | ||
** Assess the adequacy of the model both in terms of the individual variables and its overall fit | *** perform logistic model with the OS as the dependent variable and the independent variables beginning with the results from univariable analysis above and | ||
*** then by multivariable analysis using all independent variables (full model) and select via stepwise procedure both forward and backward selection. | |||
* the | *** Examine the importance of each variable included based on the probability result of its coefficient. | ||
* the | *** Those not contributing to the model are eliminated and new model is fitted. The process of deleting, refitting and verifying continues until it appears that all important variables are already included. | ||
* | ** Assess the adequacy of the model both in terms of the individual variables and its overall fit by the following : | ||
***Estimated coefficients showing p-values of < 0.05 or having clinical relevance with p-values higher or close to 0.05 are included in the model. | |||
***The association of the predicted probabilities and observed responses is calculated by the Concordance (C) index and area under the curve (AUC) between the true positive rate (sensitivity) and false positive rate (1-specificity). A value > 0.5 implies ability to discriminate the positive and negative outcomes while a value 1 implies perfect classification. This quantity indicates how well the model ranks predictions . | |||
***The Hosmer-Lemeshow Goodness-of-fit test is used to assess how well the logistic regression model fits the data. A high p-value (usually > 0.05) means the model fits well while a low p-value (≤ 0.05) indicates poor fit of the model to the data. | |||
*Validation data set involves: | |||
** Using the candidate models from the training data set - fit the model using the validation data set. | |||
** From the predicted values, determine the Concordance (C) index and area under the curve (AUC) between the true positive rate (sensitivity) and false positive rate (1-specificity). It must result to values closer to 1. | |||
** Group the predicted data into deciles (10 groups) and for each group, the observed number of events is compared to the expected number of events predicted by the model. The sum of these 10 groups called Chi-square statistic with 8 degrees of freedom must have p-value > 0.05 to denote good fit. | |||
* If both the training data set and validation data set gave good results in all tests, then the model is selected. | |||
This resulted in [[Overstay2 scoring models]]. | * This resulted in [[Overstay2 scoring models]] by site. | ||
=== Decision on a probability threshold === | === Decision on a probability threshold === | ||