Data Meaning Layer
|
|
This page introduces the concept of the Data Meaning Layer, a shared framework for defining, tracking, and communicating the meaning of data elements and metrics in CCMDB. It serves as a central reference point for how we manage evolving definitions, support consistent interpretation, and connect meaning across tools like Access/SQL, SAS, Power BI, and the wiki.
The Problem
- We have a complex legacy dataset that has had significant data encoding changes over time; we use code to combine the different parts and make sense of them for current analysis use
- Analysts and tools may apply inconsistent logic without realizing it: Julie has code for this in SAS, but Tina also has code for this in Access and Power Query/M/PowerBi; there have been instances of different data due to different interpretations
- No formal, versioned system exists to track or expose what our metrics mean or when that meaning has changed.
- Users may be unaware that the logic behind metrics (e.g. Transfer Ready DtTm tmp entry ) has evolved.
- Current workflows require tacit knowledge, increasing fragility and rework; even with significant organizational history at the table we struggle to unwrap this.
What We're Trying to Solve
- Define and govern shared meaning of metrics, derived fields, and data elements.
- We have a good start for this on the wiki, with the structured data encoding of Indicators, Reports, and the Auto Data Dictionary, but those only encode the intent, not the function
- Capture when and how those meanings change over time (semantic versioning).
- We have some of this knowledge as "Log" entries on the relevant wiki pages, more of it in meeting minutes and page histories (harder to access), but again, not in a format one could use directly to make sense of the data
- Enable different tools (SAS, Power BI, Wiki) to interpret data consistently, without re-implementing logic.
- Ensure end-users:
- Don’t need to know how logic works if the layer can abstract it
- Do need to know when meaning shifts affect interpretation
- functionality is available via wiki watch lists or RSS feeds, but our users are not tied into those
- Julie applies business logic in SAS, but it's not exposed, versioned, or referenced elsewhere.
- Power BI metrics may not match SAS outputs due to hidden drift in logic.
- Wiki is used as a source of truth, but doesn't currently integrate or surface semantic shifts.
- Enabling data cleaning as encoded data or logic in the presentation layer rather than by changing the source data
- we often find inconsistencies in the data; sometimes they are blatant and should possibly be changed in the core data, but often they are more subtle and have possible false update scenario or meaning loss
- Pagasa then updates the master data, which takes time
- encoding the fix in the presentation layer would make the change transparent and traceable, and could be done without causing work for Pagasa
Examples from Our Context
- Transfer Ready DtTm tmp entry changed to include judgment from Allied Health, changing the meaning of Transfer Delay (Medicine)
- The meaning of Service/Location field has changed over time, where it used to tell both the service and location of an admission and now does neither, but only serves as an indicator of Program and a way to ensure uniqueness of D_IDs; in the context, the Definition of a Medicine Program Admission changed.
- Definition of a Critical Care Program Admission changed over time causing confusion about ICUotherService
People Involved
- Julie: Maintains logic in SAS, sometimes modular, generally only understood by her
- Tina: Maintains the logic in MS Access, so
- Dan/Bojan/Users: Maintains Power BI and documentation pipelines; observing fragility
- Others (e.g., clinicians, reporting users): Consumers of data who may not see upstream changes
Current State
- Metric definitions live in:
- Microsoft Access SQL/VBA
- implicit logic, partly documented, partly explicitly diff-versioned
- single maintainer and builder Julie
- SAS code
- implicit logic, partly documented, not explicitly diff-versioned
- single maintainer and builder Julie
- Wiki
- contains descriptive definition only of almost all of our data; not connected to source logic
- maintained and used by all users of the data
- capable of containing code, and already providing diff-able version tracking
- Reports
- explicit explanations and caveats for meanings in the reports documents
- Microsoft Access SQL/VBA
- No standard format for:
- Declaring metric definitions
- Describing when logic changed and why
- Ensuring alignment across tools
Tool Constraints + Possibilities
MS Access
- currently contains master dataset, is getting too large
- is quite slow for sending and transformations
- is not a sanctioned tool any more
- <needs more detail>
SAS
- licensing constraints
- can consume data in most formats, but works best with columnar data
- organization is moving to free alternative #R instead
Wiki
- Can serve as human-readable source of truth
- Can reference diffs between versions of definitions (manual or linked)
- Could act as index or router to canonical definitions stored elsewhere
Potential / Emergent tools
Power BI
- reporting tool that has been used for some ad-hoc reporting from CCMDB
- would be capable of turning our quarterly reports into real-time tabular, visual and/or dash board reporting
- can consume data in most formats, but works best with columnar data
R
Risks / Gaps
- Bus factor is high for critical logic embedded in SAS (Julie)
- Without structured control, meaning drift will continue unnoticed
- No audit trail or accountability for changes to definitions
- Users have no reliable way to understand or challenge what a metric means at a point in time
Strategic Opportunity
- Introduce a Data Meaning Layer to standardize, expose, and make auditable the shared meaning
- Use the wiki to contain (or index a git of) the code that contains the transformation for a given data element/indicator, exposing it transparently to anyone at full context, if desired (this can and should be set up to not needlessly bloat page representation)
- Automatically retrieve the definitions from there as needed; the wiki can expose most recent change of definitions to streamline this
- Define a tool chain that can retrieve and use the definitions to generate data that downstream tools can ingest, and possibly maintain relational intermediate data representations, and and columnar versions based on those, that can be versioned
possible extension
- cross checks - our cross-checks are currently hard coded as queries in MS Access, which are also not transparent to users. Depending on implementation details this framework would allow us to include them
- test data and validation - this framework could also include test data and validation or unit test functionality
- branching, merging and code review - depending on the repository of the code, and the integration into the tool, this could introduce git-like code version management
