Data Meaning Layer

	This had been percolating in my head for a while and I finally had a chance to look into it. Looking forward to thoughts. Ttenbergen 18:11, 18 July 2025 (CDT)

This page introduces the concept of the Data Meaning Layer, a shared framework for defining, tracking, and communicating the meaning of data elements and metrics in CCMDB. It serves as a central reference point for how we manage evolving definitions, support consistent interpretation, and connect meaning across tools like Access/SQL, SAS, Power BI, and the wiki.

The Problem

We have a complex legacy dataset that has had significant data encoding changes over time; we use code to combine the different parts and make sense of them for current analysis use
Analysts and tools may apply inconsistent logic without realizing it: Julie has code for this in SAS, but Tina also has code for this in Access and Power Query/M/PowerBi; there have been instances of different data due to different interpretations
No formal, versioned system exists to track or expose what our metrics mean or when that meaning has changed.
Users may be unaware that the logic behind metrics (e.g. Transfer Ready DtTm tmp entry ) has evolved.
Current workflows require tacit knowledge, increasing fragility and rework; even with significant organizational history at the table we struggle to unwrap this.

What We're Trying to Solve

Define and govern shared meaning of metrics, derived fields, and data elements.
- We have a good start for this on the wiki, with the structured data encoding of Indicators, Reports, and the Auto Data Dictionary, but those only encode the intent, not the function
Capture when and how those meanings change over time (semantic versioning).
- We have some of this knowledge as "Log" entries on the relevant wiki pages, more of it in meeting minutes and page histories (harder to access), but again, not in a format one could use directly to make sense of the data
Enable different tools (SAS, Power BI, Wiki) to interpret data consistently, without re-implementing logic.
Ensure end-users:
- Don’t need to know how logic works if the layer can abstract it
- Do need to know when meaning shifts affect interpretation
  - functionality is available via wiki watch lists or RSS feeds, but our users are not tied into those
Julie applies business logic in SAS, but it's not exposed, versioned, or referenced elsewhere.
Power BI metrics may not match SAS outputs due to hidden drift in logic.
Wiki is used as a source of truth, but doesn't currently integrate or surface semantic shifts.
Enabling data cleaning as encoded data or logic in the presentation layer rather than by changing the source data
- we often find inconsistencies in the data; sometimes they are blatant and should possibly be changed in the core data, but often they are more subtle and have possible false update scenario or meaning loss
- Pagasa then updates the master data, which takes time
- encoding the fix in the presentation layer would make the change transparent and traceable, and could be done without causing work for Pagasa

Examples from Our Context

Transfer Ready DtTm tmp entry changed to include judgment from Allied Health, changing the meaning of Transfer Delay (Medicine)
The meaning of Service/Location field has changed over time, where it used to tell both the service and location of an admission and now does neither, but only serves as an indicator of Program and a way to ensure uniqueness of D_IDs; in the context, the Definition of a Medicine Program Admission changed.
Definition of a Critical Care Program Admission changed over time causing confusion about ICUotherService

People Involved

Julie: Maintains logic in SAS, sometimes modular, generally only understood by her
Tina: Maintains the logic in MS Access, so
Dan/Bojan/Users: Maintains Power BI and documentation pipelines; observing fragility
Others (e.g., clinicians, reporting users): Consumers of data who may not see upstream changes

Current State

Metric definitions live in:
- Microsoft Access SQL/VBA
  - implicit logic, partly documented, partly explicitly diff-versioned
  - single maintainer and builder Julie
- SAS code
  - implicit logic, partly documented, not explicitly diff-versioned
  - single maintainer and builder Julie
- Wiki
  - contains descriptive definition only of almost all of our data; not connected to source logic
  - maintained and used by all users of the data
  - capable of containing code, and already providing diff-able version tracking
- Reports
  - explicit explanations and caveats for meanings in the reports documents
No standard format for:
- Declaring metric definitions
- Describing when logic changed and why
- Ensuring alignment across tools

Tool Constraints + Possibilities

MS Access

- currently contains master dataset, is getting too large
- is quite slow for sending and transformations
- is not a sanctioned tool any more
- <needs more detail>

SAS

- licensing constraints
- can consume data in most formats, but works best with columnar data
- organization is moving to free alternative #R instead

Wiki

- Can serve as human-readable source of truth
- Can reference diffs between versions of definitions (manual or linked)
- Could act as index or router to canonical definitions stored elsewhere

Potential / Emergent tools

Power BI

- reporting tool that has been used for some ad-hoc reporting from CCMDB
- would be capable of turning our quarterly reports into real-time tabular, visual and/or dash board reporting
- can consume data in most formats, but works best with columnar data

R

FOSS alternative to #SAS that is becoming U of M and CHI's tool of choice for statistical analysis
R + {targets} provides a tool to build reproducible data pipelines that would have the functionality to solve much of this

Risks / Gaps

Bus factor is high for critical logic embedded in SAS (Julie)
Without structured control, meaning drift will continue unnoticed
No audit trail or accountability for changes to definitions
Users have no reliable way to understand or challenge what a metric means at a point in time

Strategic Opportunity

Introduce a Data Meaning Layer to standardize, expose, and make auditable the shared meaning
Use the wiki to contain (or index a git of) the code that contains the transformation for a given data element/indicator, exposing it transparently to anyone at full context, if desired (this can and should be set up to not needlessly bloat page representation)
Automatically retrieve the definitions from there as needed; the wiki can expose most recent change of definitions to streamline this
Define a tool chain that can retrieve and use the definitions to generate data that downstream tools can ingest, and possibly maintain relational intermediate data representations, and and columnar versions based on those, that can be versioned

One possible solution is R + {targets}, which can
- build reproducible data pipelines
- declare upstream and downstream downstream dependencies (e.g., raw → cleaned → semantic output)
- Cache and manage intermediate forms (e.g., “semantic LOS definition v3 applied to 2023 discharges”)

possible extension

cross checks - our cross-checks are currently hard coded as queries in MS Access, which are also not transparent to users. Depending on implementation details this framework would allow us to include them
test data and validation - this framework could also include test data and validation or unit test functionality
branching, merging and code review - depending on the repository of the code, and the integration into the tool, this could introduce git-like code version management