Provenance model¶
This page specifies mfgQC's auditability machinery: the data model, the
immutability guarantees, the digest/hash-chain algorithm, and exactly what
verify_provenance() does and does not prove. It is the reason mfgQC exists over
qcc or Minitab, so its scope is stated honestly rather than oversold.
The data model¶
Every analysis consumes a QCData and returns a frozen result. Both carry a
history: an immutable tuple[Step, ...]. A Step records one operation that
derived or transformed data:
| Field | Meaning |
|---|---|
operation |
what happened — load, spec, transform, capability, assumption:normality, … |
params |
the parameters of that operation (e.g. the Box-Cox λ, the spec limits) |
n_affected |
how many rows the step touched |
timestamp |
wall-clock time the step ran |
The chain is reconstructable end to end:
qc = mfgqc.load(df, measure="y").spec(lower=0.1, upper=8)
cap = qc.transform("boxcox").capability()
[s["operation"] for s in cap.lineage()]
# ['load', 'spec', 'transform', 'capability', 'assumption:normality']
lineage() returns each step as a dict, with the running digest folded in up to
and including that step.
Immutability guarantee (append-only by construction)¶
QCData and every result are frozen dataclasses; the history is an immutable
tuple of frozen Steps. As a consequence it cannot be reordered, inserted into, or
edited in place. Three boundaries close the obvious escape hatches:
- Ingest defensive-copies the input frame, so later mutation of the original DataFrame cannot reach a recorded result.
.framehands back a copy, so callers cannot mutate the stored frame through the accessor..values()is read-only.
Every transform returns a new QCData via an internal _with_step(...) that
appends to a copy of the history. Nothing mutates in place.
The digest (hash-chained, verifiable)¶
Each step folds into a running SHA-256. The timestamp is deliberately excluded from the hashed content, so the digest is reproducible run-to-run — it pins the computation, not the wall clock.
The integrity-bearing content of a step ("the canonical step") is exactly:
{
"operation": step.operation,
"params": step.params, # JSON-normalized
"n_affected": step.n_affected,
}
The chain folds each canonical step into the previous digest:
where the JSON is serialized with sorted keys and compact separators, and
\(d_{\text{final}}\) is what provenance_digest() returns and what to_dict() stamps
in. Editing the operation, params, or n_affected of any recorded step
changes \(d_{\text{final}}\).
digest = cap.provenance_digest() # store this alongside the reported Cpk
cap.verify_provenance(digest) # True now; False if any step was edited later
verify_provenance(expected) simply recomputes the digest over the current history
and compares it to expected.
What verify_provenance() proves — and what it does not¶
Scope, stated honestly
The digest is a content hash, not a cryptographic signature: code running in the same process could edit a step and recompute the digest. It defends against accidental corruption and post-hoc edits to a stored result, not against an adversary who controls the interpreter.
Put concretely: verify_provenance() gives you verifiable integrity of a
recorded result — it detects tampering with an archived analysis. It does
not by itself prevent a runtime-controlling operator from recomputing the
whole analysis over fabricated inputs. Closing that gap requires external
anchoring or signing of the head digest (e.g. signing \(d_{\text{final}}\) with a
key the operator does not hold, or writing it to an append-only external log).
That is out of scope for the core library and intentionally left to the
deployment.
One boundary is also explicit at the chart layer: once you extract the
matplotlib Figure from .view(), edits to that Figure are outside the lineage.
Using it in an audit¶
The end-to-end workflow — capturing a digest, exporting a result with its lineage, and tracing an exported number back to raw data — is walked through with a full example in the User Guide: The audit workflow.
Source: the algorithm above is implemented in mfgqc/_result.py
(history_digest, history_lineage, _canonical_step, _chain) and exposed on
both QCData and every result object.