The audit workflow¶
This is the workflow that sets mfgQC apart from a calculator: record a result's provenance digest when you report a number, export the full result (numbers, assumption checks, and lineage) as JSON, then verify and trace that number back to raw data later. Every command below was run against the current code, and the output — including the real digest strings — is pasted verbatim.
For the algorithm and the formal scope statement, see Provenance model. This page is the hands-on side of it.
The example¶
We use a small, strictly-positive dataset and apply a Box-Cox transform so the lineage has something interesting in it:
import json
import numpy as np, pandas as pd, mfgqc
rng = np.random.default_rng(11)
df = pd.DataFrame({
"cycles": np.round(rng.lognormal(mean=1.2, sigma=0.35, size=80), 3),
})
qc = mfgqc.load(df, measure="cycles").spec(lower=0.5, upper=12.0)
cap = qc.transform("boxcox").capability()
1. Run the analysis and read its lineage¶
Every result carries the full chain of operations that produced it. lineage()
returns one dict per step; pull the operation names to see the shape of the
computation:
That is the whole derivation: the frame was loaded, spec limits were attached, the measure was Box-Cox transformed, capability was computed, and a normality assumption check ran. Nothing happened that is not on this list.
2. Record the digest when you report the number¶
When you write the reported value down — into a report, a LIMS, a Certificate of Analysis — capture the provenance digest next to it:
That SHA-256 string pins the computation that produced the number — the operations, their parameters (including the fitted Box-Cox λ), and how many rows each step touched. The timestamp is deliberately not in the digest, so it is reproducible run-to-run.
Store the digest with the value, not instead of it
The digest is a fingerprint, not the data. Store it as a sibling field of the
reported Cpk — cpk = 0.723, provenance = 7cb845af… — so anyone re-deriving the
number later has something to check against.
3. Export the full result as JSON¶
to_dict() is the canonical payload. It carries the fields, the flat summary, the
assumption checks, and the lineage plus the digest — everything a downstream
report builder needs, with no report() text to parse:
The two provenance keys are history (the lineage, each step carrying its running
digest) and provenance_digest (the head digest from step 2):
d["provenance_digest"]
# '7cb845af09aa053b023f88fb972d8901ee1d6eaca6123919eca0b7ffd8279a07'
list(d["history"][0].keys())
# ['operation', 'params', 'n_affected', 'digest']
Here is the transform step from history, showing that the fitted λ and its CI are
recorded in the provenance — not buried in a log:
{
"operation": "transform",
"params": {
"method": "boxcox",
"lambda": 0.5379697633151592,
"lambda_ci": [
-0.14218010515761142,
1.2255339555066445
]
},
"n_affected": 80,
"digest": "d64b236be9447d2bfa7672f3f56b9e12fda50bf93daad8f5406cfe0948535125"
}
And the assumption checks ride along too — for example the normality check:
{
"name": "normality",
"test": "Anderson-Darling",
"statistic": 0.3432987956436193,
"p_value": 0.4812437156608955,
"passed": true,
"magnitude": 0.15081585539601827,
"magnitude_label": "est. Cpk impact",
"reliability": "ok",
"n": 80,
"recommendation": null
}
Write it to a file and you have a self-describing, archivable record:
The whole payload here is ~3.5 KB. The same digest you recorded in step 2 is stamped
into the file's provenance_digest, so the export and the reported number agree by
construction.
Note
Frontends and report builders must consume to_dict() (or the flat
summary()) — never parse report() text. The JSON is the stable contract; the
text report is for humans. See the API reference for the
full result surface.
4. Verify later¶
Months later, someone reopens the archived result — or recomputes it from the same inputs — and checks it against the digest you recorded:
verify_provenance(expected) recomputes the digest over the current history and
compares it to the one you pass in. True means the recorded computation is intact.
Tamper-evidence (demonstrated honestly)¶
The chain is tamper-evident: changing the operation, params, or n_affected of
any recorded step changes the head digest, so verification fails. Here we take the
real result, alter one field of the recorded transform step — bump the fitted λ by
1.0 — and re-verify against the original digest:
import dataclasses as dc
hist = list(cap.history)
for i, s in enumerate(hist):
if s.operation == "transform":
bad = dict(s.params)
bad["lambda"] = bad["lambda"] + 1.0 # alter a recorded parameter
hist[i] = dc.replace(s, params=bad)
tampered = dc.replace(cap, history=tuple(hist))
print(tampered.provenance_digest())
print(tampered.verify_provenance(digest))
The digest moved from 7cb845af… to e5e9dc33… and verification returns False.
One altered parameter in one step, three steps deep, and the recorded result no
longer matches its fingerprint. (Note we had to use dataclasses.replace to build a
new object — the result and its history are frozen, so there is no in-place edit to
make in the first place. See Provenance model → Immutability.)
5. Trace a number back to raw data¶
lineage() is the audit trail. Each step gives you its operation, its params,
its n_affected, and the running digest folded in up to and including that step:
for s in cap.lineage():
print(s["operation"], "| n_affected:", s["n_affected"], "| digest:", s["digest"][:16], "...")
load | n_affected: 80 | digest: a69636b71f0bddc7 ...
spec | n_affected: None | digest: 72c30c2486ecf8c3 ...
transform | n_affected: 80 | digest: d64b236be9447d2b ...
capability | n_affected: 80 | digest: 4aae3812003203cd ...
assumption:normality | n_affected: None | digest: 7cb845af09aa053b ...
Read it bottom-up to walk the reported Cpk back to the raw frame:
| Step | What it records (params) |
|---|---|
assumption:normality |
Anderson-Darling AD=0.343, p=0.481, passed; est. Cpk impact 15.1% |
capability |
method=normal, sigma_used=overall, cpk=0.723, pp=3.344, cpm=null |
transform |
method=boxcox, lambda=0.538, lambda_ci=[−0.142, 1.226] |
spec |
lower=0.5, upper=12.0, target=null |
load |
measure="cycles", 80 rows, no roles/units/subgroup |
So the reported Cpk = 0.723 was computed on the overall sigma, after a Box-Cox
transform with λ≈0.538, against spec limits [0.5, 12.0], on 80 loaded rows — and the
normality check that justifies the normal-method capability is right there in the
chain, passing. No step is hidden, and each one's digest lets you confirm where in
the chain a difference first appears.
The running digest also lets you cross-check intermediate state. The QCData after
the transform exposes the same provenance surface, and its digest equals the
transform step's running digest in the result's lineage:
qct = qc.transform("boxcox")
qct.provenance_digest()
# 'd64b236be9447d2bfa7672f3f56b9e12fda50bf93daad8f5406cfe0948535125' # == transform step digest above
lineage(), provenance_digest(), and verify_provenance() exist on both
QCData and every result object — the trail is continuous from the loaded frame
through to the final number.
What passing and failing verify actually mean¶
Read this before relying on it
A passing verify_provenance() means the recorded result is intact: the
archived analysis has not been edited since the digest was captured. A failing
one means the history no longer matches — something in the recorded chain changed.
What it does not do, on its own: it does not stop an actor who controls the Python interpreter at runtime from recomputing the whole analysis over fabricated inputs and stamping a fresh, self-consistent digest. The digest is a content hash, not a cryptographic signature. It defends against accidental corruption and post-hoc tampering with a stored result — not against an adversary who controls the process that produces it.
Closing that gap requires anchoring the head digest outside the process — signing
it with a key the operator does not hold, or writing it to an append-only external
log. That is out of scope for the core library and intentionally left to the
deployment. The full scope statement is in
Provenance model → What verify_provenance() proves.
See also¶
- Provenance model — the data model, the hash-chain algorithm, and the honest scope of the guarantee.
- Quickstart — the
load → spec → analysisflow this page builds on. - API reference — the full result surface (
to_dict(),summary(),lineage(),provenance_digest(),verify_provenance()).