Skip to content

Provenance model

This page specifies mfgQC's auditability machinery: the data model, the immutability guarantees, the digest/hash-chain algorithm, and exactly what verify_provenance() does and does not prove. It is the reason mfgQC exists over qcc or Minitab, so its scope is stated honestly rather than oversold.

The data model

Every analysis consumes a QCData and returns a frozen result. Both carry a history: an immutable tuple[Step, ...]. A Step records one operation that derived or transformed data:

Field Meaning
operation what happened — load, spec, transform, capability, assumption:normality, …
params the parameters of that operation (e.g. the Box-Cox λ, the spec limits)
n_affected how many rows the step touched
timestamp wall-clock time the step ran

The chain is reconstructable end to end:

qc  = mfgqc.load(df, measure="y").spec(lower=0.1, upper=8)
cap = qc.transform("boxcox").capability()

[s["operation"] for s in cap.lineage()]
# ['load', 'spec', 'transform', 'capability', 'assumption:normality']

lineage() returns each step as a dict, with the running digest folded in up to and including that step.

Immutability guarantee (append-only by construction)

QCData and every result are frozen dataclasses; the history is an immutable tuple of frozen Steps. As a consequence it cannot be reordered, inserted into, or edited in place. Three boundaries close the obvious escape hatches:

  • Ingest defensive-copies the input frame, so later mutation of the original DataFrame cannot reach a recorded result.
  • .frame hands back a copy, so callers cannot mutate the stored frame through the accessor.
  • .values() is read-only.

Every transform returns a new QCData via an internal _with_step(...) that appends to a copy of the history. Nothing mutates in place.

The digest (hash-chained, verifiable)

Each step folds into a running SHA-256. The timestamp is deliberately excluded from the hashed content, so the digest is reproducible run-to-run — it pins the computation, not the wall clock.

The integrity-bearing content of a step ("the canonical step") is exactly:

{
    "operation": step.operation,
    "params":    step.params,      # JSON-normalized
    "n_affected": step.n_affected,
}

The chain folds each canonical step into the previous digest:

\[ d_0 = \text{""}, \qquad d_i = \mathrm{SHA256}\big(d_{i-1} \,\Vert\, \mathrm{json}(\text{canon}_i)\big) \]

where the JSON is serialized with sorted keys and compact separators, and \(d_{\text{final}}\) is what provenance_digest() returns and what to_dict() stamps in. Editing the operation, params, or n_affected of any recorded step changes \(d_{\text{final}}\).

digest = cap.provenance_digest()   # store this alongside the reported Cpk
cap.verify_provenance(digest)      # True now; False if any step was edited later

verify_provenance(expected) simply recomputes the digest over the current history and compares it to expected.

What verify_provenance() proves — and what it does not

Scope, stated honestly

The digest is a content hash, not a cryptographic signature: code running in the same process could edit a step and recompute the digest. It defends against accidental corruption and post-hoc edits to a stored result, not against an adversary who controls the interpreter.

Put concretely: verify_provenance() gives you verifiable integrity of a recorded result — it detects tampering with an archived analysis. It does not by itself prevent a runtime-controlling operator from recomputing the whole analysis over fabricated inputs. Closing that gap requires external anchoring or signing of the head digest (e.g. signing \(d_{\text{final}}\) with a key the operator does not hold, or writing it to an append-only external log). That is out of scope for the core library and intentionally left to the deployment.

One boundary is also explicit at the chart layer: once you extract the matplotlib Figure from .view(), edits to that Figure are outside the lineage.

Using it in an audit

The end-to-end workflow — capturing a digest, exporting a result with its lineage, and tracing an exported number back to raw data — is walked through with a full example in the User Guide: The audit workflow.

Source: the algorithm above is implemented in mfgqc/_result.py (history_digest, history_lineage, _canonical_step, _chain) and exposed on both QCData and every result object.