Cross-Corpus Taxonomy Alignment: Global Category Taxonomy, Per-Corpus Alignment Table, Alignment-Drift Detection Rules, and Cross-Corpus Migration Cascade

Hero image of a deep teal cross-corpus alignment chamber showing four orbiting per-corpus taxonomy ring stacks rendered in copper around a central global category taxonomy core in ivory, orchid alignment ribbons bridging each corpus's category nodes onto the corresponding global category nodes, sage cascade arrows flowing outwards from a single per-corpus migration event to the alignment review queue at the bottom of the chamber, and a faint cross-corpus reconciliation mesh threading the entire structure to show how per-corpus rollups compose into a multi-corpus rollup, all rendered in the deep teal copper ivory orchid sage cluster palette consistent with blogs 178 through 198

Introduction

The first time a per-corpus taxonomy migration produced a silent inconsistency in the cross-corpus rollup was the Tuesday after the 2025-Q4 quarter-boundary review, four corpora into the multi-corpus deployment described in the prior post. One of the four corpora had run a clean merge_categories operation against its local taxonomy at quarter close, collapsing two near-duplicate failure-mode categories into a single category whose decision rule the on-call cohort had been treating as the operationally dominant phrasing for the prior two quarters. The local migration passed every validation rule the per-corpus protocol checks: the snapshot anchor hash matched the prior snapshot, the forward-projection mapping covered every source-category event row, the merge target's decision-rule embedding sat well below the near-duplicate threshold against the rest of the corpus's taxonomy, and the corpus facilitator had a clean rationale row in the migration audit table. The cross-corpus rollup the platform team ran at the end of that week, however, produced a category-share rollup that disagreed with the per-corpus rollup the merging corpus had produced for the same quarter. The disagreement was small, three to four percentage points across two of the global taxonomy's category buckets, but the disagreement was real and the platform team could not explain it from the per-corpus migration audit rows alone.

The forensic trace took most of a sprint and surfaced a cross-corpus alignment problem the per-corpus protocol had not, by itself, surfaced. The merging corpus's local merger had been clean against its own taxonomy, but the two pre-merger categories had been mapped to two distinct global categories in the cross-corpus alignment table the platform team maintained, and the post-merger combined category had no clean mapping onto either global category because the merged decision rule's embedding sat closer to a third global category than to either of the two original mappings. The local migration had been correct against the local taxonomy and silently incorrect against the global taxonomy, and the cross-corpus rollup had projected the post-merger rows onto the third global category without surfacing the projection-shift to the platform team. The post that follows walks through the cross-corpus alignment protocol the resolution required: the global category taxonomy the platform team maintains, the per-corpus alignment table that maps each corpus's category ids onto the global taxonomy, the alignment-drift detection rules that flag a corpus whose local taxonomy is diverging from the global taxonomy, and the cascade rules by which a per-corpus migration operation triggers a cross-corpus alignment review before the next cross-corpus rollup runs.

The Problem

The per-corpus taxonomy versioning protocol from the prior post is a per-corpus protocol by design. Each corpus runs its own taxonomy snapshot table, its own drift-detection rules, its own migration audit trail, and its own cross-quarter trend-pass query primitive. The protocol is correct against the per-corpus rollup contract: the per-corpus rollup output is interpretable across taxonomy migrations, the per-corpus migration audit trail is auditable for four years, and the per-corpus drift-detection rules surface migration candidates ahead of operational urgency. The protocol does not, however, address the cross-corpus rollup contract: the multi-corpus reconciliation pass that composes the per-corpus quarterly rollups into a global category-share rollup that the platform team's quarterly review consumes.

The cross-corpus rollup contract has a different load-bearing property than the per-corpus contract. The per-corpus contract guarantees interpretability of a single corpus's rollup across that corpus's own taxonomy migrations. The cross-corpus contract has to guarantee interpretability of the combined rollup across multiple corpora's taxonomy migrations, where each corpus's local taxonomy is evolving independently and the global category taxonomy the platform team's rollup uses is itself evolving on a separate cadence. The combined rollup has to be correct against four sources of taxonomy drift simultaneously: the four corpora's per-corpus migrations, the global taxonomy's own quarterly migrations, the alignment table's evolution as new corpora come online or new global categories are added, and the cascade interactions between a per-corpus migration and the global taxonomy at quarter boundaries.

The naive approach is to flatten the per-corpus taxonomies into a single global taxonomy at rollup time using the most recent per-corpus snapshot for each corpus, and then run the multi-corpus rollup against that flattened taxonomy. The approach has the property that the rollup output is internally consistent, which is operationally useful for the platform team's review. The approach has two failure modes. The first failure mode is the cross-corpus alignment table is not explicit, so the per-corpus categories that map to the same global category are joined implicitly by name or embedding similarity at rollup time, which is fragile and irreproducible. The second failure mode is the per-corpus migrations cascade silently into the cross-corpus rollup without surfacing the migration-driven projection-shift to the platform team, which is the failure mode the 2025-Q4 quarter-boundary review surfaced.

flowchart LR
  A[corpus 1<br/>local taxonomy<br/>snapshot] -->|alignment row| Z[global category<br/>taxonomy snapshot]
  B[corpus 2<br/>local taxonomy<br/>snapshot] -->|alignment row| Z
  C[corpus 3<br/>local taxonomy<br/>snapshot] -->|alignment row| Z
  D[corpus 4<br/>local taxonomy<br/>snapshot] -->|alignment row| Z
  Z --> E[cross-corpus<br/>rollup pass]
  F[per-corpus<br/>migration audit<br/>row] -->|cascade trigger| G[cross-corpus<br/>alignment review<br/>queue]
  G --> H[platform-team<br/>alignment-row<br/>update]
  H --> Z

The right approach treats the cross-corpus alignment as a versioned artefact in its own right, with a global category taxonomy table the platform team maintains, a per-corpus alignment table that maps each corpus's local category ids onto the global taxonomy with explicit provenance, alignment-drift detection rules that flag local taxonomies that are diverging from the global taxonomy, and cascade rules that the per-corpus migration audit trail triggers when a local migration changes the alignment topology. The protocol's load-bearing property is that no per-corpus migration can change the cross-corpus rollup without first triggering an alignment review, and no alignment-table update can happen without a structured audit row that the cross-corpus rollup pass can replay against historical quarters.

How It Works

The protocol has four pieces. The first is the global category taxonomy snapshot table the platform team maintains, with the same shape as the per-corpus taxonomy snapshot but at the platform-team scope rather than the per-corpus scope. The second is the per-corpus alignment table that maps each corpus's local category ids onto the global taxonomy, with explicit alignment-confidence scores, provenance, and the platform-team authorisation chain. The third is the alignment-drift detection rules that surface alignment-review candidates to the platform team at quarter boundaries. The fourth is the cascade-trigger rules that wire the per-corpus migration audit trail into the cross-corpus alignment review queue.

Global Category Taxonomy Snapshot

The global category taxonomy table is maintained by the platform team and has the same shape as the per-corpus taxonomy snapshot. Each row carries a (quarter_id, global_category_id, global_category_name, global_category_description, parent_global_category_id, decision_rule_revision, snapshot_finalised_at) tuple, with the (quarter_id, global_category_id) composite as the primary key. The decision-rule revision references a separate global_decision_rule_versions table the platform team maintains, with the same audit shape as the per-corpus equivalent. The snapshot is append-only on the quarter_id, finalised at quarter boundaries by the platform team's review pass, and used by the cross-corpus rollup pass as the rollup target for that quarter. The global taxonomy itself runs a per-quarter migration cadence with the same five operation types as the per-corpus protocol (add_global_category, deprecate_global_category, merge_global_categories, split_global_category, rename_global_category); the cross-corpus migration audit trail is a separate table from the per-corpus migration audit trails, and its retention window is six years rather than the per-corpus four-year window because the global taxonomy is the audit anchor for the platform team's annual review.

CREATE TABLE global_taxonomy_snapshot (
  quarter_id              text NOT NULL,
  global_category_id      uuid NOT NULL,
  global_category_name    text NOT NULL,
  global_category_description text NOT NULL,
  parent_global_category_id uuid,
  decision_rule_revision  uuid NOT NULL REFERENCES global_decision_rule_versions(revision_id),
  decision_rule_embedding vector(1536) NOT NULL,
  snapshot_finalised_at   timestamptz,
  PRIMARY KEY (quarter_id, global_category_id)
);

CREATE INDEX global_taxonomy_snapshot_finalised_idx
  ON global_taxonomy_snapshot (snapshot_finalised_at)
  WHERE snapshot_finalised_at IS NOT NULL;

CREATE INDEX global_taxonomy_snapshot_embedding_idx
  ON global_taxonomy_snapshot
  USING hnsw (decision_rule_embedding vector_cosine_ops);

Per-Corpus Alignment Table

The per-corpus alignment table is the protocol's load-bearing piece. Each row carries a (quarter_id, corpus_id, local_category_id, global_category_id, alignment_confidence, alignment_provenance, platform_authorisation_id, finalised_at) tuple, with the (quarter_id, corpus_id, local_category_id) composite as the primary key. The alignment-confidence column is a numeric score in [0, 1]; the four-corpus deployment uses a calibrated combination of the decision-rule embedding similarity, the platform-team's manual review score, and the operational consistency score from the prior quarter's rollup. The alignment-provenance column is a structured JSON value carrying the calibration breakdown, the prior-quarter alignment row's id (for chain traceability), and the alignment-drift signals that surfaced the alignment-review candidate, if any. The platform-authorisation column references a separate platform_authorisations table that records the platform-team review pass that authorised the alignment row, with the reviewing engineer's identity, review timestamp, and rationale.

The alignment table is append-only on the (quarter_id, corpus_id, local_category_id) composite, with the same finalisation semantics as the per-corpus and global taxonomy snapshots. Once a quarter's alignment is finalised, no row with that quarter is ever updated or deleted; alignment-row changes for the next quarter are written as new rows that reference the prior-quarter row's id in the provenance column. The append-only property is what guarantees that the historical cross-corpus rollups remain interpretable indefinitely against the alignment table state at the time the rollup was produced.

A subtle property worth calling out is the alignment table's cardinality model. Each per-corpus local category maps to exactly one global category in any given quarter; the alignment is a function from local categories to global categories, not a many-to-many relation. The constraint is enforced at table-level by the primary key. The model is deliberate: a many-to-many alignment would let two local categories be jointly aligned to two global categories with weighted shares, which is more expressive but produces an alignment-table topology the cross-corpus rollup pass cannot reconcile without re-introducing the implicit-join failure mode the protocol is designed to avoid. The constraint that each local category maps to exactly one global category in a given quarter is the property the cross-corpus rollup pass relies on to compose the per-corpus rollups into the global rollup arithmetically rather than statistically.

CREATE TABLE corpus_alignment (
  quarter_id              text NOT NULL,
  corpus_id               uuid NOT NULL,
  local_category_id       uuid NOT NULL,
  global_category_id      uuid NOT NULL,
  alignment_confidence    numeric(4,3) NOT NULL CHECK (alignment_confidence BETWEEN 0 AND 1),
  alignment_provenance    jsonb NOT NULL,
  platform_authorisation_id uuid NOT NULL REFERENCES platform_authorisations(authorisation_id),
  finalised_at            timestamptz,
  PRIMARY KEY (quarter_id, corpus_id, local_category_id),
  FOREIGN KEY (quarter_id, global_category_id)
    REFERENCES global_taxonomy_snapshot (quarter_id, global_category_id)
);

CREATE INDEX corpus_alignment_global_idx
  ON corpus_alignment (quarter_id, global_category_id);

CREATE INDEX corpus_alignment_finalised_idx
  ON corpus_alignment (finalised_at)
  WHERE finalised_at IS NOT NULL;

Alignment-Drift Detection Rules

The alignment-drift detection rules surface alignment-review candidates to the platform team in advance of the next quarter's cross-corpus rollup. Three rules carry the load.

The first is the alignment-confidence-decay rule: an alignment row whose alignment confidence drops by more than 0.10 between consecutive quarters is flagged as an alignment-review candidate. The threshold is calibrated against the four-corpus deployment's prior eighteen months of alignment audit history; rows whose confidence drops by less than 0.10 between quarters have, in that history, never warranted an alignment update, while rows whose confidence drops by more than 0.10 have, in roughly seventy-five percent of cases, warranted a closer review (a global-taxonomy migration candidate, an alignment-row swap, or an alignment-confidence recalibration).

The second is the embedding-divergence rule: a local-to-global alignment whose decision-rule embedding cosine similarity drops below a tunable threshold (the four-corpus deployment uses 0.78 against the prior quarter's similarity) is flagged as an alignment-review candidate. The embedding similarity is computed at finalisation time by passing the local category's current decision rule and the global category's current decision rule through the platform-team's standard embedding model. The rule's threshold is calibrated against the same eighteen-month history; pairs that drop below 0.78 have, in that history, been re-aligned in roughly sixty-five percent of cases.

The third is the cascade-from-per-corpus-migration rule: a per-corpus migration audit row that touches a local category whose alignment row points at a global category is flagged as an alignment-review candidate by default, regardless of the alignment-confidence and embedding-divergence signals. The rule is the cascade-trigger the introduction's failure mode would have surfaced: a per-corpus merge that combines two locally near-duplicate categories into a single category may produce a combined decision rule that no longer maps cleanly onto either of the two prior global-category alignments, and the alignment-review queue is the surface that catches the projection-shift before the next cross-corpus rollup runs.

The drift-detection function's output is a corpus_alignment_drift_candidates table that the platform team reviews at the end of each quarter, in advance of the next quarter's cross-corpus rollup. The table's rows persist for a six-quarter retention window so the platform team can cross-reference patterns across multiple quarters; rows that result in an alignment update have their alignment_audit_id column filled in, while rows that the platform team dismisses after review have their dismissed_at timestamp filled in with a free-text rationale.

flowchart TD
  A[platform team<br/>reviews corpus_alignment_drift_candidates] --> B{candidate type?}
  B -->|confidence-decay| C[review alignment<br/>confidence breakdown]
  B -->|embedding-divergence| D[review decision-rule<br/>embedding pair]
  B -->|cascade-from-migration| E[review per-corpus<br/>migration audit row]
  C --> F{recalibrate or<br/>re-align?}
  D --> G{re-align or<br/>flag for global<br/>migration?}
  E --> H{re-align or<br/>defer to next<br/>global review?}
  F -->|recalibrate| I[alignment_recalibrate<br/>operation]
  F -->|re-align| J[alignment_swap<br/>operation]
  G -->|re-align| J
  G -->|flag for global<br/>migration| K[global_migration<br/>candidate row]
  H -->|re-align| J
  H -->|defer| L[dismissed_at<br/>filled in]
  I --> M[next-quarter<br/>alignment row]
  J --> M
  K --> N[global taxonomy<br/>migration audit]

Cascade-Trigger Rules

The cascade-trigger rules are the connective tissue between the per-corpus migration audit trail and the cross-corpus alignment review queue. The protocol defines three cascade rules. The first is the direct-cascade rule: a merge_categories, split_category, or rename_category operation in any per-corpus migration audit row writes one or more rows into the corpus_alignment_drift_candidates table at the next quarter boundary, with the candidate-type set to cascade-from-migration and the source migration audit row referenced in the provenance column. The second is the inverse-cascade rule: a global taxonomy migration audit row writes alignment-review candidates for every per-corpus alignment row that points at the affected global category, with the candidate-type set to cascade-from-global-migration and the global migration audit row referenced in the provenance column. The third is the bilateral-cascade rule: a quarter that has both a per-corpus migration touching a global-aligned category and a global migration touching that global category writes a higher-priority alignment-review candidate that the platform team's quarterly review processes ahead of the unilateral cascades.

The cascade-trigger function runs as a Postgres trigger on the per-corpus and global migration audit tables, with a single deferred-execution pass at quarter close that consolidates the candidate rows and computes the priority ordering for the platform-team review queue. The deferred-execution pass is the right shape because the within-quarter migrations may chain (a per-corpus migration may itself trigger a follow-on global migration in the same quarter, and the cascade-trigger has to fire against the consolidated state at quarter close rather than against the intermediate state mid-quarter).

Implementation Guide

The implementation has three load-bearing pieces: the global taxonomy snapshot DDL the platform team maintains, the per-corpus alignment table DDL each corpus's deployment runs, and the cascade-trigger function that wires the migration audit tables into the alignment review queue.

The global taxonomy snapshot DDL is shown above. The decision-rule embedding column uses a 1536-dimensional vector with an HNSW index for fast nearest-neighbour queries across the global taxonomy at alignment time; the four-corpus deployment runs the platform team's standard text-embedding-3-large embedding at finalisation time and stores the embedding alongside the snapshot row.

The per-corpus alignment table DDL is shown above. The foreign-key constraint on (quarter_id, global_category_id) is the property that prevents an alignment row from pointing at a global category that does not exist in that quarter's snapshot, which is the cross-corpus equivalent of the per-corpus protocol's snapshot anchor hash.

The cascade-trigger function is the third piece. The function consolidates the per-corpus and global migration audit rows for a given quarter at quarter close, computes the alignment-review candidates the cascade rules would write, and writes the candidate rows into corpus_alignment_drift_candidates with the priority ordering computed from the bilateral-cascade rule. The function is idempotent against re-runs (rows are written with a deterministic candidate id derived from the source migration audit row's id) and runs once per quarter close. A reference implementation in Python with a Postgres connection wrapper sits in the companion repo.

CREATE TABLE corpus_alignment_drift_candidates (
  candidate_id            uuid PRIMARY KEY,
  quarter_id              text NOT NULL,
  corpus_id               uuid NOT NULL,
  local_category_id       uuid,
  global_category_id      uuid,
  candidate_type          text NOT NULL CHECK (candidate_type IN (
                            'confidence-decay',
                            'embedding-divergence',
                            'cascade-from-migration',
                            'cascade-from-global-migration',
                            'bilateral-cascade')),
  cascade_priority        smallint NOT NULL DEFAULT 1,
  source_migration_audit_id uuid,
  candidate_provenance    jsonb NOT NULL,
  alignment_audit_id      uuid,
  dismissed_at            timestamptz,
  dismissed_rationale     text,
  created_at              timestamptz NOT NULL DEFAULT now(),
  CHECK (
    (alignment_audit_id IS NOT NULL AND dismissed_at IS NULL) OR
    (alignment_audit_id IS NULL AND dismissed_at IS NOT NULL) OR
    (alignment_audit_id IS NULL AND dismissed_at IS NULL)
  )
);

CREATE INDEX corpus_alignment_drift_candidates_open_idx
  ON corpus_alignment_drift_candidates (quarter_id, cascade_priority)
  WHERE alignment_audit_id IS NULL AND dismissed_at IS NULL;

The platform-team review pass at quarter close walks the open rows in priority order, applies the alignment update operations against the next quarter's alignment table, and either fills in the alignment_audit_id column (alignment update authorised) or the dismissed_at and dismissed_rationale columns (alignment update declined with a structured rationale). The closure semantics match the per-corpus drift-candidate semantics from the prior post and run on the same quarterly cadence.

Comparison and Tradeoffs

The cross-corpus alignment protocol takes a place in a wider design space of approaches to cross-corpus taxonomy reconciliation. Four approaches are worth comparing.

The naive implicit-join approach is the failure mode the introduction's anecdote described: per-corpus categories that map to the same global category are joined implicitly by name or embedding similarity at rollup time, with no explicit alignment table and no cascade-trigger. The approach has zero storage overhead and is operationally simple to set up, but it is fragile against per-corpus migrations because the migrations cascade into the cross-corpus rollup silently, and is irreproducible against the historical rollup record because the implicit join's decision boundary changes as the embeddings drift.

The explicit-alignment-without-cascade approach has a per-corpus alignment table but no cascade-trigger from per-corpus migrations into the alignment review queue. The approach is reproducible at any single quarter (the alignment table records the join) but is fragile across quarters because per-corpus migrations may produce alignment-row inconsistencies the platform team only catches at the next cross-corpus rollup, by which point the rollup has already shipped. The 2025-Q4 quarter-boundary review's failure mode is exactly this approach's failure mode.

The protocol-without-drift-detection approach has a per-corpus alignment table and the cascade-trigger from per-corpus migrations but no alignment-drift detection rules. The approach catches the per-corpus-migration cascade reliably but does not surface the slow drift in the alignment-confidence and embedding-divergence signals, which means the alignment table accumulates stale rows whose confidence has decayed without any review having been triggered.

The protocol-as-described approach has all four pieces (global taxonomy snapshot, per-corpus alignment table, alignment-drift detection rules, cascade-trigger from per-corpus migrations). The trade-offs are summarised in the comparison table below.

Property Implicit-join Explicit-no-cascade Protocol-no-drift Protocol (full)
Alignment storage overhead None Low Low Low
Per-corpus migration cascade catch None None Reliable Reliable
Slow alignment drift catch None None None Reliable
Cross-corpus rollup reproducibility None Per-quarter Per-quarter Per-quarter and across quarters
Audit-traceability None Partial Full Full
Operational cost None Low Low Moderate
flowchart TD
  A[cross-corpus<br/>rollup pass] --> B{alignment shape?}
  B -->|implicit-join| C[fragile;<br/>no audit;<br/>silent migration cascade]
  B -->|explicit no cascade| D[per-quarter<br/>reproducibility<br/>only]
  B -->|protocol no drift| E[migration cascade<br/>reliable;<br/>slow drift missed]
  B -->|protocol full| F[migration cascade<br/>and slow drift<br/>both reliable]
  F --> G[platform-team<br/>review queue]
  G --> H[next-quarter<br/>alignment table]
  H --> I[next-quarter<br/>cross-corpus<br/>rollup]
Comparison image showing the pre-protocol implicit-join state on the left side rendered as four corpus taxonomy stacks with dotted ad-hoc lines joining them by embedding similarity to a faint global taxonomy core, fragile and irreproducible, versus the post-protocol explicit-alignment state on the right side rendered as the same four corpus taxonomy stacks now connected to a solid global taxonomy core via labelled alignment ribbons in orchid with confidence scores and platform-authorisation tags, with cascade-trigger arrows in sage flowing from per-corpus migration audit rows into the alignment review queue at the bottom, and three drift-detection signal regions highlighted across the alignment ribbons (confidence-decay, embedding-divergence, cascade-from-migration), all rendered in the deep teal copper ivory orchid sage cluster palette

Production Considerations

Three production considerations are worth surfacing for any team running the cross-corpus alignment protocol against a live multi-corpus deployment.

The first is alignment-table cardinality at scale. The protocol's per-quarter alignment-table row count is the product of the corpus count and the per-corpus category count. The four-corpus deployment averages a few hundred local categories per corpus, which produces a few thousand alignment rows per quarter, a comfortable working set for Postgres at any reasonable hardware scale. At the sixteen-corpus deployment from the prior post the row count rises to the low ten-thousands per quarter, still comfortable. At the sixty-corpus-plus deployment the row count rises to the low hundred-thousands per quarter and the platform team will want to revisit the per-corpus alignment-confidence calibration cost, which is the per-row cost driver because it requires recomputing the decision-rule embedding similarity at finalisation time. The mitigation is to amortise the embedding-similarity recompute across the quarter rather than running it in a single batch at finalisation, and to cache the embedding values across quarters where the underlying decision rules have not changed.

The second is platform-team review-queue throughput. The alignment-review queue is the platform-team's quarterly bottleneck; the platform team has to review every open candidate before the next cross-corpus rollup runs, and the review pass is more time-consuming than the per-corpus drift-candidate review because the alignment review involves cross-referencing the per-corpus migration audit, the global taxonomy snapshot, and the alignment-confidence breakdown for every candidate. The four-corpus deployment averages twelve to twenty open candidates per quarter, which the platform team can review in a single afternoon. The sixteen-corpus deployment's open-candidate count rises to forty to seventy per quarter, which the platform team handles across two review sessions. The sixty-corpus-plus scale will require the platform team to invest in a review-tooling layer that surfaces the highest-priority bilateral-cascade candidates first and groups the unilateral-cascade candidates by global-category cluster.

The third is alignment-table interaction with the cross-quarter trend-pass primitive. The cross-quarter trend-pass primitive from the prior post has to be extended to the cross-corpus case: a cross-corpus cross-quarter rollup has to project per-corpus rows against per-corpus snapshots, then project the per-corpus rollups against per-corpus alignment tables, then compose the per-corpus rollups against the global taxonomy snapshot. The composition is straightforward when no migrations have happened; the composition is non-trivial when both per-corpus migrations and global migrations have happened in the same quarter range, because the composition has to walk both migration chains in the right order. The protocol's right answer is to walk the per-corpus migration chain first (projecting per-corpus rows onto the per-corpus target-quarter snapshot), then walk the per-corpus alignment chain (projecting the per-corpus target-quarter rows onto the per-corpus target-quarter alignment), then walk the global migration chain (projecting the alignment-projected rows onto the global target-quarter snapshot). The order is deterministic and reproducible against the audit trail, and the composition's per-row cost is bounded by the longest migration chain length across the quarter range.

Conclusion

The cross-corpus taxonomy alignment protocol is the layer that keeps multi-corpus rollups interpretable across the independent evolution of per-corpus and global taxonomies. The global category taxonomy snapshot table is append-only, finalised at quarter boundaries, and joined to the per-corpus alignment table by (quarter_id, global_category_id), so the historical alignment state remains queryable indefinitely. The per-corpus alignment table records the local-to-global mapping with explicit confidence scores, provenance, and platform-team authorisation, and is constrained by table-level keys to the per-local-category exactly-one-global-category cardinality the cross-corpus rollup pass relies on. The alignment-drift detection rules surface alignment-review candidates to the platform team via three mechanical signals (confidence-decay, embedding-divergence, cascade-from-migration) calibrated against the eighteen-month operational history. The cascade-trigger rules wire the per-corpus migration audit trail into the alignment review queue with three cascade types (direct, inverse, bilateral), and the cascade-trigger function runs as a Postgres trigger on the migration audit tables with a single deferred-execution pass at quarter close.

The next post in this cluster will walk through the taxonomy-aware quarterly review pass the engineering manager runs at the end of each quarter, building on the per-corpus drift-candidate review and the cross-corpus alignment review. The post will cover how the engineering manager's review surfaces the highest-priority migration candidates ahead of the quarter boundary, how the migration audit trails surface the prior quarter's migrations during the review, how the alignment-drift signals feed into the engineering manager's input to the cross-corpus trend-layer pass, and how the cross-quarter trend-pass primitive's projection choices interact with the engineering manager's cross-corpus rollup review. The post after that will close this sub-cluster by walking through the annual taxonomy-evolution rollup, which is the platform team's annual review of how the global taxonomy itself has evolved across four quarters and how the per-corpus alignment tables have evolved against it.

The companion repo's adlc-eval-contracts/cross-corpus-alignment/ directory has been updated with the global taxonomy snapshot DDL, the per-corpus alignment table DDL, the alignment-drift candidate table DDL, the cascade-trigger function's reference implementation in Python, the alignment-confidence calibration script, and a worked example over a four-quarter migration chain that exercises the bilateral-cascade rule against a per-corpus merge plus a concurrent global rename.

Architecture image showing the cross-corpus taxonomy alignment protocol as a layered diagram with four per-corpus taxonomy snapshot rings on the left feeding into the per-corpus alignment table in the centre, the alignment table feeding the cross-corpus rollup pass on the right via the global taxonomy snapshot core in ivory, the per-corpus and global migration audit trails running across the bottom feeding into the corpus_alignment_drift_candidates queue at the centre-bottom via the cascade-trigger function, the platform-team review pass on the right consuming the candidate queue and writing back to the next-quarter alignment table, and the cross-quarter trend-pass primitive on the top consuming the alignment table chain and the global taxonomy chain to produce per-quarter, forward-mapped, or chain-of-migrations cross-corpus labels depending on the projection choice, all rendered in the deep teal copper ivory orchid sage cluster palette consistent with blogs 178 through 198

Sources

  • Postgres Documentation. Foreign Keys. https://www.postgresql.org/docs/current/tutorial-fk.html
  • Postgres Documentation. Triggers. https://www.postgresql.org/docs/current/trigger-definition.html
  • pgvector. HNSW Indexes. https://github.com/pgvector/pgvector
  • Martin Fowler. Bounded Context. https://martinfowler.com/bliki/BoundedContext.html
  • Pat Helland. Data on the Outside versus Data on the Inside. ACM CIDR 2005. https://www.cidrdb.org/cidr2005/papers/CIDR05_Paper8.pdf
  • Confluent. Schema Registry Compatibility Modes. https://docs.confluent.io/platform/current/schema-registry/avro.html
  • OpenAI. Text Embedding Models. https://platform.openai.com/docs/guides/embeddings
  • Google SRE Workbook. Postmortem Culture: Learning from Failure. https://sre.google/workbook/postmortem-culture/
  • Datadog. State of AI Engineering Report 2026. April 2026. https://www.datadoghq.com/state-of-ai-engineering/
  • Anthropic. Engineering Operations at Scale. 2026. https://www.anthropic.com/engineering

About the Author

Toc Am

Founder of AmtocSoft. Writing practical deep-dives on AI engineering, cloud architecture, and developer tooling. Previously built backend systems at scale. Reviews every post published under this byline.

LinkedIn X / Twitter

Published: 2026-05-09 · Written with AI assistance, reviewed by Toc Am.

Buy Me a Coffee · 🔔 YouTube · 💼 LinkedIn · 🐦 X/Twitter

Comments

Popular posts from this blog

29 Million Secrets Leaked: The Hardcoded Credentials Crisis

What is an LLM? A Beginner's Guide to Large Language Models

AI Agent Security: Prompt Injection, Poisoning, and How to Defend Against Both