Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RFC-0009: semantic document IR v2

Version: 0.2.3 | Status: normative | Phase: test


1. Summary

[RFC-0009:C-SUMMARY] Summary (Informative)

This RFC defines the normative semantic document IR v2 conformance surface for typub.

Scope:

  • Document-rooted semantic IR model (Document, Block, Inline, assets, footnotes, metadata).
  • Determinism and validation requirements for IR production and consumption.
  • Controlled downgrade semantics (Raw vs Unknown) and adapter policy declaration requirements.
  • Explicit semantic separation between math nodes and non-math SVG nodes.

Out of scope:

  • Parser implementation internals.
  • Adapter-specific target syntax details.
  • Asset backend implementation details (storage drivers, upload protocols).
  • Cross-version persisted IR compatibility (IR v2 is an in-process pipeline contract, not a long-term storage format).

Relationship to other RFCs:

  • The deprecated predecessor IR RFC MUST NOT be used as IR v2 conformance surface.
  • RFC-0002 and RFC-0004 define pipeline and materialization contracts that consume this IR.

Since: v0.2.0


2. Specification

[RFC-0009:C-SEMANTIC-BOUNDARY] Semantic-only IR boundary (Normative)

The IR MUST be semantic-first. It MUST NOT contain execution-state fields such as absolute filesystem paths, preview-only URLs, temporary upload placeholders, adapter/runtime handles, or other publish-run context.

Preview-only URL resolution data MUST be carried in non-conformance sidecar/context (for example materialization context), not in the IR conformance surface. If an implementation uses transient in-memory overlays during preview, those overlays MUST NOT participate in conformance serialization, equivalence checks, or persisted IR artifacts.

Derived cache payloads MAY appear only when all of the following are true:

  1. The payload is derivable from canonical semantic source (Document semantic content plus stable renderer config).
  2. The payload is explicitly optional and safely droppable without semantic loss.
  3. The payload MUST NOT be required for semantic validation, semantic equivalence, or adapter capability negotiation.

Pipeline state needed for provisioning/materialization MUST live outside the IR conformance surface in context/pipeline metadata.

Since: v0.2.0

[RFC-0009:C-ASSET-REFERENCE] Asset references are document-indexed (Normative)

Image and binary resources MUST be referenced by stable asset identifiers from content nodes. The document root MUST provide an asset index mapping asset identifiers to source metadata and resolved variants. Emitters MUST resolve resources through this index and MUST NOT infer resource identity from inline path strings.

Rendered binary artifacts produced by enrich/materialize pipelines (including math or SVG raster outputs) MUST also resolve through asset identifiers and Document.assets.

Document.assets resolved variants MUST contain only publish-conformance data that is reproducible across runs for identical inputs and configuration. Preview-only resolution data MUST remain in preview sidecar/context and MUST NOT be persisted or serialized as IR conformance-surface data.

Strategy-specific transport forms (for example data URI emission for Embed) are materialization/serialization concerns and MUST NOT redefine semantic identity in content nodes.

Since: v0.2.0

[RFC-0009:C-MATH-MODEL] Math must use explicit inline/block nodes (Normative)

Math MUST be represented by explicit inline and block math node variants. Implementations MUST NOT infer block math from structural heuristics such as a paragraph containing a single SVG fragment.

Math nodes MUST remain semantically distinct from non-math SVG nodes. Implementations MUST NOT classify generic SVG graphics as math for pipeline convenience.

Math nodes MAY carry canonical source. When canonical source is present, it MUST include explicit source kind (Typst or LaTeX) and source text.

Math nodes MUST carry at least one of:

  1. Canonical source.
  2. Rendered payload.

Rendered math payloads MAY be carried as optional derived cache/enrich data, but only if they satisfy RFC-0009:C-SEMANTIC-BOUNDARY (optional, droppable, and non-semantic for equivalence/validation).

Binary rendered payloads (for example PNG) MUST be represented through asset references resolved via Document.assets, not as inline strategy-specific transport encodings in semantic math nodes.

Since: v0.2.0

[RFC-0009:C-DOCUMENT-ROOT] Document root and ownership (Normative)

The IR root MUST be a document object that owns blocks, asset index, footnote definitions, and document metadata. Footnote references in inline content MUST resolve to document-scoped footnote definitions. Cross-node resources and references MUST be addressable from this single root.

Since: v0.2.0

[RFC-0009:C-ATTRS-LAYERING] Typed attrs and passthrough layering (Normative)

Attributes MUST be modeled as typed fields plus passthrough attributes. Semantically significant attributes used by pipeline logic or emitters MUST be represented by typed fields. Unknown or target-specific attributes MAY be preserved in a passthrough map. Passthrough maps in the conformance surface MUST use deterministic ordering.

Since: v0.2.0

[RFC-0009:C-RAW-UNKNOWN-POLICY] Raw and Unknown handling policy (Normative)

The IR MUST distinguish Raw nodes from Unknown nodes. Raw nodes carry literal source payload under explicit trust/origin metadata. Unknown nodes represent unmodeled structures without implicit execution.

Each adapter MUST provide a machine-readable Raw/Unknown policy declaration in its capability declaration surface.

The declared policy MUST include, at minimum, one action for each category (Raw, Unknown) from this closed set: pass, sanitize, drop, error.

Validation timing requirements:

  1. Adapter registration/loading MUST fail if the declaration is missing or contains invalid actions.
  2. Publish-time adapter selection MUST fail if no valid declaration is available for the selected adapter.
  3. CI/governance checks MUST verify declaration presence for first-party adapters.

Adapter execution MUST apply the declared policy consistently.

Since: v0.2.0

[RFC-0009:C-DETERMINISM] Deterministic normalization requirements (Normative)

IR production and serialization MUST be deterministic for identical inputs and configuration.

Implementations MUST canonicalize ordered collections where semantic order is not input-dependent, including passthrough attribute maps and style sets.

For IR v2, a style set is the value carried by inline styled content to represent one or more text styles. Canonicalization requirements for style sets are:

  1. Deduplicate styles by style identity.
  2. Serialize styles in this canonical order: Bold, Italic, Strikethrough, Underline, Mark, Superscript, Subscript, Kbd.
  3. Preserve equivalence such that semantically identical style sets produce byte-identical conformance serialization.

Non-deterministic map iteration in conformance output is prohibited.

Since: v0.2.0

[RFC-0009:C-VALIDATION] IR validation invariants (Normative)

Implementations MUST validate IR invariants before adapter specialization. Validation failures MUST be surfaced as explicit errors and MUST NOT silently degrade core semantics.

At minimum this validation MUST enforce:

  1. Heading levels are within 1..=6.
  2. All asset references resolve to existing entries in Document.assets.
  3. List nesting is structurally valid for the unified list model.
  4. Math payload validity.
  5. SVG payload validity for explicit SVG nodes.

For clause (4), the minimum interoperable validity rule is:

  • A math node MUST contain at least one of canonical source or rendered payload.
  • If canonical source is present, source kind MUST be one of the RFC-defined kinds (Typst or LaTeX).
  • If canonical source is present, source text MUST be non-empty after trimming ASCII whitespace.

For clause (5), the minimum interoperable validity rule is:

  • An explicit SVG node MUST contain at least one of canonical SVG source or rendered payload.

Implementations MAY add stricter parser-based validation (for example full Typst/LaTeX parse checks), but such strictness MUST be explicitly documented and MUST produce deterministic outcomes for fixed parser version and configuration.

Since: v0.2.0

[RFC-0009:C-IR-TYPE-SURFACE] IR type surface and conformance (Normative)

The conformance surface of IR v2 MUST be explicitly defined in this RFC and MUST NOT depend on external, non-RFC schema documents.

At minimum, a conforming implementation MUST provide the following semantic structure:

  1. Document root

    • A single Document root object.
    • Document MUST own: blocks, footnotes, assets, and meta.
    • blocks is ordered and preserves source reading order.
    • footnotes and assets MUST be key-addressable maps with deterministic ordering.
  2. Block and Inline model

    • Block and Inline MUST be closed tagged unions (or equivalent sum types) with explicit variant tags.
    • Block MUST include variants covering at least: heading, paragraph, quote, code block, divider, list, definition list, table, figure, admonition, details, math block, SVG block, unknown block, raw block.
    • Inline MUST include variants covering at least: text, code, soft break, hard break, styled span, link, image, footnote ref, math inline, SVG inline, unknown inline, raw inline.
  3. Attribute layering

    • Conforming attrs MUST be split into typed fields and passthrough map.
    • Passthrough maps MUST use deterministic key ordering.
  4. Style set surface

    • Styled inline content MUST carry a style set.
    • A style set MUST be represented as a collection of TextStyle values.
    • Conformance serialization MUST use the canonical ordering rule defined in RFC-0009:C-DETERMINISM.
  5. Assets and references

    • Image/binary references inside content MUST use stable asset identifiers.
    • Asset metadata and resolved variants MUST be stored in Document.assets, not embedded as ad-hoc runtime fields in content nodes.
    • Document.assets resolved variants are limited to reproducible publish-conformance data; preview-only resolution data MUST remain outside the conformance surface.
  6. Renderable payload model

    • Math nodes and SVG nodes MUST both use an explicit renderable payload model.
    • The renderable payload model MUST support canonical source (optional) and rendered payloads (optional), but at least one MUST be present per node.
    • Binary rendered artifacts MUST be representable by asset identifier reference so Embed/Upload/External can be resolved in materialize/serialize without changing semantic node identity.
  7. Math and SVG semantics

    • Math MUST be represented with explicit inline/block math nodes.
    • Non-math SVG content MUST be represented with explicit inline/block SVG nodes.
    • Implementations MUST NOT collapse generic SVG nodes into math nodes.
  8. List semantics

    • List structure MUST be represented as a unified list model with explicit list kind and recursive list items.
    • Per-item marker semantics (including task checked state) MUST be representable without relying on serializer heuristics.
  9. Controlled downgrade

    • Unknown and Raw nodes MUST be distinct variants with explicit handling semantics.
    • Raw variants MUST carry trust/origin metadata sufficient for policy enforcement.
  10. Legacy exclusion

  • Legacy pre-v2 HtmlElement/InlineFragment/ImageMarker shapes MUST NOT be used as the conformance surface for IR v2.

Implementations MAY add internal helper fields or transient representations, but externally visible IR conformance (serialization contracts, adapter interfaces, and validation input/output) MUST satisfy this clause.

Since: v0.2.0


Changelog

v0.2.3 (2026-02-21)

Clarify shared SVG and math render-payload semantics

Added

  • Separate explicit SVG nodes from math nodes in conformance type surface
  • Allow math canonical source to be optional while requiring source-or-rendered presence
  • Require binary rendered artifacts to resolve through Document.assets for Embed/Upload/External workflows

v0.2.2 (2026-02-21)

Clarify Document.assets resolved-variant boundary

Added

  • Constrain Document.assets resolved variants to reproducible publish-conformance data
  • Require preview-only resolution data to remain sidecar/context and out of conformance IR

v0.2.1 (2026-02-21)

Close audit gaps on conformance interoperability

Added

  • Clarify preview-only URL boundary between conformance IR and preview sidecar context
  • Pin heading level validation range to 1..=6
  • Define interoperable canonical style-set ordering
  • Strengthen first-party Raw/Unknown governance checks to MUST

v0.2.0 (2026-02-21)

Clarify conformance and validation for audit closure

Added

  • Add C-SUMMARY scope/out-of-scope
  • Define IR type surface in C-IR-TYPE-SURFACE
  • Clarify semantic boundary for optional derived caches
  • Define minimum interoperable math validation rule
  • Make Raw/Unknown policy declaration machine-readable and enforceable
  • Define style set structure and canonicalization requirements

v0.1.0 (2026-02-21)

Initial draft