Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

RFC-0004: external asset storage

Version: 0.2.3 | Status: normative | Phase: impl


1. Summary

[RFC-0004:C-SUMMARY] Summary (Informative)

RFC-0004 defines the normative contract for external asset storage in typub, extending the asset strategy system defined in RFC-0002.

This RFC specifies:

  • The External asset strategy variant for S3-compatible object storage.
  • Configuration requirements for external storage backends.
  • Asset upload tracking and caching semantics.
  • Integration with the publish pipeline’s Materialize stage.

This RFC does not define:

  • Specific storage provider implementations (AWS S3, Cloudflare R2, MinIO).
  • Image optimization or transformation logic.
  • Asset garbage collection strategies.

Scope: This specification applies when an adapter declares External as its asset strategy and external storage is configured.

Rationale: Platforms like HashNode and Dev.to strip or ignore embedded data URIs and cannot access local file paths. External object storage provides publicly accessible URLs that work universally across publishing platforms.

Since: v0.1.0


2. Specification

[RFC-0004:C-EXTERNAL-STRATEGY] External Asset Strategy (Normative)

The asset strategy system MUST support an External variant in addition to the existing Copy, Embed, and Upload variants.

When an adapter uses the External strategy:

  1. The system MUST upload local asset files to a configured external object storage service before the Publish stage.
  2. The system MUST replace local asset references in the finalized payload with publicly accessible URLs from the external storage.
  3. The system MUST NOT proceed to the Publish stage if any asset upload fails.

Strategy declaration: An adapter MAY declare External as a supported strategy in its ImageStrategyPolicy.

Unsupported strategy handling: If a user configures asset_strategy = "external" for an adapter that does not declare External in its ImageStrategyPolicy, the system MUST fail at configuration validation with an error that:

  • Names the adapter.
  • States that External is not supported.
  • Lists the strategies that the adapter does support.

The system MUST NOT silently fall back to another strategy.

Distinction from Upload: The External strategy MUST be treated as distinct from Upload. The Upload strategy indicates platform-native upload APIs (for example, Confluence attachments or Notion File Upload API), while External indicates third-party object storage independent of the target platform.

Rationale: Some platforms (HashNode, Dev.to) lack native asset upload APIs but accept external URLs. The External strategy provides a universal fallback for platforms that cannot host assets natively. Fail-fast validation prevents publishing with broken images.

Since: v0.1.0

[RFC-0004:C-STORAGE-CONFIG] Storage Configuration (Normative)

External storage configuration MUST support both global and per-platform scopes, with deterministic precedence rules.

Precedence ladder: When resolving a configuration field, the system MUST apply the following precedence order (highest to lowest):

  1. Platform-specific environment variable (for example, HASHNODE_S3_BUCKET).
  2. Platform-specific configuration file value.
  3. Global environment variable (for example, S3_BUCKET).
  4. Global configuration file value.

The first non-empty value in this order wins. The system MUST NOT merge partial values from different levels for a single field.

Global configuration: The system MUST support a global storage configuration that applies to all platforms using the External strategy.

Per-platform override: A platform MAY override specific storage configuration fields. Platform-specific values take precedence over global values at the same source level (environment or file).

Required configuration fields: The storage configuration MUST include at minimum:

  • Storage type identifier (for example, “s3” for S3-compatible storage).
  • Bucket or container name.
  • Public URL prefix for constructing accessible URLs.

Optional configuration fields:

  • Endpoint URL (for S3-compatible services). If absent, the system MUST use empty string for identifier computation.
  • Region (for example, “us-east-1”). If absent, the system MUST use empty string for identifier computation.

Credential fields: The storage configuration MUST support:

  • Access key identifier.
  • Secret access key.

Credential values MUST NOT be included in the storage configuration identifier used for cache invalidation.

Storage configuration identifier: The system MUST compute a deterministic storage configuration identifier by:

  1. Collecting these fields: type, endpoint, bucket, region, public_url_prefix.
  2. Normalizing each field:
    • type: lowercase, trimmed.
    • endpoint: if absent, use empty string. Otherwise: parse as URL; lowercase the scheme and host only; preserve path case; remove trailing slash from path; remove default port (:443 for https, :80 for http).
    • bucket: as-is (case-sensitive per S3 spec).
    • region: lowercase, trimmed. If absent, use empty string.
    • public_url_prefix: parse as URL; lowercase the scheme and host only; preserve path case; remove trailing slash from path.
  3. Concatenating as: {type}|{endpoint}|{bucket}|{region}|{public_url_prefix}.
  4. Computing SHA-256 hash of the concatenated string (UTF-8 encoded).
  5. Using the full 64 hex characters as the identifier.

Examples:

  • Input: type=s3, endpoint=https://S3.us-east-1.amazonaws.com/MyPath/, bucket=my-bucket, region=us-east-1, public_url_prefix=https://CDN.example.com/Assets/

  • Normalized: s3|https://s3.us-east-1.amazonaws.com/MyPath|my-bucket|us-east-1|https://cdn.example.com/Assets

  • Identifier: full 64-char SHA-256 hex

  • Input: type=s3, endpoint=(absent), bucket=my-bucket, region=(absent), public_url_prefix=https://cdn.example.com/

  • Normalized: s3||my-bucket||https://cdn.example.com

  • Identifier: full 64-char SHA-256 hex of that string

Validation: The system MUST validate storage configuration before attempting any upload. If required fields are missing or invalid, the system MUST fail with a descriptive error before the Materialize stage.

If an adapter uses the External strategy but no storage configuration is present, the system MUST fail with a configuration error rather than falling back to another strategy.

Rationale: A single precedence ladder eliminates ambiguity when environment variables and file values conflict at different scopes. Lowercasing only the host (not path) preserves case-sensitive path semantics per RFC 3986. Using the full SHA-256 hash eliminates collision risk for safety-critical cache keys. Explicit empty-string handling for absent optional fields ensures cross-implementation determinism. Excluding secrets from the identifier avoids cache invalidation on credential rotation.

Since: v0.1.0

[RFC-0004:C-UPLOAD-TRACKING] Asset Upload Tracking (Normative)

The system MUST persist asset upload records in local storage to enable caching and deduplication.

Two-index model: The system MUST maintain two logical indices:

  1. Content index: (storage_config_id, content_hash, extension)(remote_key, remote_url). This enables cross-path deduplication for files with identical content and extension.
  2. Path index: (local_path, storage_config_id)(content_hash, extension). This tracks the last-uploaded state for each local path.

Record fields: Each upload record MUST include at minimum:

  • Local asset path (relative to content directory).
  • Content hash of the uploaded file (SHA-256, lowercase hex, 64 characters).
  • Normalized extension (lowercase, alphanumeric only, or empty string).
  • Remote object key.
  • Public URL of the uploaded asset.
  • Upload timestamp.
  • Storage configuration identifier (as defined in RFC-0004:C-STORAGE-CONFIG).

Caching semantics: Before uploading an asset, the system MUST:

  1. Compute the content hash of the local file.
  2. Compute the normalized extension (see below).
  3. Check the content index for (storage_config_id, content_hash, extension):
    • If found, reuse the existing remote_url without uploading. Update the path index.
    • If not found, proceed to upload.
  4. After successful upload, atomically persist to both indices.

Per-asset atomicity: Each successful asset upload MUST be persisted atomically and independently. A failure uploading asset N MUST NOT roll back records for assets 1 through N-1.

This enables efficient retry: on retry, already-uploaded assets are skipped via the content index lookup.

Extension normalization: The extension MUST be normalized as follows:

  1. Extract the file extension from the original filename (characters after the last .).
  2. Convert to lowercase.
  3. Remove all characters not matching [a-z0-9].
  4. If the result is empty (no extension, or all characters removed), use empty string.

Examples:

  • image.PNGpng
  • photo.JPEGjpeg
  • data.tar.gzgz
  • README → `` (empty)
  • file.MP3!mp3
  • weird.??? → `` (empty, all invalid)

Object key format: The remote object key MUST be constructed as:

  • If extension is non-empty: {content_hash}.{extension}
  • If extension is empty: {content_hash}

Where:

  • content_hash is the lowercase hex-encoded SHA-256 hash of the file content (64 characters).
  • extension is the normalized extension (lowercase alphanumeric only).

Examples:

  • a1b2c3d4...64chars.png
  • e5f6a7b8...64chars.jpg
  • f9a8b7c6...64chars (no extension)

This format is purely content-addressable. Identical content with identical normalized extension MUST produce the same remote object key, regardless of original filename or path.

Overwrite semantics: Because object keys are derived from content hash, the key itself proves content identity. If the remote object already exists with the same key:

  1. The upload operation SHOULD succeed idempotently (overwrite or no-op).
  2. The system MUST treat AlreadyExists, PreconditionFailed, or equivalent responses as success.
  3. No remote content verification is required; the content-addressable key guarantees equivalence.

Rationale: Including extension in the content index key ensures that identical bytes with different file types (for example, a file served as both .bin and .dat) are stored separately, preserving MIME type inference from extension. The two-index model separates concerns: content deduplication uses hash+extension lookup, while path tracking enables efficient change detection. Per-asset persistence avoids redundant uploads on retry without requiring cleanup of partial remote state. Content-addressable keys eliminate the need for remote checksum verification.

Since: v0.1.0

[RFC-0004:C-PIPELINE-INTEGRATION] Pipeline Integration (Normative)

External asset upload MUST occur during the Materialize stage (Stage 7) as defined in RFC-0002:C-PIPELINE-STAGES.

Scope: This clause applies to both External and Upload asset strategies. Both strategies require deferred asset processing after specialization.

Semantic IR-centric design: Per RFC-0009:C-DOCUMENT-ROOT and RFC-0009:C-ASSET-REFERENCE, the publish pipeline maintains a semantic document IR root from Parse through Materialize. Content nodes MUST reference assets by stable asset identifiers. Materialize MUST resolve those identifiers via the document asset index and strategy context, not by replacing inline string placeholders.

Stage ordering:

  1. The Specialize stage (Stage 5) MUST produce payload/state containing:
    • Semantic IR with asset references by stable identifiers.
    • Pending asset set derived from referenced asset identifiers.
    • Effective asset strategy configuration.
  2. The Materialize stage (Stage 7) MUST:
    • Upload assets to storage (external S3 for External, platform-native for Upload) when required.
    • Resolve each referenced asset identifier to final delivery metadata (for example remote URL variants) in document asset index and/or specialization context.
  3. The Serialize stage (Stage 8) MUST convert resolved semantic IR to target format.
  4. The Publish stage (Stage 9) MUST receive payload with all required asset references resolved per target policy.

Shared infrastructure: Implementations SHOULD provide shared utilities for:

  1. Collecting referenced asset identifiers and building pending asset sets during Specialize.
  2. Resolving asset URLs/variants during Materialize.
  3. Serializing resolved semantic IR during Serialize.

Per-asset processing: For each referenced asset identifier, the system MUST:

  1. Resolve source metadata from the asset index.
  2. For External strategy: a. Compute content hash (SHA-256, lowercase hex, 64 characters). b. Compute normalized extension per RFC-0004:C-UPLOAD-TRACKING. c. Check content index for cache hit; upload if not found.
  3. For Upload strategy: a. Upload to platform-native storage API.
  4. On successful upload, persist tracking records as appropriate.
  5. Record resolved delivery data so serialization can emit target-consumable references.

The system MAY process assets in any order, including in parallel, provided all required resolutions are completed before Serialize.

Failure handling: If an asset upload fails during Materialize:

  1. The system MUST NOT proceed to Serialize.
  2. The system MUST surface a descriptive error identifying the failed asset (by identifier and source path when available).
  3. Successfully uploaded assets from the current batch MUST remain in tracking records.
  4. Successfully uploaded assets MAY remain in remote storage.

Idempotency: Retry of a failed publish operation MUST be safe and efficient. The system MUST:

  1. Skip upload for assets already present in tracking records (for External, content index hit).
  2. Handle remote AlreadyExists or overwrite responses as success per RFC-0004:C-UPLOAD-TRACKING.

The system is not required to maintain or resume from any particular ordering across retries.

Preview flow: For preview operations that do not require remote upload:

  1. Materialize MAY resolve asset identifiers to local preview URLs or project-relative references in preview sidecar/context.
  2. Preview-only URLs MUST NOT be committed into IR conformance-surface fields (including persisted Document.assets variants used for publish conformance).
  3. Serialize MAY emit preview-compatible references directly from preview sidecar/context or unresolved source metadata when target policy permits.

Rationale: Identifier/index-based processing preserves semantic IR purity, avoids string-collision classes of bugs, and keeps materialization deterministic and adapter-agnostic.

Since: v0.1.0

[RFC-0004:C-ERROR-SEMANTICS] Error Semantics (Normative)

The system MUST provide clear, actionable error messages for external storage failures.

Configuration errors: When storage configuration is missing or invalid, the error MUST:

  • Identify which configuration field is missing or invalid.
  • Indicate whether the error is at global or platform-specific scope.
  • Suggest corrective action (for example, “set S3_ACCESS_KEY_ID environment variable or add access_key_id to storage configuration”).

Upload errors: When an asset upload fails, the error MUST:

  • Identify the local asset path that failed.
  • Include the underlying storage service error message.
  • Indicate whether the failure is retryable (for example, network timeout vs. access denied).

Credential errors: When storage credentials are rejected, the error MUST NOT expose credential values. The error SHOULD indicate which credential source was used (environment variable or configuration file).

Rationale: Clear error messages reduce debugging time and prevent users from publishing with broken images.

Since: v0.1.0

[RFC-0004:C-URL-CONSTRUCTION] URL Construction (Normative)

The system MUST construct public URLs deterministically from the configured prefix and object key.

Prefix normalization: The system MUST normalize public_url_prefix before use:

  1. Parse as URL.
  2. Lowercase the scheme and host only; preserve path case.
  3. Remove any trailing / characters from the path.
  4. The normalized prefix is stored and used for all URL construction.

URL join algorithm: Given the normalized public_url_prefix and object_key, the public URL MUST be: {public_url_prefix}/{object_key}.

Object key construction: The object key MUST be constructed per RFC-0004:C-UPLOAD-TRACKING:

  • If normalized extension is non-empty: {content_hash}.{extension}
  • If normalized extension is empty: {content_hash}

Where:

  • content_hash: lowercase hex-encoded SHA-256 (64 characters).
  • extension: normalized extension per RFC-0004:C-UPLOAD-TRACKING (lowercase alphanumeric only).

The object key requires no percent-encoding because it contains only hex characters, dots, and lowercase alphanumerics.

Examples:

Original filenameNormalized extensionObject key
image.PNGpnga1b2...64chars.png
photo.JPEGjpega1b2...64chars.jpeg
README(empty)a1b2...64chars
data.tar.gzgza1b2...64chars.gz
file.???(empty)a1b2...64chars

Full URL example:

  • public_url_prefix (configured): https://CDN.example.com/Assets/
  • public_url_prefix (normalized): https://cdn.example.com/Assets
  • Content hash: a1b2c3d4e5f6... (64 chars)
  • Original filename: my image.png
  • Normalized extension: png
  • Object key: a1b2c3d4e5f6...64chars.png
  • Public URL: https://cdn.example.com/Assets/a1b2c3d4e5f6...64chars.png

Rationale: Normalizing only the scheme and host (not path) preserves case-sensitive path semantics per RFC 3986. Using only safe characters in object keys avoids encoding complexity and URL parsing issues. Explicit handling of empty extension ensures consistent behavior for extensionless files.

Since: v0.1.0

[RFC-0004:C-ASSET-LOCATION] Asset Location Constraints (Normative)

All assets referenced in content MUST be located within the project root as defined in RFC-0005:C-PROJECT-ROOT.

Validation: Before processing an asset, the system MUST verify that:

  1. The asset path resolves to a location within the project root.
  2. The asset file exists and is readable.

If an asset path resolves outside the project root, the system MUST fail with an error that:

  • Identifies the offending asset path.
  • States that assets must be within the project directory.
  • Suggests moving the asset into the project or using a symlink within the project tree.

Storage: Per RFC-0005:C-PROJECT-ROOT, asset paths stored in the status database MUST be relative to the project root. This enables project portability.

Rationale: Constraining assets to the project tree ensures:

  1. The entire project can be moved, synced, or version-controlled as a unit.
  2. Relative paths in the status database remain valid across machines.
  3. No accidental references to system files or files in unrelated directories.

Since: v0.2.1


Changelog

v0.2.3 (2026-02-21)

Preview flow boundary alignment

Added

  • Require preview URL resolution to stay in sidecar/context, not conformance IR fields

v0.2.2 (2026-02-21)

Semantic IR pipeline integration alignment

Added

  • Update C-PIPELINE-INTEGRATION to asset identifier and document asset-index resolution model

v0.2.1 (2026-02-13)

v0.2.0 (2026-02-12)

Amended C-PIPELINE-INTEGRATION to specify placeholder token mechanism for asset references, avoiding fragile regex-based URL replacement

Added

  • Add C-ASSET-LOCATION clause requiring assets within project root

v0.1.0 (2026-02-12)

Initial draft