Corpus Analysis & Compression

EVERMORE

Vox Audita Perit · Litera Scripta Manet

Every token stored once. Every work perfectly preserved. A corpus that grows stronger as it grows larger.

Explore the Project Technical Details
99.93% Shannon Efficiency
O(1) Lookup Cost
8 Specialized Databases
v17 Current Specification

What is Evermore?

Evermore is a corpus analysis and compression toolkit. It ingests large collections of text-based content — ebooks, web pages, documents, scanned PDFs — and stores them as structured references into a layered token database rather than as raw text.

The result is a corpus that is fully lossless and trivially renderable, but radically smaller than its source, with rich semantic metadata generated as a natural byproduct of the import process.

"The word the might appear two billion times across a million documents. In conventional storage, it is stored two billion times. In Evermore, it is stored once — and referenced two billion times."

Every Token Once,
Every Work Lossless

Evermore's architecture is built around a single structural invariant: each unique token is stored exactly once. Every imported work is stored as a flat sequence of references to that shared vocabulary. Deduplication emerges from the design — it is not applied as a post-processing step.

I
Tokens are the atomic unit

A token is any discrete lexical unit — word, punctuation, markup element. Every token receives an ID. What the token means is not the database's concern.

II
IDs are addresses, not labels

A token's ID is its address in the layered database. Address length encodes the layer depth directly. Lookup cost is fixed regardless of corpus size.

III
The databases are independent

Each can be stored, transferred, and backed up separately. The Token DB is the keystone, but all others can stand alone in their own right.

IV
Derived databases are always regenerable

The Corpus Analysis DB and Index DB contain no data that cannot be recomputed from the Corpus DB at any time. They can be deleted and regenerated without any loss.

Eight Databases, One Pipeline

The system is organized around eight purpose-built databases and a single processing component. Each has a specific, non-overlapping role.

Token DB
Primary · Keystone

The vocabulary. Maps token IDs to (token string, profile ID) pairs. Every other database references it. This is the compatibility surface — two Corpus DBs built against the same Token DB are directly interchangeable.

Delta Token DB
Primary · Distribution

A partial Token DB containing only the tokens not present in a referenced stable base. Enables self-contained corpus bundles without retransmitting vocabulary the recipient already holds.

Corpus DB
Primary · Content

The works as compressed token ID sequences. Pure content — no raw text, no bibliographic metadata. Each work is a flat byte stream of variable-width token IDs in source order.

Works Meta DB
Primary · Catalog

The library catalog. Bibliographic data, provenance, storage statistics, and image position records for reconstruction. Queryable entirely independently of the Token DB and Corpus DB.

Image Blob DB
Primary · Optional

Content-addressed binary store for images, keyed by SHA-256 hash. Works imported without image processing have no entries here — a valid and non-corrupt state.

Corpus Analysis DB
Derived · Post-Import

Derived entirely post-import by running analysis operations over the Corpus DB. Frequency counters, neighbor maps, and future analysis outputs. A semantic byproduct of the import architecture.

Index DB
Derived · Search

Inverted index mapping token IDs to locations across the corpus, enabling phrase search and prefix queries. Generated post-import, regenerable at any time, and optional for archival use.

Working DB
Transient · Pipeline

Tracks every document's in-flight state during import. The recovery record for the pipeline. Not backed up, not bundled. An empty Working DB after a successful import is the normal state.

Near-Shannon-Limit Encoding

Evermore's addressing scheme — SMA — achieves near-theoretical maximum information density over a 255-symbol alphabet. By reserving exactly one byte value (0xFF) as an out-of-band stream terminator, SMA uses 255 content slots per table out of 256 — losing a mere 0.07% of the Shannon limit.

IDs are variable-width: a single byte addresses the first 255 tokens, two bytes the next 65,280, three bytes the next 16.7 million. At Zipfian corpus scale, the highest-frequency tokens receive the shortest IDs — compound compression that concentrates savings on exactly the most referenced vocabulary.

Vocabulary Size ID Length Scale
≤ 255 tokens1 byteTiny / embedded
≤ 65,280 tokens2 bytesGPT-scale BPE (~50k)
≤ 16.7M tokens3 bytesLarge multilingual
≤ 4.3B tokens4 bytesAnna's Archive scale
99.93 % of Shannon Limit

log₂(255) / log₂(256) = 7.994 / 8

0xFF Reserved Terminator

The only byte value that never appears inside a valid content ID — enabling self-delimiting streams at zero structural cost.

Optimization as a
Natural Cascade

All optimization passes run after ingest. This is not a limitation — it is architecturally correct. Optimization requires corpus-wide frequency statistics, and by the time passes run, all content has been converted to compact integer sequences. The inner loop is integer comparison, not string handling.

Ingest

Format auto-detection, text extraction, tokenization, and resolution. Produces Token DB + Corpus DB in arrival-order ID space. The corpus is immediately usable.

Frequency Analysis

A single pass over the Corpus DB builds a ranked (token → frequency) table. This is the prerequisite for all subsequent optimization.

Token Remap

Assigns the shortest IDs to the highest-frequency tokens. Both deduplication and ID compression now concentrate savings on exactly the same tokens — correlated gains on the Zipfian head.

Phrase Analysis

Identifies candidate bigrams and n-grams whose collapse would yield net storage savings. Thresholds are evaluated against actual post-remap ID costs — accurate arithmetic, correct commit decisions.

Phrase Remap

Commits qualifying phrase IDs, physically shortening sequence files. Progressive: as the corpus deepens, more phrases cross the threshold. Compression improves organically with scale.

Currently in Development

Evermore is an active design-and-implementation project. The specification is mature — at version 17 — and the addressing scheme has reached formal specification at v1.2. The architecture is fully defined; implementation is underway.

System Spec v17 Mature
SMA Spec v1.2 Formalized
Implementation Active In Progress
Distribution evirmare.org Home

The corpus is in a valid, lossless, fully reconstructable state at every point in the pipeline. Optimization is never a prerequisite for reconstruction — it is an improvement to compression efficiency on top of an already-correct system.