Corpus Analysis & Compression
Vox Audita Perit · Litera Scripta Manet
Every token stored once. Every work perfectly preserved. A corpus that grows stronger as it grows larger.
Evermore is a corpus analysis and compression toolkit. It ingests large collections of text-based content — ebooks, web pages, documents, scanned PDFs — and stores them as structured references into a layered token database rather than as raw text.
The result is a corpus that is fully lossless and trivially renderable, but radically smaller than its source, with rich semantic metadata generated as a natural byproduct of the import process.
"The word the might appear two billion times across a million documents. In conventional storage, it is stored two billion times. In Evermore, it is stored once — and referenced two billion times."
Evermore's architecture is built around a single structural invariant: each unique token is stored exactly once. Every imported work is stored as a flat sequence of references to that shared vocabulary. Deduplication emerges from the design — it is not applied as a post-processing step.
A token is any discrete lexical unit — word, punctuation, markup element. Every token receives an ID. What the token means is not the database's concern.
A token's ID is its address in the layered database. Address length encodes the layer depth directly. Lookup cost is fixed regardless of corpus size.
Each can be stored, transferred, and backed up separately. The Token DB is the keystone, but all others can stand alone in their own right.
The Corpus Analysis DB and Index DB contain no data that cannot be recomputed from the Corpus DB at any time. They can be deleted and regenerated without any loss.
The system is organized around eight purpose-built databases and a single processing component. Each has a specific, non-overlapping role.
The vocabulary. Maps token IDs to (token string, profile ID) pairs. Every other database references it. This is the compatibility surface — two Corpus DBs built against the same Token DB are directly interchangeable.
A partial Token DB containing only the tokens not present in a referenced stable base. Enables self-contained corpus bundles without retransmitting vocabulary the recipient already holds.
The works as compressed token ID sequences. Pure content — no raw text, no bibliographic metadata. Each work is a flat byte stream of variable-width token IDs in source order.
The library catalog. Bibliographic data, provenance, storage statistics, and image position records for reconstruction. Queryable entirely independently of the Token DB and Corpus DB.
Content-addressed binary store for images, keyed by SHA-256 hash. Works imported without image processing have no entries here — a valid and non-corrupt state.
Derived entirely post-import by running analysis operations over the Corpus DB. Frequency counters, neighbor maps, and future analysis outputs. A semantic byproduct of the import architecture.
Inverted index mapping token IDs to locations across the corpus, enabling phrase search and prefix queries. Generated post-import, regenerable at any time, and optional for archival use.
Tracks every document's in-flight state during import. The recovery record for the pipeline. Not backed up, not bundled. An empty Working DB after a successful import is the normal state.
Evermore's addressing scheme — SMA — achieves near-theoretical maximum information density over a 255-symbol alphabet. By reserving exactly one byte value (0xFF) as an out-of-band stream terminator, SMA uses 255 content slots per table out of 256 — losing a mere 0.07% of the Shannon limit.
IDs are variable-width: a single byte addresses the first 255 tokens, two bytes the next 65,280, three bytes the next 16.7 million. At Zipfian corpus scale, the highest-frequency tokens receive the shortest IDs — compound compression that concentrates savings on exactly the most referenced vocabulary.
| Vocabulary Size | ID Length | Scale |
|---|---|---|
| ≤ 255 tokens | 1 byte | Tiny / embedded |
| ≤ 65,280 tokens | 2 bytes | GPT-scale BPE (~50k) |
| ≤ 16.7M tokens | 3 bytes | Large multilingual |
| ≤ 4.3B tokens | 4 bytes | Anna's Archive scale |
log₂(255) / log₂(256) = 7.994 / 8
The only byte value that never appears inside a valid content ID — enabling self-delimiting streams at zero structural cost.
All optimization passes run after ingest. This is not a limitation — it is architecturally correct. Optimization requires corpus-wide frequency statistics, and by the time passes run, all content has been converted to compact integer sequences. The inner loop is integer comparison, not string handling.
Format auto-detection, text extraction, tokenization, and resolution. Produces Token DB + Corpus DB in arrival-order ID space. The corpus is immediately usable.
A single pass over the Corpus DB builds a ranked (token → frequency) table. This is the prerequisite for all subsequent optimization.
Assigns the shortest IDs to the highest-frequency tokens. Both deduplication and ID compression now concentrate savings on exactly the same tokens — correlated gains on the Zipfian head.
Identifies candidate bigrams and n-grams whose collapse would yield net storage savings. Thresholds are evaluated against actual post-remap ID costs — accurate arithmetic, correct commit decisions.
Commits qualifying phrase IDs, physically shortening sequence files. Progressive: as the corpus deepens, more phrases cross the threshold. Compression improves organically with scale.
Evermore is an active design-and-implementation project. The specification is mature — at version 17 — and the addressing scheme has reached formal specification at v1.2. The architecture is fully defined; implementation is underway.
The corpus is in a valid, lossless, fully reconstructable state at every point in the pipeline. Optimization is never a prerequisite for reconstruction — it is an improvement to compression efficiency on top of an already-correct system.