Engineering deep-dive
How Cairn works.
Every other geocoder is a stack: parse PBF, push into Postgres,
build an Elasticsearch index, layer libpostal on top, glue it together
with a Node service, deploy a daemon graveyard. Cairn is one binary
reading memory-mapped tile blobs. This page explains why
that's enough — and how each layer earns its place.
00
Vocabulary
Performance and indexing terms used throughout the rest of
this page. Skim once, then forward; every term below shows up
repeatedly in the build / search / reverse / perf sections.
- p50 / p95 / p99 latency
-
Sort every request's response time low-to-high. p50 is
the median: half of requests beat it. p99 is the
slowest 1 % cut-off — only 1 in 100 requests is worse. We
report p99 because real systems are judged by their bad
days, not their averages.
- RPS (requests per second)
-
Throughput. Peak RPS is the highest sustained load
the server handled before tail latency degrades past target.
Reported alongside latency so a number can't lie about one
without admitting the other.
- RSS (resident set size)
-
The amount of physical RAM the process is actively using.
Smaller is better for fitting on cheap hosts and packing
multiple shards onto one box.
- mmap
-
Memory-mapping a file: ask the kernel to expose disk bytes
as if they were already in RAM, with the OS paging them in
on demand. Reads avoid a copy and benefit from the
page cache; cold queries touch disk, hot ones don't.
- BM25
-
The classic relevance score for text search — like TF-IDF
but with diminishing returns on term frequency. Tantivy
(the search engine library Cairn uses) ranks results with
BM25 by default.
- PIP (point-in-polygon)
-
Given a coordinate and a set of polygons, decide which
polygons contain the point. Reverse geocoding's core
operation: "I'm at lat 47.14, lon 9.52 — what city is that
inside?".
- k-NN (k nearest neighbors)
-
Given a query point, return the k closest items.
Used as a fallback for reverse when no polygon contains the
query coordinate (e.g. you're a kilometre off the coast of
Liechtenstein — return the closest few villages).
- R*-tree
-
A spatial index that organizes bounding boxes hierarchically
so PIP and k-NN queries skip irrelevant regions instead of
scanning every polygon. Used to sort tile envelopes, then to
decide which tile blobs to mmap.
- tile
-
A bounded geographic cell. Cairn stores Place data partitioned
per Valhalla-style tile (4° at level 0, 1° at level 1,
0.25° at level 2) so a query touches only the tiles its
bounding box overlaps — not the whole planet.
- rkyv
-
A Rust serialization format that lays bytes out exactly as
the program sees them in memory. Combined with mmap, it
lets queries read structured data with zero parsing — the
file is the in-memory representation.
- blake3
-
A modern cryptographic hash. The bundle's
manifest.toml records a blake3 digest of every
tile and index file, so verification is one fast pass over
the directory.
- Place / AdminFeature / BuildingFootprint
-
The three on-disk row types. Place = point
(city centre, POI, address). AdminFeature = polygon
(country, region, district). BuildingFootprint = polygon
+ height (Microsoft Building Footprints output).
- gid / place_id
-
Two identifiers per Place.
place_id is a packed
u64 — cheap, but rebuilds change it. gid is a
structured string of shape <source>:<type>:<id>
(e.g. osm:way:12345) that survives bundle
rebuilds. Bookmark the gid; reload via
/v1/place?ids=….
01
Build pipeline
From a 4.7 GB Geofabrik PBF to a 1.5 GB on-disk bundle in
~8 minutes on a single laptop. No Postgres, no Elasticsearch,
no daemon. Three parallel passes, one tantivy commit, one
rkyv serialization, one ed25519 signature.
Stage 1 · Parse PBF
Cairn reads OSM PBF blocks via osmpbf's
BlobReader and par_bridge's rayon
worker pool. Each PBF blob (~32 KB compressed protobuf) decodes
in parallel; the per-blob workers emit independent
(Vec<Place>, Counters) tuples that reduce
into the global stream.
Why parallel decode matters: at planet scale PBF
decompression dominates wall-clock. Block-level parallelism
scales nearly linearly with rayon worker count.
Stage 2 · Node-coord cache (Phase 6f)
Way centroids and admin polygons need every referenced node's
(lon, lat). The cache strategy decides build RSS:
| Strategy | Storage | Per-entry | RSS at planet | When to pick |
inline |
HashMap<i64, [f64;2]> |
~48 B |
~430 GB |
≤ 5 GB PBF |
sorted-vec |
Vec<(i64, [i32;2])> sorted, binary search |
~16 B |
~140 GB |
5–30 GB PBF |
flatnode |
mmap'd dense [i32;2] array, sparse file |
8 B / id slot |
~2–4 GB working set |
≥ 30 GB PBF |
Coord quantization to degrees * 1e7 is lossless at
OSM's native precision. Flatnode relies on POSIX sparse-file
semantics: set_len(N) reserves a 110 GB logical
range without allocating disk pages, and only-written slots
back to physical storage. Switzerland's flatnode logical
size is 103 GB; real disk usage is 16 GB.
Stage 3 · WayNodes pre-filter (Phase 6g)
Way centroid + admin polygon assembly need way_id →
Vec<node_id> for the small subset of ways that are
relation members. A pre-pass collects the relevant
HashSet<i64>; the way-place pass only
registers way_nodes for that set. Switzerland: 6.2 M ways
scanned, 790 K registered (87 % drop). At Europe scale the
difference is ~50 GB of saved RSS.
Stage 4 · Multipolygon ring assembly (Phase 7a-K)
Admin relations carry outer + inner ways. Cairn stitches
endpoints into closed rings, then validates and pairs them
into geo::Polygon values. Four invariants are
enforced:
- Consecutive-duplicate vertex collapse.
Way-stitching at a shared join node frequently produces
[…, A, A, …] sequences; the dedup pass
collapses them before the ≥4-distinct-vertex floor check.
- Self-intersection rejection. An
O(n²)
segment-pair check catches figure-8s from bad OSM edits or
stitching errors. geo::Polygon makes no
contains/centroid guarantee on self-intersecting input;
the offending ring is dropped (counter skipped_relation_invalid_ring).
- Ring orientation. OSM convention is
counter-clockwise outer + clockwise inner. The hardener
flips wrong-winding rings using
geo::Area::signed_area. Ratchets a
rings_reoriented diagnostic counter.
- Smallest-enclosing outer. When multiple
outer rings contain an inner's representative point (rare
but real for nested admin relations), the inner now binds
to the SMALLEST enclosing outer instead of the first one
tested.
Stage 5 · Parallel admin assembly (Phase 7a-pass-2b2)
Admin polygon assembly used to be a sequential
ElementReader::for_each walk pinned to a single
core. parallel_admin_relations now runs
block-level workers with their own per-tile local PlaceId
counters; collisions across blocks resolve via the same
deterministic renumber pass that the way- and node-place
passes use.
DE: pass 2b2 wall-clock ~120 s → ~17 s
(~7×). Switzerland: 6 s → 0.9 s.
Stage 6 · Tantivy index + tile blob write
Places stream into a 256 MiB-buffer
tantivy::IndexWriter (Phase 7a, was 64 MiB).
Larger buffer collapses 113 in-build segments to ~17 after
merge — fewer per-query segment iter ops at serve time.
writer.commit() is followed by
writer.wait_merging_threads() so the blake3
hash-pass that follows sees a stable on-disk segment set.
Place vectors get bucketed into tile coords
and serialized to ZSTD-compressed rkyv blobs (one per
non-empty (level, tile_id)). Admin layer is rkyv-archived
per-tile with mmap'd zero-copy reads at serve time. Point
layer stays bincode (rkyv was 68 % larger on String-heavy
points; the linear scan doesn't gain from zero-copy).
Stage 7 · Manifest + signature + SBOM
manifest.toml records every artifact's blake3
hash, byte size, and place count. sbom.json
emits a CycloneDX 1.5 inventory of every Cargo.lock
package (purl + SHA-256) plus every input dataset. Optional
cairn-build sign appends an ed25519 detached
signature; cairn-build sign-verify fails non-zero
on tamper.
02
Search pipeline
A single /v1/search?q=… call traverses 8
progressively-filtered passes — each layer is opt-in, each
protects p99, and the order is engineered to drop disqualified
candidates as early as possible.
query string
│
▼
[ trim + stop-word filter (E) ] drop interior "de", "la", "der"
│
▼
[ build_text_query ] fields: name | name_prefix | name_cjk | name_translit
│
├─ fuzzy>0 → FuzzyTermQuery + trigram pre-filter (A)
│ Must-AND: query trigrams ⊆ name_trigrams
│
▼
[ apply_phonetic_orclause ] DoubleMetaphone codes OR'd
[ apply_layer_filter ] kind allowlist
[ apply_categories_filter ] ?categories= AND
[ apply_valid_at_filter (Q) ] start_year ≤ Y ≤ end_year
│
▼
[ tantivy.search(combined, TopDocs) ] BM25 scoring on candidate set
│
▼
[ over-fetch K * 5 (rerank pool) ] so reranks promote without truncation
│
▼
[ rerank pipeline ]
│ ├─ exact_name_boost ×4.0
│ ├─ edit_distance_boost (N) Myers bit-parallel, linear decay
│ ├─ population_boost 1 + log10(1+pop) · 0.05
│ ├─ language_boost ?lang= match
│ ├─ semantic_cosine 32-dim char-trigram (or v2 ONNX if enabled)
│ └─ geo_bias 1 / (1 + focus_weight · km)
│
▼
[ sort by final_score, truncate ]
│
▼
[ format_label (admin_path → "Vaduz, Liechtenstein") ]
│
▼
JSON response
Phase 7a-A · Trigram pre-filter
FuzzyTermQuery walks tantivy's term dictionary
for every term within edit distance N of each query token.
At country / planet scale the term-dict scan dominates fuzzy
latency. Trigram pre-filter shrinks the candidate doc set
before BM25 scoring runs.
Each name variant is char-trigram extracted with
boundary-padded markers — vienna →
{·vi, vie, ien, enn, nna, na·}. Trigrams index
as a STRING multi-value field; the query-side filter is a
Should-OR over query trigrams, AND'd with the fuzzy clause.
Edit-distance-1 typos retain ≥ N-2 of N original trigrams
(pinned by a unit test) so recall is preserved.
Bypass for queries shorter than 4 chars — too few trigrams to
avoid false negatives on edit-distance-1 typos.
Phase 7a-N · Myers bit-parallel Levenshtein
Eugene Myers' 1999 bit-parallel algorithm. For patterns up to
64 chars (covering ~99 % of geocoder names — country / city /
street / POI labels rarely exceed 50 chars), the inner loop
runs entirely in u64 register arithmetic: 4
bitwise ops per text char, no allocation, no branch.
Used as a multiplicative score boost in the rerank pipeline:
smaller edit distance ⇒ higher boost. Linear decay from
EDIT_DIST_BOOST_MAX = 1.6× at distance 0 down to
1.0× at distance ≥ 4. Wagner-Fischer fallback
handles patterns over 64 chars; differential property test
in the fuzz harness pins Myers and Wagner-Fischer agreement.
Phase 4a · Phonetic match (DoubleMetaphone)
?phonetic=true ORs an extra clause against a
STRING field of pre-encoded DoubleMetaphone codes. Each name
contributes its primary + alternate codes (ASCII-folded
first; CJK skipped). Recovers Smyth → Smith,
Mueller → Müller,
Smythsonian → Smithsonian, etc. — single-handedly
rescues 99 % of the noisy CH bench typo set.
Phase 4b · Semantic v1 (32-dim char-trigram)
At index time each name's trigrams are FNV-1a hashed into a
32-dim float vector, L2-normalized. Stored as
DIM*4 packed bytes per place. Query embedding
runs the same hashing function; cosine vs hit's vector
applies a thresholded multiplicative boost (≥ 0.35 sim →
up to 1.6× score). Catches morphological variants
(Vienna → Viennese) and partial matches that
fuzzy + phonetic don't reach.
Phase 7a-B · Semantic v2 (fastembed-rs ONNX)
Feature-flagged behind cargo build --features
semantic-onnx. Loads a multilingual MiniLM-class
model (MultilingualE5Small, 384-dim output) via
ONNX Runtime through fastembed-rs. Cosine reranks the
candidate set with real cross-language semantics —
Munich ↔ München ↔ Monaco di Baviera. WASM
target stays on v1 (ORT does not compile cleanly on wasm32).
03
Reverse pipeline
Default · admin PIP → nearest-K fallback
AdminIndex stores admin polygons in an R*-tree
of bounding boxes plus a per-tile rkyv blob. Query path:
bbox-prefilter for tiles intersecting the probe, then a
geo::Contains point-in-polygon test against
each candidate's polygon. Returns finest-first by admin
level (city → region → country).
When PIP returns nothing (probe falls outside every admin
polygon — ocean, disputed territory, gap in OSM coverage),
the fallback walks NearestIndex's R*-tree of
place centroids, ranks tile slots by squared distance, and
returns the K closest places by haversine.
Phase 7a-H · Context-aware reverse
?context=full returns a structured payload
combining four signals in one round-trip — no incumbent
ships this:
{
"lat": 47.142,
"lon": 9.521,
"admin": [
{ "name": "Vaduz", "level": 8, "distance_km": 0.05 },
{ "name": "Liechtenstein","level": 0, "distance_km": 5.20 }
],
"nearest_road": { "name": "Städtle", "distance_km": 0.04 },
"nearest_poi": { "name": "Vaduz Castle", "distance_km": 0.42 },
"nearest_address": { "name": "Städtle 28", "distance_km": 0.06 }
}
Implementation hooks
NearestIndex::nearest_k_filtered(probe, k, |p|
classify_kind(&p.kind) == Some(target)) three
times — once per class. Slot collection widens
proportionally to the predicate's selectivity, so "nearest
road" doesn't degenerate when the closest tile slots happen
to contain only POIs.
04
Augmenters
v0.3 introduced a second class of pipeline alongside the
ground-up build: augmenters rewrite an existing
bundle in place with new sources. Same on-disk layout, same
manifest, same signature contract — just additional layers
and richer Place tags. Two ship today: building footprints
(lane A) and Wikidata enrichment (lane I).
Lane A · Microsoft Building Footprints
~1.4 B polygons under ODbL on Source Cooperative. The
importer (cairn-import-buildings) reads
GeoParquet via parquet + arrow-array's
ParquetRecordBatchReaderBuilder and decodes WKB
polygon geometries with a hand-rolled cursor. Why hand-rolled:
the Place loader's decode_wkb_point only handles
type 1; polygons need type 3 (single ring iteration) and
type 6 (MultiPolygon recursion into N type-3 sub-polygons).
Both paths reject Z/M-flagged inputs (PostGIS
1000+type / 2000+type /
3000+type) outright — the per-vertex reader
assumes 16-byte (x, y) records, so silently accepting PolygonZ
would mis-align every coordinate after the first.
Per-row output: BuildingFootprint { id, centroid, bbox,
outer_ring, height }. Centroid is the area-weighted
shoelace centroid of the largest polygon's outer ring (or the
bbox center as fallback for degenerate rings). Holes are
dropped at parse time — they're < 0.1 % of rows in the wild
and doubling on-disk footprint isn't worth the geometry
fidelity.
Per-tile BuildingLayer writes mirror the existing
AdminLayer rkyv pattern: 16-byte aligned header
+ rkyv body, magic CRBL. Replication factor is
~1.0 (most buildings are smaller than a tile by orders of
magnitude). Runtime BuildingIndex is an R*-tree
of tile envelopes with an LRU cache (default 256 entries —
smaller than admin/nearest 1024 because building tiles are
denser; pinning every loaded tile would balloon RSS at planet
scale). Two query modes: at(coord) returns
rooftop-containing buildings finest-first by area;
nearest_k(coord, k) ranks tile slots by
bbox-distance, gathers k * 4 candidates, and
re-sorts globally.
Containment is a three-stage filter. Stage 1: tile R*-tree
narrows to per-tile candidates. Stage 2: per-feature bbox
check (min_lon ≤ x ≤ max_lon, same for y) rejects
the dense majority — building bboxes are tight enough that
the bbox alone would already be ~70-90 % accurate at urban
density. Stage 3: Sunday-style ray-cast point-in-polygon on
the closed outer_ring confirms true rooftop
containment, catching the 10-30 % of L- / U-shaped footprints
where the bbox covers an inner notch the polygon doesn't.
The bbox-only fast path is exposed as at_bbox(coord)
/ ?strict=false for callers that already plan to
refine downstream.
Spatial join during cairn-build augment --buildings
walks every Place tile, queries the in-memory BuildingIndex
for each Place's centroid, and stamps building_id
+ building_height tags on hits. Idempotent —
the apply step dedups (key, value) before
insertion, so re-running on the same data leaves Place
tiles byte-identical.
Lane I · Wikidata enrichment
The canonical Wikidata JSONL dump is ~120 GB compressed.
Streaming it once is cheap; streaming it twice or holding
even 1 % of it in memory is not. The augmenter
(cairn-augment-wikidata) does a single
memory-bounded pass with two stages:
-
Pass 1 — collect Q-ids. Walk every place
tile, scan
Place.tags for
wikidata=Q… entries, deposit into a
HashSet<String>. OSM imports already
carry the tag for ~1-3 M Places per planet bundle. RSS
stays at ~30-200 MB depending on coverage.
-
Pass 2 — stream the dump.
BufReader with an 8 MB buffer wraps a
GzDecoder (or plain File when
.json not .json.gz). For each line,
a substring pre-filter looks for
"id":"Q…" near the front — entire entity bodies
are skipped without invoking serde_json when
the Q-id isn't in the wanted set. With ~3 M tracked Q-ids
out of ~100 M entities total, only ~3 % of lines hit the
full parser. Saves ~40 minutes wall-clock on a planet dump.
Distill output per matched entity: labels by language,
aliases by language, and a curated set of cross-ref claims
(P1566 GeoNames ID, P300 ISO 3166-2, P901 FIPS, P131 admin
parent, P31 instance-of). The P31 list is conservatively
mapped to PlaceKind via a static table —
Q515 → City, Q484170 → County, Q34442 → Street, etc. The
table only covers ~25 well-known Q-ids on purpose:
Wikidata's P31 graph has > 100 K types and a wider table
would let typo-Q-ids reclassify Places.
Apply step rewrites only modified Place tiles, preserving
each tile's existing compression scheme (zstd or raw).
Aliases land under lang = "<canonical>_alt"
so they remain searchable but stay separable from the
canonical label for future per-language ranking weights.
PlaceKind promotion is one-way: Places already
classified more specifically by upstream (e.g. an OSM admin
polygon → Country) are trusted; we never demote.
Idempotency contract
Both lanes guarantee a re-run on the same inputs produces a
byte-identical bundle. The augmenter dedups labels by
(lang, value) and tags by
(key, value) before insertion, hashes the
per-tile content signature pre/post, and skips the
encode_tile + fs::write roundtrip
when the signature matches. Re-running
cairn-build augment on a fully-augmented bundle
is a no-op — same blake3, same byte_size, same signature.
05
Bundle format
Everything Cairn needs to serve queries lives in one
directory. No catalog database, no service registry, no
runtime config-fetch. The format is content-addressed,
verifiable, and reproducible.
<bundle>/
├── manifest.toml schema_v4, blake3 of every artifact, bundle_id
├── manifest.toml.sig ed25519 detached signature (optional)
├── sbom.json CycloneDX 1.5 — Cargo.lock + datasets
├── tiles/
│ └── <level>/<id>.bin rkyv-archived Place vector + ZSTD
├── index/
│ └── text/ tantivy segments + admin_names sidecar
│ (gid field indexed for /v1/place lookups)
└── spatial/
├── admin/<tile>.rk rkyv AdminFeature, magic CRAD, mmap zero-copy
├── points/<tile>.bin bincode PlacePoint, linear scan
└── buildings/<tile>.bin rkyv ArchivedBuilding, magic CRBL (v0.3 lane A)
Tile blobs are 16-byte aligned with a stable header that
carries the rkyv archive size + payload offset. Read path
mmaps the file, validates via rkyv::check_archived_root,
then hands an Arc<AdminTileArchive> to the
per-tile slot. PIP iterates the archived form directly via
pip_archived — 128 ns inside-bbox, 2 ns
outside-bbox (the bbox prefilter rejects in cache).
Building tiles share the rkyv-archive pattern with a distinct
magic (CRBL) so the loader can't confuse them
with admin tiles, and a per-feature bbox so the runtime
BuildingIndex.at(coord) can prefilter without
touching ring vertices for the 99% of buildings whose bboxes
miss the query.
Each Place carries a stable global identifier (gid)
tag of shape <source>:<type>:<id>
(osm:way:12345,
wof:locality:101748479,
overture:place:08bb…) stamped at import time
by the source-specific importer. Sources without an upstream
stable id (OpenAddresses, generic CSV) get a deterministic
hash of (kind, normalized name, ~100 m centroid)
so bookmarks survive rebuilds. The text index registers
gid as a STRING (whole-token) field for direct
TermQuery lookup via /v1/place?ids=….
Manifest schema is at version 4 as of v0.3
(the bump signals the additive building_tiles
field). v3 readers tolerate v4 manifests via
#[serde(default)]; any future breaking change
bumps the version and ships a reader for both shapes.
06
Differentiators
Federated multi-bundle (Phase 2e)
cairn-serve --bundles a/,b/,c/ loads each
bundle independently and wraps the indices in
FederatedText /
FederatedAdmin /
FederatedNearest. Queries fan out, hits merge
by score; PIP fans out, sorts finest-first; nearest-K
fans out, re-sorts by haversine. Each Hit's
admin_names label renders inside its own
bundle before merge so Place-ID collisions across bundles
don't pollute labels.
Operational pattern: split a planet into continental
shards without standing up multiple processes.
Hot reload (Phase 7a-U)
POST /admin/reload swaps to a new bundle in
place. AppState's mutable surface is backed
by arc_swap::ArcSwapOption so installs are a
single atomic store. In-flight requests finish on the
previous snapshot they captured at handler entry; new
requests pick up the new bundle on next call.
No process restart. No dropped connections. K8s rolling
update without rolling.
Time-aware names (Phase 7a-Q)
?valid_at=YYYY filters by OSM
start_date / end_date tags.
Year-only parse handles plain digits, ISO dates, BC suffix
("44 BC"), and qualifier prefixes
("before 1989", "circa 1850").
Sentinels for unbounded validity. RangeQuery filter
joined with the rest of the search clause.
?valid_at=1939 resolves Königsberg;
?valid_at=1980 resolves Kaliningrad.
No incumbent ships temporal validity at all.
WASM autocomplete (Phase 4b)
cairn-wasm compiles to a ~250 KB
cdylib for wasm32-unknown-unknown.
Exposes Autocompleter::new(Vec<u8>) +
complete(prefix, limit) → Vec<String>
driven by FST prefix iteration. Tantivy is excluded from
the WASM path so the bundle stays small.
Use cases: country-bundle splash autocomplete, embedded
form widgets, PWA / offline-first apps.
Signed bundles + SBOM (Phase 4c, 4d)
cairn-build sign emits an ed25519 detached
signature over manifest.toml; the manifest
already carries blake3 of every artifact, so signing the
manifest signs the whole bundle. Pure-Rust ed25519-dalek
keeps static-musl self-contained.
sbom.json ships a CycloneDX 1.5 inventory.
/v1/sbom serves the file with
application/vnd.cyclonedx+json so
dependency-track / grype / cyclonedx-cli pick it up
without sniffing.
Stable global identifier (v0.4)
Two identifiers per Place — both intentional, both useful:
-
place_id: u64 — the bundle-local packed id
(level: 3 | tile: 22 | local: 39). Cheap
to hash, suitable for in-process maps. Not stable across
rebuilds because local is a per-tile insertion
counter.
-
gid: String — global identifier of shape
<source>:<type>:<id>
(osm:way:12345,
wof:locality:101748479,
overture:place:08bb…). Built at import time
from upstream stable IDs; sources without one
(OpenAddresses, generic CSV) get a deterministic hash of
(kind, name, ~100 m centroid). Survives
rebuilds + federation hops.
Stamped on a reserved gid tag at import time,
indexed as a tantivy STRING field, surfaced
on every Hit and resolvable via
/v1/place?ids=…. Format converges with what
several open-source geocoders already emit, so existing
client code that already deserializes a gid
field works unchanged.
Query explain (Phase 4e)
?explain=true populates
Hit.explain with bm25,
exact_match_boost,
population_boost,
language_boost, geo_bias,
final_score. Each rerank stage records its
multiplier in place. Off by default;
skip_serializing_if = Option::is_none keeps
payloads lean for normal callers.
Useful for debugging ranking surprises and for clients
that want to surface "why this hit ranked here". No
incumbent ships this.
07
Performance characteristics
Microbench (Apple Silicon, criterion)
| Operation | Mean (ns) | Notes |
myers/short identical | ~80 | 5-char same-string |
myers/short typo | ~120 | 1 substitution |
myers/64-char boundary | ~250 | fast-path tail |
wagner_fischer/short typo | ~150 | O(m·n) reference |
trigram_indexed/short | ~350 | 5-char Latin |
trigram_indexed/CJK | ~800 | UTF-8 walk + alphanumeric filter |
pip_archived/inside | ~94 | rkyv zero-copy ring scan |
pip_archived/outside-bbox | ~1.5 | bbox prefilter rejects in cache |
nearest_k=10/4096pts | ~375 µs | R*-tree slot rank + linear |
Live numbers in benchmarks/perf-baseline.json;
a CI workflow gates regressions beyond +15 % per benchmark.
End-to-end (Switzerland)
| Metric | Cairn | Pelias | Nominatim | Photon |
| Build wall | 24 s | 3 m 46 s | 3 h 13 m | 2 m 1 s |
| Bundle disk | 195 MB | 3.5 GB | 9.2 GB | 1.3 GB |
| Hot RSS | 80 MB | 2.7 GB | 2.4 GB | 2.1 GB |
| p50 | 0.51 ms | 13.76 ms | 9.51 ms | 5.88 ms |
| p99 | 0.74 ms | 57.23 ms | 23.00 ms | 25.18 ms |
| Peak RPS | 57 554 | 362 | 1 109 | 2 406 |
Recall on noisy queries (1 153 typo / fold variants)
| Configuration | Hits | Recall |
| baseline | 253 / 1 153 | 21.9 % |
?fuzzy=1 | 865 / 1 153 | 75.0 % |
?phonetic=true | 1 142 / 1 153 | 99.0 % |
?semantic=true | 253 / 1 153 | 21.9 % |
| all flags on | 1 153 / 1 153 | 100.0 % |
DoubleMetaphone single-handedly rescues 99 % of typos.
Semantic boost is for morphological variants (Vienna →
Viennese) and doesn't fire on character-level
perturbations. No incumbent geocoder ships a
?phonetic= toggle today.