A workflow for identifying features in AI hyperspace — uniting Goodchild's general theory of geographic representation, the IEEE P2874 Universal Domain Graph requirements, and the convergent-features evidence from modern mechanistic interpretability research.
Artificial intelligence is changing the definition of space. The same mathematical machinery that embeds words, pixels, molecules, and locations in high-dimensional vector spaces is now used to embed the world itself — a development the IEEE P2874 Spatial Web community calls World-to-Vec.
This page synthesizes three threads. First, the foundations of spatial embedding in modern AI. Second, the IEEE / Spatial Web Foundation Universal Domain Graph (UDG) requirements for Hyperspace Reference Systems and entity embeddings. Third, Goodchild, Yuan and Cova's general theory of geographic representation — which we propose as the appropriate ontological lens for features in AI hyperspace.
Recent AI research — the Platonic Representation Hypothesis, the Linear Representation Hypothesis, and the scaling of sparse autoencoders to frontier models — demonstrates that independently trained models converge on common features. These features, accessible as directions or sparse latents in hyperspace, are the AI analog of geo-objects as defined by Goodchild. Treating them as first-class Spatial Web entities is what makes World-to-Vec engineerable.
Part 1 · Theoretical Basis
Five strands of theory converge to give us the basis for feature identification in AI hyperspace.
Word2Vec showed that meaning has geometric structure: high-dimensional vectors whose offsets capture syntactic and semantic regularities. The textbook example, King − Man + Woman ≈ Queen, remains the canonical demonstration. Every modern foundation model rests on this primitive.
PyTorch-BigGraph — internally named World2Vec — extended the idea to entities of any kind. The UDG Functional Requirements treat hyperspace as the vector-space version of the world: every Spatial Web ENTITY (actor, activity, domain, place, norm) has an embedding. Reasoning, retrieval and governance operate on those embeddings.
Computation now routinely occurs in spaces of extraordinarily high dimension and non-coordinate structure. Video and agency bring time into the same hyperspace, with actors and activities as first-class objects. Tobler's First Law generalizes to telecoupling — relatedness in hyperspace independent of Earth-surface distance.
Independently trained models discover the same features. The Platonic Representation Hypothesis (Huh et al. ICML 2024) shows representational alignment grows with scale and task diversity. The Linear Representation Hypothesis (Park, Choe, Veitch ICML 2024) shows concepts live as linear directions. Scaling Monosemanticity (Anthropic 2024) extracts tens of millions of features from Claude 3 Sonnet.
Goodchild's atomic form <location, property, value> — the geo-atom — produces geo-fields (continuous properties) and geo-objects (aggregated points satisfying rules). Phase space, c(x) = f(z₁,…,zₘ), induces bona fide objects. This is exactly the construction AI feature identification performs: an embedding is a phase space; coherent regions are features.
An HRS is the AI analog of a Coordinate Reference System: a Coordinate System (CS — dimensions, units) plus a Datum that anchors the abstract CS to real-world semantics. Once an HRS is defined, Hyperspace Entity Locators (HELs) address features the way URLs address documents.
Side by side
Mapped concept-by-concept against Goodchild's general theory and the UDG requirements.
| Concept | Geographic Information Systems | AI / Spatial Web hyperspace |
|---|---|---|
| Atomic form | Geo-atom: <x, property, value> on Earth surface | Activation tuple: <v, property, value> in hyperspace |
| Reference frame | Coordinate Reference System (CS + Datum, e.g. WGS-84) | Hyperspace Reference System (CS + semantic datum) |
| Locator | URL, URI, geo-URI | Hyperspace Entity Locator (HEL) |
| Continuous representation | Geo-field (six standard discretizations) | Embedding manifold; activation field over data |
| Discrete representation | Geo-object / Feature (OGC Simple Features) | Sparse-autoencoder feature; linear concept direction |
| Aggregation principle | Tobler's First Law (proximity → similarity) | Platonic / distributional hypothesis (training proximity → representational proximity) |
| Phase-space construction | c(x) = f(z₁, …, zₘ) over m fields | Concept = partition of activation space along learned directions |
| Boundary indeterminacy | Membership function m(x); fuzzy classes | Polysemantic neurons; sparse-feature activation strength |
| Catalog standard | ISO 19110 Feature Catalog; OGC API – Features | (Emerging) SAE feature catalogs, Neuronpedia, Embedding Atlas |
Part 2 · Tools, Datasets, Models
Feature identification draws on two complementary toolchains: mechanistic interpretability from AI safety, and geospatial foundation models from GeoAI.
| Tool / method | What it does | Theory fit |
|---|---|---|
| Sparse Autoencoders (SAE) | Decompose model activations into sparse, monosemantic latents. Foundational to Anthropic's Scaling Monosemanticity. | Strong. Directly produces feature candidates — the AI analog of geo-objects. |
| TransformerLens | Library for accessing internal activations; standard substrate for SAE and circuit work. | Strong. Provides the activation tuples that SAEs and probes operate on. |
| Neuronpedia | Open platform hosting >50M SAE latents across many models, with autointerp explanations, search, steering, circuit-tracing demos. | Strong. A working analog of an ISO 19110 feature catalog for hyperspace. |
| Goodfire (Ember API) | Commercial platform turning SAE features into steering / analysis for production models. Backed by Anthropic. | Strong as a production realization; closed parts limit fully open verification. |
| Embedding Atlas (Apple, 2025) | Open-source interactive viewer for million-point embeddings with density clustering, automated labels, metadata filtering. | Strong for human-in-the-loop cartographic visualization (UDG §6.2.2). |
| UMAP / t-SNE / PCA | Dimensionality reduction for visual exploration and preprocessing. | Moderate. Useful but lossy; should never be the sole basis for feature definition. |
| Linear probes | Train a linear classifier from activations to a labeled concept. | Strong validator under the Linear Representation Hypothesis. |
| CKA / SVCCA / Model stitching | Quantify representational similarity between models; test whether one model's layer substitutes for another's. | Strong. Empirical evidence for or against convergence on each candidate feature. |
| Natural Language Autoencoders | Map activations to natural language and back; verbalize internal states. | Useful for automated labeling of discovered features. |
| Model / library | What it provides | Theory fit |
|---|---|---|
| Prithvi (NASA / IBM) | Earth Observation foundation model. Variants Prithvi-EO (6-band multispectral) and Prithvi-WxC (weather/climate). 100M – 2B parameters. | Strong. Embeddings of Earth tiles are simultaneously geo-objects and hyperspace features. |
| SatMAE / SatMAE++ / SpectralGPT / CROMA / SkySense | Self-supervised vision transformers for satellite imagery; spectral and temporal awareness. | Strong for raw embedding generation; less interpretable out-of-the-box. |
| SatCLIP (Microsoft Research) | Global, general-purpose location encoder. Contrastively aligns satellite imagery with coordinates using spherical-harmonics encoding. | Strong. A direct World-to-Vec primitive — raw location into a vector usable downstream. |
| GeoCLIP | CLIP-based location-image alignment for geo-localization. Random Fourier feature encoder. | Moderate. Tuned for geo-localization; less general than SatCLIP. |
| MOSAIKS / CSP / S2Vec | Alternative location/image embedding methods. S2Vec (2025) is self-supervised geospatial. | Moderate. Useful baselines and ablation comparators. |
| TorchGeo | PyTorch domain library: geospatial datasets, samplers, transforms, pretrained models. | Strong foundation for any geospatial embedding pipeline. |
| TerraTorch (IBM, 2025) | Fine-tuning toolkit for geospatial foundation models on TorchGeo + PyTorch Lightning. HPO, benchmarking, full workflows. | Strong for operationalizing GeoFMs. |
| GeoAI Python package | Higher-level wrapper that generates geospatial embeddings via TorchGeo foundation models for similarity search, clustering, change detection. | Strong as an applied-side runtime. |
| H3 (Uber) / DGGS | Hexagonal hierarchical spatial index; basis for the OGC AI-DGGS Disaster Pilot (2025) exposing DGGS-indexed data via OGC APIs for AI agents. | Strong as a cellular indexing layer beneath any HRS — directly cited in UDG §6.1. |
Part 3 · Recommended Workflow
Nine stages that operationalize World-to-Vec. Each stage produces an output that maps to a concept in either GIS practice (after Goodchild) or the Spatial Web UDG.
Define the entities, properties, and relations of interest. State the use case and the questions discovered features must answer. In UDG terms, choose which ENTITIES will be embedded; in Goodchild terms, choose which geo-atom properties matter.
Assemble data: text corpora, knowledge graphs, satellite imagery, sensor streams, OSM. Normalize, deduplicate, and index. For geographic data, index with H3 or another DGGS so the cellular structure of UDG §6.1 is preserved end-to-end.
Embed prepared data with a domain-appropriate foundation model — Prithvi-EO or SatCLIP for Earth Observation, SatCLIP or GeoCLIP for raw location, a Llama- or Mistral-class model for text, PyTorch-BigGraph for graphs. Persist activations or final embeddings in a queryable store (FAISS, ScaNN, vector DB).
Specify the CS (dimensions, units, normalization) and a tentative semantic datum. Early-stage: a fixed set of probe concepts (is_a_country, is_a_river, is_a_road) with measured directions. Over time, align datums to an ontology such as Common Core or the Spatial Web HSML.
Train an SAE on activations from a chosen layer (or on final embeddings if no layered model is used), recover monosemantic latents, rank by activation density and reconstruction loss. Complement with linear probes for predefined concepts and density-based clustering for emergent groupings.
Test whether each candidate is a property of the world rather than a quirk of one model. Apply Centered Kernel Alignment or canonical correlation against features from at least one other model trained on related data. Use model stitching for stronger evidence.
Project validated features to 2D / 3D using UMAP or related methods, render in Embedding Atlas or a similar viewer, apply automated labels (Natural Language Autoencoders, or an LLM autointerp pipeline as used by Neuronpedia). Subject-matter experts review and rename. Implements UDG §6.2.2.
For each validated, named feature, mint a HEL referencing the HRS and a stable identifier. Publish using a schema modeled on ISO 19110, bound to the Spatial Web (HSML / HSTP), so other systems can resolve and reuse features. Record cross-model alignment evidence and HRS-to-HRS transformations.
Use cataloged features downstream: as steering vectors for generative models, as variables for spatial analysis, as queryable entities for AI agents. Monitor feature drift as models are retrained. Apply IEEE P2874 governance — norms, contracts, ratings — to consequential features (e.g. those covering deception, bias, dangerous content).
Workflow at a glance
| Stage | Output | GIS analog | UDG concept |
|---|---|---|---|
| 1 Scope | Scope document | Geographic area of interest | ENTITY type selection |
| 2 Ingest | Cleaned, indexed data | Spatial database | Source DOMAINS |
| 3 Embed | Vector store | Coordinate transformation | Spatial embedding (§6.2) |
| 4 HRS | Reference-frame spec | CRS = CS + Datum | HRS = CS + semantic datum |
| 5 Discover | Candidate features | Object extraction | Feature candidates in hyperspace |
| 6 Validate | Cross-model evidence | Ground truthing | Datum-to-datum transformation |
| 7 Visualize | Atlas + labels | Cartographic map | Cartographic visualization (§6.2.2) |
| 8 Catalog | Published features + HELs | ISO 19110 catalog | HEL-addressable ENTITIES |
| 9 Operate | Used, monitored, governed | GIS in production | Spatial Web governance (P2874) |
Why this matters
Treating hyperspace features as first-class entities closes the long-standing gap between "embedding" and "feature".
Just as WGS-84 turned latitude/longitude into a globally interoperable system, a semantic-datum HRS turns embedding coordinates into something models can share. Without it, every embedding is provincial.
Features that recur across independent models can be named, cited, and reused. ISO 19110-style feature catalogs for hyperspace would let one team's "fraud-detection direction" be the same direction another team trusts.
Prithvi and SatCLIP embed Earth-surface places into hyperspace. SAEs extract concepts. The same workflow handles both because they share Goodchild's atomic form.
Once safety-relevant features (deception, bias, dangerous content) are catalogued and HEL-addressable, IEEE P2874 governance can attach norms, contracts, and ratings to them — instead of to opaque whole models.
Key References
The survey draws on UDG functional requirements, Percivall's presentation, the Goodchild paper, and recent literature on representational convergence and mechanistic interpretability.
Related Work
GeoRoundtable work connecting AI hyperspace, Spatial Web standards, and geographic theory.
GeoRoundtable brings together expertise in geospatial standards, agentic AI, mechanistic interpretability, and philosophy of engineering. We help organizations build the bridges between embeddings and features that the next generation of AI infrastructure depends on.
Get in touch