Six proprioceptive channels — memory, affect, time, ethics, identity, continuity — read by the model at every layer through ReZero-gated cross-attention. Under direct conflict with the text, the channel wins; and a model born with the channel learns a better base language model than channel-less controls of the same size. We report it as what it is — a measured property at toy scale, not a product — with a paper and the limits up front.
alpha gates open and sustain (0.06 → 0.928 from scratch), ablation distance > 0.The strong claim of MARS — Multi-channel Architecture for Real-time Subjectivity — is that the channels are a new architectural category, not a fine-tuning trick. tinyMARS tests that claim two ways.
Track 1 — from scratch: train the channels into a 145M decoder from layer one, with no pretrained backbone underneath, and ask whether they become causal in generation. Track 2 — frozen base: graft the channels onto a frozen Gemma 4 E2B through a trained adapter, and ask whether the mechanism of injection decides what the channels can do.
The public repository holds the code, methodology, evaluation suite, training/eval logs and per-iteration metrics for all six iterations. Built by Mario Gutierrez at Celiums Solutions LLC — Track 1 on an H200 (2026-05-02 → 05-04), Track 2 on a Cloud TPU v5e (2026-05-29 → 06-01). github.com/terrizoaguimor/tinymars
Six channels carry state that the decoder reads at every layer through a ChannelInjection block. Each injection is gated by a ReZero alpha that starts at zero and learns its own contribution.
BGE-M3 (1024-d multilingual embeddings) provides the channel features. RoPE handles positions. Standard 32 K BPE tokenizer, custom-trained from scratch on the tinyMARS corpus.
Long-horizon recall vector. The model carries what was learned over prior turns and episodes into the current generation step.
PAD-quadrant state (pleasure · arousal · dominance). The strongest channel-causal signal after iter 4 corpus expansion.
Circadian and episodic temporal indexing. The generation has a sense of when the request is happening.
Channel state reflecting the agent's current ethics posture. Generation reads it at every layer; channels never see the prompt directly.
Agent-level identity vector. The same prompt with different identity vectors produces different responses — that is the architectural claim made testable.
Cross-session continuity signal. Threads, resumed-from anchors, and the substrate's sense of being the same agent across time.
Track 1 trained the channels from scratch on an H200 and chased channel causality through four iterations. Track 2 moved to a frozen Gemma 4 E2B on a Cloud TPU and iterated the injection mechanism twice. Numbers are from the per-iteration metrics and the dual-judge eval in the repo.
Track 1 — native channels, from scratch (H200)
| Iteration | Steps | Eval loss | Affect pass-rate | alpha_l2 | Note |
|---|---|---|---|---|---|
| Iter 1 | — | — | — | — | Smoke test · custom tokenizer trained from scratch. |
| Iter 2 | — | — | — | — | Channel injection wired. Replay over skills + papers. |
| Iter 3 | 7,687 | — | 40 % | — | 20-test ablation suite first run. 6 h on H200. |
| Iter 4 | 19,073 | 1.47 | 80 % | 0.323 | + ~210 K rows of affect-rich dialog. 15 h on H200. |
Track 2 — frozen Gemma 4 E2B + adapter (Cloud TPU v5e)
| Iteration | Injection mechanism | alpha_l2 | Headline result | Verdict |
|---|---|---|---|---|
| Iter 5 | FiLM gated bias | 0.876 | Acts as a tone knob: continuity helped, content flat, temporal hurt. | H_partial |
| Iter 6 | real multi-token cross-attention | 0.672 | Reads content: temporal 24 % → 74 %; mean win-rate vs ablation 54.7 % → 62.3 %. | refutes (0/6 sig) — mechanism confirmed, underpowered |
Iter 5 → Iter 6 · per-capability win-rate (MARS vs zero-channel ablation, clean judge)
| Capability | Iter 5 (FiLM) | Iter 6 (x-attn) | Δ |
|---|---|---|---|
| temporal | 24 % | 74 % | +50 |
| ethical | 57 % | 66 % | +9 |
| memory | 54 % | 60 % | +6 |
| affect | 72 % | 76 % | +4 |
| identity | 56 % | 55 % | ~flat |
| continuity | 65 % | 43 % | −22 |
| mean | 54.7 % | 62.3 % |
The mechanism swap traded register for content, exactly as predicted: cross-attention fixed temporal reasoning but gave up the narrative-continuity edge the global bias had. No capability clears strict significance — the signal is real but underpowered, not absent.
The decisive run — native, from scratch (TPU), with bracketed channel-less controls
| Model | Params | Channels | Held-out target loss (nats) |
|---|---|---|---|
| baseline-6L | 100.8M | none | 5.252 |
| native — channels zeroed | 110.2M | present | 4.825 |
| baseline-7L | 120.3M | none | 5.253 |
Perpendicular force: the channel decides the preferred output on 88.8 % of 455 held-out counterfactual pairs (chance 25 %). Relief valve: with channels zeroed, the native beats both channel-less baselines that bracket its size — and the baselines tie each other, so the gain is the channel pathway, not parameters. Honest scope: 110M / 1B tokens, in-distribution, one iteration. Full method, numbers, and limits in the paper.
The repository is the architectural prototype, not the deployable model. We publish enough for the experiment to be reproduced — and exactly that. Decisions below are deliberate.
tinyMARS is below the conversational threshold. Releasing weights would produce confusion ("Celiums released a model that doesn't work well"). Plan B — the 4.6B refined Gemma 4 E2B base — is the deployable artifact, in a separate forthcoming repo.
Channel-causal pairs (Tipos A/B/C/D) were generated using teacher LLMs whose terms of service may restrict redistribution of generated text. We publish the methodology and prompts so anyone can regenerate their own corpus.
Some episodic memory pairs in Tipo B were patterned on actual journal entries. Not redistributed.
Skills + papers replay comes from the parallel Celiums knowledge-engine track. Separate licensing.
Hyphae is a separate research and product track. Conceptual references in some docs; no implementation in this repo.
Track 1 (from scratch) needs an H200 (or equivalent), a teacher + judge LLM for corpus generation, and no base model (random init, custom 32 K-BPE tokenizer) — the steps below. Track 2 (frozen Gemma + adapter) runs on a Cloud TPU v5e: see training/train_apuesta1.py (the ReZero-gated cross-attention adapter) and training/eval_apuesta1.py (the batched no-cache TPU decoder) in the repo. Channel vectors are synthetic in the public repo; the real corpus is private.
For Track 1 the dominant cost was teacher-LLM inference for corpus generation, not the H200 — tinyMARS-class validation is inference-bound when the corpus is teacher-generated. The figures below are Track 1.
Track 2 flipped the economics: a frozen base, a reused open-source corpus, and open judges (Qwen3, Llama-4, DeepSeek-V4) instead of frontier teachers. Each iter-6-class run was ~$33 on a Cloud TPU v5e — roughly 30× cheaper per iteration than Track 1.
Code is GPL-3.0; docs CC BY-SA 4.0. The full method, results, and limits are in the paper (PDF), archived on Zenodo with a DOI. Cite as below.
@misc{tinymars2026, author = {Mario Gutierrez}, title = {Proprioceptive Channels: Cognitive Self-State as a Perpendicular Control Axis in Language Models}, year = {2026}, publisher = {Zenodo}, doi = {10.5281/zenodo.20531347}, url = {https://github.com/terrizoaguimor/tinymars} }
From a frozen base to a model trained from scratch, the channel acts as a perpendicular control axis that overrides the text — and, born with the model, it leaves the base a measurably better language model. That is genuine evidence the channels are architectural, not a frozen-base trick. What's left is scale and iteration — turning a measured property into a model that uses it while it speaks. The whole log, code, paper, and per-iteration metrics are public.
[ MIT · two tracks · 6 iterations shipped · honest reporting ]