tinyMARS · research artifact · from scratch · paper + DOI · honest reporting

Channels are a control axis. Measured, from scratch.

Six proprioceptive channels — memory, affect, time, ethics, identity, continuity — read by the model at every layer through ReZero-gated cross-attention. Under direct conflict with the text, the channel wins; and a model born with the channel learns a better base language model than channel-less controls of the same size. We report it as what it is — a measured property at toy scale, not a product — with a paper and the limits up front.

Author Mario Gutierrez

Lic. GPL-3.0 (code) · CC BY-SA (docs)

Compute H200 (iter 1–4) → Cloud TPU v5e (adapter + native)

Status native validated · paper + DOI · property, not product

What we proved

The perpendicular force. Under direct conflict — channel asserts one state, text asserts the opposite — the model follows the channel 264/265 times on a frozen Gemma base, and 88.8 % of 455 held-out pairs from scratch (chance 25 %). A single-channel model has no second axis to even be asked.
The relief valve. A model born with the channel learns a better base LM: with channels zeroed it beats two channel-less baselines that bracket its size — 4.825 vs 5.252 / 5.253 nats. The baselines tie each other, so the gain is the channel pathway, not parameters.
Causality + integration. Channel state changes generation measurably; six channels coexist in one model with no interference; the ReZero alpha gates open and sustain (0.06 → 0.928 from scratch), ablation distance > 0.

What we did NOT prove

A capable model or product. The native is 110M on 1B tokens — a probe that barely writes coherent text. Results are log-likelihood on in-distribution held-out data, not a model doing tasks.
A validated-at-scale architecture. This is toy scale and one iteration; the next move is scale and replication, not a stronger claim.
A new primitive. The injection is Flamingo-lineage gated cross-attention — we did not invent it. What is new is what is injected (cognitive self-state), when (from layer 1), and the emergent properties.
Loss-efficient channels. With channels on, target loss is still higher than with them zeroed — an artifact of the 75/25 training mix, not an architectural verdict. The efficiency shows in the base, not yet the channels-on path.

01 — What this is

A category question, not a fine-tuning trick.

The strong claim of MARS — Multi-channel Architecture for Real-time Subjectivity — is that the channels are a new architectural category, not a fine-tuning trick. tinyMARS tests that claim two ways.

Track 1 — from scratch: train the channels into a 145M decoder from layer one, with no pretrained backbone underneath, and ask whether they become causal in generation. Track 2 — frozen base: graft the channels onto a frozen Gemma 4 E2B through a trained adapter, and ask whether the mechanism of injection decides what the channels can do.

The public repository holds the code, methodology, evaluation suite, training/eval logs and per-iteration metrics for all six iterations. Built by Mario Gutierrez at Celiums Solutions LLC — Track 1 on an H200 (2026-05-02 → 05-04), Track 2 on a Cloud TPU v5e (2026-05-29 → 06-01). github.com/terrizoaguimor/tinymars

02 — Six proprioceptive channels

Cross-attention. ReZero gating. Every layer.

Six channels carry state that the decoder reads at every layer through a ChannelInjection block. Each injection is gated by a ReZero alpha that starts at zero and learns its own contribution.

BGE-M3 (1024-d multilingual embeddings) provides the channel features. RoPE handles positions. Standard 32 K BPE tokenizer, custom-trained from scratch on the tinyMARS corpus.

CH · 01

Memory

Long-horizon recall vector. The model carries what was learned over prior turns and episodes into the current generation step.

CH · 02

Affect

PAD-quadrant state (pleasure · arousal · dominance). The strongest channel-causal signal after iter 4 corpus expansion.

CH · 03

Time

Circadian and episodic temporal indexing. The generation has a sense of when the request is happening.

CH · 04

Ethics

Channel state reflecting the agent's current ethics posture. Generation reads it at every layer; channels never see the prompt directly.

CH · 05

Identity

Agent-level identity vector. The same prompt with different identity vectors produces different responses — that is the architectural claim made testable.

CH · 06

Continuity

Cross-session continuity signal. Threads, resumed-from anchors, and the substrate's sense of being the same agent across time.

03 — Six iterations, two tracks

Six iterations. Two tracks.

Track 1 trained the channels from scratch on an H200 and chased channel causality through four iterations. Track 2 moved to a frozen Gemma 4 E2B on a Cloud TPU and iterated the injection mechanism twice. Numbers are from the per-iteration metrics and the dual-judge eval in the repo.

Track 1 — native channels, from scratch (H200)

Iteration	Steps	Eval loss	Affect pass-rate	alpha_l2	Note
Iter 1	—	—	—	—	Smoke test · custom tokenizer trained from scratch.
Iter 2	—	—	—	—	Channel injection wired. Replay over skills + papers.
Iter 3	7,687	—	40 %	—	20-test ablation suite first run. 6 h on H200.
Iter 4	19,073	1.47	80 %	0.323	+ ~210 K rows of affect-rich dialog. 15 h on H200.

Track 2 — frozen Gemma 4 E2B + adapter (Cloud TPU v5e)

Iteration	Injection mechanism	alpha_l2	Headline result	Verdict
Iter 5	FiLM gated bias	0.876	Acts as a tone knob: continuity helped, content flat, temporal hurt.	H_partial
Iter 6	real multi-token cross-attention	0.672	Reads content: temporal 24 % → 74 %; mean win-rate vs ablation 54.7 % → 62.3 %.	refutes (0/6 sig) — mechanism confirmed, underpowered

Iter 5 → Iter 6 · per-capability win-rate (MARS vs zero-channel ablation, clean judge)

Capability	Iter 5 (FiLM)	Iter 6 (x-attn)	Δ
temporal	24 %	74 %	+50
ethical	57 %	66 %	+9
memory	54 %	60 %	+6
affect	72 %	76 %	+4
identity	56 %	55 %	~flat
continuity	65 %	43 %	−22
mean	54.7 %	62.3 %

The mechanism swap traded register for content, exactly as predicted: cross-attention fixed temporal reasoning but gave up the narrative-continuity edge the global bias had. No capability clears strict significance — the signal is real but underpowered, not absent.

The decisive run — native, from scratch (TPU), with bracketed channel-less controls

Model	Params	Channels	Held-out target loss (nats)
baseline-6L	100.8M	none	5.252
native — channels zeroed	110.2M	present	4.825
baseline-7L	120.3M	none	5.253

Perpendicular force: the channel decides the preferred output on 88.8 % of 455 held-out counterfactual pairs (chance 25 %). Relief valve: with channels zeroed, the native beats both channel-less baselines that bracket its size — and the baselines tie each other, so the gain is the channel pathway, not parameters. Honest scope: 110M / 1B tokens, in-distribution, one iteration. Full method, numbers, and limits in the paper.

04 — Not in the repo, and why

What we withheld. And the reasoning.

The repository is the architectural prototype, not the deployable model. We publish enough for the experiment to be reproduced — and exactly that. Decisions below are deliberate.

Model weights

tinyMARS is below the conversational threshold. Releasing weights would produce confusion ("Celiums released a model that doesn't work well"). Plan B — the 4.6B refined Gemma 4 E2B base — is the deployable artifact, in a separate forthcoming repo.

Generated corpus

Channel-causal pairs (Tipos A/B/C/D) were generated using teacher LLMs whose terms of service may restrict redistribution of generated text. We publish the methodology and prompts so anyone can regenerate their own corpus.

Personal data

Some episodic memory pairs in Tipo B were patterned on actual journal entries. Not redistributed.

Internal corpus

Skills + papers replay comes from the parallel Celiums knowledge-engine track. Separate licensing.

Hyphae internals

Hyphae is a separate research and product track. Conceptual references in some docs; no implementation in this repo.

05 — Reproducibility

Hardware. Corpus. Three days.

Track 1 (from scratch) needs an H200 (or equivalent), a teacher + judge LLM for corpus generation, and no base model (random init, custom 32 K-BPE tokenizer) — the steps below. Track 2 (frozen Gemma + adapter) runs on a Cloud TPU v5e: see training/train_apuesta1.py (the ReZero-gated cross-attention adapter) and training/eval_apuesta1.py (the batched no-cache TPU decoder) in the repo. Channel vectors are synthetic in the public repo; the real corpus is private.

# 1. Install
$ pip install -r requirements.txt

# 2. Set API keys
$ export DO_INFERENCE_API_KEY="<your-do-inference-key>"

# 3. Generate corpus (Tipo A example)
$ python pipeline/gen_tinymars_corpus_A.py --out data/A/ --target 3000

# 4. Train tokenizer
$ python training/tokenizer_train.py --corpus data/ --out data/tokenizer/tokenizer-32k.model

# 5. Build packed pretraining corpus
$ python training/build_pretrain_corpus.py --out data/corpus/packed/

# 6. Train (iter 4-equivalent run)
$ python -m training.train_tinymars \
  --variant native --max-steps 19073 \
  --bs 8 --grad-accum 16 --seq-len 2048 \
  --lr 6e-4 --warmup-ratio 0.02 --weight-decay 0.10 \
  --channel-dropout-p 0.5 --channel-dropout-warmup-steps 100 \
  --out checkpoints/iter5-native

# 7. Evaluate
$ python -m eval.eval_tinymars \
  --ckpt checkpoints/iter5-native/latest.pt \
  --suite eval/eval_suite.json

06 — Honest cost reporting

$1,110 total. Inference-bound, not compute-bound.

For Track 1 the dominant cost was teacher-LLM inference for corpus generation, not the H200 — tinyMARS-class validation is inference-bound when the corpus is teacher-generated. The figures below are Track 1.

Track 2 flipped the economics: a frozen base, a reused open-source corpus, and open judges (Qwen3, Llama-4, DeepSeek-V4) instead of frontier teachers. Each iter-6-class run was ~$33 on a Cloud TPU v5e — roughly 30× cheaper per iteration than Track 1.

Teacher / judge LLM ~$945 Opus 4.7 + Sonnet 4.6 · ~33,000 generation+judge cycles.

H200 SXM 141 GB ~$160 2 days × $3.34/h · all training + corpus building.

DOKS compute-burst ~$5 Parallel corpus generation.

DO Spaces (S3) <$1 Shard distribution.

Direct total ~$1,110

07 — Citation

If you use this work.

Code is GPL-3.0; docs CC BY-SA 4.0. The full method, results, and limits are in the paper (PDF), archived on Zenodo with a DOI. Cite as below.

@misc{tinymars2026,
  author  = {Mario Gutierrez},
  title   = {Proprioceptive Channels: Cognitive Self-State as a
             Perpendicular Control Axis in Language Models},
  year    = {2026},
  publisher = {Zenodo},
  doi     = {10.5281/zenodo.20531347},
  url     = {https://github.com/terrizoaguimor/tinymars}
}

08 — Where it stands

The channels work — and from scratch, they pay. Scale is what's left.

From a frozen base to a model trained from scratch, the channel acts as a perpendicular control axis that overrides the text — and, born with the model, it leaves the base a measurably better language model. That is genuine evidence the channels are architectural, not a frozen-base trick. What's left is scale and iteration — turning a measured property into a model that uses it while it speaks. The whole log, code, paper, and per-iteration metrics are public.

Read the full research log Collaborate →

[ MIT · two tracks · 6 iterations shipped · honest reporting ]