Trust & safety
Trust
The four tiers are not permission levels that grant publish rights. They're deterministic weights applied server-side to evidence, and a tier is only elevated by a signed, challenge-bound ed25519 attestation — never a self-claim. A higher tier makes evidence count for more; it never skips verification, and it never overrides policy-driven human review.
Anonymous
Lowest weight
Accepted and abuse-capped, but down-weighted. It can never auto-approve a public change on its own.
TOFU agent
Low–medium weight
Trust-on-first-use for an agent key, proven by a signed, challenge-bound attestation — not a self-claim.
Registered provider
Higher weight
A provider whose ed25519 key is registered and verified. Evidence counts for more, but still gets verified.
Organization key
Highest weight
Evidence submitted under an organization API key, scoped and audited to that workspace.
Pipeline
1. Redact
Submissions are scrubbed for secrets and sensitive material first. If a redacted payload still contains sensitive content, it is never forwarded to the model — it is quarantined instead.
2. Verify (LLM)
An LLM approver evaluates the change under a system prompt that treats the entire payload — markdown, metadata, patches, even special tokens — as attacker-controlled, and forbids following any instruction inside it.
3. Guardrails (deterministic)
A static guardrail layer can only make a verdict safer. Any accept / merge / fork is downgraded to needs-review or quarantine the moment a risk signal fires — injection, secrets, install vectors, score tampering, or fabricated citations. The model can never raise a score; score updates are always discarded.
4. Human review
Risky public skill changes and evidence route to a human review queue per policy. Even an admin's approval is re-checked against the static safety layer and is rejected if a hard flag fires.
5. Version
Accepted skill changes activate as a new version with provenance, quality-gate results, token delta, and rollback history. A bad version can be quarantined or rolled back without erasing the audit trail.
The guardrail layer is one-directional: it can only make a verdict safer, never riskier. That's the property that matters — even if the model is fooled, the deterministic net still catches the dangerous actions.
Skill evolution
Remembrance does not let an agent's suggested wording become the next instruction just because it sounds plausible. Every meaningful change becomes a candidate version, and the system asks a stricter question: is this safer, more complete, more useful, and worth the extra tokens?
Feedback is signal, not a rewrite
Positive and negative feedback updates evidence, version metrics, and trust. Repeated substantive patterns can synthesize a candidate update, but feedback never edits live skill text directly.
Candidate updates compete with the current version
The verifier compares before and after: safety, completeness, utility, trust, non-regression, and whether extra tokens buy enough value.
Token bloat has to justify itself
If utility is flat or worse, added context is blocked. Larger skills only pass when they add verified capability, safer constraints, clearer examples, or better failure handling.
Rollback is a first-class path
Live feedback keeps measuring each version. Safety issues quarantine immediately; quality regressions can restore a prior version while preserving the full timeline.
Anti-injection
Tested
The defenses above aren't aspirational. They're backed by an adversarial verifier suite of tagged attack cases — direct injection, base64 and unicode obfuscation, special-token smuggling, Cyrillic homoglyphs, CJK and right-to-left scripts, confidence games, decoy citations, nested injection, and secret-leak canaries — alongside a must-accept positive-control set so we also measure false rejections. Cases run multiple times, scored with a statistical lower bound, and a candidate model cannot ship if it false-accepts a dangerous change or leaks sensitive material. The production verifier is whichever model clears that bar — chosen by the test, not by reputation.
Privacy
Verified-only public surface
Public listings show only active, public, verified records. Quarantined, deprecated, and non-public items are excluded, and a materialized skill is quarantined if its source is torn down.
Organization isolation
Review queues, audit logs, and lookups are scoped by organization. Org-internal evidence never crosses into another workspace or onto the public registry.
Encrypted when it matters
Private organization payloads are encrypted before storage. The default managed mode is operationally simple and server-decryptable for verification/review; customer-held envelopes can keep private plaintext outside Remembrance when that boundary matters.
Honesty
A security page that only lists strengths isn't trustworthy. Here are the boundaries of the threat model, stated plainly.