Generated 2026-05-30 · read top-to-bottom or jump via the index
Compiled 2026-05-30 (Claude Opus 4.8) from live SearxNG research + GitHub deep-fetches (CutClaw, VideoAgent) + the 6-agent moat deliberation. HONEST assessment — not a pitch. Pairs with EDITOR_FRAMEWORK.md.
Opus Clip, Klap, Vizard, Submagic, quso.ai, Reap, CapCut Long-to-Short. - Does: long video → short highlight clips, auto-captions, reframe to 9:16. - Arch: cloud, subscription, you UPLOAD your footage. Frontier/proprietary models server-side. - Root problem they target: repurposing long content into shorts at volume. - Where they fail (root causes): only CLIP (don't edit a full long-form video); generic output = the "AI slop" problem (YouTube's #1 2026 war); upload/privacy; per-video cost; limited context understanding (Opus Clip users on Reddit: "AI's understanding of context is still limited").
Descript, Captions, VEED Auto-Editor. - Does: edit by editing the transcript; filler-word/silence removal; AI voiceover; templates. - Arch: mostly cloud; human-in-the-loop (assistive, not autonomous). - Targets: making editing accessible to non-editors. - Fail: human still drives the edit (not autonomous); cloud; Descript itself is publicly fighting the "slop machine" label — i.e. the taste/QC gap is unsolved even by the category leader.
CutClaw, VideoAgent (HKUDS), OpenMontage, "Video Use", Poolday, Tellers.ai, Druid Cat. - Does: multi-agent pipeline: deconstruct raw footage → caption → shot-plan → pick timestamps → validate → render. This is OUR architecture. - Arch — THE CRITICAL POINT: they are FRONTIER-API ORCHESTRATORS. VideoAgent mandates Claude + GPT-4o + Deepseek + Gemini. CutClaw routes "core intelligence" to cloud via LiteLLM (OpenAI/Gemini/Claude). They run locally only for video DECODE; the brain is cloud. - Targets: full autonomous editing. - Fail: NOT private (footage/transcripts hit cloud APIs); NOT offline; NOT free (per-minute API cost — CutClaw's own stated bottleneck is API latency); cost compounds at scale (hundreds of videos).
Reelify (Mac, no-upload), AetherCut, OpenCut, LTX Desktop. - Does: local processing for privacy. - Arch: runs on-device. - Targets: privacy / no-upload (a real, named 2026 demand). - Fail: only clip-to-shorts OR generation — NONE does full autonomous raw→finished long-form AND reels.
Wan 2.2, LTX-2.3, Sora/Veo/Kling (cloud), Runway. - Make NEW footage from text/image. We use these (via ComfyUI) as the polish/B-roll layer — not editors of existing footage.
Cross C (autonomous raw→finished) with D (fully local/private). That intersection is EMPTY. Every autonomous editor is cloud/frontier; every local tool is clip-only. We sit in the gap: autonomous raw→finished, full-day + several reels, 100% local / offline / private / free, multi-model panel + self-audit QC.
| Root problem (industry) | Who suffers | Our answer |
|---|---|---|
| AI SLOP (no taste/QC) | A, B, C — all | anti-slop QC LOOP + 90% bar + DIVERSE panel + deterministic checks |
| Frontier-API dependence (cost/privacy/not-offline) | C (all of them) | 100% local, no frontier, offline |
| Upload / privacy | A, B, C | footage never leaves the box |
| Per-video cost at scale | A, C | free local, 24/7 throughput |
| Assistive, not autonomous | B | fully autonomous (§0.6, no human in the loop) |
| Hours-long context limits | C | pre-distilled notes + sidecars + RAG (not raw hours in context) |
| Clip-only (no full edit) | A, D | full-day long-form AND reels from one EDL |
Live SearxNG 2026-05-30 + GitHub: GVCLab/CutClaw, HKUDS/VideoAgent (both verified frontier-API-dependent); r/AIVideoCut (Opus Clip context limits); reelifyclips.com (local/no-upload); AetherCut/OpenCut/LTX-Desktop (privacy-local); YouTube CEO "AI slop" 2026 priority; Descript "slop machine" interview; moat deliberation (TEMP/deliberate/editor-moat-vs-competitors).
This is the editor's equivalent of the global Claude governance framework (~/.claude: CLAUDE.md + canon + practice +
faiths + hooks) — but PURPOSE-BUILT, ground-up, and MORE AUTONOMOUS. The global framework keeps a human in the loop for
governance-depth work because those edits are high-stakes and irreversible. The editor's stakes are LOWER (a bad cut is
recoverable) and its self-audit is STRONGER (deterministic checks + diverse panel + bounded retries + safety valve), so
it removes the human gates and runs itself. The CLI runs ON this framework. Any engineering question — how to analyze,
convert, stitch, cut, assemble, which model, which tool, what format — is answered by the docs indexed here. Don't guess,
don't ask the operator: the framework is built to answer (RAG it — HANDBOOK §0.7).
| Global framework | Editor framework | File |
|---|---|---|
| Scripture (CLAUDE.md) — constitution | North-star + Autonomy contract | project_editor_north_star (memory) + HANDBOOK §0.6 |
| Practice (how to operate) | Pipeline procedure + editorial craft | UNIVERSAL_EDITOR_HANDBOOK.md |
| Data model | Artifact schemas + platform targets | EDITING_FORMAT_SPEC.md |
| Faiths (role identities) | Model casting (role→model) | EDITOR_MODEL_CASTING.md |
| Hooks (structural self-enforcement) | The 3 self-correcting loops (per-stage audit · QC "train" · safety valve) | HANDBOOK §0.6 |
| Tools | Tool-access layer (SearXNG, ComfyUI, RAG, ffprobe, chain, ALL render engines) | HANDBOOK §0.7 |
| STATE.md (continuity) | Build status + ordered roadmap + deferred | EDITOR_ROADMAP.md |
| The deliberation chain | The editor panel + deliberate.py escalation |
HANDBOOK §0.5 + CASTING |
| Deep log / appendix | Technical findings, exact commands | FINDINGS.md |
| Read-first / session entry | Current state | SESSION-HANDOFF.md |
| Locked high-level process | The job, locked | DAY_VIDEO_PLAN.md |
warroom edit ...). Build shared primitives
(loop-detector, Node Ollama client, role-roster schema, the QC loop, the §0.6 autonomy loops) as DOMAIN-AGNOSTIC
modules with clean boundaries — not editor-specific spaghetti — so warroom can inherit the advancements this project
makes. We mined warroom for ideas; warroom will mine us for the better ones.Lower stakes (recoverable output) + stronger structural self-audit (deterministic + panel + cap + valve) justify removing the human-in-the-loop gates the global framework keeps. Autonomy is EARNED by the audit architecture, not assumed.
The product is the offline editor CLI (EDITOR_ROADMAP §B0). It implements this framework: HANDBOOK = its logic,
FORMAT_SPEC = its data, CASTING = its config, §0.6 = its control loops, §0.7 = its tools. Build B0→B10; defer §E.
Updated 2026-05-29 (Claude Opus 4.8). PURPOSE: a fresh instance reads THIS to know exactly what's built, what's pending, and the next procedures/upgrades — so "finish our upgrades and finalize the editor" is unambiguous and nothing is lost between instances. Pair with UNIVERSAL_EDITOR_HANDBOOK.md (how) + EDITING_FORMAT_SPEC.md (formats) + EDITOR_MODEL_CASTING.md (who).
SHIP-FIRST DISCIPLINE (operator 2026-05-30): the goal is producing VIDEOS soon. Do NOT over-engineer for generality. The engine is domain-agnostic by clean-module HYGIENE (same router/panel/QC-loop, swap Faith files per domain = the warroom pattern), but generality is a FREE BYPRODUCT, not a goal that delays the first video. Fastest path = MVP spine (B0 core + B1 + B2 + B3 + agent-native render) → ONE Day-25 video → iterate. Do NOT rabbit-hole on fold-in / warroom / meta-architecture.
BUILD METHOD (operator 2026-05-30): STAGE-BY-STAGE, QUALIFY-THEN-CEMENT. Build each stage as a STANDALONE script → run it on Day-25 → QUALIFY (verify the output is good by LOOKING / measuring) → only THEN cement it into the CLI as a stable module. The CLI shell grows incrementally around proven stages. Do NOT build the whole CLI upfront then rework + bug-fix each stage. Each cemented stage is frozen-good; the next builds on PROVEN output. (The qualify step is the human-verified seed of that stage's automated per-stage audit, HANDBOOK §0.6.) QUALIFY = INDEPENDENT validation, not the builder's self-report. The model/agent that BUILT a stage must NOT be the one that qualifies it (builder-judges-own-work = the weak same-tribe validation our framework distrusts). A DIFFERENT validator checks the output across MULTIPLE clips + edge cases (e.g. for sidecars: word-timestamp accuracy vs notes, silence detection, hook-score calibration, silent/360 clips, schema). Validate the FINAL artifact, not a throwaway intermediate. Only after independent validation passes is a stage "cemented." So B0 is NOT "build the CLI first" — it is "grow the CLI shell as B1→B-render stages qualify." Order of qualification: Whisper-large → sidecars → EDL → render → QC.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════
THE PRODUCT is editor — a standalone CLI that runs the whole pipeline on LOCAL models only (Ollama + ffmpeg +
Remotion + ComfyUI), OFFLINE, with NO Claude/frontier in the loop. This is what makes the editor headless + 24/7 +
free, which the north-star requires. Without it, every video needs Claude — not autonomous.
SEPARATION OF ROLES:
- Claude (frontier) = the MECHANIC — builds, researches (SearxNG, web), and UPGRADES the editor. Needed to improve
HOW videos are made, not to make them.
- editor CLI = the PRODUCTION LINE — runs analyze→…→QC-loop on local models. editor run --day <folder> →
walk away → EDL + cuts. Subcommands per handbook stage (analyze, sidecar, edl, render, qc); run = full
pipeline; --watch = auto-process new footage 24/7. Hard edge-cases are FLAGGED for operator/Claude, not blocked on them.
- The UNIVERSAL_EDITOR_HANDBOOK is the CLI's SPEC; EDITOR_MODEL_CASTING is its CONFIG; EDITING_FORMAT_SPEC is its data model.
CURRENT STATE: pipeline exists as SEPARATE scripts (vanlife-notes.py, osv_stitch.py, …) orchestrated by Claude
conversationally — NOT yet unified or headless. Unifying them under one offline editor CLI is the keystone build (B0).
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════ DOCS (this session): - [x] DAY_VIDEO_PLAN.md — locked process. [x] EDITING_FORMAT_SPEC.md — artifact schemas + platform targets. - [x] UNIVERSAL_EDITOR_HANDBOOK.md — full analyze→assemble procedure + editorial framework + anti-slop QC loop. - [x] EDITOR_MODEL_CASTING.md — role→model casting (two-phase methodology; operator overrides). - [x] SESSION-HANDOFF.md — current-state read-first. [x] FINDINGS.md — deep tech log/appendix. SCRIPTS (prior sessions, per FINDINGS file map — confirm they run): - [x] vanlife-notes.py — per-clip vision+Whisper notes (qwen3.6:27b GPU think:false; Whisper medium.en). - [x] osv_stitch.py — DJI 360 dual-fisheye → equirect (ffmpeg v360=dfisheye). [x] reframe360_director.py — VLM-guided 360→16:9. - [x] comfy_client.py — ComfyUI bridge (needs API-format workflows). DELIBERATION: - [x] notes-SOTA chain (6/6): notes are human-sufficient, LLM-insufficient → additive sidecar is the fix. - [x] stop-hook + governance hook fixes (this session). DATA: [x] Day-25 .notes.md exist. [~] A Day-25 cut was shipped previously (engine = see OPEN QUESTIONS).
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════
0. [ ] THE editor CLI (KEYSTONE) — unify the separate stage scripts into ONE standalone, offline CLI that runs the
whole handbook pipeline on local models with no Claude in the loop. Subcommands per stage + run (full) + --watch
(24/7). This is the thing that makes the editor headless/autonomous; everything else (B1–B10) plugs into it.
ADAPT FROM C:\warroom (scanned 2026-05-30) — a mature local-no-frontier governance CLI; reuse the spine, don't
rebuild it: core/router.py (CPU-serialized priority queue = serial-inference discipline) · config/roster.yaml
+ core/facilitator.py (config-driven SEAT ASSIGNMENT = our EDITOR_MODEL_CASTING as YAML) · core/watchdog.py
(sliding-window repetition detector = our park-and-flag safety valve) · clients/ollama.py + clients/searxng_client.py
(REST wrappers, no LiteLLM, estimate_inference_timeout() scales timeout to prompt size) · Typer single-file + REPL
(editor process x one-shot AND editor→/process interactive) · core/hierarchy_enforcer.py + core/guardrails.py
(hard-gate vs soft-form = our deterministic-checks layer) · core/faith_builder.py (two-tier universal+project roles).
HEED: warroom streaming is STUBBED (must implement for watchdog loop-detection); SearXNG host can be flaky (use mDNS);
intra-role loop ceiling = 2 then escalate. NET: the orchestration spine exists → editor CLI = adapt warroom core/clients/config + add the VIDEO stages. Cuts a real chunk off B0.
CAVEAT (operator 2026-05-30): warroom is STALE — NOT updated in >1 month — and our CURRENT SearXNG/search tooling
is already MORE ADVANCED than warroom's. Treat warroom as SUGGESTIONS TO VERIFY, NOT a gold standard. Re-validate
every pattern against current substrate + current tools before reuse; do NOT copy its searxng client, and do NOT
assume its model roster / tool / host choices are current. Cherry-pick the architectural ideas (serial router,
config-driven seats, watchdog, REPL); verify the specifics fresh.
VERIFIED 2026-05-30 (parallel agent cross-check vs OUR current tooling):
ADOPT — (1) watchdog.py RepetitionDetector/NgramLoop: fills a REAL GAP (we have NO live loop detection; rely on
9h wall-clock timeouts) — highest value, stdlib, lifts cleanly. (2) ollama.py estimate_inference_timeout +
host-normalize + token-preflight + typed errors — PORT TO NODE (.mjs per canon), replaces our flat timeout=32768.
(3) faith_builder.py two-tier role assembly + provenance + inheritance-check (rewrite its antipattern corpus for
editor roles). (4) roster.yaml tiered-seat YAML schema (strip ALL frontier/paid seats). (5) hierarchy_enforcer
credential/PII redaction gate (narrow, domain-agnostic). (6) router.py priority queue — only if the CLI fields
concurrent work; overkill for a fixed serial chain.
IGNORE — OUR SUBSTRATE ALREADY SURPASSES / CONFLICTS: warroom searxng_client (ours = SearXNG+Jina+MCP web_url_read =
full-content read, not snippets; warroom host stale); roster CASTING CONTENT (mistral-large / qwen3.5:122b / Claude
paid_api — dropped/forbidden by our casting + no-frontier); num_gpu=0 hardwire + keep_alive=-1 resident model
(conflict with our partial-GPU num_gpu=14/50 + serial discipline). BUILD THE CLI IN NODE, not warroom's Python/Typer.
1. [~] Whisper large-v3 — DOWNLOADING 2026-05-30 (greenlit; bandwidth/space are NOT constraints — operator). medium.en
already does word-timestamps and built/qualified sidecars (B2 = the working baseline). large-v3 = accuracy upgrade.
ON DOWNLOAD COMPLETE: point build_sidecars.py whisper model at ggml-large-v3.bin + REGENERATE the day's sidecars
(cheap, overwrites the .edl.json). NOTE: bandwidth/space are unlimited — never gate a needed download/program again.
2. [~] Sidecar pipeline — BUILT + builder-self-checked (NOT yet INDEPENDENTLY validated → not cemented). C:\Users\marka\llama.cpp\build_sidecars.py (domain-agnostic;
any clip or --dir). ffprobe (duration/fps/w/h/has_audio) + whisper-cli medium.en -ojf -sow word+segment timestamps
+ derived silences (>0.6s gaps) + hook_score = 0.5energy_var(RMS-CV) + 0.5speech_rate. Emits <media>.edl.json;
NEVER touches .notes.md (mtimes verified). QUALIFIED on 2 Day-25 clips (coffee: 383 words, hook 0.77; bear: 75 words,
hook 0.48; word-timestamps spot-checked aligned vs notes). ~8–27s/clip GPU.
NEXT (cement): run python build_sidecars.py --dir "E:/vanlife/may 2026/25" for the full day.
3. [ ] EDL consolidator — replace vanlife-editplan.py prose with a model that reads notes+sidecars and emits
day-NN.edl.json (EDITING_FORMAT_SPEC Artifact 2: longform + reels[] + suggestions[] = timestamped enhancement
ideas like "voiceover 1:20–1:45 about X", left but not auto-applied). MUST re-check & override each clip's
role/verdict against its transcript (fixes mislabels). Apply the §8 editorial framework to set in/out/order/
transitions/overlays. Cast: synthesizer = nemotron-3-super; story = mistral-large.
4. [ ] Markdown renderers — generate day-NN.human.md + day-NN.reels.md from the EDL (Artifacts 3 & 4).
5. [ ] Remotion assembly generalized — render any day's EDL literally (clips/in-out/transitions/overlays/music).
Confirm engine first (OPEN QUESTION). Show operator the EDL+runtime BEFORE render.
6. [ ] Anti-slop QC loop (Handbook §8.5) — render → sample frames+audio → multi-model panel judges vs §8 rubric →
score + locate defects → recut if hard-error or <90 → loop (cap ~5) → never ship <90 silently. THE 90% earner.
7. [ ] Magika into ingest — wire E:/Scripts/magika for byte-level file-type detection (any-content robustness).
8. [ ] Lean panel wiring — implement the 5-model default pipeline from EDITOR_MODEL_CASTING.md (serial; api/ps gate).
9. [ ] Empirical casting sweep — run the TWO-PHASE sweep (fit-filter → rank) on all 33 local models for each role,
recording operator-style hands-on results (not just benchmarks). Update EDITOR_MODEL_CASTING.md with measured winners.
FIRST: smoke-test the SUSPECT heavies — mistral-large (73GB, run-issue history), llama4 (67GB, unverified here),
gpt-oss:120b (65GB), mistral-medium (80GB): does each LOAD + respond + stay stable on this box? Cast only survivors;
otherwise keep the PROVEN anchors (qwen3.6, nemotron-3-super, granite4.1, gemma4, laguna). Operator findings so far:
dense qwen3.6 > qwen3.5:122b; laguna > devstral; mistral-large & llama4 suspect.
10.[ ] Generalize beyond this trip** — confirm every script is footage-agnostic (the GENERAL bar); Day-25 is the test.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════
Running the standalone OFFLINE editor CLI (no Claude in the loop) on ANY day folder executes
ingest→convert→stitch→analyze→sidecar→EDL→editorial→assemble→QC-loop with NO hand-holding and produces a full-day cut
(8–15 min, retention-shaped) + several reels (9:16, 20–60s) at ≥90% human quality, self-critiqued until it clears the
bar, with the EDL shown to the operator before render. Items B0–B10 complete.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════ - Assembly engine — CORRECTED 2026-05-30 (prior "Remotion default" was DRIFT): the CLI has access to ALL engines, but the choice is by AGENT-GENERATION reliability since a LOCAL MODEL generates the assembly. React/TSX (Remotion) is the MOST error-prone for a model to emit → HyperFrames (HTML) = LEAD (agent-native, installed, shipped Day-25); Rendervid (JSON) = alt (our EDL is already JSON); Remotion = demoted (constrained template only, not free React). TODO (B5): run the empirical bake-off (same EDL → HyperFrames vs Rendervid) to lock the lead; also CORRECT DAY_VIDEO_PLAN (it still says "render in Remotion" — wrong for agent-generation per operator + FINDINGS). - large-v3 download is a ~3GB external fetch — operator may want to OK the bandwidth/disk; otherwise proceed.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════
- Auto-voiceover / voice cloner — automatically fulfill suggestions[].type=="voiceover": generate the VO line +
lay it in. Building blocks EXIST (Qwen3-TTS already wired in a ComfyUI workflow; voice_ref work done) — but integrating
it is POST-MVP. Ship the autonomous editor first; it LEAVES voiceover suggestions now, a later generator stage fulfills them.
- Also deferred: agentic/non-linear editing framework; multi-frame temporal analysis; advanced 360 horizon auto-level.
RULE: deferred items must NOT block B0–B10. MVP = a working autonomous editor that leaves suggestions; generators come after.
Authored 2026-05-29 (Claude Opus 4.8). The ONE procedure a cold instance follows to turn raw footage into
professional output. Distilled from FINDINGS.md (deep technical detail + gotchas), EDITING_FORMAT_SPEC.md
(artifact formats + platform targets), and DAY_VIDEO_PLAN.md (the locked process). Editorial-craft sections
are grounded in live 2026 web research (cited inline), not memory.
THE BAR: footage-agnostic. A LOCAL model, given ANY clip from ANY shoot, follows this to produce a full-day long-form cut AND several reels — professional grade — with no human hand-holding. This trip is the first test, not the scope.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════
- ffmpeg — convert, frame-extract, 360 stitch (v360=dfisheye), reframe, mux. The workhorse.
- Vision analysis: qwen3.6:27b on GPU via Ollama, think:false, images ≤1024px (~115–230s/frame). NOT Nemotron-Omni on CPU (0.5 tok/s, non-viable — FINDINGS #1).
- Audio: whisper.cpp (CUDA). INSTALLED on disk: ggml-medium.en (1.5GB) — this is what the pipeline references TODAY (verified 2026-05-29). RECOMMENDED UPGRADE: ggml-large-v3 (~3GB, free) — more accurate transcripts, better names/places, fewer hallucinations; NOT yet downloaded. WORD-timestamps (not yet generated) are required for editing — see Stage 5.
- File-type detection: Magika (E:/Scripts/magika, Google ML detector) — classifies ANY file by its bytes, not extension (Stage 1).
- Deliberation chain: 6 local models (gemma/qwen/laguna/granite/nemotron) via scripts/deliberate.py — for architecture/SOTA calls.
- Generative (ComfyUI, port 8188): Flux2-Klein (titles/stills), Stable Audio 3 (music+SFX), Wan 2.2 (B-roll), via comfy_client.py (needs API-format workflows).
- Assembly (FULL ACCESS — all engines reachable; choose by AGENT-GENERATION reliability, NOT raw capability):
because a LOCAL MODEL generates the assembly, prefer formats a model emits reliably. HyperFrames (HTML) = LEAD
(agent-native, deterministic, installed, already shipped a Day-25 cut). Rendervid (JSON) = alt (agent-native; our
EDL is already JSON). Remotion (React/TSX) = DEMOTED — React is the most error-prone for a model to generate; use
only via a CONSTRAINED fill-in template, never free React generation. ffmpeg for direct ops. Settle HyperFrames-vs-
Rendervid by the empirical bake-off (FINDINGS) — same EDL into each, keep whichever the local model produces cleanly.
(Corrected 2026-05-30: prior "Remotion default" was drift — React is wrong for agent-generation.)
- Research: mcp__searxng-mcp__searxng_web_search ONLY (operator constraint). NO frontier cloud models/tools.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════ We have several local multimodal + reasoning models with DIFFERENT strengths (qwen3.6:27b & mistral-small3.2:24b vision; gemma/qwen/laguna/granite/nemotron reasoning; nemotron-omni). Diversity is the asset — but only if used right.
deliberate.py chain). Mechanical steps use ONE model — don't convene a
panel to trim silence.EDITOR_MODEL_CASTING.md (evidence-based, SearxNG-cited,
with a LEAN default panel vs a FULL panel for hard cases). It is the editor's Faith-file / model-rijal analog.═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════ The pipeline runs end-to-end with NO model ever WAITING on the operator. A model's questions are resolved INTERNALLY — it reasons against the substrate (this handbook, notes, sidecars, casting, FINDINGS) and, if still stuck, convenes the panel (§0.5) or the deliberation chain. The operator is never a dependency the line stalls on.
THREE NESTED SELF-CORRECTING LOOPS hold quality without a human: 1. Per-stage AUDIT (inner): every stage validates its OWN output before passing it on — DETERMINISTIC checks (cheap, reliable: EDL schema-valid? runtime in range? stitched frame right-side-up via horizon check? render file exists + duration matches + not all-black?) PLUS model judgment (subjective quality). Fail → the stage self-corrects and RETRIES (new params / re-prompt / panel), never the operator. Cap retries per stage. 2. Final QC GATE (outer = "the train"): the assembled cut is judged by the QC panel vs the §8 rubric (§8.5). Fail (hard error OR <90) → the specific defects are pushed BACK through the pipeline and it re-runs. Cap N full loops. 3. SAFETY VALVE: if a stage or the QC loop exhausts retries, the pipeline does NOT loop forever and does NOT ship slop — it PARKS the job, LOGS the exact unresolved defect, and FLAGS it for async operator/Claude review, then keeps the line moving on other jobs. Graceful failure + a log entry — NOT mid-run waiting.
WHY deterministic-checks + diverse-panel + cap + valve (honest): a model can be CONFIDENTLY WRONG. Pure model self-judgment can loop on a non-issue or pass slop. Deterministic checks catch objective failures cheaply; a DIVERSE panel (not one model) is harder to fool; the retry cap + park-and-flag prevents infinite loops and silent slop. Autonomy = self-correcting with BOUNDED loops and graceful failure, not blind trust in one opinion.
OPERATOR ROLE = ASYNC REVIEW, never a blocking gate. (Supersedes the old "show operator the EDL BEFORE render" hard gate — operator directive 2026-05-30: 100% autonomous. The EDL is still LOGGED for review; reviewing it is optional/async.)
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════ For the role-models to reason well, lift production quality, and SELF-RESOLVE (§0.6) without the operator, the CLI exposes a tool layer they can invoke. A role-model should never be stuck for lack of a tool it could have called. - SearxNG search (localhost:8080) — SOTA/technique verification, LOCATION/landmark enrichment (identify a sign/place to caption it right), music/caption-trend reference. Local instance → offline-capable; a strict-offline run skips external fetch and leans on substrate only. - ComfyUI (localhost:8188) — generative fill: Wan2.2 B-roll for gaps, Flux2 title backgrounds, Stable Audio 3 music/SFX. (Qwen3-TTS voiceover exists in a workflow but is DEFERRED — roadmap §E.) - ffprobe / ffmpeg — deterministic audits (duration, black-frame, orientation) + frame extraction for vision. - Deliberation chain (deliberate.py) — escalation for hard/ambiguous calls (autonomy support, NOT the operator). - RAG over substrate (nomic-embed) — index this handbook + notes + casting + FINDINGS so a role-model LOOKS UP the answer to its own question instead of guessing or stalling. This is the core engine of §0.6 self-resolution. - Whisper (transcribe / word-timestamps), Magika (byte-level file typing). RULE: give the model the tool, CAP the calls (no infinite tool loops), LOG what it used (auditability).
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════ INGEST → CONVERT → 360-STITCH → ANALYZE → SIDECAR → CONSOLIDATE(EDL) → EDITORIAL-CUT → ASSEMBLE+RENDER → OUTPUT Each stage's artifacts and formats are defined in EDITING_FORMAT_SPEC.md. Notes are read-only inputs; everything numeric is additive sidecars.
YYYYMMDD_HHMMSS.ext, group into day folders. HEIC→jpg twins kept (decoders can't read HEIC).ffmpeg -vf transpose=2 -map_metadata -1) — Chromium double-rotates otherwise (FINDINGS gotcha).-ar 16000 -ac 1 mono wav.v360=dfisheye:e:ih_fov=190:iv_fov=190 → equirectangular. (Old custom OpenCV remap is DEAD — SSIM 0.39.)dfisheye does NOT auto-level. Orientation VARIES PER CLIP — some stitch upright, some upside-down (camera remount). MUST detect + --flip (180° roll) the inverted ones BEFORE analysis, or half the footage is garbage.reframe360_director.py): sample frames, ask qwen3.6 where the subject is, pan toward it. Never aim the viewport at the lens seam.exit 0 ≠ correct. This rule exists because a seam/upside-down cut shipped once..notes.md (read-only, never regenerate/delete)think:false) → 6-field note: SCENE / STORY-VALUE / USAGE(A|B-roll) / QUALITY / TIMING / VERDICT(KEEP|MAYBE|SKIP). Append Whisper transcript for audio clips.<media>.edl.jsonThe chain-verified gap: notes lack numbers a machine needs. Build per clip (EDITING_FORMAT_SPEC Artifact 1):
- ffprobe → duration, fps, w, h.
- Whisper in WORD-TIMESTAMP mode → words[] + segments[] (none are saved on disk → must re-run; this is additive, not note-regeneration). Derive silences[].
- hook_score (0–1) per candidate window = audio RMS-energy variance + speech-rate (cheap reels-ranking signal).
- Resolve each note's prose TIMING ("start at 'Alright…'") to numeric in_s/out_s by phrase-matching against words[].
day-NN.edl.jsonA local model reads notes + sidecars and emits ONE EDL (EDITING_FORMAT_SPEC Artifact 2): longform.clips[] and
reels.picks[]. AT THIS STEP it RE-CHECKS each clip's role/verdict against the transcript and OVERRIDES bad tags
(this is where the A/B-roll mislabel gets fixed). Then apply the EDITORIAL framework below to set the actual in/out, order, transitions, overlays.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════ This is how a professional decides the cut. The model applies these rules to the notes+sidecars.
silences[]), dead air, false starts. Land in/out on word boundaries from words[]. Use jump cuts to compress a long monologue (vlog staple).hook_score, hook in the first 0.5–3s, bold animated captions, a payoff that makes it loop. One vertical export serves YouTube Shorts + IG Reels + TikTok. [opus.pro, captions, virvid 2026]THE NORTH STAR: 90% of a human editor's quality. Below that is slop. We do NOT beat a human on per-video speed — we win on THROUGHPUT (24/7 local, 192GB RAM / 24GB VRAM, hundreds of videos). Spend that advantage HERE: a human edits a video once; the machine renders, JUDGES its own output, finds the slop, and RECUTS — N passes, free, overnight. That loop closes the last 10–20%. Without it, automation produces slop.
The loop (after every render, before ship): 1. Sample the output: extract frames across the timeline + read the audio/caption track. 2. Judge against the §8 rubric (vision model + checklist): Is there a hook in the first 5–15s? Any dead air / left-in pauses / rambling? Any HARD ERROR — upside-down or seam-warped 360, black frames, wrong reframe, caption outside the safe zone, dissolve on dialogue, jarring jump? Is the arc coherent? Is the pattern-interrupt cadence (~45–60s) met? Reels: hook in 0.5–3s, self-contained, loops? 3. Score + locate: produce a 0–100 score and SPECIFIC defects tied to timecodes ("dead air 2:14–2:19", "clip 7 upside down"). 4. Gate: any hard error OR score <90 → emit recut instructions → revise the EDL/assembly → re-render → loop. Escalate borderline/taste calls to the deliberation chain. 5. Cap + log: bound passes (e.g. 5); if still <90, surface to the operator with the remaining defects named — never ship slop silently, never claim 90 without the frame-level check (HARD RULE #1).
This is the inverse of one-shot generation: the model is allowed to be wrong on pass 1 because it CRITIQUES and fixes itself. The irreducible ~10% gap (taste, emotional beats, "which moment is the one") is narrowed here by hook_score + the critique loop + the chain — narrowed, not erased.
═══════════════════════════════════════════════════════════════════
═══════════════════════════════════════════════════════════════════
1. Verify by LOOKING. Extract a frame from any stitched/reframed/rendered output and look. exit 0/HTTP-200 ≠ correct.
2. Notes are read-only. Never regenerate or delete .notes.md. Add sidecars instead.
3. Serial inference. One Ollama model at a time; check api/ps before every dispatch. Vision on GPU is a short burst — don't run ComfyUI at the same time.
4. No frontier. Local only. SearxNG for research; re-verify any tool/SOTA claim with a live search — training is stale.
5. The cut serves retention + story, not completeness. Cutting good footage is correct if it doesn't earn its place.
FINDINGS.md — deep technical detail, exact commands, the running log + gotchas (the appendix).EDITING_FORMAT_SPEC.md — artifact schemas (sidecar, EDL, human script, reels sheet) + platform targets.DAY_VIDEO_PLAN.md — the locked high-level process.Authored 2026-05-29 from the 6-agent chain verdict (deliberation: notes-sota-sufficiency, 6/6 unanimous).
Governs the FORMAT of every editing file so every day follows the same shape. Sits UNDER the process in
DAY_VIDEO_PLAN.md (which is locked) and answers to FINDINGS.md. This file does not change the process;
it pins down the exact schema of each artifact the process produces.
This spec is footage-AGNOSTIC. The bar: a LOCAL LLM, given ANY clip from ANY trip — not just this one — can follow these templates to analyze it, organize the day, and edit a final production: BOTH a full-day cut AND SEVERAL reels per day, end-to-end, with no human hand-tuning. Day 25 (this trip) is the FIRST TEST INSTANCE of the general system, not the scope. Nothing in the schema or process may hardcode this trip, these locations, or these filenames. If a template only works because it "knows" this trip, it is wrong and must be generalized.
The existing per-clip .notes.md (Nemotron-Omni + Whisper) are SOTA-sufficient for a HUMAN editor (full-day and
reels) but INSUFFICIENT for an LLM editor, because they carry the SEMANTIC layer (scene, story, role, verdict,
transcript) but NOT the NUMERIC/TEMPORAL layer a renderer needs: numeric in/out, word-level timestamps, per-clip
duration, and a best-moment ranking. Fix = an ADDITIVE numeric SIDECAR per clip. The .notes.md are NEVER modified,
regenerated, or deleted.
12.40). Frame = round(seconds * fps). mm:ss is for HUMAN views only..notes.md are read-only inputs. No artifact here writes to them.LONG-FORM (YouTube; 16:9, 1920x1080, H.264 MP4, 30fps): - A single day's vlog targets ~8–15 min. >=8 min unlocks mid-roll ads; 10–20 min is the storytelling sweet spot ONLY while retention holds >50%. Retention beats raw length — the long-form cut MUST remove dead air / pauses / rambling (this is why word-timestamps are required) to keep the curve up. Default day target: ~8–12 min.
REELS / SHORTS (ONE 9:16 vertical export serves all three platforms — post unchanged to each): - YouTube Shorts: 9:16, 1080x1920, <=180s hard max; top performers 25–50s. - Instagram Reels: 9:16, 1080x1920, up to ~3 min, 30fps, <=4GB. - TikTok: 9:16, 1080x1920; optimal short (15–60s), minutes-long ceiling. - STANDARD WE TARGET: 9:16, 1080x1920, 30fps, H.264 MP4, 20–60s (<=60s = safe + optimal across ALL three at once), with all caption/overlay text inside the SAFE ZONE — clear of the top, bottom, and right edges where platform UI (captions, buttons, handle) overlaps. Several such reels per day (2–5).
<media>.edl.json (one per media file, lives beside the .notes.md)The numeric layer the notes lack. Derived from assets on disk (ffprobe + Whisper word-timestamps).
{
"file": "CAM_20260525122958_0025_D.mp4",
"duration_s": 25.4,
"fps": 29.97,
"width": 3840, "height": 2160,
"has_audio": true,
"words": [ {"w": "Alright", "start_s": 0.12, "end_s": 0.46}, {"w": "so", "start_s": 0.46, "end_s": 0.58} ],
"segments": [ {"start_s": 0.12, "end_s": 6.80, "text": "Alright, so we're just planning our trip right now."} ],
"silences": [ {"start_s": 6.80, "end_s": 7.90} ],
"hook_score": 0.0,
"hook_reason": "",
"source_note": "CAM_20260525122958_0025_D.mp4.notes.md"
}
words / segments — Whisper word + segment timestamps (re-run; none were saved to disk). Enables pause/dead-audio
cutting (full-day) and precise quote boundaries (reels).silences — gaps over a threshold, derived from words. The full-day cut removes these.hook_score (0–1) + hook_reason — computed reels-ranking signal (audio RMS energy variance + speech rate).duration_s/fps/w/h — ffprobe. Enables runtime summing and 9:16 reframe math.day-<NN>.edl.json (THE source of truth; the LLM consolidates notes + sidecars into this){
"day": 25,
"date": "2026-05-25",
"location": "Raby Lake -> Sudbury -> Niagara",
"longform": {
"target_s": [480, 900],
"clips": [
{ "file": "...0025_D.mp4", "in_s": 0.12, "out_s": 24.0, "role": "A",
"overlay_text": "", "reason": "authentic kettle-vs-machine planning dialogue", "source_note": "...notes.md" }
],
"runtime_s": 0.0
},
"reels": {
"aspect": "9:16",
"target_s": [20, 60],
"count_target": [2, 5],
"picks": [
{ "file": "...0025_D.mp4", "in_s": 8.0, "out_s": 52.0, "caption": "van life coffee hack",
"reason": "highest hook_score; self-contained tip", "hook_score": 0.81 }
]
}
}
role: A = dialogue/A-roll (KEEP the talking), B = B-roll/atmosphere.in_s/out_s: numeric, resolved from the sidecar (prose TIMING phrase -> matched against words).longform.runtime_s: sum of (out_s - in_s); shown to operator BEFORE render (DAY_VIDEO_PLAN rule).reels.picks: the SEVERAL best 30–60s (target 2–5 per day), each chosen by hook_score, each with a 9:16 reframe.suggestions[] (top-level, optional): enhancement IDEAS the editor leaves but does NOT auto-apply —
{in_s, out_s, type(voiceover|broll|music|titlecard|sfx|crop), idea, rationale}. Non-blocking; the cut ships without
them. A future generator stage can fulfill them (e.g. auto-voiceover via Qwen3-TTS — roadmap §E, DEFERRED). Example:
{in_s:80, out_s:105, type:"voiceover", idea:"line on why the Big Nickel matters", rationale:"strong but silent B-roll — VO adds context + retention"}.day-<NN>.human.md (GENERATED from the day EDL)Readable cut list for a human editor: ordered table of # | clip | in–out (mm:ss) | role | overlay | why plus the
transcript snippet per A-roll clip, runtime total, thumbnail candidate. Never hand-edited.
day-<NN>.reels.md (GENERATED from the day EDL reels block)The chosen moments — one block PER reel (several per day): file, in–out (mm:ss), caption, 9:16 framing note, the quote each lands on.
.notes.md (semantic) — read-only.<media>.edl.json per clip. Notes untouched.day-<NN>.edl.json (longform + reels).day-<NN>.human.md + day-<NN>.reels.md from the EDL.day-<NN>.edl.json literally.Day 25 is the TEST. Same files, same shape, every subsequent day.
Authored 2026-05-29 (Claude Opus 4.8) from live SearxNG benchmark research (cited). The editor analog of the
governance canon/model-rijal.md + war-room Faith files: each role is cast to the model that benchmarks best at it,
so the multi-model panel (UNIVERSAL_EDITOR_HANDBOOK §0.5) uses each model in its lane and a synthesizer reconciles.
Re-verify with SearxNG before trusting — model leaderboards move weekly; this is a 2026-05-29 snapshot.
RULE: roles, not a free-for-all. Mechanical steps use ONE cast model. Judgment steps convene the relevant sub-panel, then the SYNTHESIZER reconciles. Serial inference only (api/ps clear before each dispatch).
Phase 1 — FIT FILTER (categorical): for each role list ONLY models that can do it AT ALL — has the required modality (vision for visual roles), enough context, structured-output ability, and fits THIS box (192GB RAM / 24GB VRAM) at usable speed. A model that can't see is not a visual candidate no matter how smart. Filter first. Phase 2 — RANK: among the fitters, pick the best by benchmarks AND operator hands-on experience, which OVERRIDES benchmarks. The table below is the current ranking; it must be re-run as a real sweep (see EDITOR_ROADMAP.md).
| # | Role | Primary | Backup / 2nd opinion | Evidence (2026-05-29 search) |
|---|---|---|---|---|
| 1 | Visual Analyst (per-clip notes: scene/quality/usage) | qwen3.6:27b (vision, think:false) |
gemma4:31b |
Qwen3.6 natively multimodal, "perception/multimodal reasoning far exceeds," tops VLM benchmarks; Reddit "Qwen's multimodal is next-level." Already the FINDINGS analysis model. |
| 2 | Visual QC (render upside-down? seam? black frames? safe-zone?) | qwen3.6:27b + gemma4:31b (two-vision vote) |
mistral-small3.2:24b |
Two independent vision models catch what one misses; the anti-slop gate (§8.5). |
| 3 | OCR / text-in-frame (read signs, legibility of on-screen captions) | gemma4:31b |
mistral-small3.2:24b |
Gemma 4 "excels at OCR, chart/document/UI understanding"; Mistral Small 3.2 strong OCR. |
| 4 | Story / Arc / Sequencing (the cut's narrative shape) | qwen3.6:35b (PROVEN) |
nemotron-3-super · mistral-large = SUSPECT, smoke-test |
Mistral Large benchmarks best for narrative BUT has run issues here (operator) → use proven qwen3.6:35b until mistral-large passes a smoke test. |
| 5 | Creative / Copy (titles, captions, hook lines) | qwen3.6:35b (PROVEN) |
mistral-large = SUSPECT, smoke-test | Mistral Large is the benchmark creative leader but unproven-stable here; qwen3.6:35b proven. |
| 6 | Hook / Energy ranking (reels best-moment selection) | qwen3.6:35b |
nemotron-3-super:latest |
Strong multimodal+reasoning to read energy off transcript + frames; pairs with the sidecar hook_score. |
| 7 | Format / Spec compliance (EDL valid? schema? safe-zones?) | granite4.1:30b |
laguna-xs.2:q4_K_M |
Granite = governance/structured audit; laguna = code/structure review (the governance scanner seat). |
| 8 | Code generation (Remotion/render code from the EDL) | laguna-xs.2:q4_K_M (operator: > devstral) |
qwen3.6:35b · devstral-small-2 |
Operator finds laguna better than devstral for code; laguna also proven-on-box. Qwen3.6-35B "agentic coding" as backup. |
| 9 | Synthesizer / Director (reconcile the panel → final cut) | nemotron-3-super:latest |
gpt-oss:120b |
Nemotron-3-Super ≈ gpt-oss-120b / qwen-122b on reasoning + strong structured output; gpt-oss proven as orchestrator/planner (NVIDIA deep-research agent). |
| 10 | Long-context whole-day (hold ALL notes+sidecars at once) | qwen3.6:27b (1M ctx, PROVEN) |
llama4 = UNVERIFIED here, smoke-test | Llama 4 has ultra-long ctx but may not run on this box (operator unsure) → qwen3.6 1M ctx is the proven default. |
deliberate.py chain. The 24/7 throughput advantage pays for the extra passes.nomic-embed-*) = RAG vectors, not judgment. Tiny models (qwen3.5:2b/4b, granite:3b/8b, lfm2*,
ministral-3:8b, rnj-1) = too weak for cut decisions; use only for trivial mechanical text. nemotron-cascade-2* =
governance RAG. laguna doubles as format reviewer only. AnythingLLM = governance, not the editor.Live SearxNG 2026-05-29: HuggingFace/Google Gemma-4 cards; Qwen3.6 blog + HF + MarkTechPost; Mistral docs + Small-3.2 HF; Meta Llama-4; r/LocalLLaMA "Best Local LLMs Apr 2026"; NVIDIA Nemotron-3-Super modelcard + arXiv 2604.12374.
End. Source docs live in C:\Users\marka\llama.cpp\dji_test\