COMPETITIVE_LANDSCAPE.md

COMPETITIVE LANDSCAPE — market products vs. what we're building

Compiled 2026-05-30 (Claude Opus 4.8) from live SearxNG research + GitHub deep-fetches (CutClaw, VideoAgent) + the 6-agent moat deliberation. HONEST assessment — not a pitch. Pairs with EDITOR_FRAMEWORK.md.

The 5 product categories on the market

A. Cloud clip-to-shorts SaaS (the crowded, funded tier)

Opus Clip, Klap, Vizard, Submagic, quso.ai, Reap, CapCut Long-to-Short. - Does: long video → short highlight clips, auto-captions, reframe to 9:16. - Arch: cloud, subscription, you UPLOAD your footage. Frontier/proprietary models server-side. - Root problem they target: repurposing long content into shorts at volume. - Where they fail (root causes): only CLIP (don't edit a full long-form video); generic output = the "AI slop" problem (YouTube's #1 2026 war); upload/privacy; per-video cost; limited context understanding (Opus Clip users on Reddit: "AI's understanding of context is still limited").

B. Transcript-based AI editors (assistive)

Descript, Captions, VEED Auto-Editor. - Does: edit by editing the transcript; filler-word/silence removal; AI voiceover; templates. - Arch: mostly cloud; human-in-the-loop (assistive, not autonomous). - Targets: making editing accessible to non-editors. - Fail: human still drives the edit (not autonomous); cloud; Descript itself is publicly fighting the "slop machine" label — i.e. the taste/QC gap is unsolved even by the category leader.

C. Agentic raw→finished editors (THE closest to us — and the key finding)

CutClaw, VideoAgent (HKUDS), OpenMontage, "Video Use", Poolday, Tellers.ai, Druid Cat. - Does: multi-agent pipeline: deconstruct raw footage → caption → shot-plan → pick timestamps → validate → render. This is OUR architecture. - Arch — THE CRITICAL POINT: they are FRONTIER-API ORCHESTRATORS. VideoAgent mandates Claude + GPT-4o + Deepseek + Gemini. CutClaw routes "core intelligence" to cloud via LiteLLM (OpenAI/Gemini/Claude). They run locally only for video DECODE; the brain is cloud. - Targets: full autonomous editing. - Fail: NOT private (footage/transcripts hit cloud APIs); NOT offline; NOT free (per-minute API cost — CutClaw's own stated bottleneck is API latency); cost compounds at scale (hundreds of videos).

D. Local / privacy editors (emerging, but narrow)

Reelify (Mac, no-upload), AetherCut, OpenCut, LTX Desktop. - Does: local processing for privacy. - Arch: runs on-device. - Targets: privacy / no-upload (a real, named 2026 demand). - Fail: only clip-to-shorts OR generation — NONE does full autonomous raw→finished long-form AND reels.

E. AI video GENERATORS (different category — a tool we USE, not a competitor)

Wan 2.2, LTX-2.3, Sora/Veo/Kling (cloud), Runway. - Make NEW footage from text/image. We use these (via ComfyUI) as the polish/B-roll layer — not editors of existing footage.

The empty square in the market (= our lane)

Cross C (autonomous raw→finished) with D (fully local/private). That intersection is EMPTY. Every autonomous editor is cloud/frontier; every local tool is clip-only. We sit in the gap: autonomous raw→finished, full-day + several reels, 100% local / offline / private / free, multi-model panel + self-audit QC.

Root causes — theirs vs. our answer

Root problem (industry)	Who suffers	Our answer
AI SLOP (no taste/QC)	A, B, C — all	anti-slop QC LOOP + 90% bar + DIVERSE panel + deterministic checks
Frontier-API dependence (cost/privacy/not-offline)	C (all of them)	100% local, no frontier, offline
Upload / privacy	A, B, C	footage never leaves the box
Per-video cost at scale	A, C	free local, 24/7 throughput
Assistive, not autonomous	B	fully autonomous (§0.6, no human in the loop)
Hours-long context limits	C	pre-distilled notes + sidecars + RAG (not raw hours in context)
Clip-only (no full edit)	A, D	full-day long-form AND reels from one EDL

HONEST SOTA verdict (where we are / are NOT state-of-the-art)

SOTA — yes, on ONE axis: the local-autonomous-full-edit combination. No shipping product does autonomous raw→finished, fully local, private, free, at scale. For the privacy/volume/cost buyer, nothing on the market matches it. Our architecture (multi-model panel + deterministic+diverse QC + substrate-RAG + agent-native render) is SOTA-aligned for that lane.
NOT SOTA — honestly: raw per-cut QUALITY. Frontier-backed tools (C, and A's proprietary models) produce a better single cut than local weights — the moat chain + research confirm a real capability gap (GPT-5.5/Opus 4.6 lead). We are also behind on MATURITY (they ship; we build), SPEED (local serial < cloud parallel), and POLISH/ease.
The honest synthesis: our PLAN/architecture is genuinely SOTA for the local-autonomous niche. Whether our OUTPUT is SOTA depends entirely on whether the QC LOOP closes the local-model quality gap to ≥90% — UNPROVEN until built. So: "SOTA plan for a real niche; output quality is the open question, and the QC loop is the bet."
Framing that's true: this is not "the universal editor product that beats funded teams." It is "the only fully-local autonomous studio" — and for the operator's use case (hundreds of private videos, free, offline, at scale), that is genuinely best-in-class because the alternatives structurally cannot do it.

Sources

Live SearxNG 2026-05-30 + GitHub: GVCLab/CutClaw, HKUDS/VideoAgent (both verified frontier-API-dependent); r/AIVideoCut (Opus Clip context limits); reelifyclips.com (local/no-upload); AetherCut/OpenCut/LTX-Desktop (privacy-local); YouTube CEO "AI slop" 2026 priority; Descript "slop machine" interview; moat deliberation (TEMP/deliberate/editor-moat-vs-competitors).

^ top

Global framework	Editor framework	File
Scripture (CLAUDE.md) — constitution	North-star + Autonomy contract	`project_editor_north_star` (memory) + HANDBOOK §0.6
Practice (how to operate)	Pipeline procedure + editorial craft	`UNIVERSAL_EDITOR_HANDBOOK.md`
Data model	Artifact schemas + platform targets	`EDITING_FORMAT_SPEC.md`
Faiths (role identities)	Model casting (role→model)	`EDITOR_MODEL_CASTING.md`
Hooks (structural self-enforcement)	The 3 self-correcting loops (per-stage audit · QC "train" · safety valve)	HANDBOOK §0.6
Tools	Tool-access layer (SearXNG, ComfyUI, RAG, ffprobe, chain, ALL render engines)	HANDBOOK §0.7
STATE.md (continuity)	Build status + ordered roadmap + deferred	`EDITOR_ROADMAP.md`
The deliberation chain	The editor panel + `deliberate.py` escalation	HANDBOOK §0.5 + CASTING
Deep log / appendix	Technical findings, exact commands	`FINDINGS.md`
Read-first / session entry	Current state	`SESSION-HANDOFF.md`
Locked high-level process	The job, locked	`DAY_VIDEO_PLAN.md`

EDITOR_ROADMAP.md

EDITOR ROADMAP — build status + ordered upgrade plan (resume point)

Updated 2026-05-29 (Claude Opus 4.8). PURPOSE: a fresh instance reads THIS to know exactly what's built, what's pending, and the next procedures/upgrades — so "finish our upgrades and finalize the editor" is unambiguous and nothing is lost between instances. Pair with UNIVERSAL_EDITOR_HANDBOOK.md (how) + EDITING_FORMAT_SPEC.md (formats) + EDITOR_MODEL_CASTING.md (who).

SHIP-FIRST DISCIPLINE (operator 2026-05-30): the goal is producing VIDEOS soon. Do NOT over-engineer for generality. The engine is domain-agnostic by clean-module HYGIENE (same router/panel/QC-loop, swap Faith files per domain = the warroom pattern), but generality is a FREE BYPRODUCT, not a goal that delays the first video. Fastest path = MVP spine (B0 core + B1 + B2 + B3 + agent-native render) → ONE Day-25 video → iterate. Do NOT rabbit-hole on fold-in / warroom / meta-architecture.

BUILD METHOD (operator 2026-05-30): STAGE-BY-STAGE, QUALIFY-THEN-CEMENT. Build each stage as a STANDALONE script → run it on Day-25 → QUALIFY (verify the output is good by LOOKING / measuring) → only THEN cement it into the CLI as a stable module. The CLI shell grows incrementally around proven stages. Do NOT build the whole CLI upfront then rework + bug-fix each stage. Each cemented stage is frozen-good; the next builds on PROVEN output. (The qualify step is the human-verified seed of that stage's automated per-stage audit, HANDBOOK §0.6.) QUALIFY = INDEPENDENT validation, not the builder's self-report. The model/agent that BUILT a stage must NOT be the one that qualifies it (builder-judges-own-work = the weak same-tribe validation our framework distrusts). A DIFFERENT validator checks the output across MULTIPLE clips + edge cases (e.g. for sidecars: word-timestamp accuracy vs notes, silence detection, hook-score calibration, silent/360 clips, schema). Validate the FINAL artifact, not a throwaway intermediate. Only after independent validation passes is a stage "cemented." So B0 is NOT "build the CLI first" — it is "grow the CLI shell as B1→B-render stages qualify." Order of qualification: Whisper-large → sidecars → EDL → render → QC.

═══════════════════════════════════════════════════════════════════

ARCHITECTURE — the deliverable is a STANDALONE CLI, not docs, not Claude

═══════════════════════════════════════════════════════════════════ THE PRODUCT is editor — a standalone CLI that runs the whole pipeline on LOCAL models only (Ollama + ffmpeg + Remotion + ComfyUI), OFFLINE, with NO Claude/frontier in the loop. This is what makes the editor headless + 24/7 + free, which the north-star requires. Without it, every video needs Claude — not autonomous.

SEPARATION OF ROLES: - Claude (frontier) = the MECHANIC — builds, researches (SearxNG, web), and UPGRADES the editor. Needed to improve HOW videos are made, not to make them. - editor CLI = the PRODUCTION LINE — runs analyze→…→QC-loop on local models. editor run --day <folder> → walk away → EDL + cuts. Subcommands per handbook stage (analyze, sidecar, edl, render, qc); run = full pipeline; --watch = auto-process new footage 24/7. Hard edge-cases are FLAGGED for operator/Claude, not blocked on them. - The UNIVERSAL_EDITOR_HANDBOOK is the CLI's SPEC; EDITOR_MODEL_CASTING is its CONFIG; EDITING_FORMAT_SPEC is its data model. CURRENT STATE: pipeline exists as SEPARATE scripts (vanlife-notes.py, osv_stitch.py, …) orchestrated by Claude conversationally — NOT yet unified or headless. Unifying them under one offline editor CLI is the keystone build (B0).

A. BUILT (done — verify on disk before trusting)

═══════════════════════════════════════════════════════════════════ DOCS (this session): - [x] DAY_VIDEO_PLAN.md — locked process. [x] EDITING_FORMAT_SPEC.md — artifact schemas + platform targets. - [x] UNIVERSAL_EDITOR_HANDBOOK.md — full analyze→assemble procedure + editorial framework + anti-slop QC loop. - [x] EDITOR_MODEL_CASTING.md — role→model casting (two-phase methodology; operator overrides). - [x] SESSION-HANDOFF.md — current-state read-first. [x] FINDINGS.md — deep tech log/appendix. SCRIPTS (prior sessions, per FINDINGS file map — confirm they run): - [x] vanlife-notes.py — per-clip vision+Whisper notes (qwen3.6:27b GPU think:false; Whisper medium.en). - [x] osv_stitch.py — DJI 360 dual-fisheye → equirect (ffmpeg v360=dfisheye). [x] reframe360_director.py — VLM-guided 360→16:9. - [x] comfy_client.py — ComfyUI bridge (needs API-format workflows). DELIBERATION: - [x] notes-SOTA chain (6/6): notes are human-sufficient, LLM-insufficient → additive sidecar is the fix. - [x] stop-hook + governance hook fixes (this session). DATA: [x] Day-25 .notes.md exist. [~] A Day-25 cut was shipped previously (engine = see OPEN QUESTIONS).

B. PENDING UPGRADES & BUILD — DO IN THIS ORDER (this is "finish our upgrades")

═══════════════════════════════════════════════════════════════════ 0. [ ] THE editor CLI (KEYSTONE) — unify the separate stage scripts into ONE standalone, offline CLI that runs the whole handbook pipeline on local models with no Claude in the loop. Subcommands per stage + run (full) + --watch (24/7). This is the thing that makes the editor headless/autonomous; everything else (B1–B10) plugs into it. ADAPT FROM C:\warroom (scanned 2026-05-30) — a mature local-no-frontier governance CLI; reuse the spine, don't rebuild it: core/router.py (CPU-serialized priority queue = serial-inference discipline) · config/roster.yaml + core/facilitator.py (config-driven SEAT ASSIGNMENT = our EDITOR_MODEL_CASTING as YAML) · core/watchdog.py (sliding-window repetition detector = our park-and-flag safety valve) · clients/ollama.py + clients/searxng_client.py (REST wrappers, no LiteLLM, estimate_inference_timeout() scales timeout to prompt size) · Typer single-file + REPL (editor process x one-shot AND editor→/process interactive) · core/hierarchy_enforcer.py + core/guardrails.py (hard-gate vs soft-form = our deterministic-checks layer) · core/faith_builder.py (two-tier universal+project roles). HEED: warroom streaming is STUBBED (must implement for watchdog loop-detection); SearXNG host can be flaky (use mDNS); intra-role loop ceiling = 2 then escalate. NET: the orchestration spine exists → editor CLI = adapt warroom core/clients/config + add the VIDEO stages. Cuts a real chunk off B0. CAVEAT (operator 2026-05-30): warroom is STALE — NOT updated in >1 month — and our CURRENT SearXNG/search tooling is already MORE ADVANCED than warroom's. Treat warroom as SUGGESTIONS TO VERIFY, NOT a gold standard. Re-validate every pattern against current substrate + current tools before reuse; do NOT copy its searxng client, and do NOT assume its model roster / tool / host choices are current. Cherry-pick the architectural ideas (serial router, config-driven seats, watchdog, REPL); verify the specifics fresh. VERIFIED 2026-05-30 (parallel agent cross-check vs OUR current tooling): ADOPT — (1) watchdog.py RepetitionDetector/NgramLoop: fills a REAL GAP (we have NO live loop detection; rely on 9h wall-clock timeouts) — highest value, stdlib, lifts cleanly. (2) ollama.py estimate_inference_timeout + host-normalize + token-preflight + typed errors — PORT TO NODE (.mjs per canon), replaces our flat timeout=32768. (3) faith_builder.py two-tier role assembly + provenance + inheritance-check (rewrite its antipattern corpus for editor roles). (4) roster.yaml tiered-seat YAML schema (strip ALL frontier/paid seats). (5) hierarchy_enforcer credential/PII redaction gate (narrow, domain-agnostic). (6) router.py priority queue — only if the CLI fields concurrent work; overkill for a fixed serial chain. IGNORE — OUR SUBSTRATE ALREADY SURPASSES / CONFLICTS: warroom searxng_client (ours = SearXNG+Jina+MCP web_url_read = full-content read, not snippets; warroom host stale); roster CASTING CONTENT (mistral-large / qwen3.5:122b / Claude paid_api — dropped/forbidden by our casting + no-frontier); num_gpu=0 hardwire + keep_alive=-1 resident model (conflict with our partial-GPU num_gpu=14/50 + serial discipline). BUILD THE CLI IN NODE, not warroom's Python/Typer. 1. [~] Whisper large-v3 — DOWNLOADING 2026-05-30 (greenlit; bandwidth/space are NOT constraints — operator). medium.en already does word-timestamps and built/qualified sidecars (B2 = the working baseline). large-v3 = accuracy upgrade. ON DOWNLOAD COMPLETE: point build_sidecars.py whisper model at ggml-large-v3.bin + REGENERATE the day's sidecars (cheap, overwrites the .edl.json). NOTE: bandwidth/space are unlimited — never gate a needed download/program again. 2. [~] Sidecar pipeline — BUILT + builder-self-checked (NOT yet INDEPENDENTLY validated → not cemented). C:\Users\marka\llama.cpp\build_sidecars.py (domain-agnostic; any clip or --dir). ffprobe (duration/fps/w/h/has_audio) + whisper-cli medium.en -ojf -sow word+segment timestamps + derived silences (>0.6s gaps) + hook_score = 0.5energy_var(RMS-CV) + 0.5speech_rate. Emits <media>.edl.json; NEVER touches .notes.md (mtimes verified). QUALIFIED on 2 Day-25 clips (coffee: 383 words, hook 0.77; bear: 75 words, hook 0.48; word-timestamps spot-checked aligned vs notes). ~8–27s/clip GPU. NEXT (cement): run python build_sidecars.py --dir "E:/vanlife/may 2026/25" for the full day. 3. [ ] EDL consolidator — replace vanlife-editplan.py prose with a model that reads notes+sidecars and emits day-NN.edl.json (EDITING_FORMAT_SPEC Artifact 2: longform + reels[] + suggestions[] = timestamped enhancement ideas like "voiceover 1:20–1:45 about X", left but not auto-applied). MUST re-check & override each clip's role/verdict against its transcript (fixes mislabels). Apply the §8 editorial framework to set in/out/order/ transitions/overlays. Cast: synthesizer = nemotron-3-super; story = mistral-large. 4. [ ] Markdown renderers — generate day-NN.human.md + day-NN.reels.md from the EDL (Artifacts 3 & 4). 5. [ ] Remotion assembly generalized — render any day's EDL literally (clips/in-out/transitions/overlays/music). Confirm engine first (OPEN QUESTION). Show operator the EDL+runtime BEFORE render. 6. [ ] Anti-slop QC loop (Handbook §8.5) — render → sample frames+audio → multi-model panel judges vs §8 rubric → score + locate defects → recut if hard-error or <90 → loop (cap ~5) → never ship <90 silently. THE 90% earner. 7. [ ] Magika into ingest — wire E:/Scripts/magika for byte-level file-type detection (any-content robustness). 8. [ ] Lean panel wiring — implement the 5-model default pipeline from EDITOR_MODEL_CASTING.md (serial; api/ps gate). 9. [ ] Empirical casting sweep — run the TWO-PHASE sweep (fit-filter → rank) on all 33 local models for each role, recording operator-style hands-on results (not just benchmarks). Update EDITOR_MODEL_CASTING.md with measured winners. FIRST: smoke-test the SUSPECT heavies — mistral-large (73GB, run-issue history), llama4 (67GB, unverified here), gpt-oss:120b (65GB), mistral-medium (80GB): does each LOAD + respond + stay stable on this box? Cast only survivors; otherwise keep the PROVEN anchors (qwen3.6, nemotron-3-super, granite4.1, gemma4, laguna). Operator findings so far: dense qwen3.6 > qwen3.5:122b; laguna > devstral; mistral-large & llama4 suspect. 10.[ ] Generalize beyond this trip** — confirm every script is footage-agnostic (the GENERAL bar); Day-25 is the test.

C. DEFINITION OF DONE — "finalize the editor"

═══════════════════════════════════════════════════════════════════ Running the standalone OFFLINE editor CLI (no Claude in the loop) on ANY day folder executes ingest→convert→stitch→analyze→sidecar→EDL→editorial→assemble→QC-loop with NO hand-holding and produces a full-day cut (8–15 min, retention-shaped) + several reels (9:16, 20–60s) at ≥90% human quality, self-critiqued until it clears the bar, with the EDL shown to the operator before render. Items B0–B10 complete.

D. OPEN QUESTIONS (operator/decision needed — do not silently pick)

═══════════════════════════════════════════════════════════════════ - Assembly engine — CORRECTED 2026-05-30 (prior "Remotion default" was DRIFT): the CLI has access to ALL engines, but the choice is by AGENT-GENERATION reliability since a LOCAL MODEL generates the assembly. React/TSX (Remotion) is the MOST error-prone for a model to emit → HyperFrames (HTML) = LEAD (agent-native, installed, shipped Day-25); Rendervid (JSON) = alt (our EDL is already JSON); Remotion = demoted (constrained template only, not free React). TODO (B5): run the empirical bake-off (same EDL → HyperFrames vs Rendervid) to lock the lead; also CORRECT DAY_VIDEO_PLAN (it still says "render in Remotion" — wrong for agent-generation per operator + FINDINGS). - large-v3 download is a ~3GB external fetch — operator may want to OK the bandwidth/disk; otherwise proceed.

E. DEFERRED FUTURE (post-MVP — do NOT start here; would delay the end product)

═══════════════════════════════════════════════════════════════════ - Auto-voiceover / voice cloner — automatically fulfill suggestions[].type=="voiceover": generate the VO line + lay it in. Building blocks EXIST (Qwen3-TTS already wired in a ComfyUI workflow; voice_ref work done) — but integrating it is POST-MVP. Ship the autonomous editor first; it LEAVES voiceover suggestions now, a later generator stage fulfills them. - Also deferred: agentic/non-linear editing framework; multi-frame temporal analysis; advanced 360 horizon auto-level. RULE: deferred items must NOT block B0–B10. MVP = a working autonomous editor that leaves suggestions; generators come after.

^ top

UNIVERSAL_EDITOR_HANDBOOK.md

UNIVERSAL EDITOR HANDBOOK — analyze → assemble, end to end

Authored 2026-05-29 (Claude Opus 4.8). The ONE procedure a cold instance follows to turn raw footage into professional output. Distilled from FINDINGS.md (deep technical detail + gotchas), EDITING_FORMAT_SPEC.md (artifact formats + platform targets), and DAY_VIDEO_PLAN.md (the locked process). Editorial-craft sections are grounded in live 2026 web research (cited inline), not memory.

THE BAR: footage-agnostic. A LOCAL model, given ANY clip from ANY shoot, follows this to produce a full-day long-form cut AND several reels — professional grade — with no human hand-holding. This trip is the first test, not the scope.

0. TOOLBOX (what's installed — verify paths in FINDINGS.md before use)

═══════════════════════════════════════════════════════════════════ - ffmpeg — convert, frame-extract, 360 stitch (v360=dfisheye), reframe, mux. The workhorse. - Vision analysis: qwen3.6:27b on GPU via Ollama, think:false, images ≤1024px (~115–230s/frame). NOT Nemotron-Omni on CPU (0.5 tok/s, non-viable — FINDINGS #1). - Audio: whisper.cpp (CUDA). INSTALLED on disk: ggml-medium.en (1.5GB) — this is what the pipeline references TODAY (verified 2026-05-29). RECOMMENDED UPGRADE: ggml-large-v3 (~3GB, free) — more accurate transcripts, better names/places, fewer hallucinations; NOT yet downloaded. WORD-timestamps (not yet generated) are required for editing — see Stage 5. - File-type detection: Magika (E:/Scripts/magika, Google ML detector) — classifies ANY file by its bytes, not extension (Stage 1). - Deliberation chain: 6 local models (gemma/qwen/laguna/granite/nemotron) via scripts/deliberate.py — for architecture/SOTA calls. - Generative (ComfyUI, port 8188): Flux2-Klein (titles/stills), Stable Audio 3 (music+SFX), Wan 2.2 (B-roll), via comfy_client.py (needs API-format workflows). - Assembly (FULL ACCESS — all engines reachable; choose by AGENT-GENERATION reliability, NOT raw capability): because a LOCAL MODEL generates the assembly, prefer formats a model emits reliably. HyperFrames (HTML) = LEAD (agent-native, deterministic, installed, already shipped a Day-25 cut). Rendervid (JSON) = alt (agent-native; our EDL is already JSON). Remotion (React/TSX) = DEMOTED — React is the most error-prone for a model to generate; use only via a CONSTRAINED fill-in template, never free React generation. ffmpeg for direct ops. Settle HyperFrames-vs- Rendervid by the empirical bake-off (FINDINGS) — same EDL into each, keep whichever the local model produces cleanly. (Corrected 2026-05-30: prior "Remotion default" was drift — React is wrong for agent-generation.) - Research: mcp__searxng-mcp__searxng_web_search ONLY (operator constraint). NO frontier cloud models/tools.

0.5 MULTI-MODEL PANEL (how diverse local models combine into judgment)

═══════════════════════════════════════════════════════════════════ We have several local multimodal + reasoning models with DIFFERENT strengths (qwen3.6:27b & mistral-small3.2:24b vision; gemma/qwen/laguna/granite/nemotron reasoning; nemotron-omni). Diversity is the asset — but only if used right.

Role-assigned panel, NOT blind averaging. Each model judges the thing it is best at — one rates VISUAL quality, one the STORY/arc, one the AUDIO/transcript, one the HOOK/energy, one FORMAT/spec compliance — then a SYNTHESIZER reconciles (the Seat-3 pattern from the deliberation chain). Diversity + synthesis beats any single model; a naive average of all models regresses to mediocre. Give each model a role, then reconcile.
Where the panel is used: ambiguous editorial calls (Stage 7), the anti-slop QC judgment (Stage 8.5), and SOTA/architecture decisions (run the 6-agent deliberate.py chain). Mechanical steps use ONE model — don't convene a panel to trim silence.
Substrate is the precondition (operator's law): the panel AMPLIFIES substrate quality — it cannot rescue thin substrate. Garbage notes/sidecars → confident garbage verdict. Thorough, detailed, SOTA-VERIFIED substrate (this handbook, the notes, the sidecars) is what makes the panel produce greatness instead of consensus slop. Re-verify any "SOTA" claim with SearxNG before trusting it — training is months stale.
Serial only: on this box the panel runs models ONE AT A TIME (api/ps clear before each). The throughput advantage (24/7, overnight) pays the time cost — that is the trade we are making.
Casting: which of the 33 local models plays which role is in EDITOR_MODEL_CASTING.md (evidence-based, SearxNG-cited, with a LEAN default panel vs a FULL panel for hard cases). It is the editor's Faith-file / model-rijal analog.

0.6 AUTONOMY MODEL (100% self-running; operator is NEVER a mid-run gate)

═══════════════════════════════════════════════════════════════════ The pipeline runs end-to-end with NO model ever WAITING on the operator. A model's questions are resolved INTERNALLY — it reasons against the substrate (this handbook, notes, sidecars, casting, FINDINGS) and, if still stuck, convenes the panel (§0.5) or the deliberation chain. The operator is never a dependency the line stalls on.

THREE NESTED SELF-CORRECTING LOOPS hold quality without a human: 1. Per-stage AUDIT (inner): every stage validates its OWN output before passing it on — DETERMINISTIC checks (cheap, reliable: EDL schema-valid? runtime in range? stitched frame right-side-up via horizon check? render file exists + duration matches + not all-black?) PLUS model judgment (subjective quality). Fail → the stage self-corrects and RETRIES (new params / re-prompt / panel), never the operator. Cap retries per stage. 2. Final QC GATE (outer = "the train"): the assembled cut is judged by the QC panel vs the §8 rubric (§8.5). Fail (hard error OR <90) → the specific defects are pushed BACK through the pipeline and it re-runs. Cap N full loops. 3. SAFETY VALVE: if a stage or the QC loop exhausts retries, the pipeline does NOT loop forever and does NOT ship slop — it PARKS the job, LOGS the exact unresolved defect, and FLAGS it for async operator/Claude review, then keeps the line moving on other jobs. Graceful failure + a log entry — NOT mid-run waiting.

WHY deterministic-checks + diverse-panel + cap + valve (honest): a model can be CONFIDENTLY WRONG. Pure model self-judgment can loop on a non-issue or pass slop. Deterministic checks catch objective failures cheaply; a DIVERSE panel (not one model) is harder to fool; the retry cap + park-and-flag prevents infinite loops and silent slop. Autonomy = self-correcting with BOUNDED loops and graceful failure, not blind trust in one opinion.

OPERATOR ROLE = ASYNC REVIEW, never a blocking gate. (Supersedes the old "show operator the EDL BEFORE render" hard gate — operator directive 2026-05-30: 100% autonomous. The EDL is still LOGGED for review; reviewing it is optional/async.)

0.7 TOOL ACCESS (capabilities the autonomous models can call — propels reasoning, quality, autonomy)

═══════════════════════════════════════════════════════════════════ For the role-models to reason well, lift production quality, and SELF-RESOLVE (§0.6) without the operator, the CLI exposes a tool layer they can invoke. A role-model should never be stuck for lack of a tool it could have called. - SearxNG search (localhost:8080) — SOTA/technique verification, LOCATION/landmark enrichment (identify a sign/place to caption it right), music/caption-trend reference. Local instance → offline-capable; a strict-offline run skips external fetch and leans on substrate only. - ComfyUI (localhost:8188) — generative fill: Wan2.2 B-roll for gaps, Flux2 title backgrounds, Stable Audio 3 music/SFX. (Qwen3-TTS voiceover exists in a workflow but is DEFERRED — roadmap §E.) - ffprobe / ffmpeg — deterministic audits (duration, black-frame, orientation) + frame extraction for vision. - Deliberation chain (deliberate.py) — escalation for hard/ambiguous calls (autonomy support, NOT the operator). - RAG over substrate (nomic-embed) — index this handbook + notes + casting + FINDINGS so a role-model LOOKS UP the answer to its own question instead of guessing or stalling. This is the core engine of §0.6 self-resolution. - Whisper (transcribe / word-timestamps), Magika (byte-level file typing). RULE: give the model the tool, CAP the calls (no infinite tool loops), LOG what it used (auditability).

1. PIPELINE OVERVIEW (8 stages)

═══════════════════════════════════════════════════════════════════ INGEST → CONVERT → 360-STITCH → ANALYZE → SIDECAR → CONSOLIDATE(EDL) → EDITORIAL-CUT → ASSEMBLE+RENDER → OUTPUT Each stage's artifacts and formats are defined in EDITING_FORMAT_SPEC.md. Notes are read-only inputs; everything numeric is additive sidecars.

2. STAGE 1 — INGEST & INVENTORY

Normalize names: YYYYMMDD_HHMMSS.ext, group into day folders. HEIC→jpg twins kept (decoders can't read HEIC).
OSV (DJI raw) renamed to .mp4. Identify per file: photo / flat-16:9 video / DJI-360-raw (1:1 dual-fisheye) / equirect-2:1.
Inventory each clip's type — it routes everything downstream.

3. STAGE 2 — CONVERT

Phone photos: bake EXIF rotation into pixels + strip tag (ffmpeg -vf transpose=2 -map_metadata -1) — Chromium double-rotates otherwise (FINDINGS gotcha).
Audio for transcription: -ar 16000 -ac 1 mono wav.

4. STAGE 3 — 360 STITCH (only for DJI dual-fisheye raw)

Method: combine the two HEVC fisheye streams then v360=dfisheye:e:ih_fov=190:iv_fov=190 → equirectangular. (Old custom OpenCV remap is DEAD — SSIM 0.39.)
dfisheye does NOT auto-level. Orientation VARIES PER CLIP — some stitch upright, some upside-down (camera remount). MUST detect + --flip (180° roll) the inverted ones BEFORE analysis, or half the footage is garbage.
Reframe 360→flat 16:9 with the VLM-director (reframe360_director.py): sample frames, ask qwen3.6 where the subject is, pan toward it. Never aim the viewport at the lens seam.
HARD RULE: never accept a stitched/reframed clip until you EXTRACT A FRAME AND LOOK at it (sky up, horizon level, subject in view). exit 0 ≠ correct. This rule exists because a seam/upside-down cut shipped once.

5. STAGE 4 — ANALYZE → the `.notes.md` (read-only, never regenerate/delete)

Per clip: extract a frame (video) → qwen3.6:27b vision (think:false) → 6-field note: SCENE / STORY-VALUE / USAGE(A|B-roll) / QUALITY / TIMING / VERDICT(KEEP|MAYBE|SKIP). Append Whisper transcript for audio clips.
These carry the SEMANTIC layer. They are GOOD; do not redo them. Judgment slips (e.g. an A-roll clip mislabeled B-roll) are corrected at Stage 6, not by re-analysis.

6. STAGE 5 — SIDECARS (additive numeric layer; notes untouched) — `<media>.edl.json`

The chain-verified gap: notes lack numbers a machine needs. Build per clip (EDITING_FORMAT_SPEC Artifact 1): - ffprobe → duration, fps, w, h. - Whisper in WORD-TIMESTAMP mode → words[] + segments[] (none are saved on disk → must re-run; this is additive, not note-regeneration). Derive silences[]. - hook_score (0–1) per candidate window = audio RMS-energy variance + speech-rate (cheap reels-ranking signal). - Resolve each note's prose TIMING ("start at 'Alright…'") to numeric in_s/out_s by phrase-matching against words[].

7. STAGE 6 — CONSOLIDATE → the day EDL (source of truth) — `day-NN.edl.json`

A local model reads notes + sidecars and emits ONE EDL (EDITING_FORMAT_SPEC Artifact 2): longform.clips[] and reels.picks[]. AT THIS STEP it RE-CHECKS each clip's role/verdict against the transcript and OVERRIDES bad tags (this is where the A/B-roll mislabel gets fixed). Then apply the EDITORIAL framework below to set the actual in/out, order, transitions, overlays.

8. STAGE 7 — EDITORIAL DECISION FRAMEWORK (the craft; web-researched 2026-05-29)

═══════════════════════════════════════════════════════════════════ This is how a professional decides the cut. The model applies these rules to the notes+sidecars.

8.1 Retention model (the spine of every decision)

The first 30 seconds decide everything — 70%+ retention there drives algorithmic promotion. Open with a HOOK in the first 5–15s (short-form: 0.5–3s): the day's most striking shot or a curiosity gap ("we did NOT expect this"). Do not open with slow logos or "hey guys." [edicionvideopro, virvid.ai, velio 2026]
Beginning and end hold attention; the middle sags. Insert a pattern interrupt every 45–60s — a visual or audio change (new location, B-roll burst, music shift, text pop) — to reset attention. [johnisaacson, velio 2026]
Cut the fluff. Remove dead air, long pauses, rambling, repeated takes. This is why word-timestamps exist: trim silences and "umm" gaps to keep the curve up. Retention beats runtime. [jamesmasi, vloglikepro]

8.2 A-roll vs B-roll (what carries the story)

A-roll = the narrative spine: dialogue/talking that carries meaning + essential audio. KEEP it; it sets the timeline.
B-roll = supplemental cutaways (scenery, hands, the thing being described). Use it to (a) show what A-roll talks about, (b) hide a cut — cover a removed stumble/pause in the A-roll while the audio runs continuously, (c) add texture/pacing. [knowlify, studiobinder, storyblocks 2026]
Rule: when you trim a pause out of A-roll, lay B-roll over the seam so the audio stays unbroken and the visual jump is invisible.

8.3 Clip vs trim vs crop (the three reductions)

CLIP (select): keep only segments with a KEEP verdict and real story value; drop SKIP. Decided from notes + hook_score.
TRIM (tighten): cut into the segment — remove leading/trailing silence (from silences[]), dead air, false starts. Land in/out on word boundaries from words[]. Use jump cuts to compress a long monologue (vlog staple).
CROP / punch-in: reframe within the frame for emphasis or to fix composition — a slow push-in on a key line adds energy; a crop salvages a wide/empty shot. For 360, the reframe-director already chooses the viewport.

8.4 Transition selection (default to the hard cut)

Hard/straight cut — ~90% of edits. Dialogue, action, momentum. When in doubt, cut. [adobe, quora]
Jump cut — skip ahead in time / delete dead space within one shot. Tightens monologue. [adobe]
J-cut (audio of NEXT shot starts before its visual) — anticipation; smooth entry into a new scene/location. [studiobinder, adobe]
L-cut (audio of CURRENT shot lingers over next visual) — let a line finish over a cutaway; natural for dialogue. [backstage, adobe]
Match cut — bridge two shots by shared motion/shape; polished time/location jump.
Crossfade/dissolve — soft, signals TIME PASSING or mood; use for B-roll montages and day-section breaks. NEVER on dialogue or action (looks amateur). [quora]
Cut on action — cut during a movement to hide the edit (seamless).
Music sets pacing: faster cuts = energy, longer holds = mood. Cut on the beat where it helps. [vloggingpro]

8.5 Title cards & text overlays

Animated title card opens the piece (the trip/day + a hook line). Lower-thirds name locations/people.
On-screen text: short, one claim per beat, mobile-first, animated (movement aids retention). Keep inside the SAFE ZONE (clear of platform UI at top/bottom/right). [digitizer, captions 2026]

8.6 Sequencing / story arc

Shape a narrative arc, don't just dump chronologically: pick a theme, open with the hook (often the best moment teased), build through the day, land a payoff at the end (arrival/reflection). Use establishing B-roll to bridge travel gaps. [arkon, sagepub travel-vlog study]

8.7 Full-day vs reels (same EDL, different selection)

Full-day (YouTube, 16:9, ~8–15 min): the whole arc, aggressively trimmed for retention, micro-hooks throughout, mid-roll-friendly (≥8 min).
Reels (9:16, 20–60s, 2–5 per day): each is ONE self-contained moment with the highest hook_score, hook in the first 0.5–3s, bold animated captions, a payoff that makes it loop. One vertical export serves YouTube Shorts + IG Reels + TikTok. [opus.pro, captions, virvid 2026]

9. STAGE 8 — ASSEMBLE + RENDER (Remotion)

Render the EDL LITERALLY: clips in order, in/out as given, transitions/overlays as specified, music bed under.
SHOW THE OPERATOR the EDL + computed runtime BEFORE rendering (locked-plan rule).
Optional SOTA polish (operator-discretion, ComfyUI/GPU): Stable Audio 3 music/SFX, Flux2 title backgrounds, Wan 2.2 B-roll for gaps.

9.5 STAGE 8.5 — ANTI-SLOP QC LOOP (this is where the 90% is earned)

THE NORTH STAR: 90% of a human editor's quality. Below that is slop. We do NOT beat a human on per-video speed — we win on THROUGHPUT (24/7 local, 192GB RAM / 24GB VRAM, hundreds of videos). Spend that advantage HERE: a human edits a video once; the machine renders, JUDGES its own output, finds the slop, and RECUTS — N passes, free, overnight. That loop closes the last 10–20%. Without it, automation produces slop.

The loop (after every render, before ship): 1. Sample the output: extract frames across the timeline + read the audio/caption track. 2. Judge against the §8 rubric (vision model + checklist): Is there a hook in the first 5–15s? Any dead air / left-in pauses / rambling? Any HARD ERROR — upside-down or seam-warped 360, black frames, wrong reframe, caption outside the safe zone, dissolve on dialogue, jarring jump? Is the arc coherent? Is the pattern-interrupt cadence (~45–60s) met? Reels: hook in 0.5–3s, self-contained, loops? 3. Score + locate: produce a 0–100 score and SPECIFIC defects tied to timecodes ("dead air 2:14–2:19", "clip 7 upside down"). 4. Gate: any hard error OR score <90 → emit recut instructions → revise the EDL/assembly → re-render → loop. Escalate borderline/taste calls to the deliberation chain. 5. Cap + log: bound passes (e.g. 5); if still <90, surface to the operator with the remaining defects named — never ship slop silently, never claim 90 without the frame-level check (HARD RULE #1).

This is the inverse of one-shot generation: the model is allowed to be wrong on pass 1 because it CRITIQUES and fixes itself. The irreducible ~10% gap (taste, emotional beats, "which moment is the one") is narrowed here by hook_score + the critique loop + the chain — narrowed, not erased.

10. STAGE 9 — OUTPUT (targets — EDITING_FORMAT_SPEC §Platform targets)

Long-form: YouTube 16:9 1920×1080, 8–15 min. Reels: 9:16 1080×1920, 20–60s, H.264, captions in safe zone.

11. HARD RULES (violating these is how prior instances shipped garbage)

═══════════════════════════════════════════════════════════════════ 1. Verify by LOOKING. Extract a frame from any stitched/reframed/rendered output and look. exit 0/HTTP-200 ≠ correct. 2. Notes are read-only. Never regenerate or delete .notes.md. Add sidecars instead. 3. Serial inference. One Ollama model at a time; check api/ps before every dispatch. Vision on GPU is a short burst — don't run ComfyUI at the same time. 4. No frontier. Local only. SearxNG for research; re-verify any tool/SOTA claim with a live search — training is stale. 5. The cut serves retention + story, not completeness. Cutting good footage is correct if it doesn't earn its place.

12. SOURCES THIS HANDBOOK DISTILLS

FINDINGS.md — deep technical detail, exact commands, the running log + gotchas (the appendix).
EDITING_FORMAT_SPEC.md — artifact schemas (sidecar, EDL, human script, reels sheet) + platform targets.
DAY_VIDEO_PLAN.md — the locked high-level process.
Editorial craft: live SearxNG research 2026-05-29 (cited inline in §8).

^ top

EDITING_FORMAT_SPEC.md

EDITING FORMAT SPEC — master format for all day-editing artifacts

Authored 2026-05-29 from the 6-agent chain verdict (deliberation: notes-sota-sufficiency, 6/6 unanimous). Governs the FORMAT of every editing file so every day follows the same shape. Sits UNDER the process in DAY_VIDEO_PLAN.md (which is locked) and answers to FINDINGS.md. This file does not change the process; it pins down the exact schema of each artifact the process produces.

Scope — GENERAL + AUTONOMOUS (the bar this spec must clear)

This spec is footage-AGNOSTIC. The bar: a LOCAL LLM, given ANY clip from ANY trip — not just this one — can follow these templates to analyze it, organize the day, and edit a final production: BOTH a full-day cut AND SEVERAL reels per day, end-to-end, with no human hand-tuning. Day 25 (this trip) is the FIRST TEST INSTANCE of the general system, not the scope. Nothing in the schema or process may hardcode this trip, these locations, or these filenames. If a template only works because it "knows" this trip, it is wrong and must be generalized.

Chain finding this encodes (why this spec exists)

The existing per-clip .notes.md (Nemotron-Omni + Whisper) are SOTA-sufficient for a HUMAN editor (full-day and reels) but INSUFFICIENT for an LLM editor, because they carry the SEMANTIC layer (scene, story, role, verdict, transcript) but NOT the NUMERIC/TEMPORAL layer a renderer needs: numeric in/out, word-level timestamps, per-clip duration, and a best-moment ranking. Fix = an ADDITIVE numeric SIDECAR per clip. The .notes.md are NEVER modified, regenerated, or deleted.

Format law

JSON is the single source of truth. The sidecars and the day EDL are JSON. A renderer or LLM follows them literally.
Markdown views are GENERATED from the JSON, never hand-written. The human script and the reels sheet are rendered from the day EDL. Same data, two readable renderings — they cannot drift because they are derived.
HTML is deferred. Only added later if an interactive (clickable-timecode) preview is wanted. Render from the JSON.
All times are seconds as floats (e.g. 12.40). Frame = round(seconds * fps). mm:ss is for HUMAN views only.
The .notes.md are read-only inputs. No artifact here writes to them.

Platform targets (web-researched 2026-05; re-verify periodically — specs drift)

LONG-FORM (YouTube; 16:9, 1920x1080, H.264 MP4, 30fps): - A single day's vlog targets ~8–15 min. >=8 min unlocks mid-roll ads; 10–20 min is the storytelling sweet spot ONLY while retention holds >50%. Retention beats raw length — the long-form cut MUST remove dead air / pauses / rambling (this is why word-timestamps are required) to keep the curve up. Default day target: ~8–12 min.

REELS / SHORTS (ONE 9:16 vertical export serves all three platforms — post unchanged to each): - YouTube Shorts: 9:16, 1080x1920, <=180s hard max; top performers 25–50s. - Instagram Reels: 9:16, 1080x1920, up to ~3 min, 30fps, <=4GB. - TikTok: 9:16, 1080x1920; optimal short (15–60s), minutes-long ceiling. - STANDARD WE TARGET: 9:16, 1080x1920, 30fps, H.264 MP4, 20–60s (<=60s = safe + optimal across ALL three at once), with all caption/overlay text inside the SAFE ZONE — clear of the top, bottom, and right edges where platform UI (captions, buttons, handle) overlaps. Several such reels per day (2–5).

Artifact 1 — clip SIDECAR `<media>.edl.json` (one per media file, lives beside the .notes.md)

The numeric layer the notes lack. Derived from assets on disk (ffprobe + Whisper word-timestamps).

{
  "file": "CAM_20260525122958_0025_D.mp4",
  "duration_s": 25.4,
  "fps": 29.97,
  "width": 3840, "height": 2160,
  "has_audio": true,
  "words": [ {"w": "Alright", "start_s": 0.12, "end_s": 0.46}, {"w": "so", "start_s": 0.46, "end_s": 0.58} ],
  "segments": [ {"start_s": 0.12, "end_s": 6.80, "text": "Alright, so we're just planning our trip right now."} ],
  "silences": [ {"start_s": 6.80, "end_s": 7.90} ],
  "hook_score": 0.0,
  "hook_reason": "",
  "source_note": "CAM_20260525122958_0025_D.mp4.notes.md"
}

words / segments — Whisper word + segment timestamps (re-run; none were saved to disk). Enables pause/dead-audio cutting (full-day) and precise quote boundaries (reels).
silences — gaps over a threshold, derived from words. The full-day cut removes these.
hook_score (0–1) + hook_reason — computed reels-ranking signal (audio RMS energy variance + speech rate).
duration_s/fps/w/h — ffprobe. Enables runtime summing and 9:16 reframe math.

Artifact 2 — day EDL `day-<NN>.edl.json` (THE source of truth; the LLM consolidates notes + sidecars into this)

{
  "day": 25,
  "date": "2026-05-25",
  "location": "Raby Lake -> Sudbury -> Niagara",
  "longform": {
    "target_s": [480, 900],
    "clips": [
      { "file": "...0025_D.mp4", "in_s": 0.12, "out_s": 24.0, "role": "A",
        "overlay_text": "", "reason": "authentic kettle-vs-machine planning dialogue", "source_note": "...notes.md" }
    ],
    "runtime_s": 0.0
  },
  "reels": {
    "aspect": "9:16",
    "target_s": [20, 60],
    "count_target": [2, 5],
    "picks": [
      { "file": "...0025_D.mp4", "in_s": 8.0, "out_s": 52.0, "caption": "van life coffee hack",
        "reason": "highest hook_score; self-contained tip", "hook_score": 0.81 }
    ]
  }
}

role: A = dialogue/A-roll (KEEP the talking), B = B-roll/atmosphere.
in_s/out_s: numeric, resolved from the sidecar (prose TIMING phrase -> matched against words).
longform.runtime_s: sum of (out_s - in_s); shown to operator BEFORE render (DAY_VIDEO_PLAN rule).
reels.picks: the SEVERAL best 30–60s (target 2–5 per day), each chosen by hook_score, each with a 9:16 reframe.
suggestions[] (top-level, optional): enhancement IDEAS the editor leaves but does NOT auto-apply — {in_s, out_s, type(voiceover|broll|music|titlecard|sfx|crop), idea, rationale}. Non-blocking; the cut ships without them. A future generator stage can fulfill them (e.g. auto-voiceover via Qwen3-TTS — roadmap §E, DEFERRED). Example: {in_s:80, out_s:105, type:"voiceover", idea:"line on why the Big Nickel matters", rationale:"strong but silent B-roll — VO adds context + retention"}.

Artifact 3 — human script `day-<NN>.human.md` (GENERATED from the day EDL)

Artifact 4 — reels sheet `day-<NN>.reels.md` (GENERATED from the day EDL `reels` block)

The chosen moments — one block PER reel (several per day): file, in–out (mm:ss), caption, 9:16 framing note, the quote each lands on.

Pipeline (concrete, under DAY_VIDEO_PLAN.md)

INPUT: the day's .notes.md (semantic) — read-only.
BUILD SIDECARS: ffprobe + Whisper word-timestamps + silence + hook score -> <media>.edl.json per clip. Notes untouched.
CONSOLIDATE: a LOCAL model reads notes + sidecars -> emits day-<NN>.edl.json (longform + reels).
RENDER VIEWS: generate day-<NN>.human.md + day-<NN>.reels.md from the EDL.
SHOW OPERATOR the EDL + runtime BEFORE rendering. Operator approves.
REMOTION renders long-form + 9:16 reels by following day-<NN>.edl.json literally.

Day 25 is the TEST. Same files, same shape, every subsequent day.

^ top

EDITOR_MODEL_CASTING.md

EDITOR MODEL CASTING — which local model plays which role

Authored 2026-05-29 (Claude Opus 4.8) from live SearxNG benchmark research (cited). The editor analog of the governance canon/model-rijal.md + war-room Faith files: each role is cast to the model that benchmarks best at it, so the multi-model panel (UNIVERSAL_EDITOR_HANDBOOK §0.5) uses each model in its lane and a synthesizer reconciles. Re-verify with SearxNG before trusting — model leaderboards move weekly; this is a 2026-05-29 snapshot.

RULE: roles, not a free-for-all. Mechanical steps use ONE cast model. Judgment steps convene the relevant sub-panel, then the SYNTHESIZER reconciles. Serial inference only (api/ps clear before each dispatch).

Casting methodology (two-phase) — operator-directed

Phase 1 — FIT FILTER (categorical): for each role list ONLY models that can do it AT ALL — has the required modality (vision for visual roles), enough context, structured-output ability, and fits THIS box (192GB RAM / 24GB VRAM) at usable speed. A model that can't see is not a visual candidate no matter how smart. Filter first. Phase 2 — RANK: among the fitters, pick the best by benchmarks AND operator hands-on experience, which OVERRIDES benchmarks. The table below is the current ranking; it must be re-run as a real sweep (see EDITOR_ROADMAP.md).

OPERATOR EMPIRICAL OVERRIDES (trust over leaderboards)

Dense qwen3.6 > qwen3.5:122b (operator, 2026-05-29): the dense 3.6 outperforms the larger older 3.5-122B in practice AND is far lighter/faster. Prefer qwen3.6 (27b/35b) everywhere; do NOT cast qwen3.5:122b (drop it as a backup).
laguna-xs.2 > devstral-small-2 for code (operator, 2026-05-29): laguna is primary code/structure; devstral backup.
mistral-large = SUSPECT (operator: history of run issues on this box). Do NOT depend on it. Needs a smoke test before casting load-bearing. Proven fallback for story/creative = qwen3.6:35b / nemotron-3-super.
llama4 = UNVERIFIED on this box (operator unsure it has ever run here). Don't depend on it; proven long-context fallback = qwen3.6 (1M ctx). Smoke-test before casting.
PREFER PROVEN-ON-THIS-BOX. The deliberation-chain models — gemma4:31b, qwen3.6:27b, laguna-xs.2, granite4.1:30b, nemotron-3-super — demonstrably run here (they ran the chain this session). Heavy unproven models (mistral-large 73GB, llama4 67GB, gpt-oss:120b 65GB, mistral-medium 80GB) must pass a SMOKE TEST (roadmap B9) before being cast load-bearing. A model that benchmarks great but won't run on this station scores ZERO. Fit (Phase 1) is gate-keeper.

The casting (role → model → why)

#	Role	Primary	Backup / 2nd opinion	Evidence (2026-05-29 search)
1	Visual Analyst (per-clip notes: scene/quality/usage)	`qwen3.6:27b` (vision, think:false)	`gemma4:31b`	Qwen3.6 natively multimodal, "perception/multimodal reasoning far exceeds," tops VLM benchmarks; Reddit "Qwen's multimodal is next-level." Already the FINDINGS analysis model.
2	Visual QC (render upside-down? seam? black frames? safe-zone?)	`qwen3.6:27b` + `gemma4:31b` (two-vision vote)	`mistral-small3.2:24b`	Two independent vision models catch what one misses; the anti-slop gate (§8.5).
3	OCR / text-in-frame (read signs, legibility of on-screen captions)	`gemma4:31b`	`mistral-small3.2:24b`	Gemma 4 "excels at OCR, chart/document/UI understanding"; Mistral Small 3.2 strong OCR.
4	Story / Arc / Sequencing (the cut's narrative shape)	`qwen3.6:35b` (PROVEN)	`nemotron-3-super` · mistral-large = SUSPECT, smoke-test	Mistral Large benchmarks best for narrative BUT has run issues here (operator) → use proven qwen3.6:35b until mistral-large passes a smoke test.
5	Creative / Copy (titles, captions, hook lines)	`qwen3.6:35b` (PROVEN)	mistral-large = SUSPECT, smoke-test	Mistral Large is the benchmark creative leader but unproven-stable here; qwen3.6:35b proven.
6	Hook / Energy ranking (reels best-moment selection)	`qwen3.6:35b`	`nemotron-3-super:latest`	Strong multimodal+reasoning to read energy off transcript + frames; pairs with the sidecar `hook_score`.
7	Format / Spec compliance (EDL valid? schema? safe-zones?)	`granite4.1:30b`	`laguna-xs.2:q4_K_M`	Granite = governance/structured audit; laguna = code/structure review (the governance scanner seat).
8	Code generation (Remotion/render code from the EDL)	`laguna-xs.2:q4_K_M` (operator: > devstral)	`qwen3.6:35b` · `devstral-small-2`	Operator finds laguna better than devstral for code; laguna also proven-on-box. Qwen3.6-35B "agentic coding" as backup.
9	Synthesizer / Director (reconcile the panel → final cut)	`nemotron-3-super:latest`	`gpt-oss:120b`	Nemotron-3-Super ≈ gpt-oss-120b / qwen-122b on reasoning + strong structured output; gpt-oss proven as orchestrator/planner (NVIDIA deep-research agent).
10	Long-context whole-day (hold ALL notes+sidecars at once)	`qwen3.6:27b` (1M ctx, PROVEN)	llama4 = UNVERIFIED here, smoke-test	Llama 4 has ultra-long ctx but may not run on this box (operator unsure) → qwen3.6 1M ctx is the proven default.

Lean default vs full panel (cost discipline)

LEAN (every video): #1 qwen3.6:27b (analyze) → #4/#5 mistral-large (story+copy) → #7 granite4.1 (format) → #9 nemotron-3-super (synthesize/direct) → #2 qwen3.6+gemma4 (QC). ~5 models, serial.
FULL (ambiguous/flagship cut, or QC fails twice): add #3 OCR, #6 hook panel, #8 codegen, #10 long-context, and route the disputed call to the 6-agent deliberate.py chain. The 24/7 throughput advantage pays for the extra passes.

Not cast (and why)

Embeddings (nomic-embed-*) = RAG vectors, not judgment. Tiny models (qwen3.5:2b/4b, granite:3b/8b, lfm2*, ministral-3:8b, rnj-1) = too weak for cut decisions; use only for trivial mechanical text. nemotron-cascade-2* = governance RAG. laguna doubles as format reviewer only. AnythingLLM = governance, not the editor.

Sources

Live SearxNG 2026-05-29: HuggingFace/Google Gemma-4 cards; Qwen3.6 blog + HF + MarkTechPost; Mistral docs + Small-3.2 HF; Meta Llama-4; r/LocalLLaMA "Best Local LLMs Apr 2026"; NVIDIA Nemotron-3-Super modelcard + arXiv 2604.12374.

^ top

Autonomous Local Video Editor — Briefing

COMPETITIVE LANDSCAPE — market products vs. what we're building

The 5 product categories on the market

A. Cloud clip-to-shorts SaaS (the crowded, funded tier)

B. Transcript-based AI editors (assistive)

C. Agentic raw→finished editors (THE closest to us — and the key finding)

D. Local / privacy editors (emerging, but narrow)

E. AI video GENERATORS (different category — a tool we USE, not a competitor)

The empty square in the market (= our lane)

Root causes — theirs vs. our answer

HONEST SOTA verdict (where we are / are NOT state-of-the-art)

Sources

EDITOR FRAMEWORK — the autonomous editor's self-governing repo (constitution + index)

Layer map (editor framework ↔ global Claude framework analog)

Governing principles (the editor's "directives")

Why this is MORE autonomous than the global framework (honest)

Build order

EDITOR ROADMAP — build status + ordered upgrade plan (resume point)

ARCHITECTURE — the deliverable is a STANDALONE CLI, not docs, not Claude

A. BUILT (done — verify on disk before trusting)

B. PENDING UPGRADES & BUILD — DO IN THIS ORDER (this is "finish our upgrades")

C. DEFINITION OF DONE — "finalize the editor"

D. OPEN QUESTIONS (operator/decision needed — do not silently pick)

E. DEFERRED FUTURE (post-MVP — do NOT start here; would delay the end product)

UNIVERSAL EDITOR HANDBOOK — analyze → assemble, end to end

0. TOOLBOX (what's installed — verify paths in FINDINGS.md before use)

0.5 MULTI-MODEL PANEL (how diverse local models combine into judgment)

0.6 AUTONOMY MODEL (100% self-running; operator is NEVER a mid-run gate)

0.7 TOOL ACCESS (capabilities the autonomous models can call — propels reasoning, quality, autonomy)

1. PIPELINE OVERVIEW (8 stages)

2. STAGE 1 — INGEST & INVENTORY

3. STAGE 2 — CONVERT

4. STAGE 3 — 360 STITCH (only for DJI dual-fisheye raw)

5. STAGE 4 — ANALYZE → the .notes.md (read-only, never regenerate/delete)

6. STAGE 5 — SIDECARS (additive numeric layer; notes untouched) — <media>.edl.json

7. STAGE 6 — CONSOLIDATE → the day EDL (source of truth) — day-NN.edl.json

8. STAGE 7 — EDITORIAL DECISION FRAMEWORK (the craft; web-researched 2026-05-29)

8.1 Retention model (the spine of every decision)

8.2 A-roll vs B-roll (what carries the story)

8.3 Clip vs trim vs crop (the three reductions)

8.4 Transition selection (default to the hard cut)

8.5 Title cards & text overlays

8.6 Sequencing / story arc

8.7 Full-day vs reels (same EDL, different selection)

9. STAGE 8 — ASSEMBLE + RENDER (Remotion)

9.5 STAGE 8.5 — ANTI-SLOP QC LOOP (this is where the 90% is earned)

10. STAGE 9 — OUTPUT (targets — EDITING_FORMAT_SPEC §Platform targets)

11. HARD RULES (violating these is how prior instances shipped garbage)

12. SOURCES THIS HANDBOOK DISTILLS

EDITING FORMAT SPEC — master format for all day-editing artifacts

Scope — GENERAL + AUTONOMOUS (the bar this spec must clear)

Chain finding this encodes (why this spec exists)

Format law

Platform targets (web-researched 2026-05; re-verify periodically — specs drift)

Artifact 1 — clip SIDECAR <media>.edl.json (one per media file, lives beside the .notes.md)

Artifact 2 — day EDL day-<NN>.edl.json (THE source of truth; the LLM consolidates notes + sidecars into this)

Artifact 3 — human script day-<NN>.human.md (GENERATED from the day EDL)

Artifact 4 — reels sheet day-<NN>.reels.md (GENERATED from the day EDL reels block)

Pipeline (concrete, under DAY_VIDEO_PLAN.md)

EDITOR MODEL CASTING — which local model plays which role

Casting methodology (two-phase) — operator-directed

OPERATOR EMPIRICAL OVERRIDES (trust over leaderboards)

The casting (role → model → why)

Lean default vs full panel (cost discipline)

Not cast (and why)

Sources

5. STAGE 4 — ANALYZE → the `.notes.md` (read-only, never regenerate/delete)

6. STAGE 5 — SIDECARS (additive numeric layer; notes untouched) — `<media>.edl.json`

7. STAGE 6 — CONSOLIDATE → the day EDL (source of truth) — `day-NN.edl.json`

Artifact 1 — clip SIDECAR `<media>.edl.json` (one per media file, lives beside the .notes.md)

Artifact 2 — day EDL `day-<NN>.edl.json` (THE source of truth; the LLM consolidates notes + sidecars into this)

Artifact 3 — human script `day-<NN>.human.md` (GENERATED from the day EDL)

Artifact 4 — reels sheet `day-<NN>.reels.md` (GENERATED from the day EDL `reels` block)