Autonomous Local Video Editor — Briefing

Generated 2026-05-30 · read top-to-bottom or jump via the index

COMPETITIVE_LANDSCAPE.md

COMPETITIVE LANDSCAPE — market products vs. what we're building

Compiled 2026-05-30 (Claude Opus 4.8) from live SearxNG research + GitHub deep-fetches (CutClaw, VideoAgent) + the 6-agent moat deliberation. HONEST assessment — not a pitch. Pairs with EDITOR_FRAMEWORK.md.

The 5 product categories on the market

A. Cloud clip-to-shorts SaaS (the crowded, funded tier)

Opus Clip, Klap, Vizard, Submagic, quso.ai, Reap, CapCut Long-to-Short. - Does: long video → short highlight clips, auto-captions, reframe to 9:16. - Arch: cloud, subscription, you UPLOAD your footage. Frontier/proprietary models server-side. - Root problem they target: repurposing long content into shorts at volume. - Where they fail (root causes): only CLIP (don't edit a full long-form video); generic output = the "AI slop" problem (YouTube's #1 2026 war); upload/privacy; per-video cost; limited context understanding (Opus Clip users on Reddit: "AI's understanding of context is still limited").

B. Transcript-based AI editors (assistive)

Descript, Captions, VEED Auto-Editor. - Does: edit by editing the transcript; filler-word/silence removal; AI voiceover; templates. - Arch: mostly cloud; human-in-the-loop (assistive, not autonomous). - Targets: making editing accessible to non-editors. - Fail: human still drives the edit (not autonomous); cloud; Descript itself is publicly fighting the "slop machine" label — i.e. the taste/QC gap is unsolved even by the category leader.

C. Agentic raw→finished editors (THE closest to us — and the key finding)

CutClaw, VideoAgent (HKUDS), OpenMontage, "Video Use", Poolday, Tellers.ai, Druid Cat. - Does: multi-agent pipeline: deconstruct raw footage → caption → shot-plan → pick timestamps → validate → render. This is OUR architecture. - Arch — THE CRITICAL POINT: they are FRONTIER-API ORCHESTRATORS. VideoAgent mandates Claude + GPT-4o + Deepseek + Gemini. CutClaw routes "core intelligence" to cloud via LiteLLM (OpenAI/Gemini/Claude). They run locally only for video DECODE; the brain is cloud. - Targets: full autonomous editing. - Fail: NOT private (footage/transcripts hit cloud APIs); NOT offline; NOT free (per-minute API cost — CutClaw's own stated bottleneck is API latency); cost compounds at scale (hundreds of videos).

D. Local / privacy editors (emerging, but narrow)

Reelify (Mac, no-upload), AetherCut, OpenCut, LTX Desktop. - Does: local processing for privacy. - Arch: runs on-device. - Targets: privacy / no-upload (a real, named 2026 demand). - Fail: only clip-to-shorts OR generation — NONE does full autonomous raw→finished long-form AND reels.

E. AI video GENERATORS (different category — a tool we USE, not a competitor)

Wan 2.2, LTX-2.3, Sora/Veo/Kling (cloud), Runway. - Make NEW footage from text/image. We use these (via ComfyUI) as the polish/B-roll layer — not editors of existing footage.

The empty square in the market (= our lane)

Cross C (autonomous raw→finished) with D (fully local/private). That intersection is EMPTY. Every autonomous editor is cloud/frontier; every local tool is clip-only. We sit in the gap: autonomous raw→finished, full-day + several reels, 100% local / offline / private / free, multi-model panel + self-audit QC.

Root causes — theirs vs. our answer

Root problem (industry) Who suffers Our answer
AI SLOP (no taste/QC) A, B, C — all anti-slop QC LOOP + 90% bar + DIVERSE panel + deterministic checks
Frontier-API dependence (cost/privacy/not-offline) C (all of them) 100% local, no frontier, offline
Upload / privacy A, B, C footage never leaves the box
Per-video cost at scale A, C free local, 24/7 throughput
Assistive, not autonomous B fully autonomous (§0.6, no human in the loop)
Hours-long context limits C pre-distilled notes + sidecars + RAG (not raw hours in context)
Clip-only (no full edit) A, D full-day long-form AND reels from one EDL

HONEST SOTA verdict (where we are / are NOT state-of-the-art)

Sources

Live SearxNG 2026-05-30 + GitHub: GVCLab/CutClaw, HKUDS/VideoAgent (both verified frontier-API-dependent); r/AIVideoCut (Opus Clip context limits); reelifyclips.com (local/no-upload); AetherCut/OpenCut/LTX-Desktop (privacy-local); YouTube CEO "AI slop" 2026 priority; Descript "slop machine" interview; moat deliberation (TEMP/deliberate/editor-moat-vs-competitors).

^ top

EDITOR_FRAMEWORK.md

EDITOR FRAMEWORK — the autonomous editor's self-governing repo (constitution + index)

This is the editor's equivalent of the global Claude governance framework (~/.claude: CLAUDE.md + canon + practice + faiths + hooks) — but PURPOSE-BUILT, ground-up, and MORE AUTONOMOUS. The global framework keeps a human in the loop for governance-depth work because those edits are high-stakes and irreversible. The editor's stakes are LOWER (a bad cut is recoverable) and its self-audit is STRONGER (deterministic checks + diverse panel + bounded retries + safety valve), so it removes the human gates and runs itself. The CLI runs ON this framework. Any engineering question — how to analyze, convert, stitch, cut, assemble, which model, which tool, what format — is answered by the docs indexed here. Don't guess, don't ask the operator: the framework is built to answer (RAG it — HANDBOOK §0.7).

Layer map (editor framework ↔ global Claude framework analog)

Global framework Editor framework File
Scripture (CLAUDE.md) — constitution North-star + Autonomy contract project_editor_north_star (memory) + HANDBOOK §0.6
Practice (how to operate) Pipeline procedure + editorial craft UNIVERSAL_EDITOR_HANDBOOK.md
Data model Artifact schemas + platform targets EDITING_FORMAT_SPEC.md
Faiths (role identities) Model casting (role→model) EDITOR_MODEL_CASTING.md
Hooks (structural self-enforcement) The 3 self-correcting loops (per-stage audit · QC "train" · safety valve) HANDBOOK §0.6
Tools Tool-access layer (SearXNG, ComfyUI, RAG, ffprobe, chain, ALL render engines) HANDBOOK §0.7
STATE.md (continuity) Build status + ordered roadmap + deferred EDITOR_ROADMAP.md
The deliberation chain The editor panel + deliberate.py escalation HANDBOOK §0.5 + CASTING
Deep log / appendix Technical findings, exact commands FINDINGS.md
Read-first / session entry Current state SESSION-HANDOFF.md
Locked high-level process The job, locked DAY_VIDEO_PLAN.md

Governing principles (the editor's "directives")

  1. 100% autonomous — no model waits on the operator (§0.6). Operator = launcher + async reviewer only.
  2. Self-audit every stage; QC the final cut; re-run the train on failure; park-and-flag — never loop forever, never ship slop.
  3. 90% human quality or it's slop. Won in the QC loop, not at render.
  4. Substrate is the precondition — thorough/detailed/SOTA, re-verified by live search; models RAG it to self-resolve.
  5. Full tool access — give models every tool (ALL render engines, ComfyUI, SearXNG, ffmpeg); they reach for what they need.
  6. Proven-on-box first — a benchmark winner that won't run on this station scores zero. Fit before rank.
  7. Footage-agnostic — any clip, any shoot. This trip is the test, not the scope.
  8. The CLI is the product — Claude builds it; it runs offline, local, 24/7. Claude is the mechanic, not the operator.
  9. Built to fold into warroom (learning flows BACK). The editor is a standalone self-governing CLI NOW, but architected as a portable SUB-CLI a future warroom can absorb (warroom edit ...). Build shared primitives (loop-detector, Node Ollama client, role-roster schema, the QC loop, the §0.6 autonomy loops) as DOMAIN-AGNOSTIC modules with clean boundaries — not editor-specific spaghetti — so warroom can inherit the advancements this project makes. We mined warroom for ideas; warroom will mine us for the better ones.
  10. PRIORITY: the editor takes PRECEDENCE over finishing warroom. Build the editor first; the fold-in is later. If a future instance must choose, the editor wins.

Why this is MORE autonomous than the global framework (honest)

Lower stakes (recoverable output) + stronger structural self-audit (deterministic + panel + cap + valve) justify removing the human-in-the-loop gates the global framework keeps. Autonomy is EARNED by the audit architecture, not assumed.

Build order

The product is the offline editor CLI (EDITOR_ROADMAP §B0). It implements this framework: HANDBOOK = its logic, FORMAT_SPEC = its data, CASTING = its config, §0.6 = its control loops, §0.7 = its tools. Build B0→B10; defer §E.

^ top

EDITOR_ROADMAP.md

EDITOR ROADMAP — build status + ordered upgrade plan (resume point)

Updated 2026-05-29 (Claude Opus 4.8). PURPOSE: a fresh instance reads THIS to know exactly what's built, what's pending, and the next procedures/upgrades — so "finish our upgrades and finalize the editor" is unambiguous and nothing is lost between instances. Pair with UNIVERSAL_EDITOR_HANDBOOK.md (how) + EDITING_FORMAT_SPEC.md (formats) + EDITOR_MODEL_CASTING.md (who).

SHIP-FIRST DISCIPLINE (operator 2026-05-30): the goal is producing VIDEOS soon. Do NOT over-engineer for generality. The engine is domain-agnostic by clean-module HYGIENE (same router/panel/QC-loop, swap Faith files per domain = the warroom pattern), but generality is a FREE BYPRODUCT, not a goal that delays the first video. Fastest path = MVP spine (B0 core + B1 + B2 + B3 + agent-native render) → ONE Day-25 video → iterate. Do NOT rabbit-hole on fold-in / warroom / meta-architecture.

BUILD METHOD (operator 2026-05-30): STAGE-BY-STAGE, QUALIFY-THEN-CEMENT. Build each stage as a STANDALONE script → run it on Day-25 → QUALIFY (verify the output is good by LOOKING / measuring) → only THEN cement it into the CLI as a stable module. The CLI shell grows incrementally around proven stages. Do NOT build the whole CLI upfront then rework + bug-fix each stage. Each cemented stage is frozen-good; the next builds on PROVEN output. (The qualify step is the human-verified seed of that stage's automated per-stage audit, HANDBOOK §0.6.) QUALIFY = INDEPENDENT validation, not the builder's self-report. The model/agent that BUILT a stage must NOT be the one that qualifies it (builder-judges-own-work = the weak same-tribe validation our framework distrusts). A DIFFERENT validator checks the output across MULTIPLE clips + edge cases (e.g. for sidecars: word-timestamp accuracy vs notes, silence detection, hook-score calibration, silent/360 clips, schema). Validate the FINAL artifact, not a throwaway intermediate. Only after independent validation passes is a stage "cemented." So B0 is NOT "build the CLI first" — it is "grow the CLI shell as B1→B-render stages qualify." Order of qualification: Whisper-large → sidecars → EDL → render → QC.

═══════════════════════════════════════════════════════════════════

ARCHITECTURE — the deliverable is a STANDALONE CLI, not docs, not Claude

═══════════════════════════════════════════════════════════════════ THE PRODUCT is editor — a standalone CLI that runs the whole pipeline on LOCAL models only (Ollama + ffmpeg + Remotion + ComfyUI), OFFLINE, with NO Claude/frontier in the loop. This is what makes the editor headless + 24/7 + free, which the north-star requires. Without it, every video needs Claude — not autonomous.

SEPARATION OF ROLES: - Claude (frontier) = the MECHANIC — builds, researches (SearxNG, web), and UPGRADES the editor. Needed to improve HOW videos are made, not to make them. - editor CLI = the PRODUCTION LINE — runs analyze→…→QC-loop on local models. editor run --day <folder> → walk away → EDL + cuts. Subcommands per handbook stage (analyze, sidecar, edl, render, qc); run = full pipeline; --watch = auto-process new footage 24/7. Hard edge-cases are FLAGGED for operator/Claude, not blocked on them. - The UNIVERSAL_EDITOR_HANDBOOK is the CLI's SPEC; EDITOR_MODEL_CASTING is its CONFIG; EDITING_FORMAT_SPEC is its data model. CURRENT STATE: pipeline exists as SEPARATE scripts (vanlife-notes.py, osv_stitch.py, …) orchestrated by Claude conversationally — NOT yet unified or headless. Unifying them under one offline editor CLI is the keystone build (B0).

═══════════════════════════════════════════════════════════════════

A. BUILT (done — verify on disk before trusting)

═══════════════════════════════════════════════════════════════════ DOCS (this session): - [x] DAY_VIDEO_PLAN.md — locked process. [x] EDITING_FORMAT_SPEC.md — artifact schemas + platform targets. - [x] UNIVERSAL_EDITOR_HANDBOOK.md — full analyze→assemble procedure + editorial framework + anti-slop QC loop. - [x] EDITOR_MODEL_CASTING.md — role→model casting (two-phase methodology; operator overrides). - [x] SESSION-HANDOFF.md — current-state read-first. [x] FINDINGS.md — deep tech log/appendix. SCRIPTS (prior sessions, per FINDINGS file map — confirm they run): - [x] vanlife-notes.py — per-clip vision+Whisper notes (qwen3.6:27b GPU think:false; Whisper medium.en). - [x] osv_stitch.py — DJI 360 dual-fisheye → equirect (ffmpeg v360=dfisheye). [x] reframe360_director.py — VLM-guided 360→16:9. - [x] comfy_client.py — ComfyUI bridge (needs API-format workflows). DELIBERATION: - [x] notes-SOTA chain (6/6): notes are human-sufficient, LLM-insufficient → additive sidecar is the fix. - [x] stop-hook + governance hook fixes (this session). DATA: [x] Day-25 .notes.md exist. [~] A Day-25 cut was shipped previously (engine = see OPEN QUESTIONS).

═══════════════════════════════════════════════════════════════════

B. PENDING UPGRADES & BUILD — DO IN THIS ORDER (this is "finish our upgrades")

═══════════════════════════════════════════════════════════════════ 0. [ ] THE editor CLI (KEYSTONE) — unify the separate stage scripts into ONE standalone, offline CLI that runs the whole handbook pipeline on local models with no Claude in the loop. Subcommands per stage + run (full) + --watch (24/7). This is the thing that makes the editor headless/autonomous; everything else (B1–B10) plugs into it. ADAPT FROM C:\warroom (scanned 2026-05-30) — a mature local-no-frontier governance CLI; reuse the spine, don't rebuild it: core/router.py (CPU-serialized priority queue = serial-inference discipline) · config/roster.yaml + core/facilitator.py (config-driven SEAT ASSIGNMENT = our EDITOR_MODEL_CASTING as YAML) · core/watchdog.py (sliding-window repetition detector = our park-and-flag safety valve) · clients/ollama.py + clients/searxng_client.py (REST wrappers, no LiteLLM, estimate_inference_timeout() scales timeout to prompt size) · Typer single-file + REPL (editor process x one-shot AND editor/process interactive) · core/hierarchy_enforcer.py + core/guardrails.py (hard-gate vs soft-form = our deterministic-checks layer) · core/faith_builder.py (two-tier universal+project roles). HEED: warroom streaming is STUBBED (must implement for watchdog loop-detection); SearXNG host can be flaky (use mDNS); intra-role loop ceiling = 2 then escalate. NET: the orchestration spine exists → editor CLI = adapt warroom core/clients/config + add the VIDEO stages. Cuts a real chunk off B0. CAVEAT (operator 2026-05-30): warroom is STALE — NOT updated in >1 month — and our CURRENT SearXNG/search tooling is already MORE ADVANCED than warroom's. Treat warroom as SUGGESTIONS TO VERIFY, NOT a gold standard. Re-validate every pattern against current substrate + current tools before reuse; do NOT copy its searxng client, and do NOT assume its model roster / tool / host choices are current. Cherry-pick the architectural ideas (serial router, config-driven seats, watchdog, REPL); verify the specifics fresh. VERIFIED 2026-05-30 (parallel agent cross-check vs OUR current tooling): ADOPT — (1) watchdog.py RepetitionDetector/NgramLoop: fills a REAL GAP (we have NO live loop detection; rely on 9h wall-clock timeouts) — highest value, stdlib, lifts cleanly. (2) ollama.py estimate_inference_timeout + host-normalize + token-preflight + typed errors — PORT TO NODE (.mjs per canon), replaces our flat timeout=32768. (3) faith_builder.py two-tier role assembly + provenance + inheritance-check (rewrite its antipattern corpus for editor roles). (4) roster.yaml tiered-seat YAML schema (strip ALL frontier/paid seats). (5) hierarchy_enforcer credential/PII redaction gate (narrow, domain-agnostic). (6) router.py priority queue — only if the CLI fields concurrent work; overkill for a fixed serial chain. IGNORE — OUR SUBSTRATE ALREADY SURPASSES / CONFLICTS: warroom searxng_client (ours = SearXNG+Jina+MCP web_url_read = full-content read, not snippets; warroom host stale); roster CASTING CONTENT (mistral-large / qwen3.5:122b / Claude paid_api — dropped/forbidden by our casting + no-frontier); num_gpu=0 hardwire + keep_alive=-1 resident model (conflict with our partial-GPU num_gpu=14/50 + serial discipline). BUILD THE CLI IN NODE, not warroom's Python/Typer. 1. [~] Whisper large-v3 — DOWNLOADING 2026-05-30 (greenlit; bandwidth/space are NOT constraints — operator). medium.en already does word-timestamps and built/qualified sidecars (B2 = the working baseline). large-v3 = accuracy upgrade. ON DOWNLOAD COMPLETE: point build_sidecars.py whisper model at ggml-large-v3.bin + REGENERATE the day's sidecars (cheap, overwrites the .edl.json). NOTE: bandwidth/space are unlimited — never gate a needed download/program again. 2. [~] Sidecar pipeline — BUILT + builder-self-checked (NOT yet INDEPENDENTLY validated → not cemented). C:\Users\marka\llama.cpp\build_sidecars.py (domain-agnostic; any clip or --dir). ffprobe (duration/fps/w/h/has_audio) + whisper-cli medium.en -ojf -sow word+segment timestamps + derived silences (>0.6s gaps) + hook_score = 0.5energy_var(RMS-CV) + 0.5speech_rate. Emits <media>.edl.json; NEVER touches .notes.md (mtimes verified). QUALIFIED on 2 Day-25 clips (coffee: 383 words, hook 0.77; bear: 75 words, hook 0.48; word-timestamps spot-checked aligned vs notes). ~8–27s/clip GPU. NEXT (cement): run python build_sidecars.py --dir "E:/vanlife/may 2026/25" for the full day. 3. [ ] EDL consolidator — replace vanlife-editplan.py prose with a model that reads notes+sidecars and emits day-NN.edl.json (EDITING_FORMAT_SPEC Artifact 2: longform + reels[] + suggestions[] = timestamped enhancement ideas like "voiceover 1:20–1:45 about X", left but not auto-applied). MUST re-check & override each clip's role/verdict against its transcript (fixes mislabels). Apply the §8 editorial framework to set in/out/order/ transitions/overlays. Cast: synthesizer = nemotron-3-super; story = mistral-large. 4. [ ] Markdown renderers — generate day-NN.human.md + day-NN.reels.md from the EDL (Artifacts 3 & 4). 5. [ ] Remotion assembly generalized — render any day's EDL literally (clips/in-out/transitions/overlays/music). Confirm engine first (OPEN QUESTION). Show operator the EDL+runtime BEFORE render. 6. [ ] Anti-slop QC loop (Handbook §8.5) — render → sample frames+audio → multi-model panel judges vs §8 rubric → score + locate defects → recut if hard-error or <90 → loop (cap ~5) → never ship <90 silently. THE 90% earner. 7. [ ] Magika into ingest — wire E:/Scripts/magika for byte-level file-type detection (any-content robustness). 8. [ ] Lean panel wiring — implement the 5-model default pipeline from EDITOR_MODEL_CASTING.md (serial; api/ps gate). 9. [ ] Empirical casting sweep — run the TWO-PHASE sweep (fit-filter → rank) on all 33 local models for each role, recording operator-style hands-on results (not just benchmarks). Update EDITOR_MODEL_CASTING.md with measured winners. FIRST: smoke-test the SUSPECT heavies — mistral-large (73GB, run-issue history), llama4 (67GB, unverified here), gpt-oss:120b (65GB), mistral-medium (80GB): does each LOAD + respond + stay stable on this box? Cast only survivors; otherwise keep the PROVEN anchors (qwen3.6, nemotron-3-super, granite4.1, gemma4, laguna). Operator findings so far: dense qwen3.6 > qwen3.5:122b; laguna > devstral; mistral-large & llama4 suspect. 10.[ ] Generalize beyond this trip** — confirm every script is footage-agnostic (the GENERAL bar); Day-25 is the test.

═══════════════════════════════════════════════════════════════════

C. DEFINITION OF DONE — "finalize the editor"

═══════════════════════════════════════════════════════════════════ Running the standalone OFFLINE editor CLI (no Claude in the loop) on ANY day folder executes ingest→convert→stitch→analyze→sidecar→EDL→editorial→assemble→QC-loop with NO hand-holding and produces a full-day cut (8–15 min, retention-shaped) + several reels (9:16, 20–60s) at ≥90% human quality, self-critiqued until it clears the bar, with the EDL shown to the operator before render. Items B0–B10 complete.

═══════════════════════════════════════════════════════════════════

D. OPEN QUESTIONS (operator/decision needed — do not silently pick)

═══════════════════════════════════════════════════════════════════ - Assembly engine — CORRECTED 2026-05-30 (prior "Remotion default" was DRIFT): the CLI has access to ALL engines, but the choice is by AGENT-GENERATION reliability since a LOCAL MODEL generates the assembly. React/TSX (Remotion) is the MOST error-prone for a model to emit → HyperFrames (HTML) = LEAD (agent-native, installed, shipped Day-25); Rendervid (JSON) = alt (our EDL is already JSON); Remotion = demoted (constrained template only, not free React). TODO (B5): run the empirical bake-off (same EDL → HyperFrames vs Rendervid) to lock the lead; also CORRECT DAY_VIDEO_PLAN (it still says "render in Remotion" — wrong for agent-generation per operator + FINDINGS). - large-v3 download is a ~3GB external fetch — operator may want to OK the bandwidth/disk; otherwise proceed.

═══════════════════════════════════════════════════════════════════

E. DEFERRED FUTURE (post-MVP — do NOT start here; would delay the end product)

═══════════════════════════════════════════════════════════════════ - Auto-voiceover / voice cloner — automatically fulfill suggestions[].type=="voiceover": generate the VO line + lay it in. Building blocks EXIST (Qwen3-TTS already wired in a ComfyUI workflow; voice_ref work done) — but integrating it is POST-MVP. Ship the autonomous editor first; it LEAVES voiceover suggestions now, a later generator stage fulfills them. - Also deferred: agentic/non-linear editing framework; multi-frame temporal analysis; advanced 360 horizon auto-level. RULE: deferred items must NOT block B0–B10. MVP = a working autonomous editor that leaves suggestions; generators come after.

^ top

UNIVERSAL_EDITOR_HANDBOOK.md

UNIVERSAL EDITOR HANDBOOK — analyze → assemble, end to end

Authored 2026-05-29 (Claude Opus 4.8). The ONE procedure a cold instance follows to turn raw footage into professional output. Distilled from FINDINGS.md (deep technical detail + gotchas), EDITING_FORMAT_SPEC.md (artifact formats + platform targets), and DAY_VIDEO_PLAN.md (the locked process). Editorial-craft sections are grounded in live 2026 web research (cited inline), not memory.

THE BAR: footage-agnostic. A LOCAL model, given ANY clip from ANY shoot, follows this to produce a full-day long-form cut AND several reels — professional grade — with no human hand-holding. This trip is the first test, not the scope.

═══════════════════════════════════════════════════════════════════

0. TOOLBOX (what's installed — verify paths in FINDINGS.md before use)

═══════════════════════════════════════════════════════════════════ - ffmpeg — convert, frame-extract, 360 stitch (v360=dfisheye), reframe, mux. The workhorse. - Vision analysis: qwen3.6:27b on GPU via Ollama, think:false, images ≤1024px (~115–230s/frame). NOT Nemotron-Omni on CPU (0.5 tok/s, non-viable — FINDINGS #1). - Audio: whisper.cpp (CUDA). INSTALLED on disk: ggml-medium.en (1.5GB) — this is what the pipeline references TODAY (verified 2026-05-29). RECOMMENDED UPGRADE: ggml-large-v3 (~3GB, free) — more accurate transcripts, better names/places, fewer hallucinations; NOT yet downloaded. WORD-timestamps (not yet generated) are required for editing — see Stage 5. - File-type detection: Magika (E:/Scripts/magika, Google ML detector) — classifies ANY file by its bytes, not extension (Stage 1). - Deliberation chain: 6 local models (gemma/qwen/laguna/granite/nemotron) via scripts/deliberate.py — for architecture/SOTA calls. - Generative (ComfyUI, port 8188): Flux2-Klein (titles/stills), Stable Audio 3 (music+SFX), Wan 2.2 (B-roll), via comfy_client.py (needs API-format workflows). - Assembly (FULL ACCESS — all engines reachable; choose by AGENT-GENERATION reliability, NOT raw capability): because a LOCAL MODEL generates the assembly, prefer formats a model emits reliably. HyperFrames (HTML) = LEAD (agent-native, deterministic, installed, already shipped a Day-25 cut). Rendervid (JSON) = alt (agent-native; our EDL is already JSON). Remotion (React/TSX) = DEMOTED — React is the most error-prone for a model to generate; use only via a CONSTRAINED fill-in template, never free React generation. ffmpeg for direct ops. Settle HyperFrames-vs- Rendervid by the empirical bake-off (FINDINGS) — same EDL into each, keep whichever the local model produces cleanly. (Corrected 2026-05-30: prior "Remotion default" was drift — React is wrong for agent-generation.) - Research: mcp__searxng-mcp__searxng_web_search ONLY (operator constraint). NO frontier cloud models/tools.

═══════════════════════════════════════════════════════════════════

0.5 MULTI-MODEL PANEL (how diverse local models combine into judgment)

═══════════════════════════════════════════════════════════════════ We have several local multimodal + reasoning models with DIFFERENT strengths (qwen3.6:27b & mistral-small3.2:24b vision; gemma/qwen/laguna/granite/nemotron reasoning; nemotron-omni). Diversity is the asset — but only if used right.

═══════════════════════════════════════════════════════════════════

0.6 AUTONOMY MODEL (100% self-running; operator is NEVER a mid-run gate)

═══════════════════════════════════════════════════════════════════ The pipeline runs end-to-end with NO model ever WAITING on the operator. A model's questions are resolved INTERNALLY — it reasons against the substrate (this handbook, notes, sidecars, casting, FINDINGS) and, if still stuck, convenes the panel (§0.5) or the deliberation chain. The operator is never a dependency the line stalls on.

THREE NESTED SELF-CORRECTING LOOPS hold quality without a human: 1. Per-stage AUDIT (inner): every stage validates its OWN output before passing it on — DETERMINISTIC checks (cheap, reliable: EDL schema-valid? runtime in range? stitched frame right-side-up via horizon check? render file exists + duration matches + not all-black?) PLUS model judgment (subjective quality). Fail → the stage self-corrects and RETRIES (new params / re-prompt / panel), never the operator. Cap retries per stage. 2. Final QC GATE (outer = "the train"): the assembled cut is judged by the QC panel vs the §8 rubric (§8.5). Fail (hard error OR <90) → the specific defects are pushed BACK through the pipeline and it re-runs. Cap N full loops. 3. SAFETY VALVE: if a stage or the QC loop exhausts retries, the pipeline does NOT loop forever and does NOT ship slop — it PARKS the job, LOGS the exact unresolved defect, and FLAGS it for async operator/Claude review, then keeps the line moving on other jobs. Graceful failure + a log entry — NOT mid-run waiting.

WHY deterministic-checks + diverse-panel + cap + valve (honest): a model can be CONFIDENTLY WRONG. Pure model self-judgment can loop on a non-issue or pass slop. Deterministic checks catch objective failures cheaply; a DIVERSE panel (not one model) is harder to fool; the retry cap + park-and-flag prevents infinite loops and silent slop. Autonomy = self-correcting with BOUNDED loops and graceful failure, not blind trust in one opinion.

OPERATOR ROLE = ASYNC REVIEW, never a blocking gate. (Supersedes the old "show operator the EDL BEFORE render" hard gate — operator directive 2026-05-30: 100% autonomous. The EDL is still LOGGED for review; reviewing it is optional/async.)

═══════════════════════════════════════════════════════════════════

0.7 TOOL ACCESS (capabilities the autonomous models can call — propels reasoning, quality, autonomy)

═══════════════════════════════════════════════════════════════════ For the role-models to reason well, lift production quality, and SELF-RESOLVE (§0.6) without the operator, the CLI exposes a tool layer they can invoke. A role-model should never be stuck for lack of a tool it could have called. - SearxNG search (localhost:8080) — SOTA/technique verification, LOCATION/landmark enrichment (identify a sign/place to caption it right), music/caption-trend reference. Local instance → offline-capable; a strict-offline run skips external fetch and leans on substrate only. - ComfyUI (localhost:8188) — generative fill: Wan2.2 B-roll for gaps, Flux2 title backgrounds, Stable Audio 3 music/SFX. (Qwen3-TTS voiceover exists in a workflow but is DEFERRED — roadmap §E.) - ffprobe / ffmpeg — deterministic audits (duration, black-frame, orientation) + frame extraction for vision. - Deliberation chain (deliberate.py) — escalation for hard/ambiguous calls (autonomy support, NOT the operator). - RAG over substrate (nomic-embed) — index this handbook + notes + casting + FINDINGS so a role-model LOOKS UP the answer to its own question instead of guessing or stalling. This is the core engine of §0.6 self-resolution. - Whisper (transcribe / word-timestamps), Magika (byte-level file typing). RULE: give the model the tool, CAP the calls (no infinite tool loops), LOG what it used (auditability).

═══════════════════════════════════════════════════════════════════

1. PIPELINE OVERVIEW (8 stages)

═══════════════════════════════════════════════════════════════════ INGEST → CONVERT → 360-STITCH → ANALYZE → SIDECAR → CONSOLIDATE(EDL) → EDITORIAL-CUT → ASSEMBLE+RENDER → OUTPUT Each stage's artifacts and formats are defined in EDITING_FORMAT_SPEC.md. Notes are read-only inputs; everything numeric is additive sidecars.

2. STAGE 1 — INGEST & INVENTORY

3. STAGE 2 — CONVERT

4. STAGE 3 — 360 STITCH (only for DJI dual-fisheye raw)

5. STAGE 4 — ANALYZE → the .notes.md (read-only, never regenerate/delete)

6. STAGE 5 — SIDECARS (additive numeric layer; notes untouched) — <media>.edl.json

The chain-verified gap: notes lack numbers a machine needs. Build per clip (EDITING_FORMAT_SPEC Artifact 1): - ffprobe → duration, fps, w, h. - Whisper in WORD-TIMESTAMP mode → words[] + segments[] (none are saved on disk → must re-run; this is additive, not note-regeneration). Derive silences[]. - hook_score (0–1) per candidate window = audio RMS-energy variance + speech-rate (cheap reels-ranking signal). - Resolve each note's prose TIMING ("start at 'Alright…'") to numeric in_s/out_s by phrase-matching against words[].

7. STAGE 6 — CONSOLIDATE → the day EDL (source of truth) — day-NN.edl.json

A local model reads notes + sidecars and emits ONE EDL (EDITING_FORMAT_SPEC Artifact 2): longform.clips[] and reels.picks[]. AT THIS STEP it RE-CHECKS each clip's role/verdict against the transcript and OVERRIDES bad tags (this is where the A/B-roll mislabel gets fixed). Then apply the EDITORIAL framework below to set the actual in/out, order, transitions, overlays.

═══════════════════════════════════════════════════════════════════

8. STAGE 7 — EDITORIAL DECISION FRAMEWORK (the craft; web-researched 2026-05-29)

═══════════════════════════════════════════════════════════════════ This is how a professional decides the cut. The model applies these rules to the notes+sidecars.

8.1 Retention model (the spine of every decision)

8.2 A-roll vs B-roll (what carries the story)

8.3 Clip vs trim vs crop (the three reductions)

8.4 Transition selection (default to the hard cut)

8.5 Title cards & text overlays

8.6 Sequencing / story arc

8.7 Full-day vs reels (same EDL, different selection)

9. STAGE 8 — ASSEMBLE + RENDER (Remotion)

9.5 STAGE 8.5 — ANTI-SLOP QC LOOP (this is where the 90% is earned)

THE NORTH STAR: 90% of a human editor's quality. Below that is slop. We do NOT beat a human on per-video speed — we win on THROUGHPUT (24/7 local, 192GB RAM / 24GB VRAM, hundreds of videos). Spend that advantage HERE: a human edits a video once; the machine renders, JUDGES its own output, finds the slop, and RECUTS — N passes, free, overnight. That loop closes the last 10–20%. Without it, automation produces slop.

The loop (after every render, before ship): 1. Sample the output: extract frames across the timeline + read the audio/caption track. 2. Judge against the §8 rubric (vision model + checklist): Is there a hook in the first 5–15s? Any dead air / left-in pauses / rambling? Any HARD ERROR — upside-down or seam-warped 360, black frames, wrong reframe, caption outside the safe zone, dissolve on dialogue, jarring jump? Is the arc coherent? Is the pattern-interrupt cadence (~45–60s) met? Reels: hook in 0.5–3s, self-contained, loops? 3. Score + locate: produce a 0–100 score and SPECIFIC defects tied to timecodes ("dead air 2:14–2:19", "clip 7 upside down"). 4. Gate: any hard error OR score <90 → emit recut instructions → revise the EDL/assembly → re-render → loop. Escalate borderline/taste calls to the deliberation chain. 5. Cap + log: bound passes (e.g. 5); if still <90, surface to the operator with the remaining defects named — never ship slop silently, never claim 90 without the frame-level check (HARD RULE #1).

This is the inverse of one-shot generation: the model is allowed to be wrong on pass 1 because it CRITIQUES and fixes itself. The irreducible ~10% gap (taste, emotional beats, "which moment is the one") is narrowed here by hook_score + the critique loop + the chain — narrowed, not erased.

10. STAGE 9 — OUTPUT (targets — EDITING_FORMAT_SPEC §Platform targets)

═══════════════════════════════════════════════════════════════════

11. HARD RULES (violating these is how prior instances shipped garbage)

═══════════════════════════════════════════════════════════════════ 1. Verify by LOOKING. Extract a frame from any stitched/reframed/rendered output and look. exit 0/HTTP-200 ≠ correct. 2. Notes are read-only. Never regenerate or delete .notes.md. Add sidecars instead. 3. Serial inference. One Ollama model at a time; check api/ps before every dispatch. Vision on GPU is a short burst — don't run ComfyUI at the same time. 4. No frontier. Local only. SearxNG for research; re-verify any tool/SOTA claim with a live search — training is stale. 5. The cut serves retention + story, not completeness. Cutting good footage is correct if it doesn't earn its place.

12. SOURCES THIS HANDBOOK DISTILLS

^ top

EDITING_FORMAT_SPEC.md

EDITING FORMAT SPEC — master format for all day-editing artifacts

Authored 2026-05-29 from the 6-agent chain verdict (deliberation: notes-sota-sufficiency, 6/6 unanimous). Governs the FORMAT of every editing file so every day follows the same shape. Sits UNDER the process in DAY_VIDEO_PLAN.md (which is locked) and answers to FINDINGS.md. This file does not change the process; it pins down the exact schema of each artifact the process produces.


Scope — GENERAL + AUTONOMOUS (the bar this spec must clear)

This spec is footage-AGNOSTIC. The bar: a LOCAL LLM, given ANY clip from ANY trip — not just this one — can follow these templates to analyze it, organize the day, and edit a final production: BOTH a full-day cut AND SEVERAL reels per day, end-to-end, with no human hand-tuning. Day 25 (this trip) is the FIRST TEST INSTANCE of the general system, not the scope. Nothing in the schema or process may hardcode this trip, these locations, or these filenames. If a template only works because it "knows" this trip, it is wrong and must be generalized.

Chain finding this encodes (why this spec exists)

The existing per-clip .notes.md (Nemotron-Omni + Whisper) are SOTA-sufficient for a HUMAN editor (full-day and reels) but INSUFFICIENT for an LLM editor, because they carry the SEMANTIC layer (scene, story, role, verdict, transcript) but NOT the NUMERIC/TEMPORAL layer a renderer needs: numeric in/out, word-level timestamps, per-clip duration, and a best-moment ranking. Fix = an ADDITIVE numeric SIDECAR per clip. The .notes.md are NEVER modified, regenerated, or deleted.


Format law

  1. JSON is the single source of truth. The sidecars and the day EDL are JSON. A renderer or LLM follows them literally.
  2. Markdown views are GENERATED from the JSON, never hand-written. The human script and the reels sheet are rendered from the day EDL. Same data, two readable renderings — they cannot drift because they are derived.
  3. HTML is deferred. Only added later if an interactive (clickable-timecode) preview is wanted. Render from the JSON.
  4. All times are seconds as floats (e.g. 12.40). Frame = round(seconds * fps). mm:ss is for HUMAN views only.
  5. The .notes.md are read-only inputs. No artifact here writes to them.

Platform targets (web-researched 2026-05; re-verify periodically — specs drift)

LONG-FORM (YouTube; 16:9, 1920x1080, H.264 MP4, 30fps): - A single day's vlog targets ~8–15 min. >=8 min unlocks mid-roll ads; 10–20 min is the storytelling sweet spot ONLY while retention holds >50%. Retention beats raw length — the long-form cut MUST remove dead air / pauses / rambling (this is why word-timestamps are required) to keep the curve up. Default day target: ~8–12 min.

REELS / SHORTS (ONE 9:16 vertical export serves all three platforms — post unchanged to each): - YouTube Shorts: 9:16, 1080x1920, <=180s hard max; top performers 25–50s. - Instagram Reels: 9:16, 1080x1920, up to ~3 min, 30fps, <=4GB. - TikTok: 9:16, 1080x1920; optimal short (15–60s), minutes-long ceiling. - STANDARD WE TARGET: 9:16, 1080x1920, 30fps, H.264 MP4, 20–60s (<=60s = safe + optimal across ALL three at once), with all caption/overlay text inside the SAFE ZONE — clear of the top, bottom, and right edges where platform UI (captions, buttons, handle) overlaps. Several such reels per day (2–5).

Artifact 1 — clip SIDECAR <media>.edl.json (one per media file, lives beside the .notes.md)

The numeric layer the notes lack. Derived from assets on disk (ffprobe + Whisper word-timestamps).

{
  "file": "CAM_20260525122958_0025_D.mp4",
  "duration_s": 25.4,
  "fps": 29.97,
  "width": 3840, "height": 2160,
  "has_audio": true,
  "words": [ {"w": "Alright", "start_s": 0.12, "end_s": 0.46}, {"w": "so", "start_s": 0.46, "end_s": 0.58} ],
  "segments": [ {"start_s": 0.12, "end_s": 6.80, "text": "Alright, so we're just planning our trip right now."} ],
  "silences": [ {"start_s": 6.80, "end_s": 7.90} ],
  "hook_score": 0.0,
  "hook_reason": "",
  "source_note": "CAM_20260525122958_0025_D.mp4.notes.md"
}

Artifact 2 — day EDL day-<NN>.edl.json (THE source of truth; the LLM consolidates notes + sidecars into this)

{
  "day": 25,
  "date": "2026-05-25",
  "location": "Raby Lake -> Sudbury -> Niagara",
  "longform": {
    "target_s": [480, 900],
    "clips": [
      { "file": "...0025_D.mp4", "in_s": 0.12, "out_s": 24.0, "role": "A",
        "overlay_text": "", "reason": "authentic kettle-vs-machine planning dialogue", "source_note": "...notes.md" }
    ],
    "runtime_s": 0.0
  },
  "reels": {
    "aspect": "9:16",
    "target_s": [20, 60],
    "count_target": [2, 5],
    "picks": [
      { "file": "...0025_D.mp4", "in_s": 8.0, "out_s": 52.0, "caption": "van life coffee hack",
        "reason": "highest hook_score; self-contained tip", "hook_score": 0.81 }
    ]
  }
}

Artifact 3 — human script day-<NN>.human.md (GENERATED from the day EDL)

Readable cut list for a human editor: ordered table of # | clip | in–out (mm:ss) | role | overlay | why plus the transcript snippet per A-roll clip, runtime total, thumbnail candidate. Never hand-edited.

Artifact 4 — reels sheet day-<NN>.reels.md (GENERATED from the day EDL reels block)

The chosen moments — one block PER reel (several per day): file, in–out (mm:ss), caption, 9:16 framing note, the quote each lands on.


Pipeline (concrete, under DAY_VIDEO_PLAN.md)

  1. INPUT: the day's .notes.md (semantic) — read-only.
  2. BUILD SIDECARS: ffprobe + Whisper word-timestamps + silence + hook score -> <media>.edl.json per clip. Notes untouched.
  3. CONSOLIDATE: a LOCAL model reads notes + sidecars -> emits day-<NN>.edl.json (longform + reels).
  4. RENDER VIEWS: generate day-<NN>.human.md + day-<NN>.reels.md from the EDL.
  5. SHOW OPERATOR the EDL + runtime BEFORE rendering. Operator approves.
  6. REMOTION renders long-form + 9:16 reels by following day-<NN>.edl.json literally.

Day 25 is the TEST. Same files, same shape, every subsequent day.

^ top

EDITOR_MODEL_CASTING.md

EDITOR MODEL CASTING — which local model plays which role

Authored 2026-05-29 (Claude Opus 4.8) from live SearxNG benchmark research (cited). The editor analog of the governance canon/model-rijal.md + war-room Faith files: each role is cast to the model that benchmarks best at it, so the multi-model panel (UNIVERSAL_EDITOR_HANDBOOK §0.5) uses each model in its lane and a synthesizer reconciles. Re-verify with SearxNG before trusting — model leaderboards move weekly; this is a 2026-05-29 snapshot.

RULE: roles, not a free-for-all. Mechanical steps use ONE cast model. Judgment steps convene the relevant sub-panel, then the SYNTHESIZER reconciles. Serial inference only (api/ps clear before each dispatch).

Casting methodology (two-phase) — operator-directed

Phase 1 — FIT FILTER (categorical): for each role list ONLY models that can do it AT ALL — has the required modality (vision for visual roles), enough context, structured-output ability, and fits THIS box (192GB RAM / 24GB VRAM) at usable speed. A model that can't see is not a visual candidate no matter how smart. Filter first. Phase 2 — RANK: among the fitters, pick the best by benchmarks AND operator hands-on experience, which OVERRIDES benchmarks. The table below is the current ranking; it must be re-run as a real sweep (see EDITOR_ROADMAP.md).

OPERATOR EMPIRICAL OVERRIDES (trust over leaderboards)

The casting (role → model → why)

# Role Primary Backup / 2nd opinion Evidence (2026-05-29 search)
1 Visual Analyst (per-clip notes: scene/quality/usage) qwen3.6:27b (vision, think:false) gemma4:31b Qwen3.6 natively multimodal, "perception/multimodal reasoning far exceeds," tops VLM benchmarks; Reddit "Qwen's multimodal is next-level." Already the FINDINGS analysis model.
2 Visual QC (render upside-down? seam? black frames? safe-zone?) qwen3.6:27b + gemma4:31b (two-vision vote) mistral-small3.2:24b Two independent vision models catch what one misses; the anti-slop gate (§8.5).
3 OCR / text-in-frame (read signs, legibility of on-screen captions) gemma4:31b mistral-small3.2:24b Gemma 4 "excels at OCR, chart/document/UI understanding"; Mistral Small 3.2 strong OCR.
4 Story / Arc / Sequencing (the cut's narrative shape) qwen3.6:35b (PROVEN) nemotron-3-super · mistral-large = SUSPECT, smoke-test Mistral Large benchmarks best for narrative BUT has run issues here (operator) → use proven qwen3.6:35b until mistral-large passes a smoke test.
5 Creative / Copy (titles, captions, hook lines) qwen3.6:35b (PROVEN) mistral-large = SUSPECT, smoke-test Mistral Large is the benchmark creative leader but unproven-stable here; qwen3.6:35b proven.
6 Hook / Energy ranking (reels best-moment selection) qwen3.6:35b nemotron-3-super:latest Strong multimodal+reasoning to read energy off transcript + frames; pairs with the sidecar hook_score.
7 Format / Spec compliance (EDL valid? schema? safe-zones?) granite4.1:30b laguna-xs.2:q4_K_M Granite = governance/structured audit; laguna = code/structure review (the governance scanner seat).
8 Code generation (Remotion/render code from the EDL) laguna-xs.2:q4_K_M (operator: > devstral) qwen3.6:35b · devstral-small-2 Operator finds laguna better than devstral for code; laguna also proven-on-box. Qwen3.6-35B "agentic coding" as backup.
9 Synthesizer / Director (reconcile the panel → final cut) nemotron-3-super:latest gpt-oss:120b Nemotron-3-Super ≈ gpt-oss-120b / qwen-122b on reasoning + strong structured output; gpt-oss proven as orchestrator/planner (NVIDIA deep-research agent).
10 Long-context whole-day (hold ALL notes+sidecars at once) qwen3.6:27b (1M ctx, PROVEN) llama4 = UNVERIFIED here, smoke-test Llama 4 has ultra-long ctx but may not run on this box (operator unsure) → qwen3.6 1M ctx is the proven default.

Lean default vs full panel (cost discipline)

Not cast (and why)

Sources

Live SearxNG 2026-05-29: HuggingFace/Google Gemma-4 cards; Qwen3.6 blog + HF + MarkTechPost; Mistral docs + Small-3.2 HF; Meta Llama-4; r/LocalLLaMA "Best Local LLMs Apr 2026"; NVIDIA Nemotron-3-Super modelcard + arXiv 2604.12374.

^ top

End. Source docs live in C:\Users\marka\llama.cpp\dji_test\