Teach by recording

Teaching by recording is Agivar's most distinctive ability: demonstrate a workflow once, and the AI can learn it, remember it, and follow it later.

A picture beats a thousand words; a demo beats a thousand pictures. Instead of laboriously describing "click here, then click there", record a screen demo — how the mouse moves, what the keyboard typed, how the screen changed, all captured precisely. The AI doesn't have to guess.

What problem it solves

Teaching a GUI workflow with text usually gets stuck in two places:

Can't write it precisely — "click that submit button" — which one? what color? in which area? Text easily leaves ambiguity.
Can't write it completely — small steps the demonstrator takes for granted (click the field first to activate it, scroll to the bottom of the page…) often get left out.

A recording solves both at once:

A precise action sequence — while recording, mouse clicks/drags/scrolls, keyboard input, and hotkeys are captured in sync, with a timeline. Nothing is missed.
The matching frames — after recording, the system extracts keyframes automatically: the frame just before every action that visibly changes the screen (click, double-click, drag, pressing Enter to submit…) is kept, plus periodic samples during quiet stretches. The AI can both "see" what the screen looked like before an action and compare what it became after.
Your extra explanations — add text annotations (Alt+I on Windows, Cmd+Shift+I on macOS), or use the green waveform button for voice annotations; these are merged into the recording's timeline.
Distilled into reusable know-how — the recording is processed into a structured "operation description" (Overview / Initial state / Step-by-step / a "stage result" after each step), which goes straight into the memory base; the entry also records the ID and key clips of the "demo recording", so Task Mode can later ask targeted questions about details in it.

The full flow

Step 1: Start recording

Start a new conversation and turn on Teach Mode.
Click Record → Start screen recording in the input bar, or click Start recording from Select recording file.
A floating recorder control bar appears on screen — semi-transparent, draggable, always on top, with a timer, voice annotation, text annotation, history, pause / continue, and End.

The control bar isn't recorded

By default, the control bar and annotation box are excluded from the captured footage, so the video contains whatever is behind them. They only become visible in recordings when Streamer Mode is enabled.

📷 Screenshot

Place a screenshot of the "floating recorder control bar" here (static/img/recorder-bar.png).

Step 2: Demonstrate (add text or voice explanations)

Just go do the operations you want to demonstrate. Recommendations:

Slow down a bit — leave a small pause between steps so keyframes can reliably catch the pre-action screen.
Demonstrate one complete workflow at a time — from a clear starting point (e.g. "browser is open") to a clear endpoint ("coins given successfully"). Don't record several unrelated things together.

When you want to add a text note for a step, click Annotate on the control bar or use the shortcut (Alt+I on Windows, Cmd+Shift+I on macOS). An annotation input opens beside the floating bar:

Type your note, e.g. "wait for the page to fully load before clicking here, otherwise the button won't respond".
Enter sends, Shift+Enter inserts a newline, Esc cancels.
The note is merged into the recording's action timeline at the moment it happened.
While the annotation input is open, what you type into it is not recorded as an action (the shortcut that opened it, plus the following modifier-key releases, are filtered out too).

If you want to talk while demonstrating, click the green waveform button or use the shortcut (Alt+D on Windows, Cmd+Shift+D on macOS) to start voice annotation; click it again or use the same shortcut again to stop. Agivar turns your speech into timestamped text notes and shows them in the annotation history.

When should you annotate? For anything "you can't tell from the footage": why you waited, why you picked this and not that, what the step is for, what the gotcha is.

📷 Screenshot

Place a screenshot of the "annotation input" here (static/img/explain-dialog.png).

Step 3: End recording

Click End on the control bar. If you need to pause, click Pause, then Continue when ready; paused time is not captured and does not count toward the recording length.

After stopping, the system packages what you recorded and starts background processing — this step needs the network, and how long it takes depends on the recording length, typically tens of seconds to a few minutes. During processing:

you can keep typing in the input bar and add other attachments;
but the Send button is temporarily disabled until processing finishes, so you can't send the message (with the recording) yet;
the recording attachment shows elapsed time / estimated time remaining;
if you do not want to send this recording with the current message, click X on the attachment card. The recording can still be found from Select recording file, and you can delete it from history if you truly want to discard it.

📷 Screenshot

Place a screenshot of the "recording attachment card while processing" here (static/img/recording-processing.png).

Step 4: Preview (optional)

After processing (or before), open the recording preview dialog to play back what you just recorded and confirm the demo is complete:

▶ Start / ⏸ Stop / ⟲ Replay.
Not happy with it? Delete the attachment and re-record.

📷 Screenshot

Place a screenshot of the "recording preview dialog" here (static/img/recording-preview.png).

Step 5: Send it — over to the Teach Agent

Once processing is done and the Send button is back, send the message (you can also type a few notes alongside it). When the Teach Agent receives it:

it first sees a summary of the recording and the structured operation description — so it knows the whole workflow without any extra lookups;
if it needs more detail from inside the recording, it can ask targeted questions about the recording content (internally via a dedicated "video Q&A" capability, e.g. "what does the popup in that frame say");
when something is incomplete or ambiguous, it asks you to confirm.

Step 6: Distilling into a memory entry

The Teach Agent organizes the workflow into the memory base (filed as platform/topic). The entry looks roughly like:

# Overview
This demonstrates the user opening the "非十科技" video on Bilibili and giving it two coins.

# Initial state
The browser is open.

# Steps

## Step 1: Open a new tab and navigate to Bilibili
1. Click the "+" button to the right of the browser tab bar to open a new tab.
2. Click the address bar at the top of the browser to activate input.
3. Type bilibili.com in the address bar.
4. Press Enter to navigate.
Stage result: successfully reached the bilibili.com homepage, showing the recommended video list and the top search bar.

## Step 2: …
…
Stage result: …

# Related demo recording
Recording ID: 00046; summary: …; key clips: …

A few details to note:

Steps are split into several abstract steps, each with its concrete operations underneath, and each ends with a "stage result" — when Task Mode follows along it can use this to judge whether the current stage is complete.
On-screen targets are written unambiguously — "click the blue 'Confirm' button at the bottom-right of the popup", "click the magnifier icon to the right of the search box", rather than just "click confirm".
The entry records the demo recording's ID at the end, so later, when Task Mode runs a related task, it can pull up this recording and ask about details.

What actually happens during processing

For your understanding, "background processing" is roughly four stages (you don't do anything):

Upload keyframes — the keyframes picked from the recording are uploaded to the cloud.
Annotate frame by frame — the AI looks at these keyframes, combines them with the mouse/keyboard action timeline, and produces a detailed description with frame references.
Generate a summary — condenses the detailed description into a short video summary.
Generate the operation description — produces, in the "Overview / Initial state / Step-by-step / stage result" structure, the "operation description" that can go straight into a memory entry.

The recording's raw frames and mouse/keyboard events stay locally under ~/.agivar/; the processed result (keyframes, summary, operation description) is stored in the cloud, tied to your account.

Voice annotations (optional)

The green waveform button on the control bar records voice annotations. Turn it on and speak naturally; Agivar transcribes your speech in real time and places the recognized text into the annotation history at the right timestamps. When generating the operation description, the AI prefers the terminology and structure you spoke out loud, so for longer workflows you do not need to type every note manually.

The first time you use it, your system may request microphone permission. If microphone access is unavailable, screen recording continues normally, just without voice annotations.

Best practices for teaching by recording

✅ One workflow per recording, with clear start and end.
✅ Slow down, leave a pause between steps.
✅ Use text or voice annotations at key moments to explain the "why" — that's the information the recording can't give and only you can.
✅ Set up the environment before demonstrating (log in / open what's needed beforehand), and state it clearly in the "Initial state".
✅ Preview once before sending.
⚠️ Don't record sensitive screens like CAPTCHAs or passwords — when a step like that is truly needed, pause there, add a text annotation such as "user logs in here", and skip the actual input.
⚠️ With multiple monitors, the primary monitor is recorded — do the demo on the primary screen.

The memory base — see what a recording-taught workflow ends up stored as
Screen Recording and Voice Notes — every floating-bar and voice feature
Task Mode — how this know-how gets reused when running tasks
FAQ

What problem it solves​

The full flow​

Step 1: Start recording​

Step 2: Demonstrate (add text or voice explanations)​

Step 3: End recording​

Step 4: Preview (optional)​

Step 5: Send it — over to the Teach Agent​

Step 6: Distilling into a memory entry​

What actually happens during processing​

Voice annotations (optional)​

Best practices for teaching by recording​

Next​

What problem it solves

The full flow

Step 1: Start recording

Step 2: Demonstrate (add text or voice explanations)

Step 3: End recording

Step 4: Preview (optional)

Step 5: Send it — over to the Teach Agent

Step 6: Distilling into a memory entry

What actually happens during processing

Voice annotations (optional)

Best practices for teaching by recording

Next