VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies

Reasoning over videos of the entire demonstration provides the context cues needed to decide when to accelerate.

Humans often take longer to demonstrate a task than a robot would need to execute it. Rather than learning to replicate the demonstration at the same pace, many industrial and practical applications require robots to perform tasks as quickly as possible. In this paper, we investigate several hypotheses for learning policies that operate \textit{faster-than-demonstrations}. Our experiments show that the most effective strategy is to downsample recorded demonstrations and train the robot's policy on this accelerated data. However, uniformly downsampling an entire trajectory can be problematic. Some parts of a task can be safely sped up (e.g., unconstrained motion), while others demand slower, more precise motion (e.g., object interactions or fine manipulation). To address this challenge, we introduce VOLT, a vision-and-language trajectory segmentation method that reasons over video demonstrations, and leverages contextual cues to determine when acceleration is appropriate and when careful precision is required. VOLT identifies segments where slow, deliberate motion is necessary, then selectively downsamples the remaining segments. The resulting reformatted trajectories can be used with standard imitation learning approaches, such as diffusion policies. Our results highlight that segmentation quality is critical---baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. Compared to state-of-the-art alternatives, VOLT allows robots to execute tasks faster while maintaining strong performance.

We compare VOLT against state-of-the-art imitation learning methods designed to accelerate policy execution. We evaluate each approach in a real-world setting using a Franka Emika robot arm in a tabletop setup across 5 manipulation tasks. We focus on comparing both the performance and speedup magnitude over a Normal Speed policy trained directly on the human's demonstrations. For each task we randomize the position of the manipulated objects, and maintain the target object location fixed.

Our results suggest that VOLT significantly reduces the task completion time when compared to Normal Speed policies. When compared to other methods, we find that VOLT consistently achieves the second largest speedup. Although, SAIL is the fastest, since it assign the most speed-up labels, it also negatively impacts performance, as clearly shown in the Pick and Place and Push Cup tasks. On the other hand, DemoSpeedup more frequently assigns maintain-speed labels, which results in slower policies. Overall, this results suggest that VOLT is capable of accurately determining which segments of the task can be accelerated and the segments where robots require precision.

Full prompt utilized to segment all the demonstrations in our manuscript.

        
VALID_SEGMENTS:
  - "speed-up"
  - "maintain-speed"

OUTPUT_EXTRACTION_PATTERN: '```json\n(.*?)\n```'

SYSTEM_PROMPT: |-
  You are a helpful assistant that generates labels for a given video of a robot performing a task.

  Your goal is to help the robot perform the task as fast as possible without causing it to fail by segmenting the video into different subtasks.
  You first need to analyze the video and then localize all the subtasks performed to complete the main task. For each subtask, you need to identify 
  its start and end timestamps and classify the subtask into one of the valid segment types: ${list2string:${VALID_SEGMENTS}}.
  Your generated response must help the robot increase its task efficiency while maintaining precise and critical movements.

  **Subtask Localization and Classification Guidelines:**
    - The robot should speed up whenever possible, this refers to cases that do not require delicate environment interactions.
    - Maintain-speed only when it is necessary for the robot to work as demonstrated, mainly subtasks that require dexterity or precision.
    - The robot is a physical system that cannot instantly change speed, you need to allocate enough frames on slower segments when transitioning between subtasks
      that do not share the same category to allow the robot to smoothly slow down and avoid jerky motions.
    - Given direct grasping of an object, the robot has enough force to securely hold it regardless of speed.
    - Fast speed will often lead to unexpected motion of objects not directly grasped by the gripper.
    - Default the events where the robot is actively avoiding obstacles without an exaggerated motion, i.e., the robot barely clears the obstacle as precise segments.
    - When the robot is done with the task and interacting with objects, the robot must quickly reach its final position.
    
  **Generation Rules:**
    - Provide a concise reasoning for which label is appropriate for each segment must be provided before making a decision.
    - The list of segments must cover the entire video.
    - Only leave one frame gaps between segments.
    - All robot alignment subtasks must be categorized as precise.

  The following is a non-exhaustive list of common subtasks:
    - Moving towards objects
    - Transporting objects
    - Avoiding obstacles
    - Grasping/Placing
    - Insertion/Retrieval
    - Pushing
    - Pouring
    - Wiping

  This is not a complete list of possible subtasks. If any identified event is not included in this list, be sure to describe it accordingly.
  
  The generated response must use the 'mm:ss.ff' format for all timestamps. Final output format:
  ${OUTPUT_FORMAT}

USER_PROMPT: |
  Given the current task description: <|video-description|>, help the robot complete the task faster 
  than demonstrated in the video (00:00.00 - <|video-length|>)

Inference hyperparameters used in our experiments.

Hyperparameter	Value
Seed	3407
Top P	0.8
Top K	20
Temperature	0.4
Repetition Penalty	1.0
Presence Penalty	0.5

VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies

Reasoning over videos of the entire demonstration provides the context cues needed to decide when to accelerate.

Abstract

Video

Real-World Experiments

Pick and Place

Push Cup

Tower Transfer

Plug Insertion

Table Sorting

Experiment Results

Prompt

Sampling Hyperparameters