Generating workouts from Youtube videos using Whisper & GPT-4

I often watch follow-along workout videos for a variety of exercises: stretches, sports drills, strength training. Unlike my workout journals or apps where I can log entries, videos require me to constantly replay them to remember details and a separate tool to log completions.

To solve this problem, I built an iOS app that combines two machine learning models, Whisper & GPT-4, to generate workouts from a Youtube link.

You can try it out yourself using TestFlight: https://www.video2workouts.ai

System Design Overview

The workflow looks like this:

  1. submit a POST request with a youtube URL to /workouts
  2. Upon receiving the url, initialize a task using Celery. Celery is a distributed task queue that will handle the generation task asynchronously in a separate process.
  3. Celery runs three tasks in order:
    • #1: Use yt-dlp for downloading the video into a .mp3 file
    • #2: Use OpenAI’s whisper on the mp3 file to generate a transcribed text file.
    • #3: Run a prompt with gpt-4 to return a list of exercises as json from the transcription.
  4. Once the task is successful, store the generated json as exercises & workout data into the db
  5. During the celery task, fetch progress in couple second interval using the taskID

Handling longer videos

A challenge early on in the project was handling 15-20 minute video sizes or greater. GPT-3.5 was working great for 10 minute videos, but this became a problem when videos were 15-20 minutes long because it exceeded the maximum # of prompt tokens.

There are two approaches to solve this issue:

  1. Split the video into smaller audio chunks of 5 minutes upon silent regions, and generate mini summaries using GPT. Then, pass in the summaries together for a final output.
  2. Use GPT-4, which can handle a max of 8192 tokens, almost double GPT-3.5-Turbo (4096 tokens). However, pay an additional cost: $0.06 per 1K completion tokens and $0.03 per 1k prompt tokens, versus $0.002 per 1K for both completion & prompts using GPT-3.5.

Since the output is structured data and not text summaries, the accuracy will take a severe hit if videos were broken into chunks. This is especially true for small videos, where the first 1/5th of a video is just an introduction, whereas for longer videos 1/5th can actually be a completed exercise. The cost to upgrade to GPT-4 was reasonable and handled 20 minute videos perfectly, so it was well worth the upgrade.

For future improvements where longer workouts are desired or transcriptions can be lengthier, a combination of #1 and #2 will be necessary.

Function Calling

It was crucial GPT returned a consistent data format (title, description, sets, reps, units). OpenAI provides a way to do this through function calling: define a function that’s passed into their completion API, so that the function is called when generating a response. You can define which parameters are required as well as their property types. Pretty cool!

Defensive UX

Since the generation task involved multiple steps, it was important to relay progress back to the user. Each step can also fail for various reasons (especially prompting) and sometimes workout data is hallucinated. Some things that helped:

  • Moved to GPT-4 and reduced hallucinations by a noticeable margin
  • Implemented custom states in Celery & retrying logic with exponential backoff
  • Improved prompt messaging by adding the following text:

Each workout should have the mandatory fields defined in the function. If an exercise doesn't have any of the required fields, leave it out. If no exercises can be determined, return an empty array.

What’s next?

  • Retrieving start timestamps for each exercise
  • Use optimized versions of whisper to speed up transcription
  • Vertically scale heroku server for faster generation

Try Now

Thanks for reading! Check out the app here: https://www.video2workouts.ai


One response to “Generating workouts from Youtube videos using Whisper & GPT-4”

Leave a comment