Creating Skip2Smooth started from pure frustration. You know that feeling when you're trying to send a video to someone and it's taking forever because the file is massive? Or when you're traveling with limited data and want to share a moment but WhatsApp just won't cooperate? Yeah, I was there too many times.
I started thinking: "Why am I sending every single frame when half of them barely change from the previous one?" If you think about a typical video, most frames are pretty similar to the ones right before them. Your face doesn't teleport across the screen between frames – it moves smoothly. So why not just send the important moments and let the computer fill in the gaps?
That's when I discovered Google's FILM (Frame Interpolation for Large Motion) model, and everything clicked. What if I could build a system that intelligently picks out the frames that matter, sends just those, and then uses AI to recreate the smooth motion on the other end? That became Skip2Smooth.
The Problem Space
Let's be real for a second. Video files are HUGE. A 1-minute 1080p video at 30fps can easily be 100MB or more. Now multiply that by however long your video is, and you're looking at gigabytes of data that need to travel across the internet.
Traditional compression (like H.264 or H.265) helps, but it has limits. They're basically squishing all that data into a smaller package without throwing anything away. Skip2Smooth takes a different approach: why not just send less data in the first place?
The challenge was figuring out which frames are actually important. You can't just randomly pick frames – the video would look choppy and weird. You need to be smart about it.
How I Figured Out Which Frames Matter
This was the fun part. I needed a way to measure how different two consecutive frames are. If they're nearly identical, maybe we don't need both. If there's a big change, that's a keyframe we definitely want to keep.
I ended up using three different metrics, because each one catches different things:
MSE (Mean Squared Error): This is the simple one. It literally compares every pixel between two frames and measures the difference. High MSE means lots of change. The math nerds love this one because it's straightforward, but it's not always perceptually accurate.
SSIM (Structural Similarity Index): This one's smarter. Instead of just looking at raw pixel differences, it considers the structure of the image. It tries to measure what humans actually perceive as "similar." It's great for catching things like brightness changes that don't actually change the content much.
LPIPS (Learned Perceptual Image Patch Similarity): This is the AI-powered metric. It uses a neural network trained on human perception to measure how different two images look to actual humans. It's computationally expensive but super accurate for what matters visually.
I combine all three into a single "difference score" and use that to decide which frames to keep. The cool part is you can tune how aggressive the compression is – want to keep more quality? Keep more frames. Need maximum compression? Be more selective.
The Compression Magic
Once I know which frames to keep, the compression itself is pretty straightforward:
- Analyze every pair of consecutive frames
- Score them based on how different they are
- Use adaptive thresholding to pick keyframes (this adjusts based on how much motion is in the video)
- Create a new video with just those keyframes
- Save a CSV file that remembers which frame numbers we kept
The adaptive thresholding was crucial. In a slow scene where nothing's moving much, you can skip way more frames. In an action sequence, you need to keep more. The algorithm figures this out automatically.
The Reconstruction Wizardry
This is where it gets really cool. On the receiving end, you get this choppy video with frames missing. The system needs to fill in the gaps, and this is where Google's FILM model shines.
Here's what happens:
- Extract all the keyframes from the compressed video
- Load the CSV file to see which frame numbers we originally had
- Calculate how many frames are missing between each keyframe pair
- For each gap, use FILM to generate the missing frames
FILM is incredible. You give it two frames and a timestamp (like 0.5 for exactly halfway between them), and it generates a frame that looks completely natural. It understands motion, depth, and even complex movements like hair blowing in the wind.
The tricky part was handling different gap sizes. Sometimes frames are only 5 frames apart, sometimes 50. I process each segment individually, generating all the intermediate frames needed, then stitch everything back together into a smooth 30fps video.
The User Experience Journey
I built the interface with Streamlit because I wanted something clean and simple. No one wants to mess with command-line arguments when they're just trying to send a video.
On the sender side, you upload your video, and the system shows you a real-time graph of frame differences. It's actually pretty mesmerizing to watch – you can literally see where the action happens in your video. Then you use a slider to pick how much compression you want. Want 70% size reduction? 90%? Your call. The system finds the optimal settings to hit that target.
When you hit "Compress," it processes the video and shows you side-by-side comparisons. You can see exactly what the compressed version looks like before sending it. When you're happy, click "Send" and it generates a UUID that you share with whoever needs the video.
On the receiver side, it's even simpler. Paste the UUID, click "Retrieve," then "Reconstruct." The system does all the heavy lifting – downloading the files, running the AI interpolation, stitching the video back together. Progress bars keep you updated so you're not sitting there wondering if it crashed.
The Technical Deep Dive
Let me geek out for a moment on some of the more interesting technical challenges:
Memory Management: FILM can use a LOT of GPU memory, especially with high-resolution videos. I initially tried batch sizes of 10-20 frames at once, but it kept crashing. Dropping to batch_size=1 made it way more stable, even if slightly slower. Better slow than crashed.
Frame Alignment: The FILM model needs frames with dimensions divisible by 64. I added automatic padding and cropping to handle any resolution. This was one of those details that took forever to debug because it only failed on certain video sizes.
Timestamp Calculation: When you need to interpolate, say, 15 frames between two keyframes, you need to calculate 15 equally-spaced timestamps between 0 and 1. I use numpy's linspace for this, excluding the endpoints (since those are your keyframes). Getting this math right was crucial for smooth playback.
Supabase Integration: I chose Supabase for storage because it's basically "Postgres + Storage as a Service." The database tracks metadata (file names, upload times, identifiers) while the storage bucket holds the actual video files. The Python client makes it super easy to work with.
Real-World Performance
In my testing, Skip2Smooth typically achieves:
- 60-80% compression for talking-head videos (like vlogs or presentations)
- 40-60% compression for moderate motion (like walking around)
- 20-40% compression for high-action content (sports, action scenes)
The quality retention is usually in the 85-95% range perceptually. Sometimes the reconstructed video actually looks smoother than the original because FILM is so good at motion interpolation.
The biggest bottleneck is the FILM model inference. On a decent GPU, you can process about 2-5 frames per second during reconstruction. On CPU only, drop that to 0.2-0.5 fps. So yeah, GPU recommended if you're doing this regularly.
The Features That Make It Special
Adaptive Compression: The system doesn't just blindly skip frames. It analyzes the content and adjusts how aggressive it is based on what's happening in the video. A static interview? Super aggressive compression. A parkour video? More conservative.
Visual Feedback: You can literally see which parts of your video have the most motion through the metrics graph. It's like having X-ray vision for video content.
Cloud-Based Sharing: No need to email massive files or use Dropbox. Upload once, share a simple code. The receiver doesn't need any special software beyond a web browser.
Quality Control: You're always in control of the compression-quality tradeoff. Don't like the preview? Adjust the slider and try again. No guessing required.
Lessons Learned
Building this taught me so much:
-
Perceptual metrics matter more than mathematical ones: MSE alone is terrible for video quality assessment. Humans don't see differences the way algorithms do.
-
AI models are amazing but finicky: FILM works incredibly well, but you need to handle its quirks (alignment, memory, preprocessing).
-
User experience is everything: The underlying tech is complex, but users should never have to think about it. Upload, compress, share – that's it.
-
Trade-offs are inevitable: You can't have maximum compression, perfect quality, and instant speed. Pick two. I optimized for compression + quality and accepted that reconstruction takes time.
What's Next?
I have a bunch of ideas for future versions:
Audio preservation: Right now, this is video-only. Adding audio back in after reconstruction would be huge.
Real-time processing: What if you could stream the reconstruction process? As keyframes arrive, start interpolating immediately.
Custom model training: FILM is great, but training a specialized model on the types of videos you care about could improve quality.
Mobile support: Imagine doing this all on your phone. The compression could happen locally, saving bandwidth before upload.
Peer-to-peer transfer: Cut out the cloud middleman for even faster sharing between devices on the same network.
Why This Matters
In a world where video is eating the internet, we need smarter ways to handle it. Not everyone has gigabit fiber. Lots of people have data caps. Some folks are trying to share videos from areas with terrible connectivity.
Skip2Smooth isn't trying to replace professional video editing or streaming services. It's a tool for those moments when you just need to get a video from point A to point B efficiently, without sacrificing quality.
Plus, it's just really cool to watch AI recreate motion that never existed in the compressed file. Every time I see FILM generate a smooth pan or interpolate someone's facial expression, I'm amazed all over again.
Try It Yourself
The code is open source (MIT license), so feel free to fork it, break it, improve it, or just learn from it. If you build something cool on top of it, let me know – I'd love to see what you come up with.
And if you're dealing with terrible upload speeds or data caps, give Skip2Smooth a try. It might just save you a few hours of waiting and a chunk of your data plan.
Built with Python, TensorFlow, Streamlit, and way too much coffee. Powered by Google's FILM model and the frustration of slow internet.