cloud-test: Cloud Spin, Part 3: processing video using Google Cloud Platform services

Google Cloud Platform Blog

Cloud Spin, Part 3: processing video using Google Cloud Platform services

Friday, October 2, 2015

Cloud Spin, Part 1 and Part 2 introduced the Google Cloud Spin project, an exciting demo built for Google Cloud Platform Next, and how we built the mobile applications that orchestrated 19 Android phones to record simultaneous video.
And now the last step is to retrieve the videos from each phone, find the frame corresponding to an audio cue in each video, and compile those images into a 180-degree animated GIF. This post explains the design decisions we made for the Cloud Spin back-end processing and how we built it.
The following figure shows a high level view of the back-end design:

1. A mobile app running on each phone uploads the raw video to a Google Cloud Storage bucket.

Video taken by one of the cameras

2. An extractor process running on a Google Compute Engine instance finds and extracts the single frame corresponding to the audio cue.

3. A stitcher process running on an App Engine Managed VM combines the individual frames into a video that pans across an 180-degree view of an instant in time, and then generates a corresponding animated GIF.

How we built the Cloud Spin back-end servicesAfter we’d designed what the back end services would do, there were several challenges to solve as we built them. We had to figure how to:

Store large amounts of video, frame, and GIF data

Extract the frame in each video that corresponds to the audio cue

Merge frames into an animated GIF

Make the video processing run quickly

Storing video, frame, and GIF data
We decided the best place to store incoming raw videos, extracted frames, and the resulting animated GIFs was Google Cloud Storage. It’s easy to use, integrates well with mobile devices, provides strong consistency, and automatically scales to handle large amounts of traffic, should our demo become popular.
We also configured the Cloud Storage buckets with Object Change Notifications that kicked off the back-end video processing when the Recording app uploaded new video from the phones.
Extracting a frame that corresponds to an audio cue
Finding the frame corresponding to the audio cue or beep poses challenges. Audio and video are recorded with different qualities and at different sample rates, so it takes some work to match them up.. We needed to find the frame that matched the noisiest section of the audio.To do so, we grouped the audio frames into frame intervals, each interval containing the audio that roughly corresponded to a single video frame. We computed the average noise of each interval by calculating the average of the squared amplitude of the samples. Once we identified the interval with the largest average noise, we extracted the corresponding video frame as a PNG file.
We wrote the extractor process in Python and used MoviePy, a module for video editing that uses the FFmpeg framework to handle video encoding and decoding.
Merging frames into an animated GIF
The process of generating an animated GIF from a set of video frames can be done with only four FFmpeg commands, run by a Bash script. First we generate a video by stitching all the frames together in order, extract a color palette, and then use it to generate a lower-resolution GIF to upload to Twitter.
Making the video processing run quickly
Processing the videos one-by-one on a single machine would take longer than we wanted. Next participants to have to wait to see their animated GIF. Cloud Spin takes 19 videos for each demo, one from each phone in the 180-degree arc.If extracting the synchronized frame from each video takes 5 seconds, and merging the frames takes 10 seconds, with serial processing the time between taking the demo shot and the final animated video would be (19 * 5s + 10s = 110s), almost two minutes!
We can make the process faster by parallelizing the frame extraction. If we use 19 virtual machines, one to process each video, the time between the demo shot and the animated GIF is only 15 seconds. To make this improvement work, we had to modify our design to handle synchronization of multiple machines.
Parallelizing the workload
We developed the extraction and stitching process as independent applications. This made it easy to parallelize the frame extraction. We can run 19 extractors and one stitcher, each as a Docker container on Google Compute Engine.
But how do we make sure that each video is processed by one, and only one, extractor? Google Cloud Pub/Sub is a messaging system that solves this problem in a performant and scalable way. Using Cloud Pub/Sub, we can create a communication channel that is loosely coupled across subscribers.This means that the extractor and stitcher applications interact through Cloud Pub/Sub, with no assumptions about the underlying implementation of either application. This makes future evolutions of the infrastructure easier to implement.

The preceding diagram shows two Cloud Pub/Sub topics that act as processing queues for the extractor and the stitcher applications. Each time a mobile app uploads a new video, Cloud Pub/Sub publishes a message on the videos topic. The extractors subscribe to the videos topic. When a new message is published, the first extractor to pull it down has a lease on the message during which it processes the video in order to extract the frame corresponding to the audio cue. If the processing completes successfully, the extractor acknowledges the videos message to Cloud Pub/Sub, which causes Cloud Pub/Sub to publish a new message to the frames topic. If the extractor process fails, the lease on the videos message expires and Cloud Pub/Sub republishes the message, where it can be handled by another extractor.
When a message is published on the frames topic the stitcher pulls it down and waits until all of the frames of a single session are ready to be stitched together into an animated GIF. In order for the stitcher application to detect when it has all the frames, it needs a way to check the real-time status of all of the frames in a session.
Managing frame status with Firebase
Part 2 discussed how we managed orchestrating the phones to take simultaneous video using a Firebase database that provides real-time synchronization.
We also used Firebase to track the process of extracting the frame from each camera that corresponds to the audio cue. To do so, we added a status field to each extracted frame in the session as shown in the following screenshot.

When the Android phone takes the video it sets this status to RECORDING and then to UPLOADING when it uploads the video to Cloud Storage. The extractor process sets the status to READY when the frame matching the audio cue has been extracted. When all of the frames in a session are set to READY the stitcher process combines the extracted frames into an animated GIF and stores the GIF in Cloud Storage and its path on Firebase.
Having the status stored in Firebase made it possible for us to create a dashboard that showed, in real time, each step of the processing of the video taken by the Android phones and the resulting animated GIF.

We finished development of the Cloud Spin backend in time for the Google Cloud Platform Next events, and together with the mobile apps that captured the video, ran a successful demo.
You can find more Cloud Spin animations on our Twitter feed: @googlecloudspin. We plan to release the complete demo code. Watch Cloudspin on GitHub for updates.
This is the final post in the three-part series on Google Cloud Spin. I hope you had as much fun discovering this demo as we did building it. Our goal was to demonstrate the possibilities of Google Cloud Platform in a fun way, and inspire you build something awesome!
- Posted by Francesc Campoy Flores, Google Cloud Platform