From Text to Blockbuster in Seconds
A Hollywood studio has put on hold its $800-million expansion plans after watching the stunning hyper-realistic videos produced by an OpenAI innovation that leverages a transformer architecture to convert data into spacetime patches of video and image
You’ve probably seen those viral videos recently where people type in a text prompt and an AI creates a hyper-realistic video based on the description. OpenAI, the company that launched ChatGPT on30November2022, is once again behind the stunning videos that have leveraged a transformer architecture that operates on spacetime patches of video and image latent codes. The promise and peril of those stunning sample videos generated by Sora have sent shockwaves through the movie industry.
While a Hollywood studio has put on hold its $800-million expansion plans, after watching those videos, many others have welcomed it as it would enable movie-makers to create pilots in a matter of seconds investing only their imagination. The makers of Game of Thrones sequel instead of dumping tens of millions of dollars to shoot a pilot it might wind up passing on, could now use a generative artificial intelligence system trained on its library of shows to create a rough cut in the style of the original.
Its largest model, Sora, is capable of generating a minute of high-fidelity video. Results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.The concept of turning visual data into patches, or small regions or segments of an image that are used as a representation for processing visual data in models, was inspired by large language models (LLMs) that gain broad capabilities through training on vast amounts of internet data. The success of LLMs is attributed in part to the use of tokens that can handle different types of text such as code, math, and various natural languages.
Keep guessing about the training data
While LLMs use text tokens, Sora uses visual patches instead. Patches have been proven to be a useful representation for models dealing with visual data. Patches are a scalable and effective way to train generative models on a wide range of videos and images.The trillion-dollar question is what is the dataset that Sora has been trained on. OpenAI has barely shared anything about its training data. But in order to create a model this advanced, Sora needed lots of video data, so we can assume it was trained on video data scraped from all corners of the internet.
So how does it actually work?
Sora is trained on a massive dataset of existing videos and images. By analysing huge numbers of videos frame-by-frame, Sora learns to recognize patterns related to objects, motions, animals, people, etc.
When you give Sora a text prompt, it generates a more detailed description of what the video should contain. For example, “a cute puppy” might become “a golden retriever puppy playing with a ball in a grassy backyard on a sunny day.” This detailed description acts as instructions for what Sora should create.
Sora breaks down the description into smaller pieces and decides which visual elements need to be generated and arranged. Using its knowledge from step 1, Sora’s AI “imagination” gets to work creating individual video frames containing the necessary pieces (the puppy, the backyard, etc.)
Finally, Sora smoothly stitches together all the generated frames to create a seamless, realistic-looking video bringing your textual description to life!Of course, this is just a simple overview – Sorauses extremely advanced machine learning techniques under the hood. But hopefully this gives you an idea of how text prompts get turned into fully formed videos as if by magic! As Sora learns from more and more data, its creative capabilities will only continue to improve.
How Sora’s AI Brain creates videos
To understand how Sora works, think of it like a little movie studio inside a computer program. First, Sora takes the text description you give it and turns that into a script. Then, its different AI systems get to work on making the actual video:
- The Costume Designer AI
Sora has an AI system that has studied millions of images and videos to learn what different objects, animals, and people can look like. This Costume Designer AI picks out clothes, landscapes, buildings, etc. to match the descriptions in your text prompt.
- The 3D Set Constructor AI
Next, Sora uses another AI system to convert all those visual elements into 3D models and arrange them in a full 3D environment. So, if your text mentions a beach, this AI builds a 3D beach setting. It can create basic 3D scenes based on what it has learned from analysing videos frame-by-frame.
- The Videography AI
Now Sora uses its Videography AI to move a virtual camera through the generated 3D scene, figuring out shots and angles that help bring your video to life. This is how it creates realistic motion and camera angles in its videos.
- The Rendering System
Finally, Sora’s Rendering System takes all the 3D assets and camera motion and renders out high-quality 2D video frames. It blends together the visual elements, sets lighting and colours, and makes the finished frames look realistic.
- The Editing Suite AI
To finish it off, Sora’s Editing Suite AI smooths over any rough edges and stitches all the frames together into one seamless, flowing video matching what you described!
And there you have it – that roughly how Sora’s different AI’s team up to transform text into full-blown videos. Each of those steps involves sophisticated machine-learning techniques under the hood. As Sora is trained on more data, all those AI systems keep getting better at producing creative videos to match what you imagine!