Clowning around with Sora 2, while waiting in the drive-thru.
The clown video to the right started as a little experiment, in between other projects. An iPhone, a few free apps, and about an hour of actual work spread across a day. The result was a 60-second “horror” piece that would have required a production crew, makeup artists, and location scouts just a few years ago.
Welcome to Sora 2. I'm addicted to it—even though it frustrates me half the time.
What Is Sora, Actually?
OpenAI just released Sora 2 as a free app for iOS and Android. Type a prompt describing a scene, and it generates a 10-second video. But to understand why this matters—and why the clown video was even possible—you need to understand what's happening under the hood.
Sora is a diffusion transformer—a hybrid architecture combining two of the most powerful approaches in modern AI. Traditional diffusion models work by learning to reverse a noise process: they're trained on millions of videos that have been progressively corrupted with random static, and they learn to clean up that noise step-by-step.
When you give Sora a prompt, it starts with pure visual chaos and gradually refines it into coherent motion, guided by your text description. What makes Sora different from earlier video generators is the transformer architecture layered on top. Transformers—the same technology behind ChatGPT—excel at understanding relationships across long sequences.
In video, that means Sora can maintain consistency across frames in ways previous models couldn't. A character's face stays recognizable as they turn their head. A camera movement follows physically plausible trajectories. Objects don't randomly appear and disappear between cuts.
This matters for character driven narratives. This particular genre depends on building tension through consistency—you need to believe the clown is real and persistent — scene to scene.
Earlier AI video tools couldn't maintain that coherence. Sora can, most of the time.
Building the Clown: Character Creation Across Platforms
The clown didn't start in Sora. I built the character in Google's Imagen—another diffusion model, this one optimized for still images. The prompt was simple: a creepy clown on a white background, visual cues inspired by Pennywise. Pale face, unsettling smile, costume details that read clearly even in motion.
The white background matters technically. When you import a character into Sora, the model needs to isolate what's “character” from what's “environment.” Clean backgrounds make that separation cleaner.
What happens next is called image-to-video conditioning. Instead of starting from pure noise, Sora uses your reference image as an anchor point. The model generates motion that's consistent with both the image and your text prompt. The character's appearance becomes a constraint the diffusion process must satisfy.
I gave the clown a personality profile within Sora: how it moves (deliberate, predatory), how it sounds (I'd add audio later), how it behaves in space.
“Ronald McDamaged is an eerie clown who glides silently, always clutching a different balloon. He speaks in slow, haunting riddles and rhymes, with an unsettling, melodic voice, playfully inviting onlookers into his world of carnival mischief and chilling whimsy.”
Here are some example character videos of Ronald in various situations — straight outta Sora:
From that point on, I could drop this character into any scene and maintain visual consistency.
Without consistent characters, you're just making tech demos. With them, you can build narrative tension across scenes—the character becomes recognizable, their actions carry weight, and suddenly you're storytelling instead of generating random clips.
Six Scenes, One Nightmare: Prompt Engineering for Horror
Each Sora video maxes out at 10 seconds, but you can stitch up to six together within the app for a 60-second piece. The clown video uses all six slots. Making them feel like one coherent piece—rather than six random experiments—required understanding how diffusion models interpret language.
These models don't “understand” prompts the way humans do. They map text to visual concepts through patterns learned from training data. Certain words and phrases reliably trigger certain visual qualities. Film terminology often works better than abstract descriptions because the training data included countless videos tagged with professional vocabulary.
For the clown piece, every prompt started with identical visual constraints:
“Cinematic horror, 1980s VHS aesthetic, desaturated colors, heavy film grain, single-source dramatic lighting, 35mm anamorphic lens distortion.”
That first sentence never changed across all six clips. It told Sora to explore the same visual territory each time. Underneath, I'd describe what happens in that specific scene:
“A clown emerges slowly from deep shadows in an abandoned warehouse, turning deliberately to face the camera. Dust particles visible in the light beam.”
The consistency comes from repetition. By keeping those visual constraints identical, you're telling the model to sample from the same region of possibility space. The natural variation in diffusion models—which usually creates jarring inconsistency—becomes an asset. Each clip looks slightly different, but those differences feel like natural movement between scenes rather than glitches.
For horror specifically, I leaned into terminology the model associated with the genre: “ominous,” “lurking,” “emerging from shadows,” “deliberate movement.” These words have strong associations in the training data with specific visual treatments.
The Chaos You Can't Control
Let me be honest: every render is a roll of the dice.
I generated probably 20 versions of the six scenes before I had takes I liked (Sora 2 allows 30 video generations per day.) Same prompts, wildly different results. The clown's expression would shift between menacing and almost comedic. For example, the lighting would nail the mood in one render, and flatten it in the next.
Sometimes the character would move exactly as I'd envisioned; sometimes the physics would go subtly wrong in ways that broke the look. Tell-tale AI wonkiness destroys the illusion of reality, and takes the viewer out of the story. Making them focus on weird technical flaws instead of the narrative.
This is the mathematical reality of diffusion models. Each generation samples from a probability distribution. You're not retrieving stored footage—you're asking the model to create something new that satisfies your constraints. The same constraints can be satisfied in many different ways.
For the clown video, I embraced this. I generated multiple versions of each scene and picked the ones where the randomness worked in my favor—where the slight variations added to the unsettling quality rather than undermining it. A flicker of movement that wasn't in my prompt. A shadow that fell unexpectedly. These artifacts of the generation process became features, not bugs.
You cannot use Sora for client work that requires predictable, repeatable results. Not yet. But for creative projects where you can iterate and curate? The chaos becomes raw material.
Sound Design: The Rest of the Story
Video without audio is a tech demo. The clown piece needed sound that matched the 1980s VHS aesthetic I'd built visually.
I used Suno, another AI tool that generates custom music from text prompts. Suno uses a similar diffusion-based approach, but applied to audio—it learns patterns of how different genres, instruments, and moods combine, then generates new compositions matching your description.
My prompt asked for something specific: deteriorated analog synth, the kind of score John Carpenter might have composed if the tape had been left in a hot car for a decade. Unsettling sustained tones. Occasional dissonant stings. The sound of something… off.
Three minutes later, I had a soundtrack. I brought everything into CapCut—the six Sora clips and the Suno audio—and built the final edit. Transitions between scenes, rearranging scenes, fine-tuning the color grade to push the VHS aesthetic further, syncing the audio stings to the clown's movements.
The whole production stack happened on my phone. Character creation in Imagen, video generation in Sora, audio creation in Suno, editing in CapCut. All free or cheap apps. All while doing other things throughout the day.
What This Means for Visual Storytelling
The clown video is a proof of concept, not a masterpiece. But it demonstrates something important: the barrier between imagination and execution just collapsed.
Five years ago, making even a rough version of this piece would have required: a location (found or built), a performer, prosthetic makeup or a high-quality mask, lighting equipment, a camera capable of the aesthetic I wanted, and either the skills to do it myself or the budget to hire people who could. The logistics alone would have killed most ideas before they started.
Today, the constraint is imagination and iteration. Can you describe what you see in your head clearly enough for the model to approximate it? Can you generate enough variations to find the ones that work? Can you assemble the pieces into something coherent?
These are different skills than traditional filmmaking, but they're still skills. Prompt engineering for visual media is a craft that improves with practice. Learning which words trigger which visual associations. Understanding how to structure constraints for consistency. Knowing when to fight the model's tendencies and when to lean into them.
The photographers and filmmakers who will thrive aren't necessarily the ones who adopt these tools fastest. They're the ones who understand them deeply enough to know when AI generation serves a project and when it doesn't—and who can integrate new capabilities without losing what makes their work valuable.
Your eye for composition, your instinct for story, your ability to see what others miss—none of that becomes less valuable. If anything, it becomes more valuable because the tools to execute your vision keep getting faster and cheaper. The creative vision was always the scarce resource; execution is becoming abundant.
The Trade-Offs
The app is free, but you're the product. Every video you generate, every prompt you write—it's training data for OpenAI's models. The output you're getting today, they'll refine and sell back to you via subscription fees tomorrow.
The content restrictions feel arbitrary for horror work. Certain imagery is off-limits in ways that aren't always clear until you hit the invisible “Content Guidelines” boundary. I couldn't push the clown piece as dark as I might have wanted. The model refused prompts that seemed reasonable to me, for reasons it wouldn't fully explain. I attempted to communicate a fix for this to Sam Altman via Sora 2 — I got no response.
This is a fundamental tension in generative AI: the capabilities that enable creative expression also enable harm. Companies navigate this through restrictions that inevitably frustrate legitimate creative use. For horror—a genre that often needs to transgress boundaries to work—this is a real limitation.
And the inconsistency I mentioned means you can't promise clients specific outcomes. Every render is probabilistic. This is a creative tool, not a production tool. Not yet.
Try It Yourself
Download Sora from Apple's App Store or from Google Play. It's free.
Start with character creation. Build someone—or something—in an image generator first. Keep the background clean. Import that image into Sora and give it behavioral constraints. Then drop it into scenes and see what happens.
Pay attention to which prompt structures produce more consistent results. Learn the vocabulary that triggers the visual qualities you want. Generate many versions and curate ruthlessly.
You don't have to love it. You don't have to use it professionally tomorrow. But you should understand it, because this is where visual storytelling is heading whether we're comfortable with it or not.
Your inner clown is waiting. What will you make? 🤡
