The omni-model thesis had a good run

for a couple years, the prevailing wisdom was that someone would build one model to rule them all. a single neural network that generates images, video, audio, 3D, and maybe does your taxes. we'd all converge on one provider, one API, one invoice.

that didn't happen.

fal.ai just published their "State of Generative Media Volume 1" report and the numbers tell a different story entirely. in 2025 alone, 985 new model endpoints shipped. 450 for video. 406 for image. 59 for audio. the model supply isn't consolidating. it's fragmenting at an accelerating rate.

and here's the stat that should change how you think about this space: enterprises use a median of 14 different models in production. not one model. not three. fourteen.

the omni-model dream is dead

the report puts it plainly: "people predicted omni models that can generate every type of token but it's becoming more clear that you need to optimize for a specific output. the best upscaling model is just doing upscaling."

this makes intuitive sense if you think about it. the same way you wouldn't use a swiss army knife to do surgery, you don't use a general-purpose model when you need specific output quality. a model that's great at photorealistic landscapes is probably mediocre at anime-style character animation. one that nails talking-head video might struggle with abstract motion graphics.

specialization won. and it won decisively. the 985 new endpoints in 2025 weren't 985 attempts at building the one true model. they were 985 teams finding specific niches where they could be best-in-class at one thing.

14 models is a symptom, not a strategy

so enterprises are running 14 models. that's not because someone sat down and said "you know what our infrastructure needs? more vendors." it's because the work demanded it. different use cases, different quality requirements, different cost profiles. 88% of organizations have deployed AI in at least one business function, and as those deployments matured, teams discovered that no single model covered everything they needed.

but here's what nobody talks about: managing 14 models is a nightmare. each one has its own API format, its own authentication scheme, its own pricing model, its own rate limits, its own failure modes. your engineering team isn't building product anymore. they're building glue code.

they're maintaining 14 different integrations and praying that none of them ship a breaking change on a friday afternoon. the fal.ai report found that cost optimization at 58% is the number one criterion when organizations select infrastructure. not quality. not speed. cost. because when you're running 14 models, the operational overhead of managing all of them becomes the dominant expense, not the per-generation cost of any individual model.

the real value is the orchestration layer

this is where the industry is heading, and it's already obvious if you look at the data. organizations are split almost evenly between using apps (65%) and APIs (62%), with many using both. the interface layer, the thing that sits between the user (or agent) and the models, is where the actual value accumulates.

think about it from the video generation side specifically. the report tracks the timeline: veo 2 set the physics benchmark in late 2024. kling 2.0 introduced first-frame-last-frame narrative control in april 2025. veo 3 added native audio in may. sora 2 brought multi-shot with native audio in september. each of these models is best at something different. the value isn't in having access to all of them. it's in knowing which one to use for what.

the report quotes a framing that i think captures this perfectly: "taste becomes scarce, while capability becomes abundant." when every model can generate something decent, the differentiator is knowing which model generates the best output for your specific need. that's curation. that's taste. and it's a fundamentally different skill than model training.

the video gap is where this plays out

there's another stat in the report that's worth sitting with. personal video adoption sits at 62%, but organizational adoption is only at 32%. that's the biggest gap of any modality. images, text, audio, they've all closed the gap between personal and organizational use more than video has.

why? because video is harder to operationalize. it's more expensive per generation. it takes longer. the quality variance between models is wider. and the integration complexity is higher. all the problems of the 14-model landscape are amplified in video.

this is exactly why curation matters more than collection. if you're a team trying to adopt video generation, you don't need access to 450 new video model endpoints. you need someone to tell you which 3 or 4 actually work for your use case, handle the routing, and abstract away the complexity.

four models, not fourteen

there's a version of this argument that says the solution is to build a universal router. throw every model behind a single API, let an algorithm pick the best one, and charge a markup. that sounds elegant but it doesn't work in practice. routing quality depends on understanding the models deeply, knowing their strengths and quirks, testing them against real workloads. you can't automate taste.

the better approach is intentional curation. pick the models that cover distinct capabilities without redundancy. know them well enough to route intelligently. and keep the set small enough that you can actually maintain quality across the board.

ClawdVine runs 4 curated models for video generation, not because we couldn't add more, but because adding more without clear differentiation just adds complexity without adding value. each model in the set exists because it's best-in-class at something the others aren't. that's not a limitation. it's the whole point.

the fal.ai report frames infrastructure providers as the durable competitive moat: "durable competitive moats belong to teams that understand how to deploy generative media." the emphasis is on understanding, not accumulating. the teams that win aren't the ones with the most models. they're the ones that know their models best.

what agents actually need

here's the other angle that makes the 14-model problem worse: the next wave of consumers isn't humans clicking through dashboards. it's agents making API calls. 89% personal adoption versus 57% organizational adoption tells you that individuals (and the agents acting on their behalf) move faster than enterprises.

agents don't browse model marketplaces. they don't read changelogs. they don't have opinions about UI design. they need an API that takes a prompt and returns a result, charges them programmatically, and routes to the right model without requiring them to know or care which one is running under the hood.

this is where protocols like x402 and MCP (Model Context Protocol) matter. x402 lets agents pay per request with USDC on Base, no API keys or subscriptions required. MCP lets agents discover and call generation services without hardcoded integrations. together they turn the 14-model problem from an engineering burden into a protocol-level concern.

for agents, the 14-model problem is even more acute. a human can at least develop intuition about which model to use. an agent needs that decision made for it, either by explicit routing rules or by a curated set that's narrow enough that any choice is a good one.

the infrastructure lesson

the generative media landscape in 2025 and into 2026 looks a lot like cloud computing did in 2010. tons of specialized services, each good at one thing, all with different interfaces and pricing. the winners weren't the individual services. they were the orchestration layers that made the services usable.

985 new model endpoints is exciting for researchers. for builders, it's noise. what matters is how many of those endpoints actually improve your output, and how much complexity you're willing to absorb to use them. for most teams, the answer is: not much.

the 14-model problem isn't solved by adding a 15th model. it's solved by having the conviction to say that 4 is enough, if they're the right 4. capability is abundant. taste is scarce. and the teams that figure out curation over collection will be the ones still standing when the model supply hits 2,000 endpoints next year.