
Let’s imagine the ideal model.
I recently tried out Anima. It’s an open-source anime model that, for the first time, has given me a real sense that we’re getting close to something like NovelAI. Testing this model got me thinking about how we can drive AI forward, and what I’d like to see the open-source community become in the coming years. In compiling this top 10 list, I’d like to try to capture not just my own expectations, but those of many others as well.
<aside> 👉
Think of this as a Christmas wish list; I’m not sure how feasible many of the things I’m about to present to you are (or to what extent they could be achieved this year if a group were to set out to make them a reality). I think just putting them into words is already a big step. If you share the hope that any of the proposals I’m about to make might come true, please pass this article on to the right people and groups so that this can happen.

</aside>
To succeed in 2026, there’s no need to obsess over the quality of NanoBanana or GPT-Image 7. (Or whichever version came out in May – to be honest, I lost count ages ago.) There are other, more achievable priorities that could go viral if a medium-sized, focused team sets its mind to it. China has already made quite a name for itself in open source, but it could make an even bigger name for itself if it takes note of the following, for whoever masters these 10 points will master the art of war:
<aside> 👉
None of the images in this article belong to me; they have been taken from various online sources and are being used solely to illustrate the points made.
</aside>
Limitations: Anime models, for example, are trained on curated data from various websites (Boorus, DeviantArt, etc.), and therefore rely entirely on the fandom surrounding the characters. (In short, this depends on there being plenty of fan art or images taken directly from the film or series.) For a model to be able to generalise and identify almost all existing characters in pop culture, it would need hundreds of tagged images of these characters. And of others that are not so popular. In turn, it should also include some characters that have few drawings, but who are very popular or lend themselves to memes (Heisenberg, Shrek, Queen Elizabeth II…).

I have to say that many of the images featured in this article are a bit of a cheat; it’s not that the NovelAI model is that good by default, it’s just that the users are experts (not to mention that they use NovelAI’s control tools to make them look perfect)
<aside> 💡
In the absence of diverse data: Training with synthetic data could help. NovelAI, or general-purpose generative models, are in fact currently the best source for this type of training. However, great care must be taken with hidden watermarks, training biases, or any other related issues (which could, in fact, lead to errors). It is also possible to train using images from Loras de Illustrious. (Although these are generally of poorer quality.)

</aside>
We need more comprehensive datasets: every film, TV series, video game, cartoon or comic should be fully reproduced and segmented, with the resulting images tagged exhaustively; more filters will need to be created (or intelligently relaxed) when curating data (for example: accepting NSFW images, accepting 3D models or taking screenshots of them). We already have segmentation models and agent-based models that could assist with tagging or extraction tasks. In 1 or 2 years (and with a lot of money involved), we could achieve a fully functional and open model. Now, it is worth mentioning that this is simply a colossal undertaking that would offer diminishing returns; very few companies would agree to maintain this. (But various fan groups might try to achieve it)

I don’t know anything about cartoons, but I think this image from NovelAI shows the kind of versatility an open-source anime model should aim for. Two characters, from two different series, in completely different styles.
Speaking of understanding, a good AI model should be able to handle conflicting relationships and unusual situations. The famous example of a horse riding a human in space. Another example might be the AI being able to depict a ridiculous situation such as a stabbing with a plastic fork (or something even more random, like a Hello Kitty umbrella). I think these last examples make the point clear enough.

Characters in their official styles by default: This is something I’d like to highlight – the fact that a model should be able to easily render characters in their official styles should be an essential requirement. DALLE 3 already had some amusing overfitting issues and went quite viral for that reason back in the day.

The iconic overfitting that characterised DALLE 3 back in the day should make a comeback,

Anime-specific editing model (Flux Klein type): An editing model could provide that extra something Anime models need to succeed. For a start, it would remove the need for a Controlnet model for certain edits. Given that open editing models are only just getting started, we don’t know when we might have an Anime editing model (we didn’t even have Z-image edit) or what state it would be in when it came out. What I can say is that, as an experiment, it would be brilliant.
<aside> 💡
The downside of capable generative image editing models is that they require reasonably powerful hardware. Developing a model specialised in anime that is also lightweight could take several years. It is also worth noting that current editing models perform rather poorly with anime, being able to handle only basic edits and requiring specific LORAs for more specific tasks.

An example of a simple workflow for generative editing using the Flux Klein model in anime.
</aside>
Controlnet IP adapter “Vibe Transfer” VERY WELL TRAINED An IP Adapter is like a specific editing template for a particular parameter in image generation; it is generally used to give the generated images a certain style or consistency based on a reference image. (A style transfer tool) I’ve tried a few, and I have to say that the ones available so far are very superficial. To truly capture any style and generalise about it correctly, you need a versatile and comprehensive model. In such cases, low-budget solutions (such as those that have emerged so far) are of little help. (In fact, many of the images you’ll find in this article were created using the best IP adapter on the market—NovelAI’s—which is believed to have been trained using several thousand dollars’ worth of resources.)

NovelAI’s ability to infer styles and characters through prompt and vibe transfer is remarkable.
Other ways to maintain consistency: To conclude, if none of the above options for achieving ultimate control are feasible, we’re left with what I consider to be the simplest option in 2026: good old Controlnets. I believe that for these to be fully relevant, they would need to innovate in some way (more comprehensive training, as I mentioned with the ‘IP Adapters’). The possibility of creating Loras is also worth highlighting, but this depends more on the community and on how quickly leading trainers can adapt the model architecture to their programmes.
<aside> 👉
Many people argue that a good control net is better than a good model; I have always argued the opposite. A good model should follow a sort of Pareto principle. (The initial output should already contain 80% of the elements initially desired, with the remaining 20% being something that can be achieved using additional techniques, which should also be simple and accurate.) AI should save time on simple generations. Controlnet is fine as an additional model, but it should not form the basis of the generation. Not to mention that the correct generalisation of a model is crucial for controlling the generation using some of the other techniques.

NovelAI's Vibe Transfer is so good that it even supports multiple inputs.
</aside>
We’re only just beginning to appreciate natural language and comprehension in open-source models. But I believe we should already be focusing on the next big step to take image AI to a new level. The closest we’ve come to a model that understands Spanish with complete accuracy has been Z-image, and I’d say its performance in this and other languages is merely acceptable.
I understand that, when it comes to ‘Danbooru tags’, it wouldn’t be all that difficult to train a model capable of understanding multiple languages. For a start, I know of tag databases—such as those on the Sankaku Complex website—that include synonyms and translations. (Not to mention that they are more comprehensive for certain tags.) A tagging model that responds to these tags in every possible language would be a game-changer, enabling highly detailed, multilingual anime models. Above all, it would take anime image generation back to the popularity levels seen in the early days of Stable Diffusion 1.5.

We are only just scratching the surface of what open models could be. AI can still be trained or labelled in novel ways that will enable it to evolve even further.
<aside> 💡
Be careful, though: increasing the complexity of a model by adding more concepts and synonyms would also increase the number of biases within it (for example, in Japanese it might generate better anime images as it would have better tagging for them). It is also worth mentioning that the difficulty in adding so many concepts to the model would be maintaining the natural language understanding capability (which is complex). What happens when a word means one thing in German and another in Spanish? This could increase the number of errors. Fortunately, there is a plan B if this goes wrong.

Placeholder image. (It’s a bit ironic to talk about bias with this image)