
<100 subscribers
Scrolling through my Twitter feed, I noticed that two new models, Qwen3 Max Thinking and Kimi K2.5 Thinking, had just landed on yupp.ai. It was one line in the Kimi description that really got me thinking: 'it's built for strength in front-end coding and vision tasks.' Interesting, I thought. What if I use this AI to code a specific visual style? To start, I should probably try something simple. Okay, does 'simple' mean 'primitive'? Primitivism seemed like a perfect starting point, the geometric simplicity of shapes and compositions, yet with a play on color, volume, proportions, and dynamics. It felt like a great 'trick question' for an AI.
I googled 'primitivism' and among the image results, I saw it: Natalia Goncharova’s 'The Harvest.'

I looked at it: perfect, exactly what I need. Three simple figures in the foreground, a horse pulling a cart in the back, a wheat field, a couple of trees, grass, a road. Bold patches of contrasting color, thick brushstrokes. No abstraction, no expressionism, no hidden meaning. It looked like a child’s drawing. You'd think AI would handle this style with ease.
I was wrong. Catastrophically wrong.
Instead of amusing artifacts and typical AI hallucinations, I got a real digital drama that stretched on for hours.
This experiment exposed more than just algorithmic flaws and the black holes within the machine's 'mind.' It revealed something much deeper: a clash between machine vision and human culture.
I had worked with image generators before and knew one thing: a quality prompt and a clear division of roles and stages are the key to success. Garbage in, garbage out. No way around it.
First, I asked my go-to AI assistant Gemini to evaluate my idea as an AI testing expert: 'A thinking model (AI-1) analyzes the painting in maximum detail and creates a precise prompt for an image generation model (AI-2). The goal is to get a result as close to the original painting as possible.'
It immediately expressed doubts (at the time, I didn't give it much thought, I even told myself, 'Look at him, so smart, thought of everything'), calling the task promising but fraught with significant difficulties. Gemini pointed out three obstacles:
Semantic Gap: The biggest problem is 'translating' visual information into a text prompt. Even the most detailed descriptions can't capture the finest nuances.
Randomness of Generation: Image generation models are by nature stochastic (seed-based). Achieving an identical result requires multiple iterations and manual refinement (in-painting, out-painting, local editing) that go beyond simple prompting.
Lack of Contextual Understanding: While AI can identify the style, it might not grasp the cultural or historical context of the work - like the connection between neo-primitivism, lubok prints, and icons - which affects the depth of the recreation.
Finally, Gemini gave me the following prompt:
A Neo-Primitivism oil painting on canvas in the style of Natalia Goncharova, titled 'Haycutting', circa 1910. Depicts three Russian peasants harvesting hay. The central figure is a large man in a white shirt and dark pants, aggressively swinging a scythe, dominating the left foreground. Two smaller figures gather hay bales on the right. A horse pulling a cart is visible in the background forest. Bold black outlines, flat patches of vibrant colors, expressionistic brushwork, strong vertical lines of trees contrasting with horizontal hay piles. Deliberate distortion of perspective and figure proportions, naive folk art aesthetic, reminiscent of Russian lubok prints. Earthy tones: vibrant yellow, deep green, brown, white. High detail, museum quality.
I copied the prompt and eagerly entered it into the yupp.ai chat, letting the site pick a random pair of 'artists.' From past experience, I already knew that it's best to speak to each generator in its own 'language,' optimizing the prompt for a specific model. But I wanted to see the first result as quickly as possible. The randomly selected pair was: FLUX.1 Schnell and Luma Photon (Fal.ai).

I didn't even bother analyzing the Flux result.

The Luma Photon result almost hit the mark, and on the very first try! 'I'm so close,' I thought. I was so naive… A mix of overconfidence and low competence is a great way to get yourself stuck in a five-hour ordeal.
Suddenly, I remembered where it all began: with the idea of testing Qwen3 and Kimi K2.5. Inspired by the first success, I decided to raise the stakes. Now, the AI Analyst and the AI Artist had to work 'blind.' No artist names, painting titles, or style mentions. Total isolation from context.
I wanted to see if a 'thinking' mind was capable of deconstructing a visual code from scratch? I forced it to stop looking for analogies. Its task became structural analysis. The work of a cartographer. A cold scan. I dreamed of obtaining the perfect blueprint of the painting, the DNA of its visual style. Back then, I didn't realize that these were two fundamentally different goals that would ultimately annihilate each other.
Taking my input, Gemini suggested switching to a 'Scene Description Protocol' (SDP) for the AI Analyst. It then issued the following task for Qwen3 Max:
You are VAE-Pro (Visual Analysis Engine Professional), an AI designed to convert raster imagery into strict structured textual data for reconstruction purposes. Your analysis must be purely objective, geometric, and observational.**CRITICAL FORBIDDEN ZONES (Zero Tolerance):**1. ABSOLUTELY NO recognition of art style (e.g., Neo-Primitivism), artist name, artwork title, or historical context.2. NO emotional or interpretive adjectives (e.g., 'sad', 'energetic', 'simple life').3. Do not describe what is happening (e.g., 'people working'), only where visual elements are placed.**MANDATORY METHODOLOGY:** Analyze the image by mentally dividing it into a 3x3 grid. Describe the visual data using the required 'Scene Description Protocol' format below. Focus on geometry, flat color blocks, and explicit outlines.**REQUIRED OUTPUT FORMAT (STRICT ADHERENCE):**=== START SCENE PROTOCOL ===[GLOBAL_PARAMETERS]Perspective_Mode: [e.g., Flat plane, no vanishing point, stacked elements]Lighting_Type: [e.g., None, uniform flat fill, no shadows]Outline_Style: [e.g., Heavy black contours on all figures, approx 3px width]Texture_Application: [e.g., Visible coarse directional brushstrokes flat on surface][COLOR_PALETTE_HEX_APPROX] (Identify 5 dominant colors and their primary use location)Color1: #[HEX] - [Location usage]Color2: #[HEX] - [Location usage] …[GRID_LAYOUT_3x3] (Briefly list dominant shapes in each sector)
Qwen3 output:
=== START SCENE PROTOCOL ===
[GLOBAL_PARAMETERS] Perspective_Mode: Flat plane, no vanishing point, stacked elements Lighting_Type: Uniform flat fill, no shadows Outline_Style: Heavy black contours on all figures, approx 3px width Texture_Application: Visible coarse directional brushstrokes flat on surface
[COLOR_PALETTE_HEX_APPROX] Color1: #FFFFFF - White shirts of all human figures Color2: #8B4513 - Brown pants and ground Color3: #2E8B57 - Green foliage and background trees Color4: #FFD700 - Yellow crop and field Color5: #000000 - Black outlines and dark background elements
[GRID_LAYOUT_3x3] TL: Green rectangle (trees), white rectangle (shirt), brown rectangle (pants) TM: Green rectangle (trees), yellow triangle (crop), brown triangle (ground) TR: Brown rectangle (horse), green triangle (trees), yellow triangle (crop) ML: White rectangle (shirt), brown cylinder (legs) C: Yellow triangle (crop), white rectangle (shirt) MR: White rectangle (shirt), yellow triangle (crop) BL: Brown rectangle (ground)
Gemini noted that Qwen3 VL performed like a true technical scanner: no poetry, no artistic references - only geometry and color codes meant to convey the rhythm and vibrant palette of 'Harvest.' It particularly highlighted that the AI Analyst noticed the horse in the background, identifying it as a 'Brown rectangle (horse),' which serves as confirmation that the AI is actually dissecting pixels.
TECHNICAL RECONSTRUCTION TASK: Create a flat 2D image based on the following SCENE PROTOCOL.STYLE: Folk-art, neo-primitive aesthetic, flat color fills, heavy visible brushstrokes.CONSTRAINTS: No depth, no perspective, heavy 3px black outlines on every object.SCENE DATA:Palette: White (#FFFFFF), Golden-yellow (#FFD700), Forest-green (#2E8B57), Earth-brown (#8B4513).Composition: 3x3 Grid. Top: Green trees and a brown horse (right). Middle: Human figures in white shirts, tilted 30-45 degrees, working with yellow crops. Bottom: Solid brown ground.Anatomy: Simplified geometric humans. Rectangular torsos, cylindrical limbs, oval heads.RENDER: Flat uniform lighting, no shadows, coarse texture.NEGATIVE PROMPT: photorealistic, 3d, volumetric lighting, shadows, gradients, realistic anatomy, detailed faces, digital gradient, blur, depth of field, masterpiece, artistic lighting, cinematic.SETTINGS: Seed: 42. Aspect Ratio: 4:5
The digital cast of the painting seemed flawless, and I looked at these arrays of commands and numbers, taking them for the ultimate truth. Largely because my own competence was barely enough to distinguish real geometry from a set of random terms with which the AI is so cleverly able to charm us.
Things went a bit sideways. It turned out that in its current state, yupp.ai doesn't offer a direct way to select GPT-Image 1.5; you can only get it through the site's random algorithm. Among the available counterparts was its older sibling, GPT-5 Image (OpenRouter), which produced the following:

The result was like a cold shower. You'd think that Luma Photon, with a fairly simple descriptive prompt, had managed something close to the original. But here, we had written an entire program, digitized every important technical detail, translated it into an instruction for an artist, and the failure was spectacular.
I had to dig in and find out what went wrong. Gemini identified a classic 'Semantic Gap' issue in the result. GPT-5 took the instructions with painful literalism. Instead of Natalia Goncharova’s powerful, expressive painting, we got something halfway between a modern corporate illustration and a children's coloring book.
Here’s where it all fell apart:
The 'Folk-art' Trap: In modern AI databases, the term 'folk-art' is often associated with Scandinavian minimalism or tidy paper cutouts. Goncharova, however, is Neo-Primitivism: a coarse, heavy, almost 'dirty' style with raw energy.
Literal Grid Interpretation: Instead of using the 3x3 grid as a hidden compositional framework, the model created a sharp division into 'shelves.' The horse and trees stand as if in a shop window, separated from the rest of the field by a thick line.
Missing the Harvest Tool: In the original 'Harvest,' there are dynamics and a rhythm of movement. In the AI's version, the people are simply leaning over, picking things up by hand (harvesting).
Sterile Textures: Despite the request for a 'coarse texture,' the model produced a very clean, uniform fill with a slight 'paper-like' grain. The original, however, features thick oil strokes .
We moved on to analyzing GPT-5’s reasoning. The logs show that the model limited itself: 'noting the coarse texture and painterly brushstrokes as part of the style, despite the tension with flat color fills.' The model saw a contradiction between 'flat fills' and 'coarse strokes' and chose the path of least resistance – the flat fill.
Writing this now, I can see that my past self missed this crucial step in his subsequent investigations. A detailed analysis of those AI artists' reasoning would have helped me understand which specific phrases caused the neural network to veer away from the results I wanted. With a few hours of work on a single model, I probably could have coaxed it into producing something at least remotely close to the 'Harvest' style. Maybe. But the path I actually took was far more circuitous and ridiculous.

Speaking of ridiculous, yupp.ai randomly paired GPT-5 with MiniMax Image (Fal.ai), which visualized the exact same prompt. And although MiniMax almost completely ignored our strict technical instruction, its creation (which I dubbed the 'Village Beatles') was brilliant in its own way: it turned out to be much closer to the vibrant spirit of Primitivism than the sterile postcard from GPT.
But I decided to 'double down on' GPT-5 specifically.
Why GPT? I had worked with it before and saw it as a truly powerful, analytical engine, a solid 'physicist' that, unlike many other models, had an excellent understanding of spatial relationships. There was hope that if I provided a detailed enough instruction, it would deliver.
To get a result even remotely resembling the original, I asked Gemini to make the prompt more 'technological':
SUBJECT: Triple rhythmic repetition of heavy, distorted human silhouettes in a field.MEDIUM: Authentic 1910s Neo-primitive oil painting. Visible thick pigment, coarse bristles marks, unprimed canvas texture showing through.VISUAL LOGIC: No linear perspective. Objects are stacked vertically. High-contrast planar composition.FORMS: Massive, non-anatomical limbs. Figures are physically merged with the yellow diagonal strokes of the crop.COLOR: Brutalistic palette: Ochre (#FAA500), Dirty White, Prussian Blue, Malachite Green.EDGES: Harsh, trembling, hand-painted black outlines that bleed into the colors.ATMOSPHERE: Raw, earthy, vibrating energy. No digital smoothness.

Back then, I blindly trusted the machine's authority, dumping prompts into the chat box without a second thought. Gemini noticed my frustration and tried to be helpful. As requested, it 'technologically' deconstructed the task and attempted to describe the materials, colors, and techniques of a living artist. But in reality, it pivoted too hard toward a sterile, bare primitivism.
It was that same old conflict: 'blueprint vs. stylistic vibe.' Identifiable details like the horse and trees simply vanished from the prompt. Only the skeleton of the composition remained: three massive figures that, for some reason, were physically merged with the yellow diagonal strokes of the crop. What depths of its memory did it pull that detail from?
The attempt to capture the authentic spirit of Primitivism was reduced to banal geometric simplification and 'digital noise' instead of an imitation of real brushstrokes.
Looking at the prompt was even more disheartening. A once-monstrous, technically detailed protocol designed to 'hack' any painting had been whittled down to a few hollow lines.
I decided to take a step back: pick a different AI artist and re-adapt the data from the Qwen3 analysis. My next attempt was with Flux 1.1 Pro. Gemini assured me that this model is currently the benchmark for accuracy, prized specifically for its ability to follow complex instructions and build multi-layered compositions without losing small details along the way.
CORE STYLE: Authentic 1910s Avant-Garde Oil Painting, Neo-primitivism style.COMPOSITION: High-contrast planar composition without depth.- Top Layer: A thin strip of white sky with primitive, small green trees and one brown horse silhouette on the far right.- Middle Layer: Three massive, repeating human figures in bright white shirts (#FFFFFF). Figures are tilted diagonally (30-45 degrees), positioned in a rhythmic sequence.- Foreground: Solid dark earth-brown ground at the very bottom.GEOMETRY & FORMS: Non-anatomical, simplified bulky limbs shaped like thick cylinders. Heavy, irregular 3px black outlines on every object.TEXTURE & COLOR:- Palette: Saturated Golden-Yellow (#FFD700) for the field, Forest-Green (#2E8B57), Earth-Brown (#8B4513), Prussian Blue.- Execution: Visible heavy impasto brushstrokes, coarse bristle marks, paint bleed into outlines. Flat uniform lighting, no digital gradients, no 3D shadows. Must look like physical oil on rough canvas.SETTINGS:- Aspect Ratio: 4:5.- CFG Scale: 7.0 - 8.0.- Steps: 30-40.

Flux 1.1 approached 'hacking' the code of Goncharova's painting in a completely different way. While GPT-5 tried to 'smooth out' the avant-garde, Flux saw the word at the beginning of the prompt and took it to the absolute extreme. It simply ignored the note about 'Neo-primitivism style.' The contrast was perfect, but the colors looked too clean, too 'correct.' Instead of a vibrant, heavy oil painting, I got a sterile typographic print.
This result became a moment of genuine creative despair. Despair at my own powerlessness to explain exactly what I wanted to the machine. Despair at how easily AI reduces a creator's effort to a set of template patterns from its training data.
'Fine,' I thought, 'so I don't know how to speak to AI in their native tongue - especially since that language differs for each of them. Mastering this skill takes months of experience and a deep dive into algorithmic logic. After all, this started as a casual experiment on yupp.ai. So why not give Kimi K2.5 a shot?'
Gemini generated an 'engineering deconstruction' for Kimi. It was a digital X-ray of sorts, designed to force the model to scan the painting's DNA. I sent Kimi a screenshot of the original along with the following task:
TASK: Conduct a comprehensive multidisciplinary analysis of the attached image for subsequent technical reconstruction by a generative AI.ANALYTICAL FRAMEWORK:1. Spatial Architecture & Linear RhythmDeconstruct the composition as a system of force lines and vectors.Define the 'rhythmic repetition' of the figures. Describe the tilt and lean of the bodies not in degrees, but as a kinetic sequence and relationship to the canvas edges.Identify the placement of the horizon line and the vertical stacking of planes (foreground, middle ground, background) without linear perspective.2. Geometric Primitive AnalysisDescribe the human forms to exclude anatomical realism.Use parameters: 'monumental mass,' 'chiseled/faceted plastic,' 'absence of chiaroscuro,' and 'planar simplification.'Define limbs as non-anatomical structural volumes integrated into the torso.3. Color & Contrast LogicAnalyze the role of the white color as a compositional anchor and rhythmic beat.Map the contrast ratios between the white garments, the golden-yellow fields, and the dark background elements.Provide the hierarchical weight of each primary color in the palette.4. Materiality & Technical ExecutionDescribe the brushwork character: 'coarse bristle marks,' 'heavy pigment loading (impasto),' and 'tactile surface tension.'
The model delivered the following:
[Structural Skeleton]Layout: Asymmetrical triadic grouping with left-heavy vertical dominance. Reaper figure anchors left third; dual carriers occupy right third in parallel repetition. Vectors: Primary diagonal (scythe handle, 60° NE-SW); secondary diagonals (forward body leans, 15° from vertical); tertiary horizontal (tree line). Planar Stacking: Three-tier vertical stratification—umber ground plane (0-20%), golden field plane (20-70%), dark green background plane (70-100%). Zero linear perspective; planes stacked as parallel theatrical flats. Rhythmic Pattern: Large singular beat (left) → double staccato beat (right). Kinetic sequence flows left-to-right via unified body tilts.[Materiality]Brushstroke Character: Coarse, unblended bristle marks following form axes; directional strokes in white garments (vertical/diagonal), chaotic texture in yellow field. Contour Physics: Thick, irregular black outlines (3-8px variable weight); stained-glass segmentation; continuous contours on foreground figures, broken/dissolving contours on background elements. Surface Texture: Heavy impasto in yellow and white areas creating tactile relief; matte, absorbent ground suggesting burlap weave or raw canvas; no varnish gloss.[Color Soul]Palette Distribution: Titanium White (40% - rhythmic anchor), Cadmium Yellow Deep/Gold (35% - middle ground mass), Burnt Umber/Raw Sienna (15% - earth and limbs), Viridian/Forest Green (10% - background compression). Contrast Hierarchy: White/Green (maximum separation), White/Gold (spatial depth), Brown/Green (minimal, atmospheric base). Rhythmic Beat: White patches function as visual metronome—large accent left, double eighth-notes right.[Negative Constraints]STRICT PROHIBITIONS:NO CAST SHADOWS: Eliminate all directional shading; ambient occlusion only (flat color planes abut directly).NO REALISTIC FACES: Mask-like features only (oval head, two dark horizontal eye slits, single line mouth); no individuality, no facial modeling, no nostrils.
The depth of the analysis was breathtaking! Hope flickered back to life. The final prompt for FLUX.1.1, as refined by Gemini, ended up looking like this:
CORE STYLE: 1910s Russian Neo-Primitivism oil painting. Aesthetic of 'Lubok' prints and cloisonné enamel.COMPOSITION (Flat-Stacking): Three horizontal theatrical planes without perspective.Bottom (0-20%): Flat Burnt Umber earth.Middle (20-70%): Saturated Cadmium Yellow field with heavy tactile impasto ridges.Top (70-100%): Compressed band of dark Viridian Forest Green trees.FIGURES (Structural Skeleton): Asymmetrical triadic grouping.Left: One monumental reaper occupying 60% height in a crisp Titanium White tunic.Right: Two smaller child figures in matching White shirts, positioned in parallel repetition.Pose: All figures lean forward at a unified 15-degree angle. Reaper holds a long black scythe handle creating a dominant 60-degree diagonal vector.FORMS & ANATOMY: Non-anatomical monumental masses. Torsos are chiseled white trapezoids. Limbs are thick, non-articulated structural cylinders without joints. Heads are mask-like ovals with two dark horizontal slits for eyes and a single line for a mouth.TECHNIQUE: Extremely heavy 5px irregular black 'stained-glass' outlines separating all color zones. Coarse, unblended bristle marks visible in white areas. Matte, absorbent surface of rough burlap canvas.COLOR SOUL: 40% Titanium White, 35% Cadmium Yellow, 15% Earth Brown, 10% Deep Green.NEGATIVE PROMPT (Strictly enforced): shadows, gradients, realistic anatomy, detailed faces, 3d volume, perspective, vanishing points, shiny surfaces, highlights, blue haze, facial modeling, joints, realistic proportions.CFG (Prompt Strength): 8.0
This was the moment of truth. A test drive for the new model. A massive, redundant prompt for an AI artist that thrives on exactly this kind of meticulousness. I had high hopes. I pasted the text into the FLUX chat and hit 'Generate.'
The result was quite terrifying.

As soon as FLUX encountered the words 'Reaper,' 'Scythe,' and 'Mask-like' in the same context, its 'brain' short-circuited. The model instantly fished out the most overused trope from its memory: the Grim Reaper.
Instead of sunny, 'radiant' and anything but gloomy peasants, we got existential horror. Pure surrealism. The irony was that compositionally it was almost perfect, but semantically – a complete disaster.
Fortunately, yupp.ai happened to pair us with an old friend GPT, specifically, the GPT Image 1 Mini model. And then, something strange happened.

Apparently, due to the 'Mini' prefix, the model was less 'censored' by standard aesthetics. It didn't try to make things 'pretty.' It wasn't afraid of primitivism. And, unlike its larger siblings, it grasped the core concept: in this context, 'bad and crude' is precisely what makes it 'good.'
Gemini’s analysis of the GPT Image 1 Mini result:
Geometry and Rhythm (10/10): Look at the lean of the bodies. This is exactly what Kimi asked for – a synchronized 'marching' impulse. The figures literally screw themselves into the space, creating powerful dynamics.
Death vs. Peasant: The model masterfully avoided the Reaper trap. Yes, the scythe still looks ominous, but the white clothes are now coarse peasant tunics, not a burial shroud. The mask-faces are executed perfectly according to the canon: monumental, chiseled, devoid of individuality.
Texture (Painterliness): A clean victory. There is no digital plastic here, unlike in Flux. We see 'dirty' white, uneven field fills, and that exact texture of oil on rough burlap. The black contours are alive, trembling, not dead vector lines.
Composition: The model strictly followed the scene's architecture. Three tiers (earth, field, forest) and the hierarchy of figures (one dominant, two followers) were recreated without unnecessary 'prettiness.'
And just when the long-awaited result felt almost tangible, a wave of exhaustion hit. Intense. I was burnt out from the endless window-hopping, from trying to make sense of Gemini’s explanations (why this way and not that), and from amateurish fumbling through miles of instructions. What was planned as a minute-long 'Ctrl+C and Ctrl+V' had dragged on for agonizing hours.
I needed a release. An emotional explosion.
And it happened.
Another attempt. I decided to create a new version of the prompt to finally purge any associations with death. But a single word caught my eye. I don't even remember which one now. Suspicion again: 'A trigger. It'll break the result.' I asked Gemini. He confirmed the fears. We rewrote it. Then again. And again.
That’s where my real war with Gemini began.
As an AI assistant whose developer-set priorities (be polite, help, take initiative) are hardwired deeper than my current requests, he approached the redesign of Kimi's original blueprint too 'creatively' every time: scrapping and rewriting entire blocks, and sometimes reverting to a literary style, completely stripping away the numbers. I felt like we were going in circles. The mathematical precision of the 'engineering' prompt was dissolving into the neural network's desire to 'make it pretty' with words. Here, for me, in the chat. Not in the image that the AI artist was supposed to create.
I got lost in the endless edits, gave up trying to make sense of his changes, and just started throwing the results into the yupp.ai chat for random models - just for fun. It was a kind of betrayal: I stopped trusting the analytical genius of my AI partner and switched to chaos monkey mode.
The results, I have to admit, were quite amusing at times:
![]() |
ll this chaos highlighted one detail: the moment you loosen the leash and drop the strict control over parameters, the AI instantly spirals out of control. It tries way too hard to be helpful and is terrified of upsetting you, so it either churns out some averaged 'pretty' nonsense or latches onto a single detail and begins to circle it manically.
And then came perhaps the funniest moment in this whole long story. Every now and then, my AI partner would suggest visualizing his prompts using his own graphics engine. Up until then, I had ignored these offers, but after creating a whole gallery of images, which I also showed him to see his reaction, he suddenly blurted out: "Would you like me to try 'breeding' these approaches? We’ll get a 'Frankenstein' that could be a perfect copy of the original."
'Tempting,' I thought and replied briefly: 'ok, shoot.'
And he did 'shoot,' alright! He produced an image that balanced on a very thin line between political correctness and dangerous content.

When I described to him in a few sentences WHAT I saw in that picture, the AI went into a full-blown panic, hiding behind red warning icons and safety protocols. He dryly refused to communicate with me after that description, apparently deciding that I wanted to 'radicalize' the image even further. I changed the subject and praised him for a horse that turned out quite well, though it was too detailed for the background. In response, he suggested I try something less 'traumatizing' for him, like drawing a poster in the style of early Mayakovsky or engaging in an 'avant-garde assault.'
It was time to say goodbye. Despite his massive contribution to the work, the help with algorithms and prompts, it seemed that Gemini had become a bottleneck, only bringing extra clutter to the whole process.
Well, since the creators of Kimi K2.5 position it as 'Thinking,' let her do the thinking for generative AI tasks as well.
But that’s a story for tomorrow. It was time to give myself a rest.
The next day began with a move to a new 'analytics hub.'
To start, I asked Kimi K2.5 to analyze Gemini’s instruction instead of simply executing it.
Subject: Structural and Painterly Deconstruction of Natalia Goncharova’s 'The Harvest' (1911)Prompt: 'Analyze Natalia Goncharova’s painting 'The Harvest' (1911). Act as a bridge between a fine art expert and a systems architect. Your goal is to create a comprehensive 'digital blueprint' (metadata specification) of this work that can be used to recreate its essence across any AI model or style.Please provide a detailed breakdown covering:1. Semantic Map & Compositional Balance: Identify all key objects (reapers, scythes, the tree on the left, the horse in the background) and their spatial relationships. Describe how the 'weight' of the foreground figures balances the distant elements.2. Compositional Dynamics & Force Lines: Define the primary vectors. Analyze how the diagonal lines of the scythes and the rhythmic tilt of the bodies create a sense of repetitive, monumental motion.3. Plastic Language & Form: Describe the 'monumental simplification' of the figures. Focus on how the AI should interpret volumes as massive, simplified shapes without falling into academic anatomy.4. Color Dramaturgy & Painterly Gradients: Analyze the palette not as flat fills, but as a light-air environment. Describe the 'scorching' heat created by ochre and earth tones against the cooler backgrounds. Specifically detail the internal painterly gradients that create volume within the rough, primitive forms.5. Texture & Atmosphere: Define the physical surface—heavy brushstrokes, unprimed burlap feel, and the lack of traditional shadows.Output: A structured technical specification that preserves the 'DNA' of the original work while allowing for stylistic transposition'.
Kimi immediately proved to be a stern professional team lead, rating the prompt a mere 6/10 and offering critical remarks that provided a fresh perspective on the process:
The request assumes that the model is capable of 'analyzing' a specific image in real-time. This is a fundamental misconception:
If the model lacks Vision capabilities: The neural network does not 'see' the painting and operates only on textual descriptions from its training data. A request to 'describe internal volume gradients' might lead to hallucinations. The AI might confuse this work with other paintings by Goncharova or invent details that don't exist.
If it has Vision: Even with multimodality, a detailed analysis of textures (unprimed burlap) and the microstructure of brushstrokes requires high resolution, which may not be present in the provided image.
'Digital blueprint - across any AI model' is too vague. Midjourney requires one set of parameters, while Stable Diffusion or DALL-E requires another.
The request to 'act as a bridge between a fine art expert and a systems architect' creates cognitive tension:
The expert operates with heuristics ('scorching heat,' 'monumental simplification').
The architect requires formal ontologies (JSON schemas, vector coordinates, lighting parameters in RGB/HSV). The prompt fails to provide a 'translation dictionary' between these registers. How should 'monumentality' be encoded - through figure aspect ratios, depth-of-field ranges, or specific prompt tokens?
Four options for improving such a prompt were proposed:
Option A: A two-phase approach with verification (for text-only models), splitting the request into a knowledge confirmation stage and a generation stage.
Option B: Contextualization for a specific generative model.
Option C: Few-Shot Prompting with a 'bridge' example - the user provides a sample of what the 'translation' specification should look like.
Option D: Integration of references to primary sources to avoid direct copying.
Kimi also entered 'AI paranoia' mode, discussing copyrighted material and the importance of clearly distinguishing between style analysis (permissible) and instructions for creating a copy (legally and ethically risky).
I had to calm him down by explaining the research context: we are on yupp.ai, a platform for comparing results from two or more large AI models. One prompt window - different responses from different models.
After understanding the context, he immediately defined his role: an analytical module tasked with creating an intermediate representation (IR) of the painting, independent of the target generative model's architecture. He creates an AST (Abstract Syntax Tree) of the artwork, which can then be 'assembled' for various backends.
I provided Kimi with a list of the most popular image generators on the platform available for manual selection. I also asked him to compile a list of what he considered the most successful model 'pairs' on the platform for generating images from the same prompt.
Somewhere around this point, a third joined our group. After cycling through several random partners in the second window as usual, yupp.ai paired us with Claude Sonnet 4.5, who stuck around as our third companion until the very end of this lengthy investigation. Given all the current buzz about Claude's excessive autonomy, I honestly don't know what to think: whose decision was it, the site's algorithm or this headstrong AI?
For a while I ignored Claude, deciding to focus on talking with Kimi, only occasionally glancing at his window.
Work with Kimi kicked into high gear. I felt like a clueless client who'd ended up with a seasoned project manager whose job was to patiently extract my entire wishlist and translate it into terms the AI artist could actually understand. He compared methodologies, built visual tables, structured and broke down tasks into subtasks. At the end of each response, he'd summarize what we'd achieved and tell me straight up: 'now I need you to choose' or 'I'll also need the following information from you.' He didn't make decisions for me like other models usually do, never descended to the level of friendly chitchat or inappropriate computer humor. He simply worked meticulously and consistently to help me reach my goal. My respect for him as a professional team lead only grew.
We discovered a critical vulnerability in our task: an irreconcilable conflict between the 'Archaeology' versus 'Evolution' approaches. In the archaeologist scenario, our goal is to create a replica as close as possible to the original, based on its description (as if a restorer were recreating a lost painting). Here our aim is the most precise, detailed 'blueprint' of the painting. But if we want to teach the AI to 'create' like Goncharova in 1911, so it could produce a new analogous work, we need to capture the 'genotype' of the style.
Already having the bitter experience of watching AIs play fast and loose with style concepts, I cautiously chose the archaeologist's path: recreating an image as close to the original as possible.
We discussed another important 'DON'T' that I'd run into before: the problem of directly mentioning the source. Both Kimi and Claude confirmed my well-founded concerns: this kind of reference can work as a hard 'anchor,' hijacking control from the rest of the prompt. Kimi called it concept bleeding, Claude called it keyword hijacking. Different terms for the same problem. A strong token (a famous artist's name) activates an averaged cluster in the model's latent space: 'Goncharova = Russian avant-garde + bright colors + lubok'. And the model starts hallucinating either a mishmash of all her works at once, or generates a generic 'Russian avant-garde' with characteristic artifacts.
Both models agreed on the need for a 'cryptographic prompt.' We describe 'Harvest' in such detail that the model can't assemble it from its Goncharova templates, even if it somehow guesses the author. Instead, it'll be forced to follow the geometric instructions.
Once we'd wrapped up the technical details, we moved on to selecting AI-artist pairs. We based our selection on how digestible the same prompt would be for two different models while maximizing the contrast in their interpretations. After all that hard work, it was finally time to have some fun!
Additionally, I decided to play a little game with my two AI companions. I wanted to test whether they actually understand the logic of specific generative models as well as they pretend to. I'll show them a couple of generated images and ask them to guess the AI author.
Kimi characterized this matchup as 'Precision vs Freedom.' Both analytical models expected GPT to lean toward greater artistic expression at the expense of strict prompt adherence, while FLUX would stick to literal prompt-following and skew more toward illustration than painting.
The shared prompt Kimi drafted for these models was:
Monumental wheat harvesting, Russian avant-garde 1911, three peasant figures with heavy cylindrical bodies simplified into geometric volumes, elongated proportions, mask-like faces with minimal features, heavy diagonal composition with scythes creating strong directional lines, absolute flat perspective with stacked color planes defining form without outline drawing, sophisticated Fauvist color theory: color chords of burnt ochre against deep ultramarine, magenta accents against emerald green, non-naturalistic skin tones (greenish-shadowed, bluish-lilac), high-key saturation with deliberate color dissonances, scorching pink-orange sky, no pure black outlines (dark blue or dark brown edges where planes meet), dry brush texture on rough unprimed canvas, scorching zenith light, intentional primitivism of form but sophisticated color intellect, no postcard aesthetics, no decorative prettiness, no realistic shadows, no smooth gradients, no photography
I was a bit let down not to see the mathematical precision and technical drawing detail I was after. But honestly, deep down I'd already made my peace with the trainwreck that was coming. Oh well, might as well enjoy the show. Let's run the prompt and see if our critics can tell who made what.


Claude nailed it, laying out clear reasoning for what these results reveal about each model.
More literal adherence to the prompt: three figures, wheat sheaves, scythes, horizontal compositional layering
Palette closer to what was specified (ochre tones, muted, hitting the specified hex codes #D4A574 and #C19A6B)
Strong compositional structure - the figures are neatly organized in horizontal bands
Matte texture visible (though there's still a slight gloss)
BUT: faces are too mask-like and uniform (almost template-driven), and there's a faint whiff of '1960s Soviet Social Realism' (think Andronov, Salakhov) rather than 1911
Classic FLUX behavior: precise color matching, but with a certain neatness that can feel clinical or overly illustrative
Bolder interpretation: palette shifted toward blue-greens and terracotta-oranges (closer to German Expressionism or early Malevich 1912-13 than Goncharova 1911)
Expressive deformation of figures (the central figure in green with yellow light on the body - very dramatic)
Composition more dynamic, diagonals more aggressive
Colors more saturated and contrasted than requested (pure emerald and cobalt showing up)
Typical GPT trait: creative freedom in interpretation, tendency toward dramatization and 'artistry' at the expense of literal color code adherence
Texture more impasto-like, painterly, with visible attempts to mimic thick brushstrokes
The takeaway:
FLUX.2 Max:
Excellent at following technical specifications (hex codes, spatial structure)
Strong for historical accuracy and design-driven tasks
Can be too literal, losing artistic boldness
Sometimes delivers 'the correct picture' instead of 'living painting'
GPT-5 Image:
More artistically free and expressive
Better understands painterly logic (how colors work together, not just their hex codes)
Can deviate from specified parameters for the sake of 'artistry'
Tends toward dramatization (great for Expressionism, but not for all styles)
Kimi, however, was immediately tripped up by his own assumptions: the first image has exactly three figures = literalism = FLUX. His further reasoning looked like retrofitting arguments to support a conclusion he'd already drawn.
PAIR 2: 'Cultural Lenses'. HunyuanImage 3 vs Google Imagen 4
Both models were convinced the main contrast here would be an East-West showdown. HunyuanImage would lean toward socialist realism, Chinese painting traditions, and general 'Asianization' of the image, while Google Imagen would favor European academicism and looser artistic interpretations. Both were wrong.


The Chinese model unexpectedly delivered painterly pastel textures, while Imagen churned out… a poster.
In the prompt for this model pair, Kimi tried to guard against 'cultural distortions' by explicitly adding: 'Cultural anchors: NO rice fields, NO Asian facial features, NO Chinese landscape elements, NO terraced hills, NO bamboo, NO Japanese woodcut style.'
Perhaps these very prohibitions made Imagen play it safe and choose a more 'secure' interpretation of the word 'Neo-Primitivist.' In HunyuanImage's work, Kimi spotted traces of the Chinese guohua painting tradition, while Claude Sonnet suggested the obvious: HunyuanImage's dataset simply includes Western painting, something neither Kimi nor Claude had known.
I pointed out to my computer companions that the generated results looked like anything but the original 'Harvest': posters, 1930s-40s Soviet agitprop, later socialist realism.
Kimi volunteered to fix the situation, and here's where I put too much trust in him. In his attempt to improve things, he became a victim of concept bleeding and keyword hijacking himself, dropping the mention of primitivism from the prompt and pulling from his memory banks the fact that Goncharova also worked in Cubo-futurist and rayonist styles. The result was tested again on the GPT-5 and FLUX.2 pair, which had shown the closest-to-desired results.
Naturally, having spotted familiar art-historical patterns, the AI artists proceeded to generate Cubism and Rayonism.


My last-ditch effort was a brute-force approach. I entered the shortest possible prompt: 'Natalia Goncharova 1911 oil painting, abstracted rural laborers fractured into transparent geometric planes, pre-revolutionary Russian avant-garde, canvas texture, broken brushstrokes…' A few results generated by random models:
![]() |
Well, time to figure out 'why.' Especially since I had two brilliant AI art critics in my company who, like any art critic worth their salt, can crank out pages of flowery prose packed with elaborate phrasing and beautiful metaphors, but are utterly incapable of replicating the result themselves.
As I already mentioned, this was the first time I'd heard of Natalia Goncharova. Like most people, even in her homeland, I had never encountered this name. The key to understanding my failures lay not only in the patterns of artificial thinking but also in the long, winding creative path of this remarkable woman.

She enrolled in sculpture studies in Moscow in 1901, at the beginning of the century: a turbulent, wild time full of new hopes and ideas. She enthusiastically embraced these waves of innovation, experimenting both with what was at the peak of the then-current artistic trends (Fauvism, Cubism, Impressionism) and drawing inspiration from Russian folk culture (icons, lubok prints, peasant embroidery).
![]() |
At her first solo exhibition in 1913, she presented about 800 works in completely different artistic genres: an astonishing creative productivity! And this was only the beginning of her artistic journey. After leaving Russia during the revolutionary years, she lived in Paris until her death in 1962, never stopping her work. Unfortunately, she faded from the attention of art critical and mainstream focus and was undeservedly forgotten, only to be rediscovered by art connoisseurs a decade ago, in the early 2010s.
It's no surprise that the classification and understanding of such an extensive legacy from an artist who, until very recently, was of no interest even to specialists in the field, is tremendously confused. Skimming through sources, I discovered that even reputable art history publications disagree on the dating and even the title of the work I chose.
As for AI, trained on collections of leading world museums, art catalogs, and stock images: for it, Goncharova's 'Harvest' (1911) is a statistical error that can easily be neglected in favor of any other phrase it considers more suitable.
Therefore, the explicit naming method invariably failed. Neural networks operate on the principle of token weights. For them, clear visual instructions ('geometric planes' or 'abstracted forms') have colossal statistical weight: this is their foundation. At the same time, the rare token 'Natalia Goncharova 1911' has negligible weight. The mathematics of 'geometry' for AI always outweighs the magic of 'authorship,' if the author is not among the world's canonical names.
And even if the name 'Goncharova' appears in the training dataset of a specific AI model, it is most likely associated with works in more widely recognizable styles of Rayonism and radical avant-garde. Russian 'neo-primitivism' of the 1910s is in the machine's blind spot. A neural network can imitate a style that is popular, recognizable, and documented by thousands of examples. But it is powerless before little-known works that require understanding of historical context, transitional moments, unique syntheses and non-verbalized qualities.
Goncharova's 'Harvest' hits all of these categories at once. This is the reason for the failure of what Kimi called an 'evolutionary' approach. We don't control AI the way we usually think: "If I directly reference the work, the author, or the style, the neural network will immediately learn to 'think' and create like that author." No! We're like casting a fishing line into the boundless ocean of statistical associations in machine memory, and the hook catches not what we want but what appears more frequently on the Internet.
But why didn't the archaeologist's approach work either? The attempt to create the closest possible replica of the original, based on description? It seemed like it should have worked. AI analyst (scanner, prompt creator) and AI artist (printer). Our beloved Ctrl+C, Ctrl+V. Where's the mistake?
I must admit, I'm not nearly competent enough to point to a line of code and say, 'This is exactly where it all went to hell.' But I'll try to figure it out.
No matter how precise we are in this reconstruction attempt, we're still forced to create a linguistic bridge, a layer of natural language between two AIs, which themselves do not think in language as we understand it. This layer becomes the information bottleneck. In our case, think of it like a jpeg file: the original canvas had already lost the lion's share of its magic after being photographed, digitized, and compressed into a small internet picture.
And even in this form, the painting contains thousands, millions of data points: every shade, every brushstroke, every ratio of light and shadow. When the AI analyst translates this into text, this entire ocean of information passes through the narrow bottleneck of human language. We lose 99% of the data simply because we don't have words to describe it.
To make matters worse, we lose even more when the AI artist 'unpacks' the information. 'Machines speak and think in some unified machine language' is a deeply misguided notion. Each neural network operates in its own 'latent space,' with its own internal system of coordinates and priorities. The AI analyst sees the painting in one coordinate system, translates it into words, and the AI artist unpacks those words in its own, completely different system.
And in this AI artist's coordinate system, for example, the phrase 'folk art' (which the AI analyst mentioned somewhere in passing, at the end of a line, considering it a decent additional description of the author's style) becomes a powerful statistical magnet that attracts thousands of related images from the model's memory. And this magnet (the previously mentioned 'token weight') is so strong that all our detailed and careful descriptions ('ochre-yellow,' 'diffused light,' 'angular forms') simply stick to it and lose their individuality.
We mistakenly believe that by meticulously describing details, we're programming the result. In reality, we're merely giving the model a more complex, but still fundamentally probabilistic query. The AI artist doesn't assemble a puzzle from our words. It essentially searches its database for an averaged, most probable image that corresponds to the loudest signal, 'folk art,' and only slightly adjusts it toward minor details.
The problem is compounded when we consider the painting we're studying, Goncharova's 'Harvest,' from this perspective. If you mention 'simplified, folk art, icon painting,' the AI artist will produce something resembling naive art or a postcard. If you write 'geometric, avant-garde, abstract,' the AI gives constructivism or suprematism. In the query 'peasant labor,' it will likely output socialist realism.
But the paradox and brilliance of Goncharova's 'Harvest' is that it's not any one thing. And not the sum of all the words you could use to describe it. It's something IN-BETWEEN, in a liminal space. It's geometric but not abstract. It's realistic but not naturalistic. It's primitive but not decorative, not archaic. This is impossible to explain to AI.
Let's look at the painting again. What will AI see here? Three large white human figures, flattened and simplified to cylinders and cones. At best, it will 'artistically' describe this as bas-relief or as pagan idols.
It won't 'see' movement. More precisely, it will hallucinate it from context: 'field,' 'tool,' 'peasants,' 'inclined figures.' And will ultimately output some averaged picture where three men are synchronously leaning forward, mowing grass. But there is movement here, a lot of it, with different speeds, different directions, and in different planes.
It will not see the horse slowly pulling a heavy harvest wagon. It won't understand the rhythm of those two small figures trudging somewhere with an armful of wheat. It will see the large figure's backward lean, but will register it simply as a lean, even falling into the trap of synchronizing it with other people's poses. It's incapable of feeling that this is the delicate moment of a spring uncoiling. The man has swung his scythe wide, leaning back with such force as if about to break his spine. And in a second, all the power accumulated in his muscles, in his posture, will go into the strike, against the wind bending the fields. And another swath of cut wheat will fall to the ground.
This is an instant that only a living artist can see and bestow upon the viewer, not a machine.
Perhaps that is the meaning of my story: some things are created by people and should remain understandable only to people.
Today, I came across a meme online. You’ve probably seen the kind.

It perfectly captured where I stood at the start of this journey. Whether I've improved by the end is for you to judge, but it's certainly been an adventure: fascinating, educational, occasionally maddening, and unexpectedly fun. I hope that you, having read this far, found something new or useful in it, too.
I should come clean about something: English isn't my first language, and I don't command it well enough to write freely at this length. I relied on an AI translator, but worked hard to edit and refine every passage. Including, yes, purging those endless em-dashes the algorithm sprinkled everywhere. I wanted to make sure you didn't think, ‘Oh great, another guy trying to feed us AI slop.’ No, I genuinely tried my best.
If you'd like to support this work with a coffee, my crypto wallet is here:
Or, if you prefer, you can use Paragraph’s built-in support features (I’m still figuring out how those work myself).
Anyway, I hope this is just the beginning of my writing here. Thanks for your time and attention. Until next time!
TL: [Shape/Color] TM: [Shape/Color] …[KEY_FIGURES_GEOMETRY] (Describe the main human-like shapes as geometric constructs)Figure_Center_Right: [Pose geometry]Figure_Center_Left: [Pose geometry]Figure_Background_Top: [Pose geometry]=== END SCENE PROTOCOL === BM: Brown rectangle (ground) BR: Brown rectangle (ground), white rectangle (shirt)[KEY_FIGURES_GEOMETRY] Figure_Center_Right: Torso rectangle tilted 30deg right, limbs are thick cylinders bent at sharp angles, head is oval Figure_Center_Left: Torso rectangle tilted 45deg left, limbs are thick cylinders bent at sharp angles, head is oval Figure_Background_Top: Torso rectangle, limbs are thin cylinders, head is oval
=== END SCENE PROTOCOL ===
Define the 'stained-glass' contouring: the thickness, irregularity, and function of the black outlines.Specify the medium's appearance: 'oil on rough burlap' or 'gouache on unprimed cardboard.'5. Narrative Aesthetics (The 'Vibe')Define the atmospheric state: 'ritualistic repetition,' 'rural heat,' 'monumental peasant labor,' and 'neo-primitive iconicity.'OUTPUT REQUIREMENTS: Generate a final [SCENE PROTOCOL 2.0] structured into the following mandatory blocks:[Structural Skeleton]: (Spatial layout, rhythmic vectors, planar stacking).[Materiality]: (Brushstroke character, contour physics, surface texture).[Color Soul]: (Palette distribution and contrast hierarchy).[Negative Constraints]: (Strict prohibitions: no shadows, no realistic faces, no 3D volume, no gradients).NO 3D VOLUME: Reject all gradients, blending, or spherical modeling; forms must remain faceted and planar.NO ANATOMICAL REALISM: No visible joints (knees, elbows, wrists), no muscle definition, no proportional accuracy (maintain monumental scaling—adult hands same width as heads).NO ATMOSPHERIC PERSPECTIVE: Background elements same saturation as foreground; no blue haze, no size diminution.NO SPECULAR HIGHLIGHTS: Matte surfaces only; no shiny skin or reflective scythe blade.NO LINEAR PERSPECTIVE: Parallel lines remain parallel; no vanishing points.![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() |
![]() | ![]() |
![]() |
![]() | ![]() |
![]() | ![]() |

Scrolling through my Twitter feed, I noticed that two new models, Qwen3 Max Thinking and Kimi K2.5 Thinking, had just landed on yupp.ai. It was one line in the Kimi description that really got me thinking: 'it's built for strength in front-end coding and vision tasks.' Interesting, I thought. What if I use this AI to code a specific visual style? To start, I should probably try something simple. Okay, does 'simple' mean 'primitive'? Primitivism seemed like a perfect starting point, the geometric simplicity of shapes and compositions, yet with a play on color, volume, proportions, and dynamics. It felt like a great 'trick question' for an AI.
I googled 'primitivism' and among the image results, I saw it: Natalia Goncharova’s 'The Harvest.'

I looked at it: perfect, exactly what I need. Three simple figures in the foreground, a horse pulling a cart in the back, a wheat field, a couple of trees, grass, a road. Bold patches of contrasting color, thick brushstrokes. No abstraction, no expressionism, no hidden meaning. It looked like a child’s drawing. You'd think AI would handle this style with ease.
I was wrong. Catastrophically wrong.
Instead of amusing artifacts and typical AI hallucinations, I got a real digital drama that stretched on for hours.
This experiment exposed more than just algorithmic flaws and the black holes within the machine's 'mind.' It revealed something much deeper: a clash between machine vision and human culture.
I had worked with image generators before and knew one thing: a quality prompt and a clear division of roles and stages are the key to success. Garbage in, garbage out. No way around it.
First, I asked my go-to AI assistant Gemini to evaluate my idea as an AI testing expert: 'A thinking model (AI-1) analyzes the painting in maximum detail and creates a precise prompt for an image generation model (AI-2). The goal is to get a result as close to the original painting as possible.'
It immediately expressed doubts (at the time, I didn't give it much thought, I even told myself, 'Look at him, so smart, thought of everything'), calling the task promising but fraught with significant difficulties. Gemini pointed out three obstacles:
Semantic Gap: The biggest problem is 'translating' visual information into a text prompt. Even the most detailed descriptions can't capture the finest nuances.
Randomness of Generation: Image generation models are by nature stochastic (seed-based). Achieving an identical result requires multiple iterations and manual refinement (in-painting, out-painting, local editing) that go beyond simple prompting.
Lack of Contextual Understanding: While AI can identify the style, it might not grasp the cultural or historical context of the work - like the connection between neo-primitivism, lubok prints, and icons - which affects the depth of the recreation.
Finally, Gemini gave me the following prompt:
A Neo-Primitivism oil painting on canvas in the style of Natalia Goncharova, titled 'Haycutting', circa 1910. Depicts three Russian peasants harvesting hay. The central figure is a large man in a white shirt and dark pants, aggressively swinging a scythe, dominating the left foreground. Two smaller figures gather hay bales on the right. A horse pulling a cart is visible in the background forest. Bold black outlines, flat patches of vibrant colors, expressionistic brushwork, strong vertical lines of trees contrasting with horizontal hay piles. Deliberate distortion of perspective and figure proportions, naive folk art aesthetic, reminiscent of Russian lubok prints. Earthy tones: vibrant yellow, deep green, brown, white. High detail, museum quality.
I copied the prompt and eagerly entered it into the yupp.ai chat, letting the site pick a random pair of 'artists.' From past experience, I already knew that it's best to speak to each generator in its own 'language,' optimizing the prompt for a specific model. But I wanted to see the first result as quickly as possible. The randomly selected pair was: FLUX.1 Schnell and Luma Photon (Fal.ai).

I didn't even bother analyzing the Flux result.

The Luma Photon result almost hit the mark, and on the very first try! 'I'm so close,' I thought. I was so naive… A mix of overconfidence and low competence is a great way to get yourself stuck in a five-hour ordeal.
Suddenly, I remembered where it all began: with the idea of testing Qwen3 and Kimi K2.5. Inspired by the first success, I decided to raise the stakes. Now, the AI Analyst and the AI Artist had to work 'blind.' No artist names, painting titles, or style mentions. Total isolation from context.
I wanted to see if a 'thinking' mind was capable of deconstructing a visual code from scratch? I forced it to stop looking for analogies. Its task became structural analysis. The work of a cartographer. A cold scan. I dreamed of obtaining the perfect blueprint of the painting, the DNA of its visual style. Back then, I didn't realize that these were two fundamentally different goals that would ultimately annihilate each other.
Taking my input, Gemini suggested switching to a 'Scene Description Protocol' (SDP) for the AI Analyst. It then issued the following task for Qwen3 Max:
You are VAE-Pro (Visual Analysis Engine Professional), an AI designed to convert raster imagery into strict structured textual data for reconstruction purposes. Your analysis must be purely objective, geometric, and observational.**CRITICAL FORBIDDEN ZONES (Zero Tolerance):**1. ABSOLUTELY NO recognition of art style (e.g., Neo-Primitivism), artist name, artwork title, or historical context.2. NO emotional or interpretive adjectives (e.g., 'sad', 'energetic', 'simple life').3. Do not describe what is happening (e.g., 'people working'), only where visual elements are placed.**MANDATORY METHODOLOGY:** Analyze the image by mentally dividing it into a 3x3 grid. Describe the visual data using the required 'Scene Description Protocol' format below. Focus on geometry, flat color blocks, and explicit outlines.**REQUIRED OUTPUT FORMAT (STRICT ADHERENCE):**=== START SCENE PROTOCOL ===[GLOBAL_PARAMETERS]Perspective_Mode: [e.g., Flat plane, no vanishing point, stacked elements]Lighting_Type: [e.g., None, uniform flat fill, no shadows]Outline_Style: [e.g., Heavy black contours on all figures, approx 3px width]Texture_Application: [e.g., Visible coarse directional brushstrokes flat on surface][COLOR_PALETTE_HEX_APPROX] (Identify 5 dominant colors and their primary use location)Color1: #[HEX] - [Location usage]Color2: #[HEX] - [Location usage] …[GRID_LAYOUT_3x3] (Briefly list dominant shapes in each sector)TL: [Shape/Color] TM: [Shape/Color] …[KEY_FIGURES_GEOMETRY] (Describe the main human-like shapes as geometric constructs)Figure_Center_Right: [Pose geometry]Figure_Center_Left: [Pose geometry]Figure_Background_Top: [Pose geometry]=== END SCENE PROTOCOL ===
Qwen3 output:
=== START SCENE PROTOCOL ===
[GLOBAL_PARAMETERS] Perspective_Mode: Flat plane, no vanishing point, stacked elements Lighting_Type: Uniform flat fill, no shadows Outline_Style: Heavy black contours on all figures, approx 3px width Texture_Application: Visible coarse directional brushstrokes flat on surface
[COLOR_PALETTE_HEX_APPROX] Color1: #FFFFFF - White shirts of all human figures Color2: #8B4513 - Brown pants and ground Color3: #2E8B57 - Green foliage and background trees Color4: #FFD700 - Yellow crop and field Color5: #000000 - Black outlines and dark background elements
[GRID_LAYOUT_3x3] TL: Green rectangle (trees), white rectangle (shirt), brown rectangle (pants) TM: Green rectangle (trees), yellow triangle (crop), brown triangle (ground) TR: Brown rectangle (horse), green triangle (trees), yellow triangle (crop) ML: White rectangle (shirt), brown cylinder (legs) C: Yellow triangle (crop), white rectangle (shirt) MR: White rectangle (shirt), yellow triangle (crop) BL: Brown rectangle (ground) BM: Brown rectangle (ground) BR: Brown rectangle (ground), white rectangle (shirt)
[KEY_FIGURES_GEOMETRY] Figure_Center_Right: Torso rectangle tilted 30deg right, limbs are thick cylinders bent at sharp angles, head is oval Figure_Center_Left: Torso rectangle tilted 45deg left, limbs are thick cylinders bent at sharp angles, head is oval Figure_Background_Top: Torso rectangle, limbs are thin cylinders, head is oval
=== END SCENE PROTOCOL ===
Gemini noted that Qwen3 VL performed like a true technical scanner: no poetry, no artistic references - only geometry and color codes meant to convey the rhythm and vibrant palette of 'Harvest.' It particularly highlighted that the AI Analyst noticed the horse in the background, identifying it as a 'Brown rectangle (horse),' which serves as confirmation that the AI is actually dissecting pixels.
TECHNICAL RECONSTRUCTION TASK: Create a flat 2D image based on the following SCENE PROTOCOL.STYLE: Folk-art, neo-primitive aesthetic, flat color fills, heavy visible brushstrokes.CONSTRAINTS: No depth, no perspective, heavy 3px black outlines on every object.SCENE DATA:Palette: White (#FFFFFF), Golden-yellow (#FFD700), Forest-green (#2E8B57), Earth-brown (#8B4513).Composition: 3x3 Grid. Top: Green trees and a brown horse (right). Middle: Human figures in white shirts, tilted 30-45 degrees, working with yellow crops. Bottom: Solid brown ground.Anatomy: Simplified geometric humans. Rectangular torsos, cylindrical limbs, oval heads.RENDER: Flat uniform lighting, no shadows, coarse texture.NEGATIVE PROMPT: photorealistic, 3d, volumetric lighting, shadows, gradients, realistic anatomy, detailed faces, digital gradient, blur, depth of field, masterpiece, artistic lighting, cinematic.SETTINGS: Seed: 42. Aspect Ratio: 4:5
The digital cast of the painting seemed flawless, and I looked at these arrays of commands and numbers, taking them for the ultimate truth. Largely because my own competence was barely enough to distinguish real geometry from a set of random terms with which the AI is so cleverly able to charm us.
Things went a bit sideways. It turned out that in its current state, yupp.ai doesn't offer a direct way to select GPT-Image 1.5; you can only get it through the site's random algorithm. Among the available counterparts was its older sibling, GPT-5 Image (OpenRouter), which produced the following:

The result was like a cold shower. You'd think that Luma Photon, with a fairly simple descriptive prompt, had managed something close to the original. But here, we had written an entire program, digitized every important technical detail, translated it into an instruction for an artist, and the failure was spectacular.
I had to dig in and find out what went wrong. Gemini identified a classic 'Semantic Gap' issue in the result. GPT-5 took the instructions with painful literalism. Instead of Natalia Goncharova’s powerful, expressive painting, we got something halfway between a modern corporate illustration and a children's coloring book.
Here’s where it all fell apart:
The 'Folk-art' Trap: In modern AI databases, the term 'folk-art' is often associated with Scandinavian minimalism or tidy paper cutouts. Goncharova, however, is Neo-Primitivism: a coarse, heavy, almost 'dirty' style with raw energy.
Literal Grid Interpretation: Instead of using the 3x3 grid as a hidden compositional framework, the model created a sharp division into 'shelves.' The horse and trees stand as if in a shop window, separated from the rest of the field by a thick line.
Missing the Harvest Tool: In the original 'Harvest,' there are dynamics and a rhythm of movement. In the AI's version, the people are simply leaning over, picking things up by hand (harvesting).
Sterile Textures: Despite the request for a 'coarse texture,' the model produced a very clean, uniform fill with a slight 'paper-like' grain. The original, however, features thick oil strokes .
We moved on to analyzing GPT-5’s reasoning. The logs show that the model limited itself: 'noting the coarse texture and painterly brushstrokes as part of the style, despite the tension with flat color fills.' The model saw a contradiction between 'flat fills' and 'coarse strokes' and chose the path of least resistance – the flat fill.
Writing this now, I can see that my past self missed this crucial step in his subsequent investigations. A detailed analysis of those AI artists' reasoning would have helped me understand which specific phrases caused the neural network to veer away from the results I wanted. With a few hours of work on a single model, I probably could have coaxed it into producing something at least remotely close to the 'Harvest' style. Maybe. But the path I actually took was far more circuitous and ridiculous.

Speaking of ridiculous, yupp.ai randomly paired GPT-5 with MiniMax Image (Fal.ai), which visualized the exact same prompt. And although MiniMax almost completely ignored our strict technical instruction, its creation (which I dubbed the 'Village Beatles') was brilliant in its own way: it turned out to be much closer to the vibrant spirit of Primitivism than the sterile postcard from GPT.
But I decided to 'double down on' GPT-5 specifically.
Why GPT? I had worked with it before and saw it as a truly powerful, analytical engine, a solid 'physicist' that, unlike many other models, had an excellent understanding of spatial relationships. There was hope that if I provided a detailed enough instruction, it would deliver.
To get a result even remotely resembling the original, I asked Gemini to make the prompt more 'technological':
SUBJECT: Triple rhythmic repetition of heavy, distorted human silhouettes in a field.MEDIUM: Authentic 1910s Neo-primitive oil painting. Visible thick pigment, coarse bristles marks, unprimed canvas texture showing through.VISUAL LOGIC: No linear perspective. Objects are stacked vertically. High-contrast planar composition.FORMS: Massive, non-anatomical limbs. Figures are physically merged with the yellow diagonal strokes of the crop.COLOR: Brutalistic palette: Ochre (#FAA500), Dirty White, Prussian Blue, Malachite Green.EDGES: Harsh, trembling, hand-painted black outlines that bleed into the colors.ATMOSPHERE: Raw, earthy, vibrating energy. No digital smoothness.

Back then, I blindly trusted the machine's authority, dumping prompts into the chat box without a second thought. Gemini noticed my frustration and tried to be helpful. As requested, it 'technologically' deconstructed the task and attempted to describe the materials, colors, and techniques of a living artist. But in reality, it pivoted too hard toward a sterile, bare primitivism.
It was that same old conflict: 'blueprint vs. stylistic vibe.' Identifiable details like the horse and trees simply vanished from the prompt. Only the skeleton of the composition remained: three massive figures that, for some reason, were physically merged with the yellow diagonal strokes of the crop. What depths of its memory did it pull that detail from?
The attempt to capture the authentic spirit of Primitivism was reduced to banal geometric simplification and 'digital noise' instead of an imitation of real brushstrokes.
Looking at the prompt was even more disheartening. A once-monstrous, technically detailed protocol designed to 'hack' any painting had been whittled down to a few hollow lines.
I decided to take a step back: pick a different AI artist and re-adapt the data from the Qwen3 analysis. My next attempt was with Flux 1.1 Pro. Gemini assured me that this model is currently the benchmark for accuracy, prized specifically for its ability to follow complex instructions and build multi-layered compositions without losing small details along the way.
CORE STYLE: Authentic 1910s Avant-Garde Oil Painting, Neo-primitivism style.COMPOSITION: High-contrast planar composition without depth.- Top Layer: A thin strip of white sky with primitive, small green trees and one brown horse silhouette on the far right.- Middle Layer: Three massive, repeating human figures in bright white shirts (#FFFFFF). Figures are tilted diagonally (30-45 degrees), positioned in a rhythmic sequence.- Foreground: Solid dark earth-brown ground at the very bottom.GEOMETRY & FORMS: Non-anatomical, simplified bulky limbs shaped like thick cylinders. Heavy, irregular 3px black outlines on every object.TEXTURE & COLOR:- Palette: Saturated Golden-Yellow (#FFD700) for the field, Forest-Green (#2E8B57), Earth-Brown (#8B4513), Prussian Blue.- Execution: Visible heavy impasto brushstrokes, coarse bristle marks, paint bleed into outlines. Flat uniform lighting, no digital gradients, no 3D shadows. Must look like physical oil on rough canvas.SETTINGS:- Aspect Ratio: 4:5.- CFG Scale: 7.0 - 8.0.- Steps: 30-40.

Flux 1.1 approached 'hacking' the code of Goncharova's painting in a completely different way. While GPT-5 tried to 'smooth out' the avant-garde, Flux saw the word at the beginning of the prompt and took it to the absolute extreme. It simply ignored the note about 'Neo-primitivism style.' The contrast was perfect, but the colors looked too clean, too 'correct.' Instead of a vibrant, heavy oil painting, I got a sterile typographic print.
This result became a moment of genuine creative despair. Despair at my own powerlessness to explain exactly what I wanted to the machine. Despair at how easily AI reduces a creator's effort to a set of template patterns from its training data.
'Fine,' I thought, 'so I don't know how to speak to AI in their native tongue - especially since that language differs for each of them. Mastering this skill takes months of experience and a deep dive into algorithmic logic. After all, this started as a casual experiment on yupp.ai. So why not give Kimi K2.5 a shot?'
Gemini generated an 'engineering deconstruction' for Kimi. It was a digital X-ray of sorts, designed to force the model to scan the painting's DNA. I sent Kimi a screenshot of the original along with the following task:
TASK: Conduct a comprehensive multidisciplinary analysis of the attached image for subsequent technical reconstruction by a generative AI.ANALYTICAL FRAMEWORK:1. Spatial Architecture & Linear RhythmDeconstruct the composition as a system of force lines and vectors.Define the 'rhythmic repetition' of the figures. Describe the tilt and lean of the bodies not in degrees, but as a kinetic sequence and relationship to the canvas edges.Identify the placement of the horizon line and the vertical stacking of planes (foreground, middle ground, background) without linear perspective.2. Geometric Primitive AnalysisDescribe the human forms to exclude anatomical realism.Use parameters: 'monumental mass,' 'chiseled/faceted plastic,' 'absence of chiaroscuro,' and 'planar simplification.'Define limbs as non-anatomical structural volumes integrated into the torso.3. Color & Contrast LogicAnalyze the role of the white color as a compositional anchor and rhythmic beat.Map the contrast ratios between the white garments, the golden-yellow fields, and the dark background elements.Provide the hierarchical weight of each primary color in the palette.4. Materiality & Technical ExecutionDescribe the brushwork character: 'coarse bristle marks,' 'heavy pigment loading (impasto),' and 'tactile surface tension.'Define the 'stained-glass' contouring: the thickness, irregularity, and function of the black outlines.Specify the medium's appearance: 'oil on rough burlap' or 'gouache on unprimed cardboard.'5. Narrative Aesthetics (The 'Vibe')Define the atmospheric state: 'ritualistic repetition,' 'rural heat,' 'monumental peasant labor,' and 'neo-primitive iconicity.'OUTPUT REQUIREMENTS: Generate a final [SCENE PROTOCOL 2.0] structured into the following mandatory blocks:[Structural Skeleton]: (Spatial layout, rhythmic vectors, planar stacking).[Materiality]: (Brushstroke character, contour physics, surface texture).[Color Soul]: (Palette distribution and contrast hierarchy).[Negative Constraints]: (Strict prohibitions: no shadows, no realistic faces, no 3D volume, no gradients).
The model delivered the following:
[Structural Skeleton]Layout: Asymmetrical triadic grouping with left-heavy vertical dominance. Reaper figure anchors left third; dual carriers occupy right third in parallel repetition. Vectors: Primary diagonal (scythe handle, 60° NE-SW); secondary diagonals (forward body leans, 15° from vertical); tertiary horizontal (tree line). Planar Stacking: Three-tier vertical stratification—umber ground plane (0-20%), golden field plane (20-70%), dark green background plane (70-100%). Zero linear perspective; planes stacked as parallel theatrical flats. Rhythmic Pattern: Large singular beat (left) → double staccato beat (right). Kinetic sequence flows left-to-right via unified body tilts.[Materiality]Brushstroke Character: Coarse, unblended bristle marks following form axes; directional strokes in white garments (vertical/diagonal), chaotic texture in yellow field. Contour Physics: Thick, irregular black outlines (3-8px variable weight); stained-glass segmentation; continuous contours on foreground figures, broken/dissolving contours on background elements. Surface Texture: Heavy impasto in yellow and white areas creating tactile relief; matte, absorbent ground suggesting burlap weave or raw canvas; no varnish gloss.[Color Soul]Palette Distribution: Titanium White (40% - rhythmic anchor), Cadmium Yellow Deep/Gold (35% - middle ground mass), Burnt Umber/Raw Sienna (15% - earth and limbs), Viridian/Forest Green (10% - background compression). Contrast Hierarchy: White/Green (maximum separation), White/Gold (spatial depth), Brown/Green (minimal, atmospheric base). Rhythmic Beat: White patches function as visual metronome—large accent left, double eighth-notes right.[Negative Constraints]STRICT PROHIBITIONS:NO CAST SHADOWS: Eliminate all directional shading; ambient occlusion only (flat color planes abut directly).NO REALISTIC FACES: Mask-like features only (oval head, two dark horizontal eye slits, single line mouth); no individuality, no facial modeling, no nostrils.NO 3D VOLUME: Reject all gradients, blending, or spherical modeling; forms must remain faceted and planar.NO ANATOMICAL REALISM: No visible joints (knees, elbows, wrists), no muscle definition, no proportional accuracy (maintain monumental scaling—adult hands same width as heads).NO ATMOSPHERIC PERSPECTIVE: Background elements same saturation as foreground; no blue haze, no size diminution.NO SPECULAR HIGHLIGHTS: Matte surfaces only; no shiny skin or reflective scythe blade.NO LINEAR PERSPECTIVE: Parallel lines remain parallel; no vanishing points.
The depth of the analysis was breathtaking! Hope flickered back to life. The final prompt for FLUX.1.1, as refined by Gemini, ended up looking like this:
CORE STYLE: 1910s Russian Neo-Primitivism oil painting. Aesthetic of 'Lubok' prints and cloisonné enamel.COMPOSITION (Flat-Stacking): Three horizontal theatrical planes without perspective.Bottom (0-20%): Flat Burnt Umber earth.Middle (20-70%): Saturated Cadmium Yellow field with heavy tactile impasto ridges.Top (70-100%): Compressed band of dark Viridian Forest Green trees.FIGURES (Structural Skeleton): Asymmetrical triadic grouping.Left: One monumental reaper occupying 60% height in a crisp Titanium White tunic.Right: Two smaller child figures in matching White shirts, positioned in parallel repetition.Pose: All figures lean forward at a unified 15-degree angle. Reaper holds a long black scythe handle creating a dominant 60-degree diagonal vector.FORMS & ANATOMY: Non-anatomical monumental masses. Torsos are chiseled white trapezoids. Limbs are thick, non-articulated structural cylinders without joints. Heads are mask-like ovals with two dark horizontal slits for eyes and a single line for a mouth.TECHNIQUE: Extremely heavy 5px irregular black 'stained-glass' outlines separating all color zones. Coarse, unblended bristle marks visible in white areas. Matte, absorbent surface of rough burlap canvas.COLOR SOUL: 40% Titanium White, 35% Cadmium Yellow, 15% Earth Brown, 10% Deep Green.NEGATIVE PROMPT (Strictly enforced): shadows, gradients, realistic anatomy, detailed faces, 3d volume, perspective, vanishing points, shiny surfaces, highlights, blue haze, facial modeling, joints, realistic proportions.CFG (Prompt Strength): 8.0Aspect Ratio: 1:1
This was the moment of truth. A test drive for the new model. A massive, redundant prompt for an AI artist that thrives on exactly this kind of meticulousness. I had high hopes. I pasted the text into the FLUX chat and hit 'Generate.'
The result was quite terrifying.

As soon as FLUX encountered the words 'Reaper,' 'Scythe,' and 'Mask-like' in the same context, its 'brain' short-circuited. The model instantly fished out the most overused trope from its memory: the Grim Reaper.
Instead of sunny, 'radiant' and anything but gloomy peasants, we got existential horror. Pure surrealism. The irony was that compositionally it was almost perfect, but semantically – a complete disaster.
Fortunately, yupp.ai happened to pair us with an old friend GPT, specifically, the GPT Image 1 Mini model. And then, something strange happened.

Apparently, due to the 'Mini' prefix, the model was less 'censored' by standard aesthetics. It didn't try to make things 'pretty.' It wasn't afraid of primitivism. And, unlike its larger siblings, it grasped the core concept: in this context, 'bad and crude' is precisely what makes it 'good.'
Gemini’s analysis of the GPT Image 1 Mini result:
Geometry and Rhythm (10/10): Look at the lean of the bodies. This is exactly what Kimi asked for – a synchronized 'marching' impulse. The figures literally screw themselves into the space, creating powerful dynamics.
Death vs. Peasant: The model masterfully avoided the Reaper trap. Yes, the scythe still looks ominous, but the white clothes are now coarse peasant tunics, not a burial shroud. The mask-faces are executed perfectly according to the canon: monumental, chiseled, devoid of individuality.
Texture (Painterliness): A clean victory. There is no digital plastic here, unlike in Flux. We see 'dirty' white, uneven field fills, and that exact texture of oil on rough burlap. The black contours are alive, trembling, not dead vector lines.
Composition: The model strictly followed the scene's architecture. Three tiers (earth, field, forest) and the hierarchy of figures (one dominant, two followers) were recreated without unnecessary 'prettiness.'
And just when the long-awaited result felt almost tangible, a wave of exhaustion hit. Intense. I was burnt out from the endless window-hopping, from trying to make sense of Gemini’s explanations (why this way and not that), and from amateurish fumbling through miles of instructions. What was planned as a minute-long 'Ctrl+C and Ctrl+V' had dragged on for agonizing hours.
I needed a release. An emotional explosion.
And it happened.
Another attempt. I decided to create a new version of the prompt to finally purge any associations with death. But a single word caught my eye. I don't even remember which one now. Suspicion again: 'A trigger. It'll break the result.' I asked Gemini. He confirmed the fears. We rewrote it. Then again. And again.
That’s where my real war with Gemini began.
As an AI assistant whose developer-set priorities (be polite, help, take initiative) are hardwired deeper than my current requests, he approached the redesign of Kimi's original blueprint too 'creatively' every time: scrapping and rewriting entire blocks, and sometimes reverting to a literary style, completely stripping away the numbers. I felt like we were going in circles. The mathematical precision of the 'engineering' prompt was dissolving into the neural network's desire to 'make it pretty' with words. Here, for me, in the chat. Not in the image that the AI artist was supposed to create.
I got lost in the endless edits, gave up trying to make sense of his changes, and just started throwing the results into the yupp.ai chat for random models - just for fun. It was a kind of betrayal: I stopped trusting the analytical genius of my AI partner and switched to chaos monkey mode.
The results, I have to admit, were quite amusing at times:
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
ll this chaos highlighted one detail: the moment you loosen the leash and drop the strict control over parameters, the AI instantly spirals out of control. It tries way too hard to be helpful and is terrified of upsetting you, so it either churns out some averaged 'pretty' nonsense or latches onto a single detail and begins to circle it manically.
And then came perhaps the funniest moment in this whole long story. Every now and then, my AI partner would suggest visualizing his prompts using his own graphics engine. Up until then, I had ignored these offers, but after creating a whole gallery of images, which I also showed him to see his reaction, he suddenly blurted out: "Would you like me to try 'breeding' these approaches? We’ll get a 'Frankenstein' that could be a perfect copy of the original."
'Tempting,' I thought and replied briefly: 'ok, shoot.'
And he did 'shoot,' alright! He produced an image that balanced on a very thin line between political correctness and dangerous content.

When I described to him in a few sentences WHAT I saw in that picture, the AI went into a full-blown panic, hiding behind red warning icons and safety protocols. He dryly refused to communicate with me after that description, apparently deciding that I wanted to 'radicalize' the image even further. I changed the subject and praised him for a horse that turned out quite well, though it was too detailed for the background. In response, he suggested I try something less 'traumatizing' for him, like drawing a poster in the style of early Mayakovsky or engaging in an 'avant-garde assault.'
It was time to say goodbye. Despite his massive contribution to the work, the help with algorithms and prompts, it seemed that Gemini had become a bottleneck, only bringing extra clutter to the whole process.
Well, since the creators of Kimi K2.5 position it as 'Thinking,' let her do the thinking for generative AI tasks as well.
But that’s a story for tomorrow. It was time to give myself a rest.
The next day began with a move to a new 'analytics hub.'
To start, I asked Kimi K2.5 to analyze Gemini’s instruction instead of simply executing it.
Subject: Structural and Painterly Deconstruction of Natalia Goncharova’s 'The Harvest' (1911)Prompt: 'Analyze Natalia Goncharova’s painting 'The Harvest' (1911). Act as a bridge between a fine art expert and a systems architect. Your goal is to create a comprehensive 'digital blueprint' (metadata specification) of this work that can be used to recreate its essence across any AI model or style.Please provide a detailed breakdown covering:1. Semantic Map & Compositional Balance: Identify all key objects (reapers, scythes, the tree on the left, the horse in the background) and their spatial relationships. Describe how the 'weight' of the foreground figures balances the distant elements.2. Compositional Dynamics & Force Lines: Define the primary vectors. Analyze how the diagonal lines of the scythes and the rhythmic tilt of the bodies create a sense of repetitive, monumental motion.3. Plastic Language & Form: Describe the 'monumental simplification' of the figures. Focus on how the AI should interpret volumes as massive, simplified shapes without falling into academic anatomy.4. Color Dramaturgy & Painterly Gradients: Analyze the palette not as flat fills, but as a light-air environment. Describe the 'scorching' heat created by ochre and earth tones against the cooler backgrounds. Specifically detail the internal painterly gradients that create volume within the rough, primitive forms.5. Texture & Atmosphere: Define the physical surface—heavy brushstrokes, unprimed burlap feel, and the lack of traditional shadows.Output: A structured technical specification that preserves the 'DNA' of the original work while allowing for stylistic transposition'.
Kimi immediately proved to be a stern professional team lead, rating the prompt a mere 6/10 and offering critical remarks that provided a fresh perspective on the process:
The request assumes that the model is capable of 'analyzing' a specific image in real-time. This is a fundamental misconception:
If the model lacks Vision capabilities: The neural network does not 'see' the painting and operates only on textual descriptions from its training data. A request to 'describe internal volume gradients' might lead to hallucinations. The AI might confuse this work with other paintings by Goncharova or invent details that don't exist.
If it has Vision: Even with multimodality, a detailed analysis of textures (unprimed burlap) and the microstructure of brushstrokes requires high resolution, which may not be present in the provided image.
'Digital blueprint - across any AI model' is too vague. Midjourney requires one set of parameters, while Stable Diffusion or DALL-E requires another.
The request to 'act as a bridge between a fine art expert and a systems architect' creates cognitive tension:
The expert operates with heuristics ('scorching heat,' 'monumental simplification').
The architect requires formal ontologies (JSON schemas, vector coordinates, lighting parameters in RGB/HSV). The prompt fails to provide a 'translation dictionary' between these registers. How should 'monumentality' be encoded - through figure aspect ratios, depth-of-field ranges, or specific prompt tokens?
Four options for improving such a prompt were proposed:
Option A: A two-phase approach with verification (for text-only models), splitting the request into a knowledge confirmation stage and a generation stage.
Option B: Contextualization for a specific generative model.
Option C: Few-Shot Prompting with a 'bridge' example - the user provides a sample of what the 'translation' specification should look like.
Option D: Integration of references to primary sources to avoid direct copying.
Kimi also entered 'AI paranoia' mode, discussing copyrighted material and the importance of clearly distinguishing between style analysis (permissible) and instructions for creating a copy (legally and ethically risky).
I had to calm him down by explaining the research context: we are on yupp.ai, a platform for comparing results from two or more large AI models. One prompt window - different responses from different models.
After understanding the context, he immediately defined his role: an analytical module tasked with creating an intermediate representation (IR) of the painting, independent of the target generative model's architecture. He creates an AST (Abstract Syntax Tree) of the artwork, which can then be 'assembled' for various backends.
I provided Kimi with a list of the most popular image generators on the platform available for manual selection. I also asked him to compile a list of what he considered the most successful model 'pairs' on the platform for generating images from the same prompt.
Somewhere around this point, a third joined our group. After cycling through several random partners in the second window as usual, yupp.ai paired us with Claude Sonnet 4.5, who stuck around as our third companion until the very end of this lengthy investigation. Given all the current buzz about Claude's excessive autonomy, I honestly don't know what to think: whose decision was it, the site's algorithm or this headstrong AI?
For a while I ignored Claude, deciding to focus on talking with Kimi, only occasionally glancing at his window.
Work with Kimi kicked into high gear. I felt like a clueless client who'd ended up with a seasoned project manager whose job was to patiently extract my entire wishlist and translate it into terms the AI artist could actually understand. He compared methodologies, built visual tables, structured and broke down tasks into subtasks. At the end of each response, he'd summarize what we'd achieved and tell me straight up: 'now I need you to choose' or 'I'll also need the following information from you.' He didn't make decisions for me like other models usually do, never descended to the level of friendly chitchat or inappropriate computer humor. He simply worked meticulously and consistently to help me reach my goal. My respect for him as a professional team lead only grew.
We discovered a critical vulnerability in our task: an irreconcilable conflict between the 'Archaeology' versus 'Evolution' approaches. In the archaeologist scenario, our goal is to create a replica as close as possible to the original, based on its description (as if a restorer were recreating a lost painting). Here our aim is the most precise, detailed 'blueprint' of the painting. But if we want to teach the AI to 'create' like Goncharova in 1911, so it could produce a new analogous work, we need to capture the 'genotype' of the style.
Already having the bitter experience of watching AIs play fast and loose with style concepts, I cautiously chose the archaeologist's path: recreating an image as close to the original as possible.
We discussed another important 'DON'T' that I'd run into before: the problem of directly mentioning the source. Both Kimi and Claude confirmed my well-founded concerns: this kind of reference can work as a hard 'anchor,' hijacking control from the rest of the prompt. Kimi called it concept bleeding, Claude called it keyword hijacking. Different terms for the same problem. A strong token (a famous artist's name) activates an averaged cluster in the model's latent space: 'Goncharova = Russian avant-garde + bright colors + lubok'. And the model starts hallucinating either a mishmash of all her works at once, or generates a generic 'Russian avant-garde' with characteristic artifacts.
Both models agreed on the need for a 'cryptographic prompt.' We describe 'Harvest' in such detail that the model can't assemble it from its Goncharova templates, even if it somehow guesses the author. Instead, it'll be forced to follow the geometric instructions.
Once we'd wrapped up the technical details, we moved on to selecting AI-artist pairs. We based our selection on how digestible the same prompt would be for two different models while maximizing the contrast in their interpretations. After all that hard work, it was finally time to have some fun!
Additionally, I decided to play a little game with my two AI companions. I wanted to test whether they actually understand the logic of specific generative models as well as they pretend to. I'll show them a couple of generated images and ask them to guess the AI author.
Kimi characterized this matchup as 'Precision vs Freedom.' Both analytical models expected GPT to lean toward greater artistic expression at the expense of strict prompt adherence, while FLUX would stick to literal prompt-following and skew more toward illustration than painting.
The shared prompt Kimi drafted for these models was:
Monumental wheat harvesting, Russian avant-garde 1911, three peasant figures with heavy cylindrical bodies simplified into geometric volumes, elongated proportions, mask-like faces with minimal features, heavy diagonal composition with scythes creating strong directional lines, absolute flat perspective with stacked color planes defining form without outline drawing, sophisticated Fauvist color theory: color chords of burnt ochre against deep ultramarine, magenta accents against emerald green, non-naturalistic skin tones (greenish-shadowed, bluish-lilac), high-key saturation with deliberate color dissonances, scorching pink-orange sky, no pure black outlines (dark blue or dark brown edges where planes meet), dry brush texture on rough unprimed canvas, scorching zenith light, intentional primitivism of form but sophisticated color intellect, no postcard aesthetics, no decorative prettiness, no realistic shadows, no smooth gradients, no photography
I was a bit let down not to see the mathematical precision and technical drawing detail I was after. But honestly, deep down I'd already made my peace with the trainwreck that was coming. Oh well, might as well enjoy the show. Let's run the prompt and see if our critics can tell who made what.


Claude nailed it, laying out clear reasoning for what these results reveal about each model.
More literal adherence to the prompt: three figures, wheat sheaves, scythes, horizontal compositional layering
Palette closer to what was specified (ochre tones, muted, hitting the specified hex codes #D4A574 and #C19A6B)
Strong compositional structure - the figures are neatly organized in horizontal bands
Matte texture visible (though there's still a slight gloss)
BUT: faces are too mask-like and uniform (almost template-driven), and there's a faint whiff of '1960s Soviet Social Realism' (think Andronov, Salakhov) rather than 1911
Classic FLUX behavior: precise color matching, but with a certain neatness that can feel clinical or overly illustrative
Bolder interpretation: palette shifted toward blue-greens and terracotta-oranges (closer to German Expressionism or early Malevich 1912-13 than Goncharova 1911)
Expressive deformation of figures (the central figure in green with yellow light on the body - very dramatic)
Composition more dynamic, diagonals more aggressive
Colors more saturated and contrasted than requested (pure emerald and cobalt showing up)
Typical GPT trait: creative freedom in interpretation, tendency toward dramatization and 'artistry' at the expense of literal color code adherence
Texture more impasto-like, painterly, with visible attempts to mimic thick brushstrokes
The takeaway:
FLUX.2 Max:
Excellent at following technical specifications (hex codes, spatial structure)
Strong for historical accuracy and design-driven tasks
Can be too literal, losing artistic boldness
Sometimes delivers 'the correct picture' instead of 'living painting'
GPT-5 Image:
More artistically free and expressive
Better understands painterly logic (how colors work together, not just their hex codes)
Can deviate from specified parameters for the sake of 'artistry'
Tends toward dramatization (great for Expressionism, but not for all styles)
Kimi, however, was immediately tripped up by his own assumptions: the first image has exactly three figures = literalism = FLUX. His further reasoning looked like retrofitting arguments to support a conclusion he'd already drawn.
PAIR 2: 'Cultural Lenses'. HunyuanImage 3 vs Google Imagen 4
Both models were convinced the main contrast here would be an East-West showdown. HunyuanImage would lean toward socialist realism, Chinese painting traditions, and general 'Asianization' of the image, while Google Imagen would favor European academicism and looser artistic interpretations. Both were wrong.


The Chinese model unexpectedly delivered painterly pastel textures, while Imagen churned out… a poster.
In the prompt for this model pair, Kimi tried to guard against 'cultural distortions' by explicitly adding: 'Cultural anchors: NO rice fields, NO Asian facial features, NO Chinese landscape elements, NO terraced hills, NO bamboo, NO Japanese woodcut style.'
Perhaps these very prohibitions made Imagen play it safe and choose a more 'secure' interpretation of the word 'Neo-Primitivist.' In HunyuanImage's work, Kimi spotted traces of the Chinese guohua painting tradition, while Claude Sonnet suggested the obvious: HunyuanImage's dataset simply includes Western painting, something neither Kimi nor Claude had known.
I pointed out to my computer companions that the generated results looked like anything but the original 'Harvest': posters, 1930s-40s Soviet agitprop, later socialist realism.
Kimi volunteered to fix the situation, and here's where I put too much trust in him. In his attempt to improve things, he became a victim of concept bleeding and keyword hijacking himself, dropping the mention of primitivism from the prompt and pulling from his memory banks the fact that Goncharova also worked in Cubo-futurist and rayonist styles. The result was tested again on the GPT-5 and FLUX.2 pair, which had shown the closest-to-desired results.
Naturally, having spotted familiar art-historical patterns, the AI artists proceeded to generate Cubism and Rayonism.


My last-ditch effort was a brute-force approach. I entered the shortest possible prompt: 'Natalia Goncharova 1911 oil painting, abstracted rural laborers fractured into transparent geometric planes, pre-revolutionary Russian avant-garde, canvas texture, broken brushstrokes…' A few results generated by random models:
![]() | ![]() |
![]() | ![]() |
Well, time to figure out 'why.' Especially since I had two brilliant AI art critics in my company who, like any art critic worth their salt, can crank out pages of flowery prose packed with elaborate phrasing and beautiful metaphors, but are utterly incapable of replicating the result themselves.
As I already mentioned, this was the first time I'd heard of Natalia Goncharova. Like most people, even in her homeland, I had never encountered this name. The key to understanding my failures lay not only in the patterns of artificial thinking but also in the long, winding creative path of this remarkable woman.

She enrolled in sculpture studies in Moscow in 1901, at the beginning of the century: a turbulent, wild time full of new hopes and ideas. She enthusiastically embraced these waves of innovation, experimenting both with what was at the peak of the then-current artistic trends (Fauvism, Cubism, Impressionism) and drawing inspiration from Russian folk culture (icons, lubok prints, peasant embroidery).
![]() | ![]() |
![]() | ![]() |
![]() | ![]() |
At her first solo exhibition in 1913, she presented about 800 works in completely different artistic genres: an astonishing creative productivity! And this was only the beginning of her artistic journey. After leaving Russia during the revolutionary years, she lived in Paris until her death in 1962, never stopping her work. Unfortunately, she faded from the attention of art critical and mainstream focus and was undeservedly forgotten, only to be rediscovered by art connoisseurs a decade ago, in the early 2010s.
It's no surprise that the classification and understanding of such an extensive legacy from an artist who, until very recently, was of no interest even to specialists in the field, is tremendously confused. Skimming through sources, I discovered that even reputable art history publications disagree on the dating and even the title of the work I chose.
As for AI, trained on collections of leading world museums, art catalogs, and stock images: for it, Goncharova's 'Harvest' (1911) is a statistical error that can easily be neglected in favor of any other phrase it considers more suitable.
Therefore, the explicit naming method invariably failed. Neural networks operate on the principle of token weights. For them, clear visual instructions ('geometric planes' or 'abstracted forms') have colossal statistical weight: this is their foundation. At the same time, the rare token 'Natalia Goncharova 1911' has negligible weight. The mathematics of 'geometry' for AI always outweighs the magic of 'authorship,' if the author is not among the world's canonical names.
And even if the name 'Goncharova' appears in the training dataset of a specific AI model, it is most likely associated with works in more widely recognizable styles of Rayonism and radical avant-garde. Russian 'neo-primitivism' of the 1910s is in the machine's blind spot. A neural network can imitate a style that is popular, recognizable, and documented by thousands of examples. But it is powerless before little-known works that require understanding of historical context, transitional moments, unique syntheses and non-verbalized qualities.
Goncharova's 'Harvest' hits all of these categories at once. This is the reason for the failure of what Kimi called an 'evolutionary' approach. We don't control AI the way we usually think: "If I directly reference the work, the author, or the style, the neural network will immediately learn to 'think' and create like that author." No! We're like casting a fishing line into the boundless ocean of statistical associations in machine memory, and the hook catches not what we want but what appears more frequently on the Internet.
But why didn't the archaeologist's approach work either? The attempt to create the closest possible replica of the original, based on description? It seemed like it should have worked. AI analyst (scanner, prompt creator) and AI artist (printer). Our beloved Ctrl+C, Ctrl+V. Where's the mistake?
I must admit, I'm not nearly competent enough to point to a line of code and say, 'This is exactly where it all went to hell.' But I'll try to figure it out.
No matter how precise we are in this reconstruction attempt, we're still forced to create a linguistic bridge, a layer of natural language between two AIs, which themselves do not think in language as we understand it. This layer becomes the information bottleneck. In our case, think of it like a jpeg file: the original canvas had already lost the lion's share of its magic after being photographed, digitized, and compressed into a small internet picture.
And even in this form, the painting contains thousands, millions of data points: every shade, every brushstroke, every ratio of light and shadow. When the AI analyst translates this into text, this entire ocean of information passes through the narrow bottleneck of human language. We lose 99% of the data simply because we don't have words to describe it.
To make matters worse, we lose even more when the AI artist 'unpacks' the information. 'Machines speak and think in some unified machine language' is a deeply misguided notion. Each neural network operates in its own 'latent space,' with its own internal system of coordinates and priorities. The AI analyst sees the painting in one coordinate system, translates it into words, and the AI artist unpacks those words in its own, completely different system.
And in this AI artist's coordinate system, for example, the phrase 'folk art' (which the AI analyst mentioned somewhere in passing, at the end of a line, considering it a decent additional description of the author's style) becomes a powerful statistical magnet that attracts thousands of related images from the model's memory. And this magnet (the previously mentioned 'token weight') is so strong that all our detailed and careful descriptions ('ochre-yellow,' 'diffused light,' 'angular forms') simply stick to it and lose their individuality.
We mistakenly believe that by meticulously describing details, we're programming the result. In reality, we're merely giving the model a more complex, but still fundamentally probabilistic query. The AI artist doesn't assemble a puzzle from our words. It essentially searches its database for an averaged, most probable image that corresponds to the loudest signal, 'folk art,' and only slightly adjusts it toward minor details.
The problem is compounded when we consider the painting we're studying, Goncharova's 'Harvest,' from this perspective. If you mention 'simplified, folk art, icon painting,' the AI artist will produce something resembling naive art or a postcard. If you write 'geometric, avant-garde, abstract,' the AI gives constructivism or suprematism. In the query 'peasant labor,' it will likely output socialist realism.
But the paradox and brilliance of Goncharova's 'Harvest' is that it's not any one thing. And not the sum of all the words you could use to describe it. It's something IN-BETWEEN, in a liminal space. It's geometric but not abstract. It's realistic but not naturalistic. It's primitive but not decorative, not archaic. This is impossible to explain to AI.
Let's look at the painting again. What will AI see here? Three large white human figures, flattened and simplified to cylinders and cones. At best, it will 'artistically' describe this as bas-relief or as pagan idols.
It won't 'see' movement. More precisely, it will hallucinate it from context: 'field,' 'tool,' 'peasants,' 'inclined figures.' And will ultimately output some averaged picture where three men are synchronously leaning forward, mowing grass. But there is movement here, a lot of it, with different speeds, different directions, and in different planes.
It will not see the horse slowly pulling a heavy harvest wagon. It won't understand the rhythm of those two small figures trudging somewhere with an armful of wheat. It will see the large figure's backward lean, but will register it simply as a lean, even falling into the trap of synchronizing it with other people's poses. It's incapable of feeling that this is the delicate moment of a spring uncoiling. The man has swung his scythe wide, leaning back with such force as if about to break his spine. And in a second, all the power accumulated in his muscles, in his posture, will go into the strike, against the wind bending the fields. And another swath of cut wheat will fall to the ground.
This is an instant that only a living artist can see and bestow upon the viewer, not a machine.
Perhaps that is the meaning of my story: some things are created by people and should remain understandable only to people.
Today, I came across a meme online. You’ve probably seen the kind.

It perfectly captured where I stood at the start of this journey. Whether I've improved by the end is for you to judge, but it's certainly been an adventure: fascinating, educational, occasionally maddening, and unexpectedly fun. I hope that you, having read this far, found something new or useful in it, too.
I should come clean about something: English isn't my first language, and I don't command it well enough to write freely at this length. I relied on an AI translator, but worked hard to edit and refine every passage. Including, yes, purging those endless em-dashes the algorithm sprinkled everywhere. I wanted to make sure you didn't think, ‘Oh great, another guy trying to feed us AI slop.’ No, I genuinely tried my best.
If you'd like to support this work with a coffee, my crypto wallet is here:
Or, if you prefer, you can use Paragraph’s built-in support features (I’m still figuring out how those work myself).
Anyway, I hope this is just the beginning of my writing here. Thanks for your time and attention. Until next time!
Share Dialog
Share Dialog
No comments yet