The Stable Diffusion team has done a great job! New painting model straight out of AI posters, pixel-level generation

Pixel-level image generation DeepFloyd IF is still based on the diffusion model, but there are two major differences compared to the previous Stable Diffusion.

The part responsible for understanding text has been changed from OpenAI's CLIP to Google T5-XXL, combined with an additional attention layer in the super-resolution module to obtain more accurate text understanding.

The part responsible for generating images has been switched from a latent diffusion model to a pixel-level diffusion model.

That is, the diffusion process no longer acts on the latent space representing the image encoding, but directly on the pixels.

A set of official visual comparisons between DeepFloyd IF and other AI painting models is also provided.

As you can see, Google Parti and Nvidia eDiff-1, which use T5 for text understanding, can also draw text accurately, and the fact that AI can't write is the pot of CLIP.

However, NVIDIA eDiff-1 is not open source, Google's several models are not even a demo, DeepFloyd IF has become a more practical choice.

The specific generation of images on DeepFloyd IF is consistent with the previous model, and the language model understands the text first into a small 64×64 resolution map, and then enlarged by different levels of diffusion model and super-resolution model.

On this architecture, by scaling down the specified image back to 64×64 and then re-performing diffusion using new cue words, it also implements graph generation and adjusts the style, content and details.

And it can be implemented directly without fine-tuning the model.

In addition, the advantage of DeepFloyd IF is that the IF-4.3B base model is the one with the most effective parameters in the U-Net part of the current diffusion model.

In the experiment, IF-4.3B achieves the best FID score and reaches SOTA (lower FID means higher image quality and better diversity).

Who is DeepFloyd DeepFloyd AI Research is an independent R&D team under StabilityAI, influenced by the rock band Pink Floyd and calling itself an "R&D band".

There are only four members in the team, all of whom are from Eastern European backgrounds by their last names.

In addition to the open source code, the team also offered an online demo of the DeepFloyd IF model on HuggingFace.

We also gave it a try, but unfortunately it does not support Chinese language yet.

The reason may be that there is not much Chinese content in the training dataset LAION-A, but since it is open source, we believe that it will not be too late for a good variant to be trained on the Chinese dataset.

One More Thing DeepFloyd IF is not the only action Stability AI took last night on open source.

In terms of language models, they also launched StableVicuna, the first chatbot to open source and introduce RLHF technology, based on the small alpaca Vicuna-13B model implementation.

The code and model weights are now available for download.

Full desktop and mobile interfaces will also be released soon.

Deepfloyd IF online demo at

https://huggingface.co/spaces/DeepFloyd/IF

Code:

https://github.com/deep-floyd/IF

StableVicuna online demo:

https://huggingface.co/spaces/CarperAI/StableVicuna

Weighted Downloads:

https://huggingface.co/CarperAI/stable-vicuna-13b-delta