Full circle: An AI Generated NFTs struggle to remove its own background

witrebel@newsletter.paragraph.com (_Witrebel) — Thu, 31 Mar 2022 21:11:31 GMT

Almost a year ago, Berk created an immersive NFT collection with each image (or BGAN as some call them) generated via a GAN network. This was one of the very first instances of an AI generated NFT. I became intrigued with the project and collected quite a few of them after reading some articles about them and perusing the collection. Most of the BGANs have a solid background, but about 1/5th have a patterned background. An even smaller subset of are actually morphing gifs. There has recently been a growing trend to integrate PFP NFT’s [Profile Picture NFT’s] into the metaverse. This integration allows metaverse to participants walk around utilizing their NFT’s as pseudonymous identities. To import an NFT into the metaverse, some platforms require that the API’s for a given collection offer an endpoint that serves the NFT image with the background removed or transparent.

Now in theory, this would be as easy as asking Berk for the set of original background-less images that the GAN initially produced. The problem was that most of the BGANS were trained with colored backgrounds, so the outputs already had colors merged with the figures, erasing the possibility of separating them easily! Only a set of ~4000 BGANS existed that had no background, but that set wasn't 100% included in the final published BGAN collection. The community considered manually removing the backgrounds, but at 11305 images, not counting each frame of the Hypes, it was a monumental proposal. There had to be a better way, if only a machine could somehow learn… how to remove a background.

I did some reading and discovered an open source model called U2Net, for salient object detection, was already being employed by some tools for background removal.

Example Results of U2Net Pretrained Background Removal Before

Example Results of U2Net Pretrained Background After

Simple! We can just run this tool and it should remove the backgrounds from our BGANs easy peasy right?

U2Net Pretrained Results On BGAN# 764

As you can see, the pre-trained model is amazing at real world background removal, but faltered when presented AI-generated pixel art. The challenge was clear. We needed to retrain the U2Net specifically for this type of image. The problem? I have never dabbled in machine learning before, so this was starting from square one.

The basic concepts seemed manageable, and after some research I laid out a plan.

Develop an algorithm that would locate any matches in the 4600 unlabeled background-less images with their associated BGAN.
Use a small set of images that had already had backgrounds manually removed to validate the matching algorithm.
Run the algorithm to collect any pairs that exist and label them.
Use this data set to retrain the U2Net model
Use this new model to automatically remove the backgrounds from the entire collection.

To accomplish the first step of correlating and labeling any pairs that are in our set of background-less images, I employed a Mean Squared Error function. Like always, I jumped straight into the problem when I should have thought things out. I was using 1024x1024 resolution images, and trying to stuff them all into GPU memory using CuPy for MSE analysis. Being somewhat unfamiliar with this field, instead of leveraging frameworks, I basically wrote a very inefficient mechanism to batch load arrays into GPU VRAM. Mostly because I hate working with package dependencies and can’t be bothered to RTFM. Learn from my mistakes.

You don’t need to compute the MSE on a 1024x1024 image if the pixel-art is really 24x24. Just resize the image down and work with those images instead.
You may be able to get away with computing the MSE of the greyscale image, saving time versus computing it for the RGB version.
You don’t need GPU acceleration if you are working sequentially with small images, trying to load data into and out of the GPU just makes things slower.
If you do need GPU acceleration, RTFM and use frameworks made for dealing with GPUs and your specific type of problem
Use Conda or any environment manager really.
Let me repeat. USE CONDA.

Eventually, I had a working algorithm and I started tweaking. It was close, but not perfect. It seemed to have a propensity for matching dark backgrounds. It dawned on me that really all we care about is the BGAN itself, not the background. I adjusted the array computation so that the MSE was only calculated on the center 1/3rd of each image. After some manual review of the results, I determined that an MSE of less than 30 was an exact match. Anything over that was close, but no cigar.

Examples of the top 3 matches for a target image with MSE scores listed.

With great excitement, I found that we now have 2649 image-label pairs. Naively, I assumed this was all we needed, this must be the ground truth I keep reading about! I loaded up the images into the U2Net_Train.py script that had been kindly included in the repository, edited what needed editing, and just like that, my machine was learning! After a few hours of training, I eagerly loaded up the newly trained model into the image-background-removal-tool and tried a test folder of 100 images.

First pass training results…. something’s off

Clearly, there was something wrong. Anyone familiar with the this topic will probably spot it right away. The hair and glasses being removed was a breadcrumb that led me to the truth. Turns out, what I thought was “ground truth” was not exactly what the U2Net model considered ground truth. The model is attempting to predict or infer a background MASK. The image removal tool uses this mask to do the actual background removal. Ground truth for this model is actually a black and white image, with white denoting fully opaque, and black denoting fully transparent.

Mask versus Image with no background

Actual Image, BGAN# 6941

To address this, I wrote a program to extract the alpha channel from each image and then save it as a greyscale, giving us the masks we needed. Additionally, to help avoid overfitting, I had the program rotate every 7th image a few degrees off axis, as well as its associated mask. This rotation was in addition to a random 90, 180, or 270 degree base rotation, giving the training set a little bit of variance from the highly orthogonal pixel art.

I played with masks that had an alpha channel value that varied from 0–255, which I called soft masking, and also where it was binary, either 0 or 255, which I called hard masking. Additionally, I generated removal results using a purely color distance equation that simple compared each pixel to the value of the 0,0 pixel in terms of color distance.

After making these adjustments and checkpointing a few generations of the model, we finally have some useable results!

BGAN 26 BG Removal Model Comparison

BGAN 72 BG Removal Model Comparison

BGAN 422 BG Removal Model Comparison

As you can see, the manual and soft models seem to be the most consistently accurate, but for automatic application to 8000+ images there as no single model that I felt confident enough in to place full and complete trust in.

As a result, I am looking to the BGAN community and the internet at large to join me in crowdsource the final sort. I have waded into the surprisingly non trivial world of deploying a full stack interactive website, and am somewhere between proud and embarrassed to ask anyone interested in contributing to head over to http://bgans.rocks and start voting on which background removal you feel is most accurate!

I have a lot going on IRL but I will clean up my code and create a github repo as I plan to follow up this post with a more technical writeup for other groups to potentially re-use this workflow on their own collections. Part 2 publication date TBD….