Cover photo

my first attempt to write vision model from scratch

It all began with my curiosity about how machine learning models work after joining a data science competition at my university. We were tasked with creating a fraud classifier model, and my team decided to use an ensemble model consisting of XGBoost (XGB) and LightGBM (LGB). We achieved around 28% accuracy, but unfortunately, my team didn’t make it to the finals. Later, I noticed announcements that similar models could be built with neural networks. I realized these models could be exported to a .pt file, which was familiar to me from my research on using YOLOv8 models for inference. This led me to conclude that I could apply a similar approach to a vision model similar to YOLO, prompting my research on yolo.cpp.

one of my attempt to run raw YOLOV8s.pt on bluesim
one of my attempt to run raw YOLOV8s.pt on bluesim

It always begins with wondering how—how am I supposed to do this when I don't know anything at all? It's not like I suddenly got some revelation in a dream. So, I started by drawing a flowchart of how I assumed things worked. I concluded that there are two main parts of the code I need to use:

  1. Data loader

  2. Neural Networks

I assumed the data loader would be relatively easy to make since I learned a bunch of file handling in C during my first semester. However, the main problem was the data processor. Since I wanted to create everything from scratch, I decided not to use any external libraries like TensorFlow, NumPy, or PyTorch, and I wanted to make it run fast (spoiler: it didn't). So, I decided to make it in C++.

I also changed my approach to writing code. I noticed that experienced people always write functions with their descriptions below them, so I tried that too. Surprisingly, it made the code more readable than just writing the code without explanations.

First, I researched what dataset the YOLO pretrained model uses, and I found out it’s COCO. Without fully understanding what COCO is or how to use it, I tried to use it blindly. I soon discovered that the dataset is actually 25 GB, and I realized it wouldn't be possible to download it due to my limited resources.

Then, I searched for a more viable solution for the dataset and found an interesting one called CIFAR-10. It consists of 10 classes, 32x32 images, and is less than 500 MB. So, I began to write my CIFAR-10 loader header file.

#include "cifar10_loader.h"
#include <stdio.h>
#include <stdlib.h>

std::vector<CIFAR10Image> load_cifar10_bin(const char* data_path) {
    std::vector<CIFAR10Image> dataset(NUM_IMAGES);
    FILE* file = fopen(data_path, "rb");

    if (!file) {
        fprintf(stderr, "Error opening file: %s\n", data_path);
        exit(1);
    }

    for (int i = 0; i < NUM_IMAGES; ++i) {
        unsigned char label;
        fread(&label, LABEL_BYTES, 1, file);

        unsigned char buffer[IMAGE_BYTES];
        fread(buffer, IMAGE_BYTES, 1, file);

        cv::Mat img(IMAGE_SIZE, IMAGE_SIZE, CV_8UC3);

        for (int row = 0; row < IMAGE_SIZE; ++row) {
            for (int col = 0; col < IMAGE_SIZE; ++col) {
                img.at<cv::Vec3b>(row, col)[0] = buffer[row * IMAGE_SIZE + col]; // Blue
                img.at<cv::Vec3b>(row, col)[1] = buffer[IMAGE_SIZE * IMAGE_SIZE + row * IMAGE_SIZE + col]; // Green
                img.at<cv::Vec3b>(row, col)[2] = buffer[2 * IMAGE_SIZE * IMAGE_SIZE + row * IMAGE_SIZE + col]; // Red
            }
        }

        dataset[i].image = img;
        dataset[i].label = (int)label;
    }

    fclose(file);
    return dataset;
}

and it worked great. With that in place, I could use it for the main.cpp. Now, it was time to face the second problem: the data processor.

I wrote these three main functions inside tiny_network.c and then improved them with some openmp and i added some clear comments for what’s happening on the important part of the process on the neural networks :


NeuralNetwork::NeuralNetwork(const std::vector<int>& layer_sizes) : layer_sizes(layer_sizes) {
    // Initialize weights and biases randomly
    for (size_t i = 1; i < layer_sizes.size(); ++i) {
        int rows = layer_sizes[i];
        int cols = layer_sizes[i - 1];
        std::vector<std::vector<float>> layer_weights(rows, std::vector<float>(cols));
        std::vector<float> layer_biases(rows);
        // Initialize weights with small random values and biases with zeros
        for (int j = 0; j < rows; ++j) {
            std::generate(layer_weights[j].begin(), layer_weights[j].end(), []() { return static_cast<float>(rand()) / RAND_MAX - 0.5f; });
            layer_biases[j] = 0.0f;
        }
        weights.push_back(layer_weights);
        biases.push_back(layer_biases);
    }
}

std::vector<float> NeuralNetwork::forward(const std::vector<float>& input) {
    activations.clear();
    z_values.clear();
    activations.push_back(input);

    std::vector<float> activation = input;
    for (size_t i = 0; i < weights.size(); ++i) {
        std::vector<float> z(layer_sizes[i + 1], 0.0f);

        #pragma omp parallel for // Parallelize the outer loop
        for (size_t j = 0; j < layer_sizes[i + 1]; ++j) {
            for (size_t k = 0; k < layer_sizes[i]; ++k) {
                z[j] += weights[i][j][k] * activation[k];
            }
            z[j] += biases[i][j];
            z[j] = 1.0 / (1.0 + std::exp(-z[j])); // Sigmoid activation
        }
        z_values.push_back(z);
        activation = z;
        activations.push_back(activation);
    }
    return activation;
}

void NeuralNetwork::backward(const std::vector<float>& input, const std::vector<float>& target, float learning_rate) {
    std::vector<float> output_gradients = activations.back();
    for (size_t i = 0; i < output_gradients.size(); ++i) {
        output_gradients[i] -= target[i];
    }

    std::vector<std::vector<float>> hidden_gradients(weights.size());
    for (int l = weights.size() - 1; l >= 0; --l) {
        hidden_gradients[l].resize(layer_sizes[l + 1], 0.0f);

        #pragma omp parallel for // Parallelize the outer loop
        for (size_t j = 0; j < layer_sizes[l + 1]; ++j) {
            float gradient = output_gradients[j] * activations[l + 1][j] * (1.0f - activations[l + 1][j]);
            hidden_gradients[l][j] = gradient;
            for (size_t k = 0; k < layer_sizes[l]; ++k) {
                #pragma omp atomic // Ensure atomic operation for weight update
                weights[l][j][k] -= learning_rate * gradient * activations[l][k];
            }
            #pragma omp atomic // Ensure atomic operation for bias update
            biases[l][j] -= learning_rate * gradient;
        }

        if (l > 0) {
            std::vector<float> next_output_gradients(layer_sizes[l], 0.0f);

            #pragma omp parallel for // Parallelize the outer loop
            for (size_t k = 0; k < layer_sizes[l]; ++k) {
                for (size_t j = 0; j < layer_sizes[l + 1]; ++j) {
                    next_output_gradients[k] += hidden_gradients[l][j] * weights[l][j][k];
                }
            }
            output_gradients = next_output_gradients;
        }
    }
}
8: NeuralNetwork::NeuralNetwork(const std::vector<int>& layer_sizes) : layer_sizes(layer_sizes) {

This line defines the constructor for the NeuralNetwork class. It takes a vector of integers as input, which represents the sizes of each layer in the neural network.

10: for (size_t i = 1; i < layer_sizes.size(); ++i) {

This loop initializes the weights and biases for each layer in the neural network.

13: std::vector<std::vector<float>> layer_weights(rows, std::vector<float>(cols));
14: std::vector<float> layer_biases(rows);

These lines declare the weights and biases for each layer.

17: std::generate(layer_weights[j].begin(), layer_weights[j].end(), []() { return static_cast<float>(rand()) / RAND_MAX - 0.5f; });
18: layer_biases[j] = 0.0f;

These lines initialize the weights with small random values and biases with zeros.

20: weights.push_back(layer_weights);
21: biases.push_back(layer_biases);

These lines add the initialized weights and biases to the respective vectors.

25: std::vector<float> NeuralNetwork::forward(const std::vector<float>& input) {

This line defines the forward function, which performs a forward pass through the neural network.

30: std::vector<float> activation = input;

This line initializes the activation vector with the input.

34: #pragma omp parallel for // Parallelize the outer loop

This line uses OpenMP to parallelize the outer loop for performance.

37: z[j] += weights[i][j][k] * activation[k];

This line calculates the weighted sum of the inputs.

40: z[j] = 1.0 / (1.0 + std::exp(-z[j])); // Sigmoid activation

This line applies the sigmoid activation function.

46: return activation;

This line returns the final activation vector.

49: void NeuralNetwork::backward(const std::vector<float>& input, const std::vector<float>& target, float learning_rate) {

This line defines the backward function, which performs backpropagation to update the weights and biases.

52: output_gradients[i] -= target[i];

This line calculates the output gradients by subtracting the target from the activations.

59: #pragma omp parallel for // Parallelize the outer loop

This line uses OpenMP to parallelize the outer loop for performance.

61: float gradient = output_gradients[j] * activations[l + 1][j] * (1.0f - activations[l + 1][j]);

This line calculates the gradient for the hidden layers.

65: weights[l][j][k] -= learning_rate * gradient * activations[l][k];

This line updates the weights using the calculated gradient and the learning rate.

68: biases[l][j] -= learning_rate * gradient;

This line updates the biases using the calculated gradient and the learning rate.

77: next_output_gradients[k] += hidden_gradients[l][j] * weights[l][j][k];

This line calculates the gradients for the next layer.

80: output_gradients = next_output_gradients;

This line updates the output gradients for the next iteration of the loop.

as you can see i have no code for any GPU optimization and relay only on CPUs since i didn’t have any GPU to test the code on, I’d assume i need to learn some C/CUDA to maximize the GPU. then we can move on on the main.cpp and i have added some clear comments so i don’t have to explain line per line.

#include <opencv2/opencv.hpp>
#include "cifar10_loader.h"
#include "TinyNn_optimized.h"
#include "preprocess.h"
#define NUM_DATA 5
#define TEST_PATH "dataset/test_batch.bin"
#define LEARNING_RATE 0.5
#define NUM_EPOCHS 1000
#define BATCH_SIZE 64 

int main() {
    //loading the dataset, boring stuffs
    std::vector<CIFAR10Image> combined_dataset;
    for(int i = 0; i < NUM_DATA; i++){
        std::string data_path = "dataset/data_batch_" + std::to_string(i + 1) + ".bin";
        
        // Example of using the data path
        std::cout << "Processing file: " << data_path << std::endl;

        std::vector<CIFAR10Image> dataset = load_cifar10_bin(data_path.c_str());

        // Append the loaded dataset to the combined_dataset vector
        if (!dataset.empty()) {
            combined_dataset.insert(combined_dataset.end(), dataset.begin(), dataset.end());
        } else {
            std::cerr << "Failed to load dataset from " << data_path << std::endl;
        }


    }
    std::cout << "Total number of images in the combined dataset: " << combined_dataset.size() << std::endl;
   

    //function demo
    //const char* data_path = "dataset/data_batch_1.bin";
    //std::vector<CIFAR10Image> dataset = load_cifar10_bin(data_path);

    //if (!combined_dataset.empty()) {
        
    //    cv::imshow("CIFAR-10 Image", combined_dataset[0].image);
    //    printf("Label: %d\n", combined_dataset[0].label);
    //    cv::waitKey(0);
    //}
   


    //pre-cooking || preprocess the dataset
    std::vector<cv::Mat> images;
    std::vector<int> labels;
    for (const auto& item : combined_dataset) {
        images.push_back(preprocess(item.image));
        labels.push_back(item.label);
    }
   
    // Split into train and validation sets (80% train, 20% validation)
    int num_train = static_cast<int>(0.8 * images.size());
    std::vector<cv::Mat> train_images(images.begin(), images.begin() + num_train);
    std::vector<int> train_labels(labels.begin(), labels.begin() + num_train);
    std::vector<cv::Mat> val_images(images.begin() + num_train, images.end());
    std::vector<int> val_labels(labels.begin() + num_train, labels.end());



    //cooking stuffs
    //each numbers in layer_sizes in order is input, first hidden layer, second hidden layer and finally an
    //output layer, the reasoning in each numbers is:
    //3072 => because of 32 x 32 x 3 after flattening of cifar10 dataset, 
    //128 is just a wild guess for first hidden layers (?) or some reasoning of math that
    //myself couldnt understand yet, goes the same for the second hidden layers, 
    //and finally the output layers, 10 is for representative of confidence in each classes
    //so if i used cifar100 it would be 100 instead of 10, but i guess the hidden layer cant be less
    //than the output layer so there were supposed to be an adjustment if i were using a different
    //datasets
    std::vector<int> layer_sizes = {3072, 128, 64, 10}; //input -> hidden1 -> hidden2 -> output
    NeuralNetwork nn(layer_sizes);



    // Training loop
    for (int epoch = 0; epoch < NUM_EPOCHS; ++epoch) {
        float epoch_loss = 0.0;
        int num_correct = 0;

        for (size_t start = 0; start < combined_dataset.size(); start += BATCH_SIZE) {
            size_t end = std::min(start + BATCH_SIZE, combined_dataset.size());

            std::vector<std::vector<float>> batch_inputs;
            std::vector<std::vector<float>> batch_targets;

            for (size_t i = start; i < end; ++i) {
                
                const auto& item = combined_dataset[i];
                
                std::vector<float> input(item.image.total());
                std::memcpy(input.data(), item.image.ptr<float>(), item.image.total() * sizeof(float));
                batch_inputs.push_back(input);
                
                std::vector<float> target(layer_sizes.back(), 0.0f);
                target[item.label] = 1.0f; // One-hot encoding
                batch_targets.push_back(target);
            }

            for (size_t i = 0; i < batch_inputs.size(); ++i) {
                std::vector<float> output = nn.forward(batch_inputs[i]);
                nn.backward(batch_inputs[i], batch_targets[i], LEARNING_RATE);

                int predicted_label = std::max_element(output.begin(), output.end()) - output.begin();
                if (predicted_label == std::distance(batch_targets[i].begin(), std::max_element(batch_targets[i].begin(), batch_targets[i].end()))) {
                    num_correct++;
                }
            }
        }

        std::cout << "Epoch " << epoch << ": Accuracy = " << (num_correct / static_cast<float>(combined_dataset.size())) << std::endl;
    }

    // Loading and testing on the test set
    std::vector<CIFAR10Image> testset = load_cifar10_bin(TEST_PATH);
    std::vector<cv::Mat> test_images;
    std::vector<int> test_labels;
    for (const auto& item : testset) {
        test_images.push_back(preprocess(item.image));
        test_labels.push_back(item.label);
    }
 
    int num_correct_test = 0;
    for (size_t i = 0; i < test_images.size(); ++i) {
        const auto& item = combined_dataset[i];
        std::vector<float> input(item.image.total());
        std::memcpy(input.data(), item.image.ptr<float>(), item.image.total() * sizeof(float));

        std::vector<float> output = nn.forward(input);

        int predicted_label = std::max_element(output.begin(), output.end()) - output.begin();
        if (predicted_label == test_labels[i]) {
            num_correct_test++;
        }
    }
    std::cout << "Test Accuracy = " << (num_correct_test / static_cast<float>(test_images.size())) << std::endl;

    return 0;
}

and these came together pretty good, yet i still can’t optimize further with only utilizing CPUs.

running a training phase
running a training phase

see u all on my next project.