# my first attempt to write vision model from scratch

By [kyomoto](https://paragraph.com/@kyomoto) · 2024-07-27

---

It all began with my curiosity about how machine learning models work after joining a data science competition at my university. We were tasked with creating a fraud classifier model, and my team decided to use an ensemble model consisting of XGBoost (XGB) and LightGBM (LGB). We achieved around 28% accuracy, but unfortunately, my team didn’t make it to the finals. Later, I noticed announcements that similar models could be built with neural networks. I realized these models could be exported to a .pt file, which was familiar to me from my research on using YOLOv8 models for inference. This led me to conclude that I could apply a similar approach to a vision model similar to YOLO, prompting my research on yolo.cpp.

![one of my attempt to run raw YOLOV8s.pt on bluesim](https://storage.googleapis.com/papyrus_images/a00d1a13d0fbe1e9dc14686c2beab62822dcd583e9e9741ed444bf9a729b6266.png)

one of my attempt to run raw YOLOV8s.pt on bluesim

It always begins with wondering how—how am I supposed to do this when I don't know anything at all? It's not like I suddenly got some revelation in a dream. So, I started by drawing a flowchart of how I assumed things worked. I concluded that there are two main parts of the code I need to use:

1.  Data loader
    
2.  Neural Networks
    

I assumed the data loader would be relatively easy to make since I learned a bunch of file handling in C during my first semester. However, the main problem was the data processor. Since I wanted to create everything from scratch, I decided not to use any external libraries like TensorFlow, NumPy, or PyTorch, and I wanted to make it run fast (spoiler: it didn't). So, I decided to make it in C++.

I also changed my approach to writing code. I noticed that experienced people always write functions with their descriptions below them, so I tried that too. Surprisingly, it made the code more readable than just writing the code without explanations.

First, I researched what dataset the YOLO pretrained model uses, and I found out it’s COCO. Without fully understanding what COCO is or how to use it, I tried to use it blindly. I soon discovered that the dataset is actually 25 GB, and I realized it wouldn't be possible to download it due to my limited resources.

Then, I searched for a more viable solution for the dataset and found an interesting one called CIFAR-10. It consists of 10 classes, 32x32 images, and is less than 500 MB. So, I began to write my CIFAR-10 loader header file.

    #include "cifar10_loader.h"
    #include <stdio.h>
    #include <stdlib.h>
    
    std::vector<CIFAR10Image> load_cifar10_bin(const char* data_path) {
        std::vector<CIFAR10Image> dataset(NUM_IMAGES);
        FILE* file = fopen(data_path, "rb");
    
        if (!file) {
            fprintf(stderr, "Error opening file: %s\n", data_path);
            exit(1);
        }
    
        for (int i = 0; i < NUM_IMAGES; ++i) {
            unsigned char label;
            fread(&label, LABEL_BYTES, 1, file);
    
            unsigned char buffer[IMAGE_BYTES];
            fread(buffer, IMAGE_BYTES, 1, file);
    
            cv::Mat img(IMAGE_SIZE, IMAGE_SIZE, CV_8UC3);
    
            for (int row = 0; row < IMAGE_SIZE; ++row) {
                for (int col = 0; col < IMAGE_SIZE; ++col) {
                    img.at<cv::Vec3b>(row, col)[0] = buffer[row * IMAGE_SIZE + col]; // Blue
                    img.at<cv::Vec3b>(row, col)[1] = buffer[IMAGE_SIZE * IMAGE_SIZE + row * IMAGE_SIZE + col]; // Green
                    img.at<cv::Vec3b>(row, col)[2] = buffer[2 * IMAGE_SIZE * IMAGE_SIZE + row * IMAGE_SIZE + col]; // Red
                }
            }
    
            dataset[i].image = img;
            dataset[i].label = (int)label;
        }
    
        fclose(file);
        return dataset;
    }
    

and it worked great. With that in place, I could use it for the main.cpp. Now, it was time to face the second problem: the data processor.

I wrote these three main functions inside tiny\_network.c and then improved them with some openmp and i added some clear comments for what’s happening on the important part of the process on the neural networks :

    
    NeuralNetwork::NeuralNetwork(const std::vector<int>& layer_sizes) : layer_sizes(layer_sizes) {
        // Initialize weights and biases randomly
        for (size_t i = 1; i < layer_sizes.size(); ++i) {
            int rows = layer_sizes[i];
            int cols = layer_sizes[i - 1];
            std::vector<std::vector<float>> layer_weights(rows, std::vector<float>(cols));
            std::vector<float> layer_biases(rows);
            // Initialize weights with small random values and biases with zeros
            for (int j = 0; j < rows; ++j) {
                std::generate(layer_weights[j].begin(), layer_weights[j].end(), []() { return static_cast<float>(rand()) / RAND_MAX - 0.5f; });
                layer_biases[j] = 0.0f;
            }
            weights.push_back(layer_weights);
            biases.push_back(layer_biases);
        }
    }
    
    std::vector<float> NeuralNetwork::forward(const std::vector<float>& input) {
        activations.clear();
        z_values.clear();
        activations.push_back(input);
    
        std::vector<float> activation = input;
        for (size_t i = 0; i < weights.size(); ++i) {
            std::vector<float> z(layer_sizes[i + 1], 0.0f);
    
            #pragma omp parallel for // Parallelize the outer loop
            for (size_t j = 0; j < layer_sizes[i + 1]; ++j) {
                for (size_t k = 0; k < layer_sizes[i]; ++k) {
                    z[j] += weights[i][j][k] * activation[k];
                }
                z[j] += biases[i][j];
                z[j] = 1.0 / (1.0 + std::exp(-z[j])); // Sigmoid activation
            }
            z_values.push_back(z);
            activation = z;
            activations.push_back(activation);
        }
        return activation;
    }
    
    void NeuralNetwork::backward(const std::vector<float>& input, const std::vector<float>& target, float learning_rate) {
        std::vector<float> output_gradients = activations.back();
        for (size_t i = 0; i < output_gradients.size(); ++i) {
            output_gradients[i] -= target[i];
        }
    
        std::vector<std::vector<float>> hidden_gradients(weights.size());
        for (int l = weights.size() - 1; l >= 0; --l) {
            hidden_gradients[l].resize(layer_sizes[l + 1], 0.0f);
    
            #pragma omp parallel for // Parallelize the outer loop
            for (size_t j = 0; j < layer_sizes[l + 1]; ++j) {
                float gradient = output_gradients[j] * activations[l + 1][j] * (1.0f - activations[l + 1][j]);
                hidden_gradients[l][j] = gradient;
                for (size_t k = 0; k < layer_sizes[l]; ++k) {
                    #pragma omp atomic // Ensure atomic operation for weight update
                    weights[l][j][k] -= learning_rate * gradient * activations[l][k];
                }
                #pragma omp atomic // Ensure atomic operation for bias update
                biases[l][j] -= learning_rate * gradient;
            }
    
            if (l > 0) {
                std::vector<float> next_output_gradients(layer_sizes[l], 0.0f);
    
                #pragma omp parallel for // Parallelize the outer loop
                for (size_t k = 0; k < layer_sizes[l]; ++k) {
                    for (size_t j = 0; j < layer_sizes[l + 1]; ++j) {
                        next_output_gradients[k] += hidden_gradients[l][j] * weights[l][j][k];
                    }
                }
                output_gradients = next_output_gradients;
            }
        }
    }
    

    8: NeuralNetwork::NeuralNetwork(const std::vector<int>& layer_sizes) : layer_sizes(layer_sizes) {
    

This line defines the constructor for the `NeuralNetwork` class. It takes a vector of integers as input, which represents the sizes of each layer in the neural network.

    10: for (size_t i = 1; i < layer_sizes.size(); ++i) {
    

This loop initializes the weights and biases for each layer in the neural network.

    13: std::vector<std::vector<float>> layer_weights(rows, std::vector<float>(cols));
    14: std::vector<float> layer_biases(rows);
    

These lines declare the weights and biases for each layer.

    17: std::generate(layer_weights[j].begin(), layer_weights[j].end(), []() { return static_cast<float>(rand()) / RAND_MAX - 0.5f; });
    18: layer_biases[j] = 0.0f;
    

These lines initialize the weights with small random values and biases with zeros.

    20: weights.push_back(layer_weights);
    21: biases.push_back(layer_biases);
    

These lines add the initialized weights and biases to the respective vectors.

    25: std::vector<float> NeuralNetwork::forward(const std::vector<float>& input) {
    

This line defines the `forward` function, which performs a forward pass through the neural network.

    30: std::vector<float> activation = input;
    

This line initializes the activation vector with the input.

    34: #pragma omp parallel for // Parallelize the outer loop
    

This line uses OpenMP to parallelize the outer loop for performance.

    37: z[j] += weights[i][j][k] * activation[k];
    

This line calculates the weighted sum of the inputs.

    40: z[j] = 1.0 / (1.0 + std::exp(-z[j])); // Sigmoid activation
    

This line applies the sigmoid activation function.

    46: return activation;
    

This line returns the final activation vector.

    49: void NeuralNetwork::backward(const std::vector<float>& input, const std::vector<float>& target, float learning_rate) {
    

This line defines the `backward` function, which performs backpropagation to update the weights and biases.

    52: output_gradients[i] -= target[i];
    

This line calculates the output gradients by subtracting the target from the activations.

    59: #pragma omp parallel for // Parallelize the outer loop
    

This line uses OpenMP to parallelize the outer loop for performance.

    61: float gradient = output_gradients[j] * activations[l + 1][j] * (1.0f - activations[l + 1][j]);
    

This line calculates the gradient for the hidden layers.

    65: weights[l][j][k] -= learning_rate * gradient * activations[l][k];
    

This line updates the weights using the calculated gradient and the learning rate.

    68: biases[l][j] -= learning_rate * gradient;
    

This line updates the biases using the calculated gradient and the learning rate.

    77: next_output_gradients[k] += hidden_gradients[l][j] * weights[l][j][k];
    

This line calculates the gradients for the next layer.

    80: output_gradients = next_output_gradients;
    

This line updates the output gradients for the next iteration of the loop.

as you can see i have no code for any GPU optimization and relay only on CPUs since i didn’t have any GPU to test the code on, I’d assume i need to learn some C/CUDA to maximize the GPU. then we can move on on the main.cpp and i have added some clear comments so i don’t have to explain line per line.

    #include <opencv2/opencv.hpp>
    #include "cifar10_loader.h"
    #include "TinyNn_optimized.h"
    #include "preprocess.h"
    #define NUM_DATA 5
    #define TEST_PATH "dataset/test_batch.bin"
    #define LEARNING_RATE 0.5
    #define NUM_EPOCHS 1000
    #define BATCH_SIZE 64 
    
    int main() {
        //loading the dataset, boring stuffs
        std::vector<CIFAR10Image> combined_dataset;
        for(int i = 0; i < NUM_DATA; i++){
            std::string data_path = "dataset/data_batch_" + std::to_string(i + 1) + ".bin";
            
            // Example of using the data path
            std::cout << "Processing file: " << data_path << std::endl;
    
            std::vector<CIFAR10Image> dataset = load_cifar10_bin(data_path.c_str());
    
            // Append the loaded dataset to the combined_dataset vector
            if (!dataset.empty()) {
                combined_dataset.insert(combined_dataset.end(), dataset.begin(), dataset.end());
            } else {
                std::cerr << "Failed to load dataset from " << data_path << std::endl;
            }
    
    
        }
        std::cout << "Total number of images in the combined dataset: " << combined_dataset.size() << std::endl;
       
    
        //function demo
        //const char* data_path = "dataset/data_batch_1.bin";
        //std::vector<CIFAR10Image> dataset = load_cifar10_bin(data_path);
    
        //if (!combined_dataset.empty()) {
            
        //    cv::imshow("CIFAR-10 Image", combined_dataset[0].image);
        //    printf("Label: %d\n", combined_dataset[0].label);
        //    cv::waitKey(0);
        //}
       
    
    
        //pre-cooking || preprocess the dataset
        std::vector<cv::Mat> images;
        std::vector<int> labels;
        for (const auto& item : combined_dataset) {
            images.push_back(preprocess(item.image));
            labels.push_back(item.label);
        }
       
        // Split into train and validation sets (80% train, 20% validation)
        int num_train = static_cast<int>(0.8 * images.size());
        std::vector<cv::Mat> train_images(images.begin(), images.begin() + num_train);
        std::vector<int> train_labels(labels.begin(), labels.begin() + num_train);
        std::vector<cv::Mat> val_images(images.begin() + num_train, images.end());
        std::vector<int> val_labels(labels.begin() + num_train, labels.end());
    
    
    
        //cooking stuffs
        //each numbers in layer_sizes in order is input, first hidden layer, second hidden layer and finally an
        //output layer, the reasoning in each numbers is:
        //3072 => because of 32 x 32 x 3 after flattening of cifar10 dataset, 
        //128 is just a wild guess for first hidden layers (?) or some reasoning of math that
        //myself couldnt understand yet, goes the same for the second hidden layers, 
        //and finally the output layers, 10 is for representative of confidence in each classes
        //so if i used cifar100 it would be 100 instead of 10, but i guess the hidden layer cant be less
        //than the output layer so there were supposed to be an adjustment if i were using a different
        //datasets
        std::vector<int> layer_sizes = {3072, 128, 64, 10}; //input -> hidden1 -> hidden2 -> output
        NeuralNetwork nn(layer_sizes);
    
    
    
        // Training loop
        for (int epoch = 0; epoch < NUM_EPOCHS; ++epoch) {
            float epoch_loss = 0.0;
            int num_correct = 0;
    
            for (size_t start = 0; start < combined_dataset.size(); start += BATCH_SIZE) {
                size_t end = std::min(start + BATCH_SIZE, combined_dataset.size());
    
                std::vector<std::vector<float>> batch_inputs;
                std::vector<std::vector<float>> batch_targets;
    
                for (size_t i = start; i < end; ++i) {
                    
                    const auto& item = combined_dataset[i];
                    
                    std::vector<float> input(item.image.total());
                    std::memcpy(input.data(), item.image.ptr<float>(), item.image.total() * sizeof(float));
                    batch_inputs.push_back(input);
                    
                    std::vector<float> target(layer_sizes.back(), 0.0f);
                    target[item.label] = 1.0f; // One-hot encoding
                    batch_targets.push_back(target);
                }
    
                for (size_t i = 0; i < batch_inputs.size(); ++i) {
                    std::vector<float> output = nn.forward(batch_inputs[i]);
                    nn.backward(batch_inputs[i], batch_targets[i], LEARNING_RATE);
    
                    int predicted_label = std::max_element(output.begin(), output.end()) - output.begin();
                    if (predicted_label == std::distance(batch_targets[i].begin(), std::max_element(batch_targets[i].begin(), batch_targets[i].end()))) {
                        num_correct++;
                    }
                }
            }
    
            std::cout << "Epoch " << epoch << ": Accuracy = " << (num_correct / static_cast<float>(combined_dataset.size())) << std::endl;
        }
    
        // Loading and testing on the test set
        std::vector<CIFAR10Image> testset = load_cifar10_bin(TEST_PATH);
        std::vector<cv::Mat> test_images;
        std::vector<int> test_labels;
        for (const auto& item : testset) {
            test_images.push_back(preprocess(item.image));
            test_labels.push_back(item.label);
        }
     
        int num_correct_test = 0;
        for (size_t i = 0; i < test_images.size(); ++i) {
            const auto& item = combined_dataset[i];
            std::vector<float> input(item.image.total());
            std::memcpy(input.data(), item.image.ptr<float>(), item.image.total() * sizeof(float));
    
            std::vector<float> output = nn.forward(input);
    
            int predicted_label = std::max_element(output.begin(), output.end()) - output.begin();
            if (predicted_label == test_labels[i]) {
                num_correct_test++;
            }
        }
        std::cout << "Test Accuracy = " << (num_correct_test / static_cast<float>(test_images.size())) << std::endl;
    
        return 0;
    }
    

and these came together pretty good, yet i still can’t optimize further with only utilizing CPUs.

![running a training phase](https://storage.googleapis.com/papyrus_images/89f0989aa5e59b62049ded16c4f5d3c3c22bec9ff31a77ca519f54e72a0a4a67.png)

running a training phase

see u all on my next project.

---

*Originally published on [kyomoto](https://paragraph.com/@kyomoto/my-first-attempt-to-write-vision-model-from-scratch)*
