top of page
NeuroForge

AI training: Reliable results with cross validation

To make statements about the performance of an artificial intelligence, various parameters can be considered. Often, accuracy, loss (which roughly translates to "error"), or the F1 score are used. However, regardless of the metric, there can be craftsmanship errors that significantly diminish the reliability of the results.




Training and Validation


The error (loss) of a neural network provides insight into how well it performs a task but is often not very intuitive. For this blog post, I would like to consider the accuracy of the network. This metric is only meaningful for classification applications, where, for example, an image is assigned to a category.


In this blog post, I want to illustrate the topic using the Digits Dataset from the UCI Machine Learning Repository. The Digit Dataset contains 1797 images of handwritten digits, each 8x8 pixels in size. This problem is suitable for a blog post as it can be efficiently computed on a CPU. The model required to classify the data is relatively small:


def get_model():
    # we need a layer that acts as input.
    # shape of that input has to be known and depends on data.
    input_layer = layers.Input(shape=(config.NUM_FEATURES,))
    # hidden layers are the model's power to fit data.
    # number of neurons and type of layers are crucial.
    # idea behind decreasing number of units per layer:
    # increase the "abstraction" in each layer...
    hidden_layer = layers.Dense(units=64)(input_layer)
    hidden_layer = layers.Dense(units=32)(hidden_layer)
    # last layer represents output.
    # activation of each neuron corresponds to the models decision of
    # choosing that class.
    # softmax ensures that all activations summed up are equal to 1.
    # this lets one interpret that output as a probability
    output_layer = layers.Dense(units=config.NUM_DIGITS, activation='softmax')(hidden_layer)
    # actual creation of the model with in- and output layers
    model = engine.Model(inputs=[input_layer], outputs=[output_layer])
    # transform into a trainable model by specifying the optimizing function
    # (here stochastic gradient descent),
    # as well as the loss (eg. how big of an error is produced by the model)
    # track the model's accuracy as an additional metric (only possible for classification)
    model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])
    return model


This model can now be trained on the Digits data set. The following code excerpt is somewhat shortened for clarity, the complete source code can be found on our Github.


# seed the random generator, for reproducible results
from numpy import random
random.seed(1337)
# same for tensorflow
import tensorflow as tf
tf.set_random_seed(42)
from datetime import datetime
from os import path
import numpy
from PIL import Image
from keras import callbacks
from sklearn import datasets
from digits.models.mlp import get_model
from digits.util import image_to_ndarray
from digits.config import NUM_DIGITS, MAX_FEATURE
# load prepared data set containing 1797 digits as 8x8 images
digit_features, digit_classes = datasets.load_digits(n_class=NUM_DIGITS, return_X_y=True)
num_samples = digit_classes.shape[0]
# normalize features, see documentation of sklearn.datasets.load_digits!
# neural networks work best with normalized data
digit_features /= MAX_FEATURE
# we need so called "one-hot" vectors
# one-hots are vectors, where all entries are 0 except the target class, which is 1
digit_labels = numpy.zeros(shape=(num_samples, NUM_DIGITS))
for index, digit_class in enumerate(digit_classes):
    digit_labels[index][digit_class] = 1.
# get a neural net, that can fit our problem
model = get_model()
# training the model
model.fit(
    train_x, train_y,
    batch_size=32, epochs=30, 
    validation_split=.2
)


For the reproducibility of model training, random.seed(1337) and tf.set_random_seed(42) are crucial. Before training a neural network, all model parameters are initialized with random values, and the dataset is shuffled (randomly). Therefore, if the same model is trained multiple times on the same data without these two lines, different results would be obtained each time. This poses a problem when comparing different network architectures (see our article on searching for optimal hyperparameters using Grid Search).


Running this training, the result can be visualized as follows (here using TensorBoard):




These graphs only represent the values during training, more interesting for the actual performance of the network are the values during the validation phase, i.e. on data that the network has never seen before.




At first glance, these results look very promising, with the model's accuracy reaching 90% and above after a short time. However, it is important to view this critically: Only one run was performed, and since the splitting of the data into training and validation sets was done randomly, we cannot be certain that different splits would not yield worse results.




Reliable results with cross validation


The idea for cross validation is to make several runs on the same data set, but to divide it differently each time into training and validation sets. Depending on how you divide the data, it is called k-fold cross validation or one-out cross validation. In the first case, the data set is divided into k equal-sized blocks, also called folds.



The model is now trained k times and in each run one of the folds is reserved for validation and the rest are used for training. This procedure is similar to bootstrapping, but data points are not used twice. Between each run, the model is reset to the original state. With these adjustments, the training process now looks like this.


# seed the random generator, for reproducible results
from numpy import random
random.seed(1337)
# same for tensorflow
import tensorflow as tf
tf.set_random_seed(42)
from datetime import datetime
from os import path
import numpy
from PIL import Image
from sklearn import datasets
from sklearn.model_selection import KFold
from digits.models.mlp import get_model
from digits.util import image_to_ndarray
from digits.config import NUM_DIGITS, MAX_FEATURE
# load prepared data set containing 1797 digits as 8x8 images
digit_features, digit_classes = datasets.load_digits(n_class=NUM_DIGITS, return_X_y=True)
num_samples = digit_classes.shape[0]
# normalize features, see documentation of sklearn.datasets.load_digits!
# neural networks work best with normalized data
digit_features /= MAX_FEATURE
# we need so called "one-hot" vectors
# one-hots are vectors, where all entries are 0 except the target class, which is 1
digit_labels = numpy.zeros(shape=(num_samples, NUM_DIGITS))
for index, digit_class in enumerate(digit_classes):
    digit_labels[index][digit_class] = 1.
# get a neural net, that can fit our problem and remember its initial weights
model = get_model()
initial_weights = model.get_weights()
# initialize the cross validation folds api
kfold = KFold(6, shuffle=True, random_state=42)
# iterate over all possible fold combinations
fold = 0
for train, test in kfold.split(digit_features):
    # split the data into features and labels depending on the fold
    train_x, train_y = digit_features[train], digit_labels[train]
    test_x, test_y = digit_features[test], digit_labels[test]
    # reset the model's weights
    model.set_weights(initial_weights)
    # training the model
    model.fit(
        train_x, train_y,
        batch_size=32, epochs=30, 
        validation_split=.0, 
        validation_data=(test_x, test_y)
    )
    fold += 1

For our example, k=6, so we get six training runs, which we can visualise exactly as in the first example.



None of the six training runs stands out with particularly poor accuracy or error values, so we can now confidently say that the model accurately represents the data.


The choice of k is not based on fixed rules; however, k=10 is a value that has proven itself over the years, striking a good balance between computational effort and reliability.


Summary


One can use k-fold cross-validation to remove bias introduced by the selection of training and validation datasets. This technique is particularly helpful for small and medium-sized neural networks. For huge networks like Inception V3, the increased training effort is impractical.


If you have any comments, questions, or ideas regarding our article, we always appreciate receiving an email at blog@neuroforge.de.

bottom of page