n-fold cross-validation classification with LDA classifier

  1. For CoSMoMVPA's copyright information and license terms, #
  2. see the COPYING file distributed with CoSMoMVPA. #

Contents

Define data

config=cosmo_config();
data_path=fullfile(config.tutorial_data_path,'ak6','s01');

% Load the dataset with VT mask
ds = cosmo_fmri_dataset([data_path '/glm_T_stats_perrun.nii'], ...
                     'mask', [data_path '/vt_mask.nii']);

% remove constant features
ds=cosmo_remove_useless_data(ds);

set sample attributes

ds.sa.targets = repmat((1:6)',10,1);
ds.sa.chunks = floor(((1:60)-1)/6)'+1;

% Add labels as sample attributes
classes = {'monkey','lemur','mallard','warbler','ladybug','lunamoth'};
ds.sa.labels = repmat(classes,1,10)';

% this is good practice after setting attributes manually
cosmo_check_dataset(ds);

Part 1: 'manual' crossvalidation

nsamples=size(ds.samples,1); % should be 60 for this dataset

% allocate space for preditions for all 60 samples
all_pred=zeros(nsamples,1);

% safety check:
% to simplify this exercise, the code below assumes that .sa.chunks is in
% the range 1..10; if that is not the case, the code may not work properly.
% Therefore an 'assert' statement is used to verify that the chunks are as
% required for the remainder of this exercise.
assert(isequal(ds.sa.chunks,floor(((1:60)-1)/6)'+1));

nfolds=numel(unique(ds.sa.chunks)); % should be 10

% run n-fold cross-validation
% in the k-th fold (k ranges from 1 to 10), test the LDA classifier on
% samples with chunks==k and after training on all other samples.
%
% (in this exercise this is done manually, but easier solutions involve
% using cosmo_nfold_partitioner and cosmo_crossvalidation_measure)
for fold=1:nfolds
    % make a logical mask (of size 60x1) for the test set. It should have
    % the value true where ds.sa.chunks has the same value as 'fold', and
    % the value false everywhere else. Assign this to the variable
    % 'test_msk'
    % >@@>
    test_msk=ds.sa.chunks==fold;
    % <@@<

    % slice the input dataset 'ds' across samples using 'test_msk' so that
    % it has only samples in the 'fold'-th chunk. Assign the result to the
    % variable 'ds_test';
    % >@@>
    ds_test=cosmo_slice(ds,test_msk);
    % <@@<

    % now make another logical mask (of size 60x1) for the training set.
    % the value true where ds.sa.chunks has a different value as 'fold',
    % and the value false everywhere else. Assign this to the variable
    % 'train_msk'
    % >@@>
    train_msk=ds.sa.chunks~=fold;
    % (alternative: train_msk=~test_msk)
    % <@@<

    % slice the input dataset again using train_msk, and assign to the
    % variable 'ds_train'
    % >@@>
    ds_train=cosmo_slice(ds,train_msk);
    % <@@<

    % Use cosmo_classify_lda to get predicted targets for the
    % samples in 'ds_test'. To do so, use the samples and targets
    % from 'ds_train' for training (as first and second argument for
    % cosmo_classify_lda), and the samples from 'ds_test' for testing
    % (third argument for cosmo_classify_lda).
    % Assign the result to the variable 'fold_pred', which should be a 6x1
    % vector.
    % >@@>
    fold_pred=cosmo_classify_lda(ds_train.samples,ds_train.sa.targets,...
                                    ds_test.samples);
    % <@@<

    % store the predictions from 'fold_pred' in the 'all_pred' vector,
    % at the positions masked by 'test_msk'.
    % >@@>
    all_pred(test_msk)=fold_pred;
    % <@@<
end

% safety check:
% for this exercise, the following code tests whether the predicted classes
% is as they should be (i.e. the correct answer); if not an error is
% raised.

expected_pred=[ 1 1 1 2 1 1 2 1 1 1
                2 1 2 2 2 2 2 2 2 2
                4 3 1 3 3 4 3 3 3 3
                4 4 2 4 4 4 4 4 3 2
                5 5 5 5 5 6 5 5 5 5
                6 6 6 6 6 6 6 6 6 6];

% check that the output is as expected
if ~isequal(expected_pred(:),all_pred)
    error('expected predictions to be row-vector with [%s]''',...
            sprintf('%d ',all_pred_alt));
end

% Compute classification accuracy of all_pred compared to the targets in
% the input dataset 'ds', and assign to a variable 'accuracy'
% >@@>
accuracy=mean(all_pred==ds.sa.targets);
% <@@<

% print the accuracy
fprintf('\nLDA all categories n-fold: accuracy %.3f\n', accuracy);

% Visualize confusion matrix
% the cosmo_confusion_matrix convenience function is used to compute the
% confusion matrix
[confusion_matrix,classes]=cosmo_confusion_matrix(ds.sa.targets,all_pred);
nclasses=numel(classes);
% print confusion matrix to terminal window
fprintf('\nLDA n-fold cross-validation confusion matrix:\n')
disp(confusion_matrix);

% make a pretty figure
figure
imagesc(confusion_matrix,[0 10])
title('confusion matrix');
set(gca,'XTick',1:nclasses,'XTickLabel',classes);
set(gca,'YTick',1:nclasses,'YTickLabel',classes);
ylabel('target');
xlabel('predicted');
colorbar
LDA all categories n-fold: accuracy 0.833

LDA n-fold cross-validation confusion matrix:
     8     2     0     0     0     0
     1     9     0     0     0     0
     1     0     7     2     0     0
     0     2     1     7     0     0
     0     0     0     0     9     1
     0     0     0     0     0    10

Part 2: use cosmo_nfold_partitioner

% This exercise replicates the analysis done in Part 1, but now
% the the 'cosmo_nfold_partitioner' function is used to create a struct
% that defines the cross-validation scheme (it contains the indices
% for the train and set samples in each fold)
partitions=cosmo_nfold_partitioner(ds);

% Show partitions
fprintf('\nPartitions for n-fold cross-validation:\n');
cosmo_disp(partitions)

% Count how many folds there are in 'partitions', and assign to the
% variable 'nfolds'
% >@@>
nfolds=numel(partitions.train_indices); % should be 10
% <@@<

% allocate space for predictions of each sample (pattern) in 'ds'
all_pred=zeros(nsamples,1);

% As in part 1 (above), perform n-fold cross-validation using the LDA
% classifier
for fold=1:nfolds
    % implement cross-validation and store predicted labels in 'all_pred';
    % use the contents of partitions to slice the dataset in train and test
    % sets for each fold

    % from the 'partitions' struct, get the train indices for the
    % 'fold'-th fold and assign to a variable 'train_idxs'
    % >@@>
    train_idxs=partitions.train_indices{fold};
    % <@@<

    % do the same for the test indices, and assign to a variable
    % 'test_idxs'
    % >@@>
    test_idxs=partitions.test_indices{fold};
    % <@@<

    % slice the dataset twice:
    % - once using 'train_idxs'; assign the result to 'ds_train'
    % - once using 'test_idxs' ; assign the result to 'ds_test'
    % >@@>
    ds_train=cosmo_slice(ds,train_idxs);
    ds_test=cosmo_slice(ds,test_idxs);
    % <@@<

    % compute predictions for the samples in 'ds_test' after training
    % using the samples and targets in 'ds_train'
    % >@@>
    fold_pred=cosmo_classify_lda(ds_train.samples,ds_train.sa.targets,...
                                    ds_test.samples);
    % <@@<

     % store the predictions from 'fold_pred' in the 'all_pred' vector,
    % at the positions indexed by 'test_idxs'.
    % >@@>
    all_pred(test_idxs)=fold_pred;
    % <@@<
end

% Compute classification accuracy of all_pred compared to the targets in
% the input dataset 'ds', and assign to a variable 'accuracy'
% >@@>
accuracy=mean(all_pred==ds.sa.targets);
% <@@<
fprintf(['\nLDA all categories n-fold (with partitioner): '...
            'accuracy %.3f\n'], accuracy);

% Note: cosmo_crossvalidation_measure can perform the above operations as
% well (and in an easier way), but using that function is not part of
% this exercise.
Partitions for n-fold cross-validation:
.train_indices                                                                              
  { [  7         [  1         [  1        ... [  1         [  1         [  1                
       8            2            2               2            2            2                
       9            3            3               3            3            3                
       :            :            :               :            :            :                
      58           58           58              58           58           52                
      59           59           59              59           59           53                
      60 ]@54x1    60 ]@54x1    60 ]@54x1       60 ]@54x1    60 ]@54x1    54 ]@54x1   }@1x10
.test_indices                                                                               
  { [ 1    [  7    [ 13   ... [ 43    [ 49    [ 55                                          
      2       8      14         44      50      56                                          
      3       9      15         45      51      57                                          
      4      10      16         46      52      58                                          
      5      11      17         47      53      59                                          
      6 ]    12 ]    18 ]       48 ]    54 ]    60 ]   }@1x10                               

LDA all categories n-fold (with partitioner): accuracy 0.833