cosmo balance partitions hdr

function bal_partitions=cosmo_balance_partitions(partitions,ds, varargin)
% balances a partition so that each target occurs equally often in each
% training and test chunk
%
% bpartitions=cosmo_balance_partitions(partitions, ds, ...)
%
% Inputs:
%   partitions        struct with fields:
%     .train_indices  } Each is a 1xN cell (for N chunks) containing the
%     .test_indices   } sample indices for each partition
%   ds                dataset struct with field .sa.targets.
%   'nrepeats',nr     Number of repeats (default: 1). The output will
%                     have nrep as many partitions as the input set. This
%                     option is not compatible with 'nmin'.
%   'nmin',nm         Ensure that each sample occurs at least
%                     nmin times in each training set (some samples may
%                     be repeated more often than than). This option is not
%                     compatible with 'nrepeats'.
%   'balance_test'    If set to false, indices in the test set are not
%                     necessarily balanced. The default is true.
%   'seed',sd         Use seed sd for pseudoo-random number generation.
%                     Different values lead almost always to different
%                     pseudo-random orders. To disable using a seed - which
%                     causes this function to give different results upon
%                     subsequent calls with identical inputs - use sd=0.
%
% Ouput:
%   bpartitions       similar struct as input partitions, except that
%                     - each field is a 1x(N*nsets) cell
%                     - each unique target is represented about equally often
%                     - each target in each training chunk occurs equally
%                       often
%
% Examples:
%     % generate a simple dataset with unbalanced partitions
%     ds=struct();
%     ds.samples=zeros(9,2);
%     ds.sa.targets=[1 1 2 2 2 3 3 3 3]';
%     ds.sa.chunks=[1 2 2 1 1 1 2 2 2]';
%     p=cosmo_nfold_partitioner(ds);
%     %
%     % show original (unbalanced) partitioning
%     cosmo_disp(p);
%     %|| .train_indices
%     %||   { [ 2    [ 1
%     %||       3      4
%     %||       7      5
%     %||       8      6 ]
%     %||       9 ]        }
%     %|| .test_indices
%     %||   { [ 1    [ 2
%     %||       4      3
%     %||       5      7
%     %||       6 ]    8
%     %||              9 ] }
%     %
%     % make standard balancing (nsets=1); some targets are not used
%     q=cosmo_balance_partitions(p,ds);
%     cosmo_disp(q);
%     %|| .train_indices
%     %||   { [ 2    [ 1
%     %||       3      5
%     %||       7 ]    6 ] }
%     %|| .test_indices
%     %||   { [ 1    [ 2
%     %||       5      3
%     %||       6 ]    7 ] }
%     %
%     % make balancing where each sample in each training fold is used at
%     % least once
%     q=cosmo_balance_partitions(p,ds,'nmin',1);
%     cosmo_disp(q);
%     %|| .train_indices
%     %||   { [ 2    [ 2    [ 2    [ 1    [ 1
%     %||       3      3      3      5      4
%     %||       7 ]    9 ]    8 ]    6 ]    6 ] }
%     %|| .test_indices
%     %||   { [ 1    [ 1    [ 1    [ 2    [ 2
%     %||       5      4      5      3      3
%     %||       6 ]    6 ]    6 ]    7 ]    9 ] }
%     %
%     % triple the number of partitions and sample from training indices
%     q=cosmo_balance_partitions(p,ds,'nrepeats',3);
%     cosmo_disp(q);
%     %|| .train_indices
%     %||   { [ 2    [ 2    [ 2    [ 1    [ 1    [ 1
%     %||       3      3      3      5      4      5
%     %||       7 ]    9 ]    8 ]    6 ]    6 ]    6 ] }
%     %|| .test_indices
%     %||   { [ 1    [ 1    [ 1    [ 2    [ 2    [ 2
%     %||       5      4      5      3      3      3
%     %||       6 ]    6 ]    6 ]    7 ]    9 ]    8 ] }
%
% Notes:
% - this function is intended for datasets where the number of
%   samples across targets is not equally distributed. A typical
%   application is MEEG datasets.
% - By default both the train and test indices are balanced, so that
%   chance accuracy is equal to the inverse of the number of unique
%   targets (1/C with C the number of classes).
%   Balancing is considered a *Good Thing*:
%   * Suppose the entire dataset has 75% samples of
%     class A and 25% samples of class B, but the data does not contain
%     any information that allows for discrimination between the classes.
%     A classifier trained on a subset may always predict the class that
%     occured most often in the training set, which is class A. If the test
%     set also contains 75% of class A, then classification accuracy would
%     be 75%, which is higher than 1/2 (with 2 the number of classes).
%   * Balancing the training set only would accomodate this issue, but it
%     may still be the case that a classifier consistently predicts one
%     class more often than other classes. While this may be unbiased with
%      respect to predictions of one particular class over many dataset
%     instances, it could lead to biases (either above or below chance)
%     in particular instances.
%
% See also: cosmo_nchoosek_partitioner, cosmo_nfold_partitioner
%
% #   For CoSMoMVPA's copyright information and license terms,   #
% #   see the COPYING file distributed with CoSMoMVPA.           #