function bal_partitions=cosmo_balance_partitions(partitions,ds, varargin)
% balances a partition so that each target occurs equally often in each
% training and test chunk
%
% bpartitions=cosmo_balance_partitions(partitions, ds, ...)
%
% Inputs:
% partitions struct with fields:
% .train_indices } Each is a 1xN cell (for N chunks) containing the
% .test_indices } sample indices for each partition
% ds dataset struct with field .sa.targets.
% 'nrepeats',nr Number of repeats (default: 1). The output will
% have nrep as many partitions as the input set. This
% option is not compatible with 'nmin'.
% 'nmin',nm Ensure that each sample occurs at least
% nmin times in each training set (some samples may
% be repeated more often than than). This option is not
% compatible with 'nrepeats'.
% 'balance_test' If set to false, indices in the test set are not
% necessarily balanced. The default is true.
% 'seed',sd Use seed sd for pseudoo-random number generation.
% Different values lead almost always to different
% pseudo-random orders. To disable using a seed - which
% causes this function to give different results upon
% subsequent calls with identical inputs - use sd=0.
%
% Ouput:
% bpartitions similar struct as input partitions, except that
% - each field is a 1x(N*nsets) cell
% - each unique target is represented about equally often
% - each target in each training chunk occurs equally
% often
%
% Examples:
% % generate a simple dataset with unbalanced partitions
% ds=struct();
% ds.samples=zeros(9,2);
% ds.sa.targets=[1 1 2 2 2 3 3 3 3]';
% ds.sa.chunks=[1 2 2 1 1 1 2 2 2]';
% p=cosmo_nfold_partitioner(ds);
% %
% % show original (unbalanced) partitioning
% cosmo_disp(p);
% %|| .train_indices
% %|| { [ 2 [ 1
% %|| 3 4
% %|| 7 5
% %|| 8 6 ]
% %|| 9 ] }
% %|| .test_indices
% %|| { [ 1 [ 2
% %|| 4 3
% %|| 5 7
% %|| 6 ] 8
% %|| 9 ] }
% %
% % make standard balancing (nsets=1); some targets are not used
% q=cosmo_balance_partitions(p,ds);
% cosmo_disp(q);
% %|| .train_indices
% %|| { [ 2 [ 1
% %|| 3 5
% %|| 7 ] 6 ] }
% %|| .test_indices
% %|| { [ 1 [ 2
% %|| 5 3
% %|| 6 ] 7 ] }
% %
% % make balancing where each sample in each training fold is used at
% % least once
% q=cosmo_balance_partitions(p,ds,'nmin',1);
% cosmo_disp(q);
% %|| .train_indices
% %|| { [ 2 [ 2 [ 2 [ 1 [ 1
% %|| 3 3 3 5 4
% %|| 7 ] 9 ] 8 ] 6 ] 6 ] }
% %|| .test_indices
% %|| { [ 1 [ 1 [ 1 [ 2 [ 2
% %|| 5 4 5 3 3
% %|| 6 ] 6 ] 6 ] 7 ] 9 ] }
% %
% % triple the number of partitions and sample from training indices
% q=cosmo_balance_partitions(p,ds,'nrepeats',3);
% cosmo_disp(q);
% %|| .train_indices
% %|| { [ 2 [ 2 [ 2 [ 1 [ 1 [ 1
% %|| 3 3 3 5 4 5
% %|| 7 ] 9 ] 8 ] 6 ] 6 ] 6 ] }
% %|| .test_indices
% %|| { [ 1 [ 1 [ 1 [ 2 [ 2 [ 2
% %|| 5 4 5 3 3 3
% %|| 6 ] 6 ] 6 ] 7 ] 9 ] 8 ] }
%
% Notes:
% - this function is intended for datasets where the number of
% samples across targets is not equally distributed. A typical
% application is MEEG datasets.
% - By default both the train and test indices are balanced, so that
% chance accuracy is equal to the inverse of the number of unique
% targets (1/C with C the number of classes).
% Balancing is considered a *Good Thing*:
% * Suppose the entire dataset has 75% samples of
% class A and 25% samples of class B, but the data does not contain
% any information that allows for discrimination between the classes.
% A classifier trained on a subset may always predict the class that
% occured most often in the training set, which is class A. If the test
% set also contains 75% of class A, then classification accuracy would
% be 75%, which is higher than 1/2 (with 2 the number of classes).
% * Balancing the training set only would accomodate this issue, but it
% may still be the case that a classifier consistently predicts one
% class more often than other classes. While this may be unbiased with
% respect to predictions of one particular class over many dataset
% instances, it could lead to biases (either above or below chance)
% in particular instances.
%
% See also: cosmo_nchoosek_partitioner, cosmo_nfold_partitioner
%
% # For CoSMoMVPA's copyright information and license terms, #
% # see the COPYING file distributed with CoSMoMVPA. #