Ubuntu Manpage: mlpack_preprocess_split

Provided by: mlpack-bin_4.1.0-1ubuntu1_amd64

NAME

       mlpack_preprocess_split - split data

SYNOPSIS

        mlpack_preprocess_split -i unknown [-I unknown] [-S bool] [-s int] [-z bool] [-r double] [-V bool] [-T unknown] [-L unknown] [-t unknown] [-l unknown] [-h -v]

DESCRIPTION

       This utility takes a dataset and optionally labels and splits them into a training set and
       a test set. Before the split, the points  in  the  dataset  are  randomly  reordered.  The
       percentage  of  the  dataset  to  be  used  as  the  test  set  can  be specified with the
       '--test_ratio (-r)' parameter; the default is 0.2 (20%).

       The output training and test matrices may be saved with  the  '--training_file  (-t)'  and
       '--test_file (-T)' output parameters.

       Optionally,   labels   can   also   be  split  along  with  the  data  by  specifying  the
       ’--input_labels_file (-I)' parameter. Splitting labels works the same way as splitting the
       data.  The  output  training and test labels may be saved with the ’--training_labels_file
       (-l)' and '--test_labels_file (-L)' output parameters, respectively.

       So, a simple example where we want to split the dataset  'X.csv'  into  ’X_train.csv'  and
       'X_test.csv'  with  60% of the data in the training set and 40% of the dataset in the test
       set, we could run

       $  mlpack_preprocess_split  --input_file  X.csv  --training_file  X_train.csv  --test_file
       X_test.csv --test_ratio 0.4

       Also by default the dataset is shuffled and split; you can provide the ’--no_shuffle (-S)'
       option to avoid shuffling the data; an example to avoid shuffling of data is:

       $  mlpack_preprocess_split  --input_file  X.csv  --training_file  X_train.csv  --test_file
       X_test.csv --test_ratio 0.4 --no_shuffle

       If  we  had  a dataset 'X.csv' and associated labels 'y.csv', and we wanted to split these
       into 'X_train.csv', 'y_train.csv', 'X_test.csv', and 'y_test.csv', with 30% of the data in
       the test set, we could run

       $  mlpack_preprocess_split  --input_file  X.csv --input_labels_file y.csv --test_ratio 0.3
       --training_file  X_train.csv  --training_labels_file  y_train.csv  --test_file  X_test.csv
       --test_labels_file y_test.csv

       To  maintain the ratio of each class in the train and test sets, the'--stratify_data (-z)'
       option can be used.

       $  mlpack_preprocess_split  --input_file  X.csv  --training_file  X_train.csv  --test_file
       X_test.csv --test_ratio 0.4 --stratify_data

REQUIRED INPUT OPTIONS

       --input_file (-i) [unknown]
              Matrix containing data.

OPTIONAL INPUT OPTIONS

       --help (-h) [bool]
              Default help info.

       --info [string]
              Print help on a specific option. Default value ''.

       --input_labels_file (-I) [unknown]
              Matrix containing labels.

       --no_shuffle (-S) [bool]
              Avoid shuffling the data before splitting.

       --seed (-s) [int]
              Random seed (0 for std::time(NULL)). Default value 0.

       --stratify_data (-z) [bool]
              Stratify the data according to labels

       --test_ratio (-r) [double]
              Ratio of test set; if not set,the ratio defaults to 0.2 Default value 0.2.

       --verbose (-v) [bool]
              Display  informational  messages  and the full list of parameters and timers at the
              end of execution.

       --version (-V) [bool]
              Display the version of mlpack.

OPTIONAL OUTPUT OPTIONS

       --test_file (-T) [unknown]
              Matrix to save test data to.

       --test_labels_file (-L) [unknown]
              Matrix to save test labels to.

       --training_file (-t) [unknown]
              Matrix to save training data to.

       --training_labels_file (-l) [unknown]
              Matrix to save train labels to.

ADDITIONAL INFORMATION

       For further information, including relevant papers, citations,  and  theory,  consult  the
       documentation found at http://www.mlpack.org or included with your distribution of mlpack.