Provided by: mlpack-bin_4.6.0-1_amd64 

NAME
mlpack_preprocess_split - split data
SYNOPSIS
mlpack_preprocess_split -i unknown [-I unknown] [-S bool] [-s int] [-z bool] [-r double] [-V bool] [-T unknown] [-L unknown] [-t unknown] [-l unknown] [-h -v]
DESCRIPTION
This utility takes a dataset and optionally labels and splits them into a training set and a test set.
Before the split, the points in the dataset are randomly reordered. The percentage of the dataset to be
used as the test set can be specified with the '--test_ratio (-r)' parameter; the default is 0.2 (20%).
The output training and test matrices may be saved with the '--training_file (-t)' and '--test_file (-T)'
output parameters.
Optionally, labels can also be split along with the data by specifying the ’--input_labels_file (-I)'
parameter. Splitting labels works the same way as splitting the data. The output training and test labels
may be saved with the ’--training_labels_file (-l)' and '--test_labels_file (-L)' output parameters,
respectively.
So, a simple example where we want to split the dataset 'X.csv' into ’X_train.csv' and 'X_test.csv' with
60% of the data in the training set and 40% of the dataset in the test set, we could run
$ mlpack_preprocess_split --input_file X.csv --training_file X_train.csv --test_file X_test.csv
--test_ratio 0.4
Also by default the dataset is shuffled and split; you can provide the ’--no_shuffle (-S)' option to
avoid shuffling the data; an example to avoid shuffling of data is:
$ mlpack_preprocess_split --input_file X.csv --training_file X_train.csv --test_file X_test.csv
--test_ratio 0.4 --no_shuffle
If we had a dataset 'X.csv' and associated labels 'y.csv', and we wanted to split these into
'X_train.csv', 'y_train.csv', 'X_test.csv', and 'y_test.csv', with 30% of the data in the test set, we
could run
$ mlpack_preprocess_split --input_file X.csv --input_labels_file y.csv --test_ratio 0.3 --training_file
X_train.csv --training_labels_file y_train.csv --test_file X_test.csv --test_labels_file y_test.csv
To maintain the ratio of each class in the train and test sets, the'--stratify_data (-z)' option can be
used.
$ mlpack_preprocess_split --input_file X.csv --training_file X_train.csv --test_file X_test.csv
--test_ratio 0.4 --stratify_data
REQUIRED INPUT OPTIONS
--input_file (-i) [unknown]
Matrix containing data.
OPTIONAL INPUT OPTIONS
--help (-h) [bool]
Default help info.
--info [string]
Print help on a specific option. Default value ''.
--input_labels_file (-I) [unknown]
Matrix containing labels.
--no_shuffle (-S) [bool]
Avoid shuffling the data before splitting.
--seed (-s) [int]
Random seed (0 for std::time(NULL)). Default value 0.
--stratify_data (-z) [bool]
Stratify the data according to labels
--test_ratio (-r) [double]
Ratio of test set; if not set,the ratio defaults to 0.2 Default value 0.2.
--verbose (-v) [bool]
Display informational messages and the full list of parameters and timers at the end of execution.
--version (-V) [bool]
Display the version of mlpack.
OPTIONAL OUTPUT OPTIONS
--test_file (-T) [unknown]
Matrix to save test data to.
--test_labels_file (-L) [unknown]
Matrix to save test labels to.
--training_file (-t) [unknown]
Matrix to save training data to.
--training_labels_file (-l) [unknown]
Matrix to save train labels to.
ADDITIONAL INFORMATION
For further information, including relevant papers, citations, and theory, consult the documentation
found at http://www.mlpack.org or included with your distribution of mlpack.
mlpack-4.6.0 06 April 2025 mlpack_preprocess_split(1)