Data Providers

Interface

Data providers are wrappers that load external data, be it images, text, or general tensors, and split it into mini-batches so that the model can consume the data in a uniformed way.

class AbstractDataProvider

The root type for all data provider. A data provider should implement the following interfaces:

get_batch_size(provider) → Int
Parameters:provider (AbstractDataProvider) – the data provider.
Returns:the mini-batch size of the provided data. All the provided data should have the same mini-batch size (i.e. the last dimension).
provide_data(provider) → Vector{Tuple{Base.Symbol, Tuple}}
Parameters:provider (AbstractDataProvider) – the data provider.
Returns:a vector of (name, shape) pairs describing the names of the data it provides, and the corresponding shapes.
provide_label(provider) → Vector{Tuple{Base.Symbol, Tuple}}
Parameters:provider (AbstractDataProvider) – the data provider.
Returns:a vector of (name, shape) pairs describing the names of the labels it provides, and the corresponding shapes.

The difference between data and label is that during training stage, both data and label will be feeded into the model, while during prediction stage, only data is loaded. Otherwise, they could be anything, with any names, and of any shapes. The provided data and label names here should match the input names in a target SymbolicNode.

A data provider should also implement the Julia iteration interface, in order to allow iterating through the data set. The provider will be called in the following way:

for batch in eachbatch(provider)
  data = get_data(provider, batch)
end

which will be translated by Julia compiler into

state = Base.start(eachbatch(provider))
while !Base.done(provider, state)
  (batch, state) = Base.next(provider, state)
  data = get_data(provider, batch)
end

By default, eachbatch() simply returns the provider itself, so the iterator interface is implemented on the provider type itself. But the extra layer of abstraction allows us to implement a data provider easily via a Julia Task coroutine. See the data provider defined in the char-lstm example for an example of using coroutine to define data providers.

The detailed interface functions for the iterator API is listed below:

Base.eltype(provider) → AbstractDataBatch
Parameters:provider (AbstractDataProvider) – the data provider.
Returns:the specific subtype representing a data batch. See AbstractDataBatch.
Base.start(provider) → AbstractDataProviderState
Parameters:provider (AbstractDataProvider) – the data provider.

This function is always called before iterating into the dataset. It should initialize the iterator, reset the index, and do data shuffling if needed.

Base.done(provider, state) → Bool
Parameters:
  • provider (AbstractDataProvider) – the data provider.
  • state (AbstractDataProviderState) – the state returned by Base.start() Base.next().
Returns:

true if there is no more data to iterate in this dataset.

Base.next(provider) -> (AbstractDataBatch, AbstractDataProviderState)
Parameters:provider (AbstractDataProvider) – the data provider.
Returns:the current data batch, and the state for the next iteration.

Note sometimes you are wrapping an existing data iterator (e.g. the built-in libmxnet data iterator) that is built with a different convention. It might be difficult to adapt to the interfaces stated here. In this case, you can safely assume that

  • Base.start() will always be called, and called only once before the iteration starts.
  • Base.done() will always be called at the beginning of every iteration and always be called once.
  • If Base.done() return true, the iteration will stop, until the next round, again, starting with a call to Base.start().
  • Base.next() will always be called only once in each iteration. It will always be called after one and only one call to Base.done(); but if Base.done() returns true, Base.next() will not be called.

With those assumptions, it will be relatively easy to adapt any existing iterator. See the implementation of the built-in MXDataProvider for example.

Caution

Please do not use the one data provider simultaneously in two different places, either in parallel, or in a nested loop. For example, the behavior for the following code is undefined

for batch in data
  # updating the parameters

  # now let's test the performance on the training set
  for b2 in data
    # ...
  end
end
class AbstractDataProviderState

Base type for data provider states.

class AbstractDataBatch

Base type for a data mini-batch. It should implement the following interfaces:

count_samples(provider, batch) → Int
Parameters:batch (AbstractDataBatch) – the data batch object.
Returns:the number of samples in this batch. This number should be greater than 0, but less than or equal to the batch size. This is used to indicate at the end of the data set, there might not be enough samples for a whole mini-batch.
get_data(provider, batch) → Vector{NDArray}
Parameters:
  • provider (AbstractDataProvider) – the data provider.
  • batch (AbstractDataBatch) – the data batch object.
Returns:

a vector of data in this batch, should be in the same order as declared in provide_data().

The last dimension of each NDArray should always match the batch_size, even when count_samples() returns a value less than the batch size. In this case, the data provider is free to pad the remaining contents with any value.

get_label(provider, batch) → Vector{NDArray}
Parameters:
  • provider (AbstractDataProvider) – the data provider.
  • batch (AbstractDataBatch) – the data batch object.
Returns:

a vector of labels in this batch. Similar to get_data().

The following utility functions will be automatically defined.

get(provider, batch, name) → NDArray
Parameters:
  • provider (AbstractDataProvider) – the data provider.
  • batch (AbstractDataBatch) – the data batch object.
  • name (Base.Symbol) – the name of the data to get, should be one of the names provided in either provide_data() or provide_label().
Returns:

the corresponding data array corresponding to that name.

load_data!(provider, batch, targets)
Parameters:
  • provider (AbstractDataProvider) – the data provider.
  • batch (AbstractDataBatch) – the data batch object.
  • targets (Vector{Vector{SlicedNDArray}}) – the targets to load data into.

The targets is a list of the same length as number of data provided by this provider. Each element in the list is a list of SlicedNDArray. This list described a spliting scheme of this data batch into different slices, each slice is specified by a slice-ndarray pair, where slice specify the range of samples in the mini-batch that should be loaded into the corresponding ndarray.

This utility function is used in data parallelization, where a mini-batch is splited and computed on several different devices.

load_label!(provider, batch, targets)
Parameters:
  • provider (AbstractDataProvider) – the data provider.
  • batch (AbstractDataBatch) – the data batch object.
  • targets (Vector{Vector{SlicedNDArray}}) – the targets to load label into.

The same as load_data!(), except that this is for loading labels.

class DataBatch

A basic subclass of AbstractDataBatch, that implement the interface by accessing member fields.

class SlicedNDArray

A alias type of Tuple{UnitRange{Int},NDArray}.

Built-in data providers

class ArrayDataProvider

A convenient tool to iterate NDArray or Julia Array.

ArrayDataProvider(data[, label]; batch_size, shuffle, data_padding, label_padding)

Construct a data provider from NDArray or Julia Arrays.

Parameters:
  • data

    the data, could be

    • a NDArray, or a Julia Array. This is equivalent to :data => data.
    • a name-data pair, like :mydata => array, where :mydata is the name of the data and array is an NDArray or a Julia Array.
    • a list of name-data pairs.
  • label – the same as the data parameter. When this argument is omitted, the constructed provider will provide no labels.
  • batch_size (Int) – the batch size, default is 0, which means treating the whole array as a single mini-batch.
  • shuffle (Bool) – turn on if the data should be shuffled at every epoch.
  • data_padding (Real) – when the mini-batch goes beyond the dataset boundary, there might be less samples to include than a mini-batch. This value specify a scalar to pad the contents of all the missing data points.
  • label_padding (Real) – the same as data_padding, except for the labels.

TODO: remove data_padding and label_padding, and implement rollover that copies the last or first several training samples to feed the padding.

libmxnet data providers

class MXDataProvider

A data provider that wrap built-in data iterators from libmxnet. See below for a list of built-in data iterators.

CSVIter(...)

Can also be called with the alias CSVProvider. Create iterator for dataset in csv.

Parameters:
  • data_name (Base.Symbol) – keyword argument, default :data. The name of the data.
  • label_name (Base.Symbol) – keyword argument, default :softmax_label. The name of the label. Could be nothing if no label is presented in this dataset.
  • data_csv (string, required) – Dataset Param: Data csv path.
  • data_shape (Shape(tuple), required) – Dataset Param: Shape of the data.
  • label_csv (string, optional, default='NULL') – Dataset Param: Label csv path. If is NULL, all labels will be returned as 0
  • label_shape (Shape(tuple), optional, default=(1,)) – Dataset Param: Shape of the label.
Returns:

the constructed MXDataProvider.

ImageRecordIter(...)

Can also be called with the alias ImageRecordProvider. Create iterator for dataset packed in recordio.

Parameters:
  • data_name (Base.Symbol) – keyword argument, default :data. The name of the data.
  • label_name (Base.Symbol) – keyword argument, default :softmax_label. The name of the label. Could be nothing if no label is presented in this dataset.
  • path_imglist (string, optional, default='') – Dataset Param: Path to image list.
  • path_imgrec (string, optional, default='./data/imgrec.rec') – Dataset Param: Path to image record file.
  • label_width (int, optional, default='1') – Dataset Param: How many labels for an image.
  • data_shape (Shape(tuple), required) – Dataset Param: Shape of each instance generated by the DataIter.
  • preprocess_threads (int, optional, default='4') – Backend Param: Number of thread to do preprocessing.
  • verbose (boolean, optional, default=True) – Auxiliary Param: Whether to output parser information.
  • num_parts (int, optional, default='1') – partition the data into multiple parts
  • part_index (int, optional, default='0') – the index of the part will read
  • shuffle (boolean, optional, default=False) – Augmentation Param: Whether to shuffle data.
  • seed (int, optional, default='0') – Augmentation Param: Random Seed.
  • batch_size (int (non-negative), required) – Batch Param: Batch size.
  • round_batch (boolean, optional, default=True) – Batch Param: Use round robin to handle overflow batch.
  • prefetch_buffer (, optional, default=4) – Backend Param: Number of prefetched parameters
  • rand_crop (boolean, optional, default=False) – Augmentation Param: Whether to random crop on the image
  • crop_y_start (int, optional, default='-1') – Augmentation Param: Where to nonrandom crop on y.
  • crop_x_start (int, optional, default='-1') – Augmentation Param: Where to nonrandom crop on x.
  • max_rotate_angle (int, optional, default='0') – Augmentation Param: rotated randomly in [-max_rotate_angle, max_rotate_angle].
  • max_aspect_ratio (float, optional, default=0) – Augmentation Param: denotes the max ratio of random aspect ratio augmentation.
  • max_shear_ratio (float, optional, default=0) – Augmentation Param: denotes the max random shearing ratio.
  • max_crop_size (int, optional, default='-1') – Augmentation Param: Maximum crop size.
  • min_crop_size (int, optional, default='-1') – Augmentation Param: Minimum crop size.
  • max_random_scale (float, optional, default=1) – Augmentation Param: Maxmum scale ratio.
  • min_random_scale (float, optional, default=1) – Augmentation Param: Minimum scale ratio.
  • max_img_size (float, optional, default=1e+10) – Augmentation Param: Maxmum image size after resizing.
  • min_img_size (float, optional, default=0) – Augmentation Param: Minimum image size after resizing.
  • random_h (int, optional, default='0') – Augmentation Param: Maximum value of H channel in HSL color space.
  • random_s (int, optional, default='0') – Augmentation Param: Maximum value of S channel in HSL color space.
  • random_l (int, optional, default='0') – Augmentation Param: Maximum value of L channel in HSL color space.
  • rotate (int, optional, default='-1') – Augmentation Param: Rotate angle.
  • fill_value (int, optional, default='255') – Augmentation Param: Maximum value of illumination variation.
  • inter_method (int, optional, default='1') – Augmentation Param: 0-NN 1-bilinear 2-cubic 3-area 4-lanczos4 9-auto 10-rand.
  • mirror (boolean, optional, default=False) – Augmentation Param: Whether to mirror the image.
  • rand_mirror (boolean, optional, default=False) – Augmentation Param: Whether to mirror the image randomly.
  • mean_img (string, optional, default='') – Augmentation Param: Mean Image to be subtracted.
  • mean_r (float, optional, default=0) – Augmentation Param: Mean value on R channel.
  • mean_g (float, optional, default=0) – Augmentation Param: Mean value on G channel.
  • mean_b (float, optional, default=0) – Augmentation Param: Mean value on B channel.
  • mean_a (float, optional, default=0) – Augmentation Param: Mean value on Alpha channel.
  • scale (float, optional, default=1) – Augmentation Param: Scale in color space.
  • max_random_contrast (float, optional, default=0) – Augmentation Param: Maximum ratio of contrast variation.
  • max_random_illumination (float, optional, default=0) – Augmentation Param: Maximum value of illumination variation.
Returns:

the constructed MXDataProvider.

MNISTIter(...)

Can also be called with the alias MNISTProvider. Create iterator for MNIST hand-written digit number recognition dataset.

Parameters:
  • data_name (Base.Symbol) – keyword argument, default :data. The name of the data.
  • label_name (Base.Symbol) – keyword argument, default :softmax_label. The name of the label. Could be nothing if no label is presented in this dataset.
  • image (string, optional, default='./train-images-idx3-ubyte') – Dataset Param: Mnist image path.
  • label (string, optional, default='./train-labels-idx1-ubyte') – Dataset Param: Mnist label path.
  • batch_size (int, optional, default='128') – Batch Param: Batch Size.
  • shuffle (boolean, optional, default=True) – Augmentation Param: Whether to shuffle data.
  • flat (boolean, optional, default=False) – Augmentation Param: Whether to flat the data into 1D.
  • seed (int, optional, default='0') – Augmentation Param: Random Seed.
  • silent (boolean, optional, default=False) – Auxiliary Param: Whether to print out data info.
  • num_parts (int, optional, default='1') – partition the data into multiple parts
  • part_index (int, optional, default='0') – the index of the part will read
  • prefetch_buffer (, optional, default=4) – Backend Param: Number of prefetched parameters
Returns:

the constructed MXDataProvider.