How to use the framework

Note

This framework is built upon TensorFlow 1.8.0

Main files

  • dataset.py : Implements dataset manipulation using the Dataset API
  • estimator_specs.py : Here it is defined the abstract class you should inherit from in order to create your own estimator.
  • train_estimator.py : The core of the framework. It contains the main training flow using Estimator API . It instanciates the model, input and hook functions for later usage during traning.
  • utils.py : It contains some utility functions to perform IO with Google Cloud Storage and tf-record manipulation

How to use

We do recommend to use the framework as a pip package. So after downloading the code:

cd /path/to/tf_image_classification
python setup sdist
pip install ./dist/tf_image_classification.3.0.0.tar.gz --upgrade

Once installed, all you need to do is to create a class that inherit from EstimatorSpec and implement its abstract methods.

Running locally - Example

python myEstimator.py --batch_size 64 --train_steps 10000 \
--train_metadata tfrecords_path/train* --eval_metadata tfrecords_path/eval* \
--checkpoint_path checkpoint_path/pretrained_ckpt.ckpt --model_dir ./models \
--eval_freq 10 --eval_throttle_secs 30 --learning_rate 0.00001

Running on Google ML Engine - Example

First, you must package your application as a pip package.

gcloud ml-engine jobs submit training JOB_ID --job-dir=gs://bucket/stagging_folder/ \
--module-name myEstimatorPkg.myEstimator \
--packages myEstimator.tar.gz,tf_image_classification-3.0.0.tar.gz,slim-0.1.tar.gz \
--region us-east1 --config cloud.yml --  --batch_size 128 --train_steps 1000 \
--train_metadata gs://bucket/tfrecords/train* \ --eval_metadata gs://bucket/tfrecords/eval* \
--checkpoint_path gs://bucket/pretrained_checkpoints/pretrained_model.ckpt \
--model_dir gs://bucket/trained-checkpoints/ --eval_freq 10 \
--eval_throttle_secs 120 --learning_rate 0.00001

Note

train() uses the method train_and_evaluate that runs seamlessly both locally and distributed training, so you don’t need to write a single line of code to run your model distributed into a ML Engine cluster.

FLAGS

Common

  • model_dir : Output directory for model and training stats
    • Default value: None
  • checkpoint_path : Checkpoint to load pre-trained model
    • Default value: None
  • train_metadata : Path to train metadata ( .csv or .tfrecord)
    • Default value: None
  • eval_metadata : Path to eval metadata ( .csv or .tfrecord)
    • Default value: None
  • batch_size : Batch size
    • Default value: 1
  • train_steps : Train steps
    • Default value: 20
  • image_size : Image size for resize on preprocessing
    • Default value: 299
  • eval_freq : How many eval batches to evaluate
    • Default value: 5
  • eval_throttle_secs : Evaluation every eval_throttle_secs seconds
    • Default value: 120
  • debug : Debug mode (does not shuffle dataset)
    • Default value: False

Optimizers

  • weight_decay : The weight decay on the model weights (_e.g._ batchnorm layers)
    • Defaut value: 0.00004
  • optimizer : Name of optimizer
  • adadelta_rho : The decay rate for adadelta
    • Default Value: 0.95
  • adagrad_initial_accumulator_value : Starting value for the AdaGrad accumulators
    • Default Value: 0.1
  • adam_beta1 : The exponential decay rate for the 1st moment estimates
    • Default Value: 0.9
  • adam_beta2 : The exponential decay rate for the 2nd moment estimates
    • Default Value: 0.999
  • opt_epsilon : Epsilon term for the optimizer
    • Default value: 1.0
  • ftrl_learning_rate_power : The learning rate power for ftrl optimizer
    • Default Value: -0.5
  • ftrl_initial_accumulator_value : Starting value for the FTRL accumulators
    • Default Value: 0.1
  • ftrl_l1 : The FTRL l1 regularization strength
    • Default Value: 0.0
  • ftrl_l2 : The FTRL l2 regularization strength
    • Default Value: 0.0
  • momentum : Momentum for MomentumOptimizer
    • Default Value: 0.9
  • rmsprop_momentum : Momentum for RMSPropOptimizer
    • Default Value: 0.9
  • rmsprop_decay : Decay term for RMSProp
    • Default Value: 0.9

Learning rate

  • learning_rate_decay_type : Specifies how the learning rate is decayed.
  • learning_rate : Initial learning rate
    • Default Value: 0.01
  • end_learning_rate : The minimal end learning rate used by a polynomial decay learning rate
    • Default Value: 0.0001
  • learning_rate_decay_factor : Learning rate decay factor
    • Default Value: 0.94
  • label_smoothing : The amount of label smoothing
    • Default Value: 0.0
  • num_epochs_per_decay : Number of epochs after which learning rate decays
    • Default Value: 2.0
  • sync_replicas : Whether or not to synchronize the replicas during training
    • Default Value: False
  • replicas_to_aggregate : The Number of gradients to collect before updating params
    • Default Value: 1

Fine Tuning

  • trainable_scopes : Comma-separated list of scopes to train. If None, all variables will be trained.
    • Default Value : None
  • checkpoint_exclude_scopes : Comma-separated list of scopes to exclude when loading checkpoint weights. If None, restore all variables.
    • Default Value : None
  • checkpoint_restore_scopes: Comma-separated list of scopes of variables to restore from a checkpoint.
    • Default Value : None

Checkpoint

  • save_summary_steps : Save summaries every this many steps
    • Default Value: 100
  • save_checkpoints_steps : Save checkpoints every this many steps. Can not be specified with save_checkpoints_secs
    • Default Value: None
  • save_checkpoints_secs : Save checkpoints every this many seconds. Can not be specified with save_checkpoints_steps
    • Default Value: None
  • keep_checkpoint_max : The maximum number of recent checkpoint files to keep. -1 to keep every checkpoints
    • Default Value: 5