Skip to content
Rashish Tandon edited this page Sep 7, 2017 · 30 revisions

Welcome to the gradient_coding wiki!

This page contains instructions for running the associated implementation on Amazon EC2. We use a cluster management toolkit called StarCluster (http://star.mit.edu/cluster/) to manage a cluster of EC2 machines.

StarCluster setup for Amazon EC2

  • Install the StarCluster toolkit (http://star.mit.edu/cluster/)

  • To configure StarCluster, edit the config file (found in .starcluster folder) as follows:

    • Add AWS Security Keys in the fields AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. See here for more details.
    • Add AWS User Id in the field AWS_USER_ID. See here for more details.
    • Generate an EC2 Key pair (see here for more details). Add the key location in the StarCluster config file, by defining a "key" in the config file, for e.g.
      [key myrandomkey]   
      KEY_LOCATION = ~/myrandomlocation/myrandomkey.pem
      
    • Define plugin templates in the config file to add MPI and mpi4py to the EC2 cluster machines --- these are used in our implementation, and are not part of standard AMIs provided by StarCluster.
      [plugin mpich2]
      SETUP_CLASS = starcluster.plugins.mpich2.MPICH2Setup
      
      [plugin mpi4py]
      SETUP_CLASS = starcluster.plugins.pypkginstaller.PyPkgInstaller
      PACKAGES = mpi4py
      
    • Define volume templates in the config file to attach an EBS volume to an EC2 cluster via NFS-share --- these may be used to store data. You will need the volume id of the EBS volume to attach it.
      [volume mydata]
      VOLUME ID = vol-123123123123
      MOUNT_PATH = /mydatapath
      
    • Define cluster templates in the config file to launch a cluster on EC2. Here is a sample configuration we used:
      [cluster myclusterconf]
      KEYNAME = myrandomkey
      CLUSTER_SIZE = 21
      CLUSTER_USER = sgeadmin
      CLUSTER_SHELL = bash
      NODE_IMAGE_ID = ami-6b211202
      NODE_INSTANCE_TYPE = t2.micro
      MASTER_IMAGE_ID = ami-3393a45a
      MASTER_INSTANCE_TYPE = m1.small
      PLUGINS = mpi4py, mpich2
      SPOT_BID = 0.5
      SUBNET_ID = subnet-9999a99b9
      PUBLIC_IPS = True
      VOLUMES = mydata
      
    • Some EC2 instances can only be launched on a subnet. See here to create it. Also, SPOT_BID specifies a spot bid for spot type instances
  • Once you have edited the config file and added a cluster template (myclusterconf in the above example), you can launch an EC2 cluster as:

    starcluster start -c myclusterconf mynewcluster
    
  • You may ssh to the Master node in your cluster as:

    starcluster sshmaster mynewcluster
    
  • Finally, you may terminate your cluster as:

    starcluster terminate mynewcluster
    

Usage Instructions

  • Launch an EC2 cluster using StarCluster.
  • Clone this repository into the Master node of your cluster
  • Edit the Makefile as needed. Below are some pointers:
    • Specify the folder containing data in the field DATA_FOLDER
    • Specify the data set path and size in the fields DATASET, N_ROWS and N_COLS, and set IS_REAL to 1.
    • You may use make arrange_real_data to preprocess the data and break it into partitions (this may have to be re-written for your specific use case, but some examples are provided)
    • To work with random data instead, use make generate_random_data to generate an artificial dataset (of size specified in the Makefile, in the fields N_ROWS and N_COLS) and set IS_REAL to 0.
    • Specify the total number of workers and the number of stragglers in the fields N_PROCS and N_STRAGGLERS, respectively.
    • If using partial coding schemes (see paper for details), specify the no. of partitions each worker processes in N_PARTITIONS, and also set PARTIAL_CODED to 1
  • Edit the number of iterations, regularization coefficient and learning rate schedules in the file main.py through the variables num_itrs, alpha and learning_rate_schedule, respectively.
  • Now, you can run (accelerated) gradient descent for various schemes as follows:
    • make naive for the Naive (uncoded) scheme
    • make avoidstragg for the Ignoring Stragglers scheme
    • make cyccoded for the Cyclic Repetition scheme
    • make repcoded for the Fractional Repetition scheme
    • make partialcyccoded for the Partial Cyclic Repetition scheme
    • make partialrepcoded for the Partial Fractional Repetition scheme

Clone this wiki locally