Add dqn and ac for VM scheduling by MicrosoftHam · Pull Request #358 · microsoft/maro

MicrosoftHam · 2021-06-10T09:33:34Z

Description

Implement reinforcement learning algorithm for VM scheduling simulation.

Linked issue(s)/Pull request(s)

issue_number

Type of Change

Related Component

Simulation toolkit
RL toolkit
Distributed toolkit

Has Been Tested

OS:
- Windows
- Mac OS
- Linux
Python version:
- 3.6
- 3.7
Key information snapshot(s):

Needs Follow Up Actions

New release package
New docker image

Checklist

Add/update the related comments
Add/update the related tests
Add/update the related documentations
Update the dependent downstream modules usage

examples/vm_scheduling/reinforcement_learning/ac/agent/ac_net.py

Jinyu-W · 2021-06-10T09:56:07Z

examples/vm_scheduling/reinforcement_learning/ac/agent/ac_net.py

+
+        action_prob = Categorical(self.forward(states, critic=False)[0] * legal_action)  # (batch_size, action_space_size)
+        action = action_prob.sample()
+        log_p = action_prob.log_prob(action)


I'm wondering should we use the action_prob without multiplying legal_action to calculate the log_p or not?

Similar concern as the reward shaping case we discussed several days ago (when only the postpone is valid but model output prefers some PM).

examples/vm_scheduling/reinforcement_learning/ac/agent/models/combine_net.py

Jinyu-W · 2021-06-10T10:07:20Z

examples/vm_scheduling/reinforcement_learning/ac/agent/models/sequence_net.py

+class SequenceNet(AbsBlock):
+    """Fully connected network with optional batch normalization, activation and dropout components.
+
+    Args:


not corresponds to the actual parameters. Absence better than errors.

ysqyang · 2021-06-10T09:53:20Z

examples/vm_scheduling/reinforcement_learning/ac/agent/models/sequence_net.py

+from maro.rl import AbsBlock
+
+
+class SequenceNet(AbsBlock):


Where is this used?

ysqyang · 2021-06-10T09:56:27Z

examples/vm_scheduling/reinforcement_learning/ac/agent/models/sequence_net.py

+from maro.rl import AbsBlock
+
+
+class SequenceNet(AbsBlock):


There doesn't seem to be a need to subclass AbsBlock. Can you try inheriting from AbsCoreModel and use PM and VM as components?

ysqyang · 2021-06-10T11:03:16Z

examples/vm_scheduling/reinforcement_learning/ac/agent/ac_net.py

+            self.component["critic"](states) if critic else None
+        )
+
+    def get_action(self, states, legal_action, training=True):


Can you make legal_action a part of states so we don't have to change the call interface? There is no restriction on what type states should be in forward.

use a tuple (states, legal_action) as states to input the function

Jinyu-W · 2021-06-11T06:13:37Z

examples/vm_scheduling/reinforcement_learning/ac/agent/models/sequence_net.py

+        self._skip_connection = skip_connection
+
+        # build the pm sequence net
+        pm_dims = [self._pm_input_dim*self._pm_num] + self._hidden_dims[:2]


pm_dims = [self._pm_input_dim * self._pm_num] + self._hidden_dims[:2]

Jinyu-W · 2021-06-11T06:15:04Z

examples/vm_scheduling/reinforcement_learning/ac/agent/models/sequence_net.py

+        self._name = name
+
+    def forward(self, x):
+        pm_info_input = x[:, :self._pm_state_dim].view(-1, self._pm_window_size, self._pm_num * self._pm_input_dim)


why previous _pm_state_dim? (why not x.view(...))

Jinyu-W · 2021-06-11T06:22:59Z

examples/vm_scheduling/reinforcement_learning/ac/agent/models/sequence_net.py

+        # self._pm_sequence_rnn.flatten_parameters()
+        # pm_sequence_feature, _ = self._pm_sequence_rnn(pm_info_feature)
+
+        vm_info_input = x[:, -self._vm_state_dim:].view(-1, self._vm_window_size, self._vm_input_dim)


similar question to the pm one

Jinyu-W · 2021-06-11T10:30:21Z

examples/vm_scheduling/reinforcement_learning/ac/agent/vm_ac.py

+                log_p_new = torch.clamp(log_p_new, min=-20)
+
+                if self.config.clip_ratio is not None:
+                    ratio = torch.exp(log_p_new - log_p)


why design the ratio like this? what's the meaning of it?

to use the PPO algorithm

Jinyu-W · 2021-06-11T10:30:48Z

examples/vm_scheduling/reinforcement_learning/ac/agent/vm_ac.py

+                    actor_loss = -(torch.min(ratio * advantages, clip_ratio * advantages)).mean()
+                else:
+                    dist = Categorical(action_probs)
+                    actor_loss = -(log_p_new * advantages + 10 * dist.entropy()).mean()


to encourage bigger entropy?

to prevent the action probability converge too fast

examples/vm_scheduling/reinforcement_learning/ac/agent/vm_ac.py

examples/vm_scheduling/reinforcement_learning/common/__init__.py

Jinyu-W · 2021-06-11T10:35:46Z

examples/vm_scheduling/reinforcement_learning/ac/components/env_wrapper.py

+from collections import defaultdict
+
+from maro.rl import ExperienceSet
+from examples.vm_scheduling.refine_rl.common import VMEnvWrapper


is it still runnable? examples.vm_scheduling.refine_rl.common seems not exist (at least in this PR)

It runnable in my environment, as I have the examples.vm_scheduling.refine_rl.common. So I don't notice this.

Jinyu-W · 2021-06-11T10:37:54Z

examples/vm_scheduling/reinforcement_learning/ac/components/env_wrapper.py

+        del buf["states"][:-1]
+        del buf["actions"][:-1]
+        del buf["rewards"][:-1]
+        del buf["info"][:-1]


return buf["info"][1:] but del buf["info"][:-1]?

info store the legal_action; In DQN, exp_set should store the next_legal_action just like the next_states. But legal_action is useless in the AC training process. So I use the same treatment as the DQN.

If it is useless, why add it into buf["info"] then?

Jinyu-W · 2021-06-15T10:08:12Z

examples/vm_scheduling/reinforcement_learning/ac/agent/__init__.py

+__all__ = [
+    "ACNet",
+    "VMActorCritic",
+    "CombineNet", "SequenceNet"


no CombineNet and SequenceNet anymore

I'll fix it

Jinyu-W · 2021-06-16T08:34:18Z