{"id":31065,"date":"2017-04-25T02:01:52","date_gmt":"2017-04-25T10:01:52","guid":{"rendered":"https:\/\/blogs.msdn.microsoft.com\/visualstudioalm\/?p=31065"},"modified":"2019-02-14T15:51:51","modified_gmt":"2019-02-14T23:51:51","slug":"how-we-use-rm-part-1","status":"publish","type":"post","link":"https:\/\/devblogs.microsoft.com\/devops\/how-we-use-rm-part-1\/","title":{"rendered":"How we use RM \u2013 Part 1"},"content":{"rendered":"<p>&lt;<\/p>\n<p>p>The teams that contribute to VSTS (TFS and other micro-services like Release Management, Package Management, etc) began using Release Management to deploy to production as outlined by Buck Hodges in <a href=\"https:\/\/blogs.msdn.microsoft.com\/buckh\/2015\/04\/08\/how-we-deploy-visual-studio-online-using-release-management\/\">this<\/a> blog. However, in Feb this year, there was some feedback that it was difficult to debug failed deployments using RM, and that engineers were being forced to use unnatural workarounds.<\/p>\n<p>&lt;<\/p>\n<p>p>We (the RM team) used that as an opportunity to re-look at our RM usage, and to fix things up so that it becomes easier to use. Along the way, we fixed up some things in the product, and some things in the way we use the product. In this two-part blog, I will walk you through what we did. The first part covers the issues that we faced, and the fixes that have been rolled out.&nbsp; I will blog the second part when the remaining fixes have been rolled out (since they are still in-flight).<\/p>\n<h2>The various things that didn\u2019t work very well in RM<\/h2>\n<p>&lt;<\/p>\n<p>p>As we walk through the issues, we will use the release \u201cTFS \u2013 Prod Update 538\u201d as an example of what didn\u2019t work very well in RM:<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image734.png\"><img decoding=\"async\" title=\"image\" style=\"border-left-width: 0px;border-right-width: 0px;border-bottom-width: 0px;padding-top: 0px;padding-left: 0px;padding-right: 0px;border-top-width: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb720.png\" width=\"1034\" height=\"326\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p>Double-clicking on the release showed a pretty un-actionable set of Issues.&nbsp; We used to see this \u201cwrapper script\u201d text for all errors, and it wasn\u2019t very useful:<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image735.png\"><img decoding=\"async\" title=\"image\" style=\"border-left-width: 0px;border-right-width: 0px;border-bottom-width: 0px;padding-top: 0px;padding-left: 0px;padding-right: 0px;border-top-width: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb721.png\" width=\"620\" height=\"641\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p>Further, the Release Summary indicated that the deployments to Ring 0, Ring 1, Ring 2, and Ring 3 succeeded.&nbsp; However, clicking on the Environments tab and looking closely at the number of tasks enabled in these rings told a different story.<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image736.png\"><img decoding=\"async\" title=\"image\" style=\"border-left-width: 0px;border-right-width: 0px;border-bottom-width: 0px;padding-top: 0px;padding-left: 0px;padding-right: 0px;border-top-width: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb722.png\" width=\"466\" height=\"588\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p>There were, in fact, zero tasks enabled in Rings 0, 1, 2 and 3!&nbsp; The bits needed to be deployed only to Ring 4, but the release was taken through Rings 0, 1, 2 and 3 without doing anything meaningful in these environments.&nbsp; This used to mess up the traceability of the product, because RM thought that the current release on Ring 0 (and Rings 1, 2 and 3 also) was \u201cTFS \u2013 Prod Update 538\u201d (at least till the next release rolled out), whereas actually the bits on Ring 0 corresponded to an older release.&nbsp; The \u201ccurrent release in an environment\u201d concept is pretty important for RM: The product surfaces this in some of its views, and also uses this for some internal operations like release retention e.g. RM won\u2019t delete a release if it is the current release on an environment.<\/p>\n<p>&lt;<\/p>\n<p>p>The above design \u2013 of dragging the release through Rings 0, 1, 2 and 3 unnecessarily \u2013 begs the question, \u201cWhy?&nbsp; Why couldn\u2019t the release be created so that zero environments started automatically \u2018after release creation\u2019, and then Ring 4 was started manually?\u201d&nbsp; The reason for this was the requirement that there should be <strong>only 1 approval for the entire release<\/strong>, across all the environments.&nbsp; When the release was created, the approver wanted to check that everything was in order across all the environments, and then didn\u2019t want to be asked to approve again.&nbsp; Therefore, this requirement was modeled as an approval on Ring 0, and every release had to go through this environment before it reached its real destination.<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image737.png\"><img decoding=\"async\" title=\"image\" style=\"border-left-width: 0px;border-right-width: 0px;border-bottom-width: 0px;padding-top: 0px;padding-left: 0px;padding-right: 0px;border-top-width: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb723.png\" width=\"442\" height=\"566\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p>Moving on \u2026 going to the Logs tab showed that the log file corresponding to the failed task wasn\u2019t even available in the browser.&nbsp; This used to sometimes happen when the log file was very large.&nbsp; The workaround was that we used to have to log into the agent box and view the logs on the agent.&nbsp;<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image738.png\"><img decoding=\"async\" title=\"image\" style=\"border-left-width: 0px;border-right-width: 0px;border-bottom-width: 0px;padding-top: 0px;padding-left: 0px;padding-right: 0px;border-top-width: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb724.png\" width=\"861\" height=\"522\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p>Once we got to the logs, we found out that the logs were not easy to understand.&nbsp; Reason: Each environment corresponded to multiple Scale Units (SUs) e.g. Ring 3 corresponds to three scale units (WEU2, SU6 and WUS22).&nbsp; So the logs for the three scale units were interspersed.<\/p>\n<p>&lt;<\/p>\n<p>p>Finally, as part of aligning to Azure\u2019s Safe Deployment practices, there was a requirement to have each Ring \u201cbake\u201d for some time before proceeding to the next ring, so that issues were flushed out in the inner rings before moving to the outer (and more public) rings.&nbsp; The bake time was modeled as a \u201csleep\u201d task.&nbsp; This was less-than-ideal because it used up an agent unnecessarily while sleeping.&nbsp;&nbsp;<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image739.png\"><img decoding=\"async\" title=\"image\" style=\"border-left-width: 0px;border-right-width: 0px;border-bottom-width: 0px;padding-top: 0px;padding-left: 0px;padding-right: 0px;border-top-width: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb725.png\" width=\"847\" height=\"376\"><\/a>&nbsp;&nbsp;<\/p>\n<p>&lt;<\/p>\n<p>p>In the screenshot above, the Sleep task was disabled for this particular hotfix deployment, probably to get some fix out into prod early, but typically this task is enabled.<\/p>\n<h2>Fixes that we have rolled out<\/h2>\n<p>We fixed the following issues either by enhancing RM or by changing the way we used the product.<\/p>\n<p><strong>Problem statement<\/strong>: The Issues list in the release was un-actionable.<\/p>\n<p><strong>Solution<\/strong>: [<font color=\"#0000ff\">Change in the way we use RM<\/font>] Changed the Powershell script so that it logged the inner exception as the Error, as opposed to the outer exception<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image740.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb726.png\" width=\"732\" height=\"626\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p><strong>Problem statement<\/strong>: Incorrect traceability in the product caused by taking each release through Ring 0 even if the bits were meant for a different Ring<\/p>\n<p>&lt;<\/p>\n<p>p><strong>Solution<\/strong>: [<font color=\"#c0504d\"><strong>Enhanced RM<\/strong><\/font>] We enabled correct traceability in the product by adding support for a new feature \u201cRelease-level-approvals\u201d, and then used this in the \u201cTFS \u2013 Prod Update\u201d Release Definition.&nbsp; This feature ensures that approval is required only once in the release \u2013 no matter which Ring is deployed first \u2013 as long as the approvers for all the Rings are the same:<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image741.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb727.png\" width=\"858\" height=\"473\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p>As you can see above, all Rings, except for the first Ring, have the following option selected: \u201cAutomatically approve auto-triggered deployments to this environment for users who have approved deployment to the previous environment\u201d.&nbsp; (That\u2019s quite a mouthful!)&nbsp; Ring 0 doesn\u2019t need to have this option selected since there is no \u201cprevious\u201d environment.<\/p>\n<p>&lt;<\/p>\n<p>p>This ensures clean traceability of the release i.e. the bits are not unnecessarily dragged through rings where they are not meant to be deployed.<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image742.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb728.png\" width=\"841\" height=\"217\"><\/a><\/p>\n<h2>Fixes still to be rolled out<\/h2>\n<p>The issues in this section have not been completely addressed.&nbsp; Once we address them, I will write up the details in part 2 of this blog.<\/p>\n<p>&lt;<\/p>\n<p>p><strong>Problem statement<\/strong>: The log file was sometimes not available in the browser \u2013 typically when it was very large.<\/p>\n<p>&lt;<\/p>\n<p>p><strong>Solution<\/strong>: [<font color=\"#0000ff\">TBD: Change<\/font><font color=\"#0000ff\"> in the way we use RM<\/font>] We will fix the log file upload reliability issue by moving from the 1.x agent to the 2.x agent.&nbsp; There is a known reliability issue with the 1.x agent with respect to uploading large log files.<\/p>\n<p>&lt;<\/p>\n<p>p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image743.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb729.png\" width=\"850\" height=\"212\"><\/a><\/p>\n<p>&lt;<\/p>\n<p>p>The upgrade of the agent is being delayed because we used a legacy variable $(Agent.WorkingDirectory) which was available in the 1.x agent, but is not available in the 2..x agent.&nbsp; So we need to re-write the Powershell scripts that used this variable, and replace its usage with $(System.DefaultWorkingDirectory).&nbsp;&nbsp;<\/p>\n<p>&lt;<\/p>\n<p>p><strong>Problem statement<\/strong>: The logs were difficult to understand, since each log file had mangled information from multiple scale units.<\/p>\n<p>&lt;<\/p>\n<p>p><strong>Solution<\/strong>: [<font color=\"#0000ff\">TBD: Change<\/font><font color=\"#0000ff\"> in the way we use RM<\/font>] We will enable better traceability per environment by having a scale unit per environment.<\/p>\n<p>&lt;<\/p>\n<p>p>This design, however, gave rise to several new issues:<\/p>\n<p>&lt;<\/p>\n<p>p><em><u>Sub-problem statement<\/u><\/em>: The number of environments will go up from 5 to more than 15. Viewing the list of Releases will become a pain because of the need to constantly re-size the Environments column.&nbsp; In addition, even with the Release-level-approval feature, approvals will still be problematic.&nbsp; Reason: Each Ring will blow up into several Environments e.g. Ring 3 \u2013&nbsp; which used to correspond to three scale units WEU2, SU6 and WUS22 \u2013 would now correspond to three environments.&nbsp; Hence, starting Ring 3 would correspond to deploying three environments manually, and approving three times \u2013 once for each environment (since Release-level-approvals kick in only if the previous approval is completed by the time the next deployment starts).<\/p>\n<p>&lt;<\/p>\n<p>p><em><u>Solution<\/u><\/em>: We did some things to make this better, with some more work pending:<\/p>\n<ul>\n<li>[<font color=\"#c0504d\"><strong>Enhanced RM<\/strong><\/font>] We added support for \u201cremembering\u201d the width of the environments column per [User, Release Definition] combination per browser <\/li>\n<li>[<font color=\"#c0504d\"><strong>Enhanced RM<\/strong><\/font>] We also made the environments smaller so that they took up less real estate<\/li>\n<ol>\n<li><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image744.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb730.png\" width=\"828\" height=\"244\"><\/a><\/li>\n<\/ol>\n<li>[<font color=\"#c0504d\"><strong>TBD: Enhanced RM<\/strong><\/font>] Over the next few sprints, we will add support for bulk deployments and bulk approvals.&nbsp; After that, hopefully, we will be able to move to an environment-per-scale unit.<\/li>\n<\/ul>\n<p><strong>Problem statement<\/strong>: How do you model the scenario of \u201cwaiting for an environment to bake\u201d without using up an agent which sleeps?<\/p>\n<p><strong>Solution<\/strong>: <\/p>\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (a) [<font color=\"#c0504d\"><strong>Enhanced RM<\/strong><\/font>] We introduced the feature \u201cResume task on timeout\u201d feature in the Manual Intervention task.&nbsp; When this is set to \u201cResume on timeout\u201d, it acts like a sleep, without using up any agent resources.<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"https:\/\/devblogs.microsoft.com\/wp-content\/uploads\/sites\/6\/2019\/05\/image745.png\"><img decoding=\"async\" title=\"image\" style=\"border-top: 0px;border-right: 0px;border-bottom: 0px;padding-top: 0px;padding-left: 0px;border-left: 0px;padding-right: 0px\" border=\"0\" alt=\"image\" src=\"https:\/\/devblogs.microsoft.com\/devops\/wp-content\/uploads\/sites\/6\/2017\/04\/image_thumb731.png\" width=\"873\" height=\"474\"><\/a><\/p>\n<p>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; (b) [<font color=\"#c0504d\"><strong>TBD: Enhanced RM<\/strong><\/font>] However, there is an additional requirement to make the timeout of the Manual Intervention task specify-able through a variable, so that the timeouts of the environments can be easily managed through environment variables.&nbsp; Once we do that over the next few sprints, we will be able to replace the Sleep task in the Release Definition with the Manual Intervention task with \u201cResume on timeout\u201d.<\/p>\n<h2>Conclusion<\/h2>\n<p>VSTS relies heavily on RM for its production deployments, and now you have some insight into how we use RM, and the improvements we are making in RM as we fine-tune this experience.&nbsp; Stay tuned for part 2 of this blog, as we iron out more of the issues that have come up during this dogfooding.<\/p>\n<p>Hopefully some of the techniques used by us will apply to your releases too.&nbsp; <\/p>\n","protected":false},"excerpt":{"rendered":"<p>&lt; p>The teams that contribute to VSTS (TFS and other micro-services like Release Management, Package Management, etc) began using Release Management to deploy to production as outlined by Buck Hodges in this blog. However, in Feb this year, there was some feedback that it was difficult to debug failed deployments using RM, and that engineers [&hellip;]<\/p>\n","protected":false},"author":77,"featured_media":45953,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[226,1],"tags":[],"class_list":["post-31065","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ci","category-devops"],"acf":[],"blog_post_summary":"<p>&lt; p>The teams that contribute to VSTS (TFS and other micro-services like Release Management, Package Management, etc) began using Release Management to deploy to production as outlined by Buck Hodges in this blog. However, in Feb this year, there was some feedback that it was difficult to debug failed deployments using RM, and that engineers [&hellip;]<\/p>\n","_links":{"self":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts\/31065","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/users\/77"}],"replies":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/comments?post=31065"}],"version-history":[{"count":0,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/posts\/31065\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/media\/45953"}],"wp:attachment":[{"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/media?parent=31065"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/categories?post=31065"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devblogs.microsoft.com\/devops\/wp-json\/wp\/v2\/tags?post=31065"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}