{"@attributes":{"version":"2.0"},"channel":{"title":"Fractional","description":"by Lindsay Holmwood","link":"https:\/\/fractio.nl\/","pubDate":"Mon, 15 Jan 2024 11:17:16 +0000","lastBuildDate":"Mon, 15 Jan 2024 11:17:16 +0000","generator":"Jekyll v3.9.3","item":[{"title":"Using a first gen iPad mini as a grafana dashboard in 2024","description":"<p>This project had a very simple goal:<\/p>\n\n<blockquote>\n  <p><strong>Display local weather measurements on old tablet in our kitchen.<\/strong><\/p>\n<\/blockquote>\n\n<p>I have:<\/p>\n\n<ul>\n  <li>A <a href=\"https:\/\/en.wikipedia.org\/wiki\/IPad_Mini_(1st_generation)\">first generation iPad mini<\/a> that has been gathering dust since 2019.<\/li>\n  <li>An outdoor weather station:\n<img style=\"max-width: 100%; height: auto;\" src=\"\/images\/posts\/2024-01-14\/weather-station.png\" alt=\"photo of the weather station mounted on the edge of a deck\" \/><\/li>\n  <li>A Raspberry Pi with a DVB receiver:\n<img style=\"max-width: 100%; height: auto;\" src=\"\/images\/posts\/2024-01-14\/rpi-sdr.png\" alt=\"photo of a Raspberry Pi Zero with a software defined radio receiver attached by USB\" \/><\/li>\n  <li>And Grafana &amp; Prometheus on a <a href=\"https:\/\/vultr.com\/\">Vultr VPS<\/a> that scrapes the Raspberry Pi.<\/li>\n<\/ul>\n\n<p>There are obviously some resource constraints here \u2014 device and operating system age, memory limits \u2014 so it was interesting to solve this problem within these constraints.<\/p>\n\n<h3 id=\"trial-and-error\">Trial and error<\/h3>\n\n<p>First step after wiping the device was to try visiting the existing Grafana dashboards, to see if it would load.\nThis failed immediately because the certificates had expired.<\/p>\n\n<p>This uncovered the first problem:\nthe device hasn\u2019t received software updates since August 2016, and certificates issued by Let\u2019s Encrypt no longer work.<\/p>\n\n<p>Fortunately other people have dealt with this issue, back when the certs in the iOS trust store expired in 2021.\n<a href=\"https:\/\/community.letsencrypt.org\/t\/isrg-root-x1-not-supported-on-ios-9-3-5\/162193\/15\">This very helpful post<\/a> on the Let\u2019s Encrypt Community Support forum explains how to manually install the current Let\u2019s Encrypt certs from <a href=\"https:\/\/letsencrypt.org\/certs\/isrgrootx1.pem\">https:\/\/letsencrypt.org\/certs\/isrgrootx1.pem<\/a>.<\/p>\n\n<p>With the certificates fixed, I tried visiting the existing graphs dashboard again, but now it showed a new error:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>If you're seeing this Grafana has failed to load its application files\n<\/code><\/pre><\/div><\/div>\n\n<p><img style=\"max-width: 100%; height: auto;\" src=\"\/images\/posts\/2024-01-14\/grafana-wont-load.png\" alt=\"screenshot of grafana error page that says 'If you're seeing this Grafana has failed to load its application files'\" \/><\/p>\n\n<p>Not particularly helpful, especially when you don\u2019t have access to Safari\u2019s dev console on the iPad.<\/p>\n\n<p>But! I was able to use <a href=\"https:\/\/github.com\/google\/ios-webkit-debug-proxy\">ios-webkit-debug-proxy<\/a> to show the errors on a desktop browser.\nThis surfaced a bunch of JavaScript errors for unsupported browser features.<\/p>\n\n<p>Based on this, I stumbled across <a href=\"https:\/\/github.com\/grafana\/grafana\/issues\/27264\">this GitHub Issue<\/a> that showed Grafana 7.0.6 was the last version that supported Safari shipped with iOS 9.3.<\/p>\n\n<p>This left me with two choices.<\/p>\n\n<h3 id=\"crazy-or-annoying-pick-one\">Crazy or annoying: pick one<\/h3>\n\n<ol>\n  <li>\n    <p>Keep using the latest Grafana, but use <a href=\"https:\/\/github.com\/tenox7\/wrp\">wrp<\/a> (as suggested in <a href=\"https:\/\/www.reddit.com\/r\/homelab\/comments\/1028lbf\/old_ipad_as_a_grafana_dashboard\/\">this Reddit post<\/a>) to render the page as an image.<\/p>\n\n    <p>This is a wild approach, and it was crazy enough that I had to try it.<\/p>\n\n    <p>While I kinda got it working, I found there were too many moving parts, and I quickly ran into memory limits on the VPS (it\u2019s running a headless Chrome).<\/p>\n  <\/li>\n  <li>\n    <p>Set up a standalone instance of <a href=\"https:\/\/grafana.com\/grafana\/download\/7.0.6?edition=oss\">Grafana running 7.0.6<\/a>, the last version that worked with iOS 9.3.5.<\/p>\n\n    <p>This had a few downsides:<\/p>\n\n    <ul>\n      <li><strong>Grafana won\u2019t be patched when there are bugs.<\/strong>\nTo mitigate, I could use Nginx + basic auth to protect it from the public web.<\/li>\n      <li><strong>It\u2019s a separate set of dashboards to maintain.<\/strong>\nI couldn\u2019t export the dashboard from the newer Grafana and import into 7.0.6, because the dashboard schema had changed.\nSo I would have to recreate dashboards manually, and some of the panel types (like the stat panel) have fewer features.<\/li>\n    <\/ul>\n  <\/li>\n<\/ol>\n\n<p>At the end of the day, option 2 was the least terrible, and the end users (my family) don\u2019t need to know or care about how inelegant the setup is behind the scenes.<\/p>\n\n<p>The last software thing to set up was <a href=\"https:\/\/apps.apple.com\/au\/app\/kiosk-mode-for-ipad\/id986554705\">Kiosk mode for iPad<\/a>, to show the dashboard in full screen.\nFortunately Kiosk mode still works on older iOS \u2014 thanks to the maintainers!<\/p>\n\n<p>Finally, I had to safely mount the iPad to the wall.\nI used the <a href=\"https:\/\/dockem.com\/products\/koala-mount-2-wall-mount-tablet?variant=29903897100363\">Dockem Koala Mount 2.0<\/a>, mounted directly into the stud:<\/p>\n\n<p><img style=\"max-width: 100%; height: auto;\" src=\"\/images\/posts\/2024-01-14\/ipad-in-situ.png\" alt=\"photo of the iPad mini first gen mounted above a kitchen counter\" \/><\/p>\n\n<p>In conclusion, it\u2019s still possible in 2024 to use older iPads with a bit of work.\nMy recommendation is to stick to things in the browser, or use some of the <a href=\"https:\/\/www.reddit.com\/r\/ipad\/comments\/zzsdwl\/my_tips_for_using_a_2012era_ipad_3_with_ios_935\/\">few apps that still work<\/a> on the first gen iPads.<\/p>\n\n","pubDate":"14 Jan 2024","link":"https:\/\/fractio.nl\/2024\/01\/14\/first-gen-ipad-mini-grafana-dashboard-in-2024\/","guid":"https:\/\/fractio.nl\/2024\/01\/14\/first-gen-ipad-mini-grafana-dashboard-in-2024\/"},{"title":"Using MikroTik Netinstall on Linux","description":"<p>If you\u2019ve used <a href=\"https:\/\/mikrotik.com\/\">MikroTik<\/a> network gear long enough, you\u2019ve likely run into devices bricking themselves after <a href=\"https:\/\/mikrotik.com\/software\">RouterOS<\/a> software upgrades.\nMaybe you\u2019ve set some configuration that has inadvertently made your device unusable.\nOr maybe you\u2019ve inherited a device and want to start with a clean slate.<\/p>\n\n<p>How do you re-install RouterOS, and maybe reset the device\u2019s configuration too?<\/p>\n\n<p>MikroTik provide the <a href=\"https:\/\/help.mikrotik.com\/docs\/display\/ROS\/Netinstall\">Netinstall<\/a> tool to do network-based RouterOS installs.\nUntil recently you could only run Netinstall on Windows, but Mikrotik recently released a Linux CLI version.<\/p>\n\n<p>As of writing, it\u2019s only been available for a few months, and it has quite a few rough edges, which I have attempted to document here.<\/p>\n\n<!-- excerpt -->\n\n<p>The Linux version of Netinstall is janky in several ways:<\/p>\n\n<ul>\n  <li>It only works on a single network interface, that you cannot control the selection of.<\/li>\n  <li>It generally fails if you have multiple active network interfaces.<\/li>\n  <li>It fails with obscure messages if there is no default route on the interface it selects.<\/li>\n  <li>It often doesn\u2019t serve up the images correctly the first time.<\/li>\n<\/ul>\n\n<p>You need to set Linux networking up in a very particular way to make Netinstall work.<\/p>\n\n<p>But before we start, a little background on Netinstall, and another less well-known RouterBOARD subsystem called Etherboot.<\/p>\n\n<h2 id=\"netinstall-is-only-one-half-of-the-solution-the-other-is-etherboot\">Netinstall is only one half of the solution. The other is Etherboot.<\/h2>\n\n<p>Netinstall is a binary that rolls a BOOTP\/TFTP server into a single executable.\nThe other half of the equation is <a href=\"https:\/\/help.mikrotik.com\/docs\/display\/ROS\/Netinstall#Netinstall-Etherboot\">Etherboot<\/a>, which is a low-level system built into MikroTik devices for installing RouterOS onto the device\u2019s flash memory.<\/p>\n\n<p>Check the documentation for how to trigger Etherboot for your specific device, but it generally boils down to:<\/p>\n\n<ul>\n  <li>Power off the device<\/li>\n  <li>Hold the reset button<\/li>\n  <li>Power on the device<\/li>\n<\/ul>\n\n<p>Then watch the output of Netinstall to see the device fetch an image and reboot.<\/p>\n\n<p>I highly recommend running a packet sniffer like Wireshark or tcpdump when you\u2019re doing this, to identify any configuration errors.<\/p>\n\n<p>If everything is working correctly, you\u2019ll see Netinstall and Etherboot do a standard BOOTP\/TFTP dance.<\/p>\n\n<p>In my particular case, I was doing this on a <a href=\"https:\/\/mikrotik.com\/product\/cap_ac\">cAP ac<\/a> that had bricked itself after an automated upgrade, and was stuck in an Etherboot reboot loop.\nI have also used this process to reinstall RouterOS on a <a href=\"https:\/\/mikrotik.com\/product\/RB952Ui-5ac2nD-TC\">hAP ac lite<\/a> with corrupted configuration.<\/p>\n\n<h3 id=\"how-to-run-netinstall-on-linux\">How to run netinstall on Linux<\/h3>\n\n<p>Before you start:<\/p>\n\n<ul>\n  <li>Fetch <a href=\"https:\/\/mikrotik.com\/download\">the latest netinstall<\/a> binary for Linux.\nAt time of writing, I was using netinstall 7.1.1.<\/li>\n  <li>Fetch the appropriate RouterOS image for your device.\nYou can find the latest image linked from the product page of your device.\nPay attention to whether the image is MIPS or ARM.<\/li>\n<\/ul>\n\n<p>Once you\u2019ve downloaded these, you need to set up a wired network, with a default route.\nThe <code class=\"language-plaintext highlighter-rouge\">netinstall<\/code> Linux binary <em>will not work<\/em> if you do not have a default route set, and will output <code class=\"language-plaintext highlighter-rouge\">FAILED TO REPLY<\/code> which is an awesomely unhelpful error message.<\/p>\n\n<p>To set up the network on Ubuntu, configure <code class=\"language-plaintext highlighter-rouge\">\/etc\/netplan\/50-cloud-init.yaml<\/code>:<\/p>\n\n<div class=\"language-yaml highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"na\">network<\/span><span class=\"pi\">:<\/span>\n  <span class=\"na\">version<\/span><span class=\"pi\">:<\/span> <span class=\"m\">2<\/span>\n  <span class=\"na\">ethernets<\/span><span class=\"pi\">:<\/span>\n    <span class=\"na\">eno1<\/span><span class=\"pi\">:<\/span>\n      <span class=\"na\">addresses<\/span><span class=\"pi\">:<\/span>\n        <span class=\"pi\">-<\/span> <span class=\"s\">192.168.88.100\/24<\/span>\n      <span class=\"na\">routes<\/span><span class=\"pi\">:<\/span>\n        <span class=\"pi\">-<\/span> <span class=\"na\">to<\/span><span class=\"pi\">:<\/span> <span class=\"s\">0.0.0.0\/0<\/span>\n          <span class=\"na\">via<\/span><span class=\"pi\">:<\/span> <span class=\"s\">192.168.88.1<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>Then apply with:<\/p>\n\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nb\">sudo <\/span>netplan generate\n<span class=\"nb\">sudo <\/span>netplan apply\n<\/code><\/pre><\/div><\/div>\n\n<p>If you have other interfaces (like wifi) shut them down with something like:<\/p>\n\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nb\">sudo <\/span>ip <span class=\"nb\">link set <\/span>dev wlp0s20f3 down\n<\/code><\/pre><\/div><\/div>\n\n<p>Then start the netinstall server:<\/p>\n\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nb\">sudo<\/span> .\/netinstall <span class=\"nt\">-a<\/span> 192.168.88.1 routeros-mipsbe-7.1.1.npk\n<\/code><\/pre><\/div><\/div>\n\n<p>Replace <code class=\"language-plaintext highlighter-rouge\">routeros-mipsbe-7.1.1.npk<\/code> with your image name.<\/p>\n\n<p>The <code class=\"language-plaintext highlighter-rouge\">-a<\/code> flag says what IP address should be assigned to Etherboot clients when doing the Netinstall dance.<\/p>\n\n<p>You should see output that looks something like this:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>Using server IP: 192.168.88.100\nStarting PXE server\nWaiting for RouterBOARD...\nPXE client: 01:23:45:67:89:10\nSending image: mips\nDiscovered RouterBOARD...\nFormatting...\nSending package routeros-mipsbe-7.1.1.npk ...\nReady for reboot...\nSent reboot command\n<\/code><\/pre><\/div><\/div>\n\n<p>Remember that depending on what your device is, when the device comes back up after the RouterOS install, the default configuration may have a firewall on the ethernet interface, so you won\u2019t be able to connect to it.<\/p>\n\n<p>The default behaviour is to serve up a RouterOS image, but keep the existing configuration on the device.<\/p>\n\n<p>If you have uploaded broken configuration to the device, or the configuration has become corrupted, a RouterOS install via Netinstall\/Etherboot won\u2019t be enough.\nYou will need to wipe all config on the target device, by running the previous command with <code class=\"language-plaintext highlighter-rouge\">-r<\/code>.<\/p>\n\n<p>More detail about the Linux version of netinstall can be found <a href=\"https:\/\/help.mikrotik.com\/docs\/display\/ROS\/Netinstall#Netinstall-InstructionsforLinux\">on the MikroTik help site<\/a>.<\/p>\n\n<h2 id=\"you-cant-use-non-mikrotik-tools-like-dnsmasq-to-serve-up-the-routeros-images\">You can\u2019t use non-MikroTik tools (like dnsmasq) to serve up the RouterOS images<\/h2>\n\n<p>You might be thinking \u201cwhy use a proprietary tool like Netinstall when I can use open source tools like dnsmasq to serve up the RouterOS images?\u201d<\/p>\n\n<p>The short answer is: I\u2019ve tried this and it doesn\u2019t work.<\/p>\n\n<p>The longer answer is: Netinstall isn\u2019t serving up just the RouterOS image, it\u2019s also repackaging it in a way that the RouterBOARD on the other end can use.<\/p>\n\n<p>The hint the magic it\u2019s doing is in these two lines of <code class=\"language-plaintext highlighter-rouge\">netinstall<\/code> output:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>Formatting...\nSending package routeros-mipsbe-6.48.1.npk ...\n<\/code><\/pre><\/div><\/div>\n\n<p>This suggests it\u2019s not just sending an image.\nAt the very least, it\u2019s also packaging up configuration to run on first boot (the <code class=\"language-plaintext highlighter-rouge\">-s<\/code> flag), or signalling to wipe existing configuration (the <code class=\"language-plaintext highlighter-rouge\">-r<\/code> flag).<\/p>\n\n<p>If you set up a dnsmasq instance and try serving up RouterOS images via TFTP, you\u2019ll find that the device will not install that RouterOS image.<\/p>\n\n<p>I have no interest in working out exactly what it\u2019s doing, nor maintaining a working open source-based alternative.<\/p>\n\n<p>If you don\u2019t value your time and you want to investigate how to go full open source, I saw one creative solution to this problem on a forum that boiled down to:<\/p>\n\n<ul>\n  <li>Set up a legit Netinstall server,<\/li>\n  <li>Packet capture a valid Netinstall\/Etherboot session<\/li>\n  <li>Extract the binary served by Netinstall from the pcap<\/li>\n  <li>Then serve it up from dnsmasq<\/li>\n<\/ul>\n","pubDate":"30 Jan 2022","link":"https:\/\/fractio.nl\/2022\/01\/30\/using-mikrotik-netinstall-on-linux\/","guid":"https:\/\/fractio.nl\/2022\/01\/30\/using-mikrotik-netinstall-on-linux\/"},{"title":"My philosophy on work","description":"<p>\ud83d\udc4b, I\u2019m Lindsay.<\/p>\n\n<h2 id=\"i-wrote-this-so-you-understand-my-philosophy-on-work\">I wrote this so you understand my philosophy on work.<\/h2>\n\n<p>You can use this as a quick debugging guide in case you see something in the wild that surprises you.<\/p>\n\n<p>This document is a set of promises I intend to keep. If I don\u2019t, I expect you to call me out.<\/p>\n\n<!-- excerpt -->\n\n<h2 id=\"im-here-to-route-information-remove-roadblocks-and-shield-the-team\">I\u2019m here to route information, remove roadblocks, and shield the team<\/h2>\n\n<p>At its core, I do three things for the team:<\/p>\n\n<ul>\n  <li><strong>Route information<\/strong> to the right places at the right time<\/li>\n  <li><strong>Remove roadblocks<\/strong> stopping us getting things done<\/li>\n  <li><strong>Shield the team<\/strong> from interruptions and distractions<\/li>\n<\/ul>\n\n<p>I have the responsibility for:<\/p>\n\n<ul>\n  <li><strong>People.<\/strong> This means your health (mental and physical) and wellbeing at work, your relationship with work (including the dark side of this \u2013 keeping burnout at bay), and creating opportunities for you to grow<\/li>\n  <li><strong>Systems.<\/strong> I am the single point of accountability for the upkeep, operations, and cost effectiveness of our socio-technical systems. I am the one on the hook for those systems. I will take responsibility when things go bad. I will ensure we work hard to make sure the likelihood of those bad things happening again is reduced.<\/li>\n  <li><strong>Delivery.<\/strong> Ensuring we have a good pipeline of work to get on with. Ensuring that work is well defined and well sized. This is the part of work I find really fun!<\/li>\n<\/ul>\n\n<p>I am here to leave this world a little better than I found it. If I\u2019m doing my job well, when I step away from the team for long periods of time, things will continue to function well, and adapt and improve.<\/p>\n\n<h2 id=\"i-value-fairness-context-and-pride-in-work\">I value fairness, context, and pride in work<\/h2>\n\n<h4 id=\"fairness\">Fairness<\/h4>\n\n<p>To be blunt \u2013 the main motivation for me doing a career change into leadership was because I experienced a real mixed bag of bosses. I thought \u201cI can do a better job\u201d, and here we are.<\/p>\n\n<p>What motivates to gets me up every morning is creating a fair and just environment for the people I am responsible for.<\/p>\n\n<p>I will call out things I think are unfair both in the workplace and for our customers, and I won\u2019t hesitate taking a stand on principles.<\/p>\n\n<h4 id=\"context\">Context<\/h4>\n\n<p>Local rationality rules everything around me. People make what they consider to be the best decisions, given the information they have at the time.<\/p>\n\n<p>Good judgement comes from experience. And experience comes from bad judgement. In my experience, disagreements often come from seeing the same thing from multiple, sometimes conflicting, perspectives.<\/p>\n\n<p>My job is to facilitate building context for the team, so we can make more right decisions, and only make new mistakes.<\/p>\n\n<h4 id=\"pride-in-work\">Pride in work<\/h4>\n\n<p>I don\u2019t have high standards, I have <em>extreme<\/em> standards.<\/p>\n\n<p>I expect great work from the people around me, and I will push you to do the best work you have done in your career.<\/p>\n\n<p>If I make you feel like I\u2019m constantly disappointed in your work, that means I\u2019m doing a bad job of setting expectations about what those high standards are.<\/p>\n\n<h2 id=\"my-expectations-are-few-but-firm\">My expectations are few but firm<\/h2>\n\n<p><strong>I don\u2019t have all the answers, and I don\u2019t expect you to either.<\/strong> We\u2019ll work together to build context that is a better approximation of reality.<\/p>\n\n<p><strong>Family first, work second.<\/strong> I am a firm believer in working-to-live, not living-to-work. If things are not on an even keel at home, your ability to do work is compromised.<\/p>\n\n<p>Work must never trump your home responsibilities. If your partner asks you to do something important for them during work hours, I expect you to take time to do it. If you have kids, I expect you to take time off for special events.<\/p>\n\n<p>I lead by example \u2013 I take time off during the school holidays so my wife can continue to study.<\/p>\n\n<p><strong>You know how to manage your time.<\/strong> I blanket approve leave requests. You know what is best for you, and I trust you to make the right call for the team and yourself. I view people going on leave as a chaos monkey that tests the anti-fragility of the team.<\/p>\n\n<h2 id=\"feedback-will-be-direct-prompt-and-humane\">Feedback will be direct, prompt, and humane<\/h2>\n\n<p><strong>If there is a problem, you will hear about it directly, promptly, and humanely from me.<\/strong> I don\u2019t hold on to feedback. Delays in feedback create anxiety. My priority is to feed it back to you as quickly as possible.<\/p>\n\n<p>I\u2019ll provide feedback throughout the day, mostly through Slack. If it\u2019s something particularly sensitive, we will do a video call.<\/p>\n\n<p><strong>When you have negative feedback for me, I expect it directly, promptly, and privately.<\/strong> Making mistakes is part of being human, and I am no exception to this.<\/p>\n\n<p>I take a very dim view of hearing criticisms of me second hand. When I do hear second hand criticism, you\u2019ll be hearing from me pretty quickly. This goes back to one of my values \u2013 fairness.<\/p>\n\n<p>I go out of my way to take public responsibility for my mistakes. They are often teachable moments that are applicable to an audience bigger than me.<\/p>\n\n<p><strong>When you have positive feedback, please deliver it publicly.<\/strong> I like my work to speak for itself, and I appreciate when you say nice things about my work in public.<\/p>\n\n<h2 id=\"my-office-hours-are-1000-to-1730\">My office hours are 10.00 to 17.30<\/h2>\n\n<p>You won\u2019t have my full attention before 10.00.<\/p>\n\n<p>I have a young family, and my priority in the morning is getting them up and going for the day. If I\u2019m in meetings before 10.00, you won\u2019t be getting the best of me. Deal with that as you will. You\u2019ll have a &lt;50% hit rate if you schedule meetings with me before 10.00.<\/p>\n\n<h2 id=\"11s-are-the-most-important-conversations-i-have\">1:1s are the most important conversations I have<\/h2>\n\n<p>1:1s take priority in my calendar. This is where you have the opportunity to ask me anything. I will help you build context about what\u2019s happening more broadly across the organisation.<\/p>\n\n<p>We will use the full time. Sometimes there will be things we need to talk about, sometimes there won\u2019t be. Even if you don\u2019t think we have things to talk about, we will use the time.<\/p>\n\n<p>I will hold you accountable for actions that come out of our 1:1s, and I expect you to do the same for me.<\/p>\n\n<p>Scheduling wise, we can do 1:1s once a week, or once a fortnight.<\/p>\n\n<p>With direct reports who have people management responsibilities, I want to meet once a week. With direct reports who are individual contributors, once a fortnight is fine \u2013 but if you find it valuable, I will do them weekly.<\/p>\n\n<p>When I assume responsibility for you, I tend to do 1:1s weekly for two-to-three months, before we mutually decide to adjust the frequency.<\/p>\n\n<h2 id=\"slack-is-the-best-way-to-contact-me\">Slack is the best way to contact me<\/h2>\n\n<p>Given I work remotely, Slack is my lifeline to the team. I am super responsive during the day.<\/p>\n\n<p><strong>Calendar:<\/strong> best to hit me up on Slack before you book anything. I will blanket reject meeting invites that don\u2019t have agendas.<\/p>\n\n<p><strong>Email:<\/strong> This is where information goes to die.<\/p>\n\n<h2 id=\"i-have-some-quirks-im-working-on-them\">I have some quirks. I\u2019m working on them.<\/h2>\n\n<p><strong>I focus on eliminating the negatives way more than I focus on accentuating the positives.<\/strong> I see problems pretty much everywhere I look. I work relentlessly to eliminate those problems. This sometimes means I don\u2019t pay attention to the good things that are happening.<\/p>\n\n<p>It\u2019s something I\u2019m working on and getting better at. Pull me up if you think I\u2019m being too negative on something.<\/p>\n\n<p><strong>I spot dysfunction way sooner than most people.<\/strong> I\u2019m like a hound dog when it comes to sniffing out dysfunction. I have found myself the canary in a coal mine more than once in my career. This has taken a personal toll more than once, and it\u2019s something I\u2019m very mindful of limiting the impact of in the future. If you see me withdrawing, it\u2019s sometimes me pattern matching against past experiences that had Very Bad Outcomes.<\/p>\n\n<p><strong>I think holistically, which sometimes means I hold contradictory views on the same topic.<\/strong> To me, this is a strength when navigating complex systems. I am very good at navigating complex systems. To people around me, this can be frustrating! I can argue three contradictory points in the same number of minutes. I can appear to be hard to pin down on a position.<\/p>\n\n<p><strong>I actively seek contrarian views, sometimes to a fault.<\/strong> Healthy debate and conflict is the lifeblood of the team. I actively create opportunities for dissent. This can be uncomfortable for people who have never worked in environments like this! Apologies in advance \u2013 I will do everything I can to make your introduction to this as gentle as possible.<\/p>\n\n<p>I am very mindful that I am often wrong, and I want to hear what I am wrong about, and why I\u2019m wrong about it, as soon as possible. I don\u2019t have all the answers, but I will relentlessly question until we get a closer approximation of reality.<\/p>\n\n<p><strong>I have a preference for working on product and delivery problems over technical ones.<\/strong> Sometimes this means the team ends up focusing on shipping and going faster, and work to manage tech debt and tech growth gets de-prioritised.<\/p>\n\n<p>Sometimes this leaks into 1:1s (particularly if you are motivated by the same thing). I\u2019m aware of it, but sometimes I still get blindsided by it. When you see this happen, let me know.<\/p>\n\n<p><strong>I am an integrator in a segmenter\u2019s clothing.<\/strong> I have a natural tendency to blend the boundaries of home and work, and seamlessly transition between the two. Given I work from home, this can result in overwork and burnout.<\/p>\n\n<p>I have lots of strategies to create boundaries between the two, like:<\/p>\n\n<ul>\n  <li>Separate physical workspace<\/li>\n  <li>Separate devices (my work phone goes on top of the coffee machine at the end of every day, so if you slack me after hours, you won\u2019t get a response until 8am the next day)<\/li>\n  <li>Wearing different clothes for work and home<\/li>\n<\/ul>\n\n<p>Call me out if I\u2019m being naughty and working while sick.<\/p>\n\n<p><strong>I won\u2019t add you on Facebook.<\/strong> There is a power dynamic in our relationship, whether we choose to acknowledge it or not.  Me adding you on Facebook puts you in an awkward position if you don\u2019t really want to share than information with me.<\/p>\n\n<p>The choice is up to you. If you friend me on Facebook, I will accept.<\/p>\n\n<p>Finally, <a href=\"https:\/\/fractio.nl\/2015\/07\/10\/think-talk-leadership\/\">I think to talk. I don\u2019t often talk to think.<\/a><\/p>\n\n<h2 id=\"this-document-like-me-is-a-work-in-progress\">This document, like me, is a work in progress<\/h2>\n\n<p>I try to update it frequently and appreciate your feedback.<\/p>\n","pubDate":"30 Jul 2018","link":"https:\/\/fractio.nl\/2018\/07\/30\/my-philosophy-on-work\/","guid":"https:\/\/fractio.nl\/2018\/07\/30\/my-philosophy-on-work\/"},{"title":"A simple proxy service for scrapers running on Morph","description":"<p>When I <a href=\"http:\/\/fractio.nl\/2008\/12\/12\/gotgastro-launched\/\">originally launched<\/a> Got Gastro back in 2008, New South Wales was the only Australian jurisdiction that <a href=\"http:\/\/foodauthority.nsw.gov.au\/penalty-notices\/\">published data<\/a> about food safety problems.<\/p>\n\n<p>Since then, several other Australian jurisdictions have started publishing their food safety data: <a href=\"https:\/\/www2.health.vic.gov.au\/public-health\/food-safety\/convictions-register\">Victoria<\/a>, <a href=\"http:\/\/ww2.health.wa.gov.au\/Articles\/F_I\/Food-offenders\/Publication-of-names-of-offenders-list\">WA<\/a>, <a href=\"http:\/\/www.health.act.gov.au\/sites\/default\/files\/\/Register%20of%20Food%20Offences.pdf\">ACT<\/a>, and <a href=\"http:\/\/www.sahealth.sa.gov.au\/wps\/wcm\/connect\/public+content\/sa+health+internet\/about+us\/legislation\/food+legislation\/food+prosecution+register\">SA<\/a>.<\/p>\n\n<p>As part of building the <a href=\"https:\/\/gotgastroagain.com\/\">new Got Gastro<\/a>, I have been <a href=\"https:\/\/github.com\/auxesis\/vic_health_register_of_convictions\">slowly<\/a> <a href=\"https:\/\/github.com\/auxesis\/wa_health_food_offenders\">adding<\/a> <a href=\"https:\/\/github.com\/auxesis\/nsw_food_authority_penalty_notices\">scrapers<\/a> for new data sets from across Australia and the UK.<\/p>\n\n<p>The low-hanging fruit has been picked \u2013 NSW and Victoria publish a page\/URL per notice, WA publishes a PDF per notice \u2013 now it\u2019s time for the harder data sources.<\/p>\n\n<!-- excerpt -->\n\n<p>South Australia Health <a href=\"http:\/\/www.sahealth.sa.gov.au\/wps\/wcm\/connect\/public+content\/sa+health+internet\/about+us\/legislation\/food+legislation\/food+prosecution+register\">publishes a register of food prosecutions<\/a>. The data is not structured. Every single entry is formatted differently. Business names, addresses, dates, and even field names are different for each entry.<\/p>\n\n<p>The clue to what\u2019s going on is in the class name on the content:<\/p>\n\n<div class=\"language-html highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nt\">&lt;div<\/span> <span class=\"na\">class=<\/span><span class=\"s\">\"wysiwyg\"<\/span><span class=\"nt\">&gt;<\/span>...<span class=\"nt\">&lt;\/div&gt;<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>This is a pretty common thing about public data: often the only reason it has been published is because legislation requires it.<\/p>\n\n<p>If the scale of the data being published is small, folk-systems spring up to handle the demand (in SA Health\u2019s case, a WYSIWYG field on a CMS to handle 6 notices). When the scale of reporting is big, you get a more structured, consistent approach (like the <a href=\"http:\/\/foodauthority.nsw.gov.au\/penalty-notices\/\">NSW Food Authority\u2019s ASP.NET application<\/a> that handles ~1500 notices).<\/p>\n\n<p>The challenge becomes: how do you build a scraper that handles the variations in data from artisanal, hand-crafted data sources?<\/p>\n\n<h2 id=\"the-scraper\">The scraper<\/h2>\n\n<p>But it turns out that\u2019s not even the most challenging problem with writing a scraper for this dataset. Sure, there are some annoying inconsistencies that require <a href=\"https:\/\/github.com\/auxesis\/sa_health_food_prosecutions_register\/blob\/d3a745ed62e4fb3730264ae043cd093a78aa8f29\/scraper.rb#L75-L136\">handling a few special cases<\/a>, but nothing impossible.<\/p>\n\n<p>The problem lies with how the scraper runs.<\/p>\n\n<p>The <a href=\"https:\/\/github.com\/auxesis\/sa_health_food_prosecutions_register\">sa_health_food_prosecutions_register<\/a> scraper runs on <a href=\"https:\/\/morph.io\/\">Morph<\/a>, a truly excellent scraping infrastructure run by the <a href=\"https:\/\/www.openaustraliafoundation.org.au\/\">Open Australia Foundation<\/a>.<\/p>\n\n<p>The scraper scrapes the food prosecutions register from the SA Health website. The SA health website sits behind some sort of Web Application Firewall. It\u2019s assumed this WAF is meant to block nasty requests to the website.<\/p>\n\n<p>Unfortunately, the WAF blocks legitimate requests from Morph, which means the scraper fails to run. The WAF sometimes returns a HTTP status code of 200 but with an error message in the body. Sometimes it just silently drops the TCP connection altogether. This behaviour only exhibits on Morph, not when running from within Australia.<\/p>\n\n<p>Bugs that only show up in production? The best.<\/p>\n\n<p>To make the scraper work on Morph, we can build a simple <a href=\"https:\/\/tinyproxy.github.io\/\">Tinyproxy<\/a>-based proxy service running in AWS to proxy requests from Morph to SA Health\u2019s website. The proxy is locked down to only accept requests originating from Morph.<\/p>\n\n<h3 id=\"designed-to-be-cheap-resilient-and-open\">Designed to be cheap, resilient, and open<\/h3>\n\n<p>The proxy service must be:<\/p>\n\n<ul>\n  <li>low cost<\/li>\n  <li>resilient to failure<\/li>\n  <li>open source and reproducible<\/li>\n<\/ul>\n\n<p>The last point is key.<\/p>\n\n<p>When I originally tested this proxying approach, I did it with a Digital Ocean droplet in Singapore. I forgot about it for a couple of weeks, then accidentally killed the droplet when I was cleaning up something else in my DO account. Aside from the fact that the proxy\u2019s existence and behaviour was opaque to anyone but me, I wanted other people to be able to use this proxying approach. More selfishly, I didn\u2019t want future Lindsay to have to remember how this house of cards was stacked.<\/p>\n\n<p>To keep costs low and the service resilient, the proxy service uses the AWS free tier, and autoscaling groups.<\/p>\n\n<p>There is <a href=\"https:\/\/terraform.io\">Terraform<\/a> config <a href=\"https:\/\/github.com\/auxesis\/sa_health_food_prosecutions_register\/tree\/master\/proxy\">in the scraper\u2019s repo<\/a> to build a proxy instance and supporting environment. The Terraform config:<\/p>\n\n<ul>\n  <li>Sets up a single VPC, with a single public subnet, routing tables, and a single internet gateway.<\/li>\n  <li>Sets up an ELB to publicly terminate requests, locked down with a security group to only accept requests from Morph (don\u2019t want to be running an open proxy).<\/li>\n  <li>Sets up an autoscaling group of a single t2.micro (free tier) instance, with a launch config that boots the latest Ubuntu Xenial AMI, and links the ELB to the ASG.<\/li>\n<\/ul>\n\n<p>When the scraper runs on Morph with the <code class=\"language-plaintext highlighter-rouge\">MORPH_PROXY<\/code> environment variable set, it connects through the ELB to the Tinyproxy instance, which then proxies the request on to SA Health\u2019s website.<\/p>\n\n<h3 id=\"drive-changes-with-make-and-environment-variables\">Drive changes with <code class=\"language-plaintext highlighter-rouge\">make<\/code> and environment variables<\/h3>\n\n<p>Once you <a href=\"https:\/\/github.com\/auxesis\/sa_health_food_prosecutions_register\">clone the repo<\/a> and set some environment variables, you can start planning your changes:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>make plan\n<\/code><\/pre><\/div><\/div>\n\n<p>To apply the plan:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>make apply\n<\/code><\/pre><\/div><\/div>\n\n<p>To destroy the environment:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>make destroy\n<\/code><\/pre><\/div><\/div>\n\n<p>This Makefile approach was borrowed from <a href=\"https:\/\/github.com\/hectcastro\/terraform-aws-vpc\/\">hectcastro\/terraform-aws-vpc<\/a>, from which this Terraform config was forked.<\/p>\n\n<h2 id=\"wrap-it-with-a-continuous-deployment-pipeline\">Wrap it with a Continuous Deployment pipeline<\/h2>\n\n<p>To keep Terraform changes consistent, all changes to the proxy service are run through a <a href=\"https:\/\/travis-ci.org\/auxesis\/sa_health_food_prosecutions_register\">Continuous Deployment pipeline on Travis<\/a>. This means no changes to the \u201cproduction\u201d service are done locally. This is important for creating visibility for new contributors of how the service runs and changes.<\/p>\n\n<p>Terraform relies on <code class=\"language-plaintext highlighter-rouge\">.tfstate<\/code> files to track state and changes between Terraform runs. Because Travis starts with a clean git clone every build (and thus no <code class=\"language-plaintext highlighter-rouge\">.tfstate<\/code>), <code class=\"language-plaintext highlighter-rouge\">terraform config<\/code> is used to push\/pull persistent state across builds.<\/p>\n\n<p>The <a href=\"https:\/\/github.com\/auxesis\/sa_health_food_prosecutions_register\/blob\/d3a745ed62e4fb3730264ae043cd093a78aa8f29\/.travis.yml\">pipeline is very simple<\/a> \u2013 it just runs <code class=\"language-plaintext highlighter-rouge\">proxy\/cibuild.sh<\/code> and <code class=\"language-plaintext highlighter-rouge\">proxy\/cideploy.sh<\/code>.<\/p>\n\n<p>These environment variables must be exported for <code class=\"language-plaintext highlighter-rouge\">proxy\/cibuild.sh<\/code> and <code class=\"language-plaintext highlighter-rouge\">proxy\/cideploy.sh<\/code> to work:<\/p>\n\n<ul>\n  <li><code class=\"language-plaintext highlighter-rouge\">BUCKET<\/code>, the name of the S3 bucket the config will be sync\u2019d with by <code class=\"language-plaintext highlighter-rouge\">terraform config<\/code><\/li>\n  <li><code class=\"language-plaintext highlighter-rouge\">AWS_ACCESS_KEY_ID<\/code>, access key for the IAM user, used by <code class=\"language-plaintext highlighter-rouge\">terraform config<\/code><\/li>\n  <li><code class=\"language-plaintext highlighter-rouge\">AWS_SECRET_ACCESS_KEY<\/code>, access key secret for the IAM user, used by <code class=\"language-plaintext highlighter-rouge\">terraform config<\/code><\/li>\n  <li><code class=\"language-plaintext highlighter-rouge\">TF_VAR_aws_access_key<\/code>, access key for the IAM user, used by <code class=\"language-plaintext highlighter-rouge\">terraform plan<\/code> and <code class=\"language-plaintext highlighter-rouge\">terraform apply<\/code><\/li>\n  <li><code class=\"language-plaintext highlighter-rouge\">TF_VAR_aws_secret_key<\/code>, access key secret for the IAM user, used by <code class=\"language-plaintext highlighter-rouge\">terraform plan<\/code> and <code class=\"language-plaintext highlighter-rouge\">terraform apply<\/code><\/li>\n<\/ul>\n\n<p>In the <code class=\"language-plaintext highlighter-rouge\">sa_health_food_prosecutions_register<\/code> proxy service case, these environment variables are exported as <a href=\"https:\/\/docs.travis-ci.com\/user\/environment-variables\/#Defining-encrypted-variables-in-.travis.yml\">encrypted environment variables<\/a> in <a href=\"https:\/\/github.com\/auxesis\/sa_health_food_prosecutions_register\/blob\/master\/.travis.yml\">.travis.yml<\/a>. This keeps the config and most of the data open, and easily reproducible.<\/p>\n\n<h2 id=\"civic-hacking-for-government-shortfalls\">Civic hacking for government shortfalls<\/h2>\n\n<p>This was a huge amount of work for a very small data set (6 notices!), but I believe it was worth it.<\/p>\n\n<p>The approach allows the scraper to reliably run on Morph, and behave in a way that\u2019s consistent with other scrapers. The costs are minimal, which is important if I\u2019m picking up the tab for poor government IT.<\/p>\n\n<p><em>(Side note: if you were a member of the public with an urgent enquiry for SA Health, but you were being silently dropped by their WAF, how would you contact them to let them know? Their contact numbers are on their website, after all)<\/em><\/p>\n\n<p>Most importantly, the service is <a href=\"https:\/\/github.com\/auxesis\/sa_health_food_prosecutions_register\/tree\/master\/proxy\">open source and reproducible<\/a>. When I asked on the Open Australia Slack about other cases of Morph scrapers failing because of active blocking of requests, nobody could think of any.<\/p>\n\n<p>I hope nobody ever has to do anything like this to make their scrapers run, but if they do, there\u2019s now a Terraform project to set up a proxy that costs less than $5\/month.<\/p>\n\n<p>Happy civic hacking!<\/p>\n","pubDate":"11 Dec 2016","link":"https:\/\/fractio.nl\/2016\/12\/11\/simple-proxy-service-for-morph-scrapers\/","guid":"https:\/\/fractio.nl\/2016\/12\/11\/simple-proxy-service-for-morph-scrapers\/"},{"title":"AWS in government: risks, myths, and misconceptions","description":"<p><em>The opinions in this post are my own, and do not represent my employer.<\/em><\/p>\n\n<p>When undertaking digital transformation initiatives in your organisation, to effectively meet user needs, we need infrastructure technology that can adapt to user needs as fast as we do.<\/p>\n\n<p>Because this type of technology is fairly new to government, there is a lot of uncertainty about how it can be used, as well as a lot of optimism about the opportunity it brings.<\/p>\n\n<p>Let\u2019s discuss some of the the risks, myths, and misconceptions we\u2019ve encountered on our journey to fully cloud based infrastructure.<\/p>\n\n<!-- excerpt -->\n\n<h2 id=\"myth-we-cant-store-data-securely\">Myth: We can\u2019t store data securely!<\/h2>\n\n<p>AWS is on Australian Signals Directorate\u2019s <a href=\"http:\/\/www.asd.gov.au\/infosec\/irap\/certified_clouds.htm\">Certified Cloud Services List<\/a> alongside several other Infrastructure as a Service (IaaS) providers like Azure. AWS has four services IRAP accredited by ASD up to <a href=\"https:\/\/www.protectivesecurity.gov.au\/informationsecurity\/Documents\/AustralianGovernmentclassificationsystem.pdf\">Unclassified DLM<\/a>: EBS, EC2, S3, and VPC. If you\u2019ve used AWS in the private sector you might find this catalogue limited, but there are a heap of workloads you can run on AWS with just these four services.<\/p>\n\n<p>It\u2019s worth noting that ASD acknowledges the risks that come with existing in house systems compared to cloud services. From ASD\u2019s <a href=\"http:\/\/www.asd.gov.au\/publications\/protect\/cloud-security-tenants.htm\">Cloud Computing Security for Tenants<\/a> guide:<\/p>\n\n<blockquote>\n  <p>Organisations need to perform a risk assessment and implement associated mitigations before using cloud services. Risks vary depending on factors such as the sensitivity and criticality of data to be stored or processed, how the cloud service is implemented and managed, how the organisation intends to use the cloud service, and challenges associated with the organisation performing timely incident detection and response. Organisations need to compare these risks against an objective risk assessment of using in house computer systems which might: be poorly secured; have inadequate availability; or, be unable to meet modern business requirements.<\/p>\n<\/blockquote>\n\n<p>While these AWS services are accredited up to Unclassified DLM, if you have protected data, there are some strategies you can use to make parts of this data available on AWS.<\/p>\n\n<p>Most data is classified at the row level in databases. While you can\u2019t put a protected row on AWS, you\u2019ll often find that individual columns in that protected row have a lower classification. This means you can put unclassified columns from protected rows on AWS, and work out a way to match up data between your public systems on AWS, and your private systems on protected networks.<\/p>\n\n<h2 id=\"misconception-well-run-it-like-physical-infrastructure\">Misconception: We\u2019ll run it like physical infrastructure!<\/h2>\n\n<p>Once you\u2019ve procured AWS, often you\u2019ll want to go for the biggest cost savings immediately. <a href=\"https:\/\/aws.amazon.com\/ec2\/pricing\/reserved-instances\/\">Reserved Instances<\/a> are a great way to achieve these cost savings, especially if you buy them for a three year period.<\/p>\n\n<p>But the value of AWS to government is not low-cost compute, it\u2019s on-tap capacity. We can\u2019t extract this value unless we build and run services like AWS recommends. To do this, we have to think differently about our software architectures.<\/p>\n\n<p>The risk with buying RIs up front for three years is you don\u2019t know what your workloads are going to be three years from now, let alone what architecture you\u2019ll build to deal with them.<\/p>\n\n<p>You might optimise your code to run in parallel across many cheaper instances. You might shift your workloads to spot instances for ad-hoc calculations.<\/p>\n\n<p>To achieve a sustainable, controlled spend, you have to start with On-Demand instances, track your spend over several months, and identify instance types that are constantly used.<\/p>\n\n<p>Then buy RIs for a year.<\/p>\n\n<p>This works well for both lifting-and-shifting traditional workloads, or for greenfields projects using cloud native architectures. If you\u2019re really keen, purchase RIs for three years, but beware the risk of premature investment in an architecture that may not match your workloads.<\/p>\n\n<p>If you do find that you\u2019re not using the RIs you\u2019ve purchased, you can sell them on the <a href=\"https:\/\/aws.amazon.com\/ec2\/purchasing-options\/reserved-instances\/marketplace\/\">RI marketplace<\/a>.<\/p>\n\n<h2 id=\"risk-our-spend-is-getting-out-of-control\">Risk: Our spend is getting out of control!<\/h2>\n\n<p>If you don\u2019t spend time monitoring and analysing your usage of IaaS services, you can very quickly find yourself spending more than you planned.<\/p>\n\n<p>Use multiple accounts to segment and control your spend.<\/p>\n\n<p><a href=\"http:\/\/docs.aws.amazon.com\/awsaccountbilling\/latest\/aboutv2\/consolidated-billing.html\">Consolidated Billing<\/a> allows you to logically separate services you\u2019re delivering across multiple accounts, but see costs in one place: on the parent account\u2019s billing page. Even better, you can grant your finance teams the ability to view billing information in the AWS console, so they can get straight to the information they need to make informed, financially prudent decisions.<\/p>\n\n<p>You can take this even further by using <a href=\"http:\/\/docs.aws.amazon.com\/awsaccountbilling\/latest\/aboutv2\/con-bill-blended-rates.html#Blended_CB\">blended rates<\/a> across your AWS accounts, where On-Demand and Reserved Instance are averaged across linked accounts using consolidated billing. This allows you to make reservations once, but automatically get the cost savings across all linked accounts.<\/p>\n\n<p>Separate accounts are also useful if your service ever gets <a href=\"http:\/\/www.apsc.gov.au\/publications-and-media\/current-publications\/machinery-of-government\">mogged<\/a> \u2013 just unlink the account and re-link it to the new parent account.<\/p>\n\n<p>One technical approach to controlling spend is to automatically shut down non-production environments every night, and rebuild automatically in the morning. Because of On-Demand instance pricing is calculated hourly, you\u2019ll halve your cost by only running instances for half the day.<\/p>\n\n<p>The side effect of this is incentivising a culture of technical resilience. When you\u2019re creating and destroying whole replicas of production systems every day, you become really good at creating and destroying environments, and more importantly <a href=\"https:\/\/en.wikipedia.org\/wiki\/Crash-only_software\">automatically handling failure<\/a>.<\/p>\n\n<p>There\u2019s also the benefit of better security posture through short-lived environments. By having ephemeral environments, we reduce the impact of one of the <a href=\"https:\/\/medium.com\/built-to-adapt\/the-three-r-s-of-enterprise-security-rotate-repave-and-repair-f64f6d6ba29d#.s0qrwdlc6\">three resources required for effective cyber attacks<\/a>: time. When we rebuild environments daily from fully patched base images, we time limit the window of opportunity for attacks to take hold.<\/p>\n\n<h1 id=\"risk-our-stuff-is-getting-hacked\">Risk: Our stuff is getting hacked!<\/h1>\n\n<p>As we mentioned before, we can\u2019t extract the value from IaaS providers like AWS unless we build and run services like they recommend. One of the fastest ways to do this is to give your developers full AWS access, to experiment with different ways of building services. By using <a href=\"http:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/introduction.html\">IAM<\/a> <a href=\"http:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_users.html\">users<\/a>, <a href=\"http:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_groups.html\">groups<\/a>, and <a href=\"http:\/\/docs.aws.amazon.com\/IAM\/latest\/UserGuide\/id_roles.html\">roles<\/a>, you can selectively grant your developers the ability to create, update, and destroy environments, fostering that culture of technical resilience.<\/p>\n\n<p>One side effect of this delegation of responsibility is services and data can be accidentally exposed to the world. You just need to take a look at the <a href=\"http:\/\/www.acma.gov.au\/theACMA\/aisi-open-services-statistics\">AISI daily observations per open service family<\/a> graphs to see how prevalent this problem is across the Australian address space:<\/p>\n\n<p><img src=\"\/images\/aisi-daily-observations-per-open-service-type.png\" alt=\"Graph of AISI daily observations per open service family\" \/><\/p>\n\n<p>One solution is to audit publicly exposed services on your AWS accounts hourly, and automatically notify your people when the <a href=\"https:\/\/speakerdeck.com\/garethr\/security-monitoring-with-open-source-penetration-testing-tools\">automated security monitoring<\/a> detects something amiss. This is surprisingly easy to do with IaaS: just query the APIs to get a list of IP addresses in use, run a scan against these addresses, and notify owners immediately, all based on tag information on vulnerable hosts, also exposed through the API.<\/p>\n\n<h1 id=\"misconception-we-arent-getting-the-reliability-benefits\">Misconception: We aren\u2019t getting the reliability benefits!<\/h1>\n\n<p>As we mentioned before, we can\u2019t extract the value from IaaS providers like AWS unless we build and run services like they recommend. On IaaS like AWS, we do this by building highly reliable systems from unreliable components.<\/p>\n\n<p>One very simple way of achieving this is with Auto Scaling Groups. You can use Auto Scaling Groups to ensure a minimum number of like-instances are running, and to scale the number of instances in the group up and down based on demand. This ensures that if any of the underlying instances that make up the Auto Scaling Group fail, they are automatically recreated.<\/p>\n\n<p>To get the full benefit of Auto Scaling Groups, you need to pre-bake your applications into your instances. Think of it as a frozen pizza \u2013 you automate the hard work up-front to get your application and environment ready to go, then the Auto Scaling Group warms them up at the last minute for consumption.<\/p>\n\n<p>The caveat for this to work is this: you need a strong continuous delivery capability that is highly automated \u2013 everything must go to production through the pipeline.<\/p>\n\n<p>The effect of this is that changes and releases become non-events. You can very quickly reach a point where you\u2019re deploying tens, if not hundreds of changes a day \u2013 all with minimal human intervention.<\/p>\n\n<p>Whenever you make a change to the application or the underlying environment, your systems automatically build new images that can be used in your Auto Scaling Groups. This requires <a href=\"https:\/\/www.terraform.io\/intro\/hashicorp-ecosystem.html\">new tools and ways of addressing<\/a> your infrastructure, programatically.<\/p>\n\n<p>Having a fully automated change process can make satisfying regulatory requirements easier, because all changes are highly controlled and logged, and each part of the automation has limited access, courtesy of IAM. When you combine this with <a href=\"https:\/\/aws.amazon.com\/cloudtrail\/\">CloudTrail<\/a>, you get a pretty powerful combination of access control and auditing.<\/p>\n\n<p>Now you\u2019re getting multiple benefits: reliability, auditability, scalability, and pertinently, cost reduction \u2013 you don\u2019t pay for what you\u2019re not using, calculated by the hour.<\/p>\n\n<h2 id=\"conclusion\">Conclusion<\/h2>\n\n<p>The opportunity IaaS provides is immense.<\/p>\n\n<p>IaaS providers like AWS help make doing the right thing easy. IaaS eliminates classes of problems, freeing up your teams to focus on the bigger picture. Most importantly, it frees people up to help your organisation learn about modern technology practices for building highly reliable government services.<\/p>\n","pubDate":"12 Oct 2016","link":"https:\/\/fractio.nl\/2016\/10\/12\/aws-in-government-myths-risks-misconceptions\/","guid":"https:\/\/fractio.nl\/2016\/10\/12\/aws-in-government-myths-risks-misconceptions\/"},{"title":"Help! I\u2019ve just been made a manager","description":"<blockquote>\n  <p><em>Your boss calls you into her office.<\/em><\/p>\n\n  <p><em>\u201cCongratulations - I\u2019m promoting you to team lead!\u201d<\/em><\/p>\n\n  <p><em>Your mouth goes dry.<\/em><\/p>\n\n  <p><em>\u201cYou\u2019ve been doing such great job on the last few projects, the leadership team thought you could help other people in your team perform just as good as you.\u201d<\/em><\/p>\n\n  <p><em>Your stomach turns to stone.<\/em><\/p>\n\n  <p><em>\u201cYour new role starts now. We\u2019ll see how it works out, and come back in a few weeks to review.\u201d<\/em><\/p>\n<\/blockquote>\n\n<hr \/>\n\n<p>This experience may feel too familiar \u2013 and perhaps painful \u2013 to you. You get thrown into the deep end with a life jacket\/anchor labeled \u201cteam lead\u201d\/\u201dsupervisor\u201d\/\u201dacting manager\u201d.<\/p>\n\n<p>As we\u2019ve read before, moving into a management position is not a promotion, <a href=\"https:\/\/fractio.nl\/2014\/09\/19\/not-a-promotion-a-career-change\/\">it\u2019s a career change<\/a>. But fate (or your boss) may not agree.<\/p>\n\n<p>How do you survive your first few weeks in your management role?<\/p>\n\n<!-- excerpt -->\n\n<h3 id=\"get-a-job-description\">Get a job description<\/h3>\n\n<p>Getting a job description written down helps clarify your boss\u2019s expectations about what functions your role is meant to perform, and what is expected of you. They\u2019re ground rules that help you understand the parameters of your work.<\/p>\n\n<p>Make sure you have a conversation with your boss about each part of the job description, to clarify your boss\u2019s interpretation. Note down anything that was different or required clarification, then send a updated version to your boss, for both your records.<\/p>\n\n<p>If you can\u2019t get a job description, write your own.<\/p>\n\n<p>You\u2019re probably thinking right now <em>\u201cOh no, it\u2019s a trap!\u201d<\/em> or <em>\u201cIsn\u2019t this what my boss should be doing?\u201d<\/em>, and you\u2019re right, it is your boss\u2019s responsibility. There\u2019s a very real chance your boss doesn\u2019t have time - that\u2019s part of the reason why you\u2019re getting your \u201cpromotion\u201d. That\u2019s not a justification, it\u2019s just a fact. Your boss may not have been in this position before either, and is just making it up as they go along. A lot of companies don\u2019t have good processes for how this sort of role change is meant to work, and thus don\u2019t have any pre-canned job descriptions they can hand to new managers. Don\u2019t even ask about training.<\/p>\n\n<p>Writing your own job description is fantastic opportunity to define what exactly you\u2019re going to be doing, in your own words, while demonstrating your communication and goal setting abilities.<\/p>\n\n<p>Get 1-2 peers to review what you\u2019ve written, and consider getting your new team to review the job description. This can help build rapport with them, and getting their buy in to what the new team is going to look like. But beware: if you\u2019ve been given this new role over someone else that now reports to you, there may be tension \u2013 modify your technique to your audience.<\/p>\n\n<p>Once the job description is written, send it to your boss with a <em>\u201cI know you\u2019re busy, so if you don\u2019t see any problems with it, no need to reply\u201d<\/em>.<\/p>\n\n<h3 id=\"managing-your-workload\">Managing your workload<\/h3>\n\n<p>\u201cI\u2019m already overloaded\u201d you\u2019re thinking, \u201cHow am I supposed to look after all these people while doing my existing work?\u201d \u2013 this is the biggest fear, and biggest challenge, when moving into a management role.<\/p>\n\n<p>You have three options:<\/p>\n\n<ul>\n  <li><strong>Keep trying to do your own work while managing others<\/strong>. This almost certainly will end in you doing a mediocre job of both. You\u2019ll spend 30% of your time on engineering, 10% on people management, and the remaining 50% on context switching and self loathing.<\/li>\n  <li><strong>Aggressively cut the scope of your personal technical workload<\/strong>, and manage the stakeholder expectations for those cuts. This frees you up to spend some of your time on people work.<\/li>\n  <li><strong>Make your old engineering workload your team\u2019s workload<\/strong>. Still cut the scope of the work, and manage your stakeholder expectations. Help the team become better at doing some of the work you were doing previously. Don\u2019t forget to manage the stakeholder relationships for their work too.<\/li>\n<\/ul>\n\n<p>Remember why this role change is happening: as a leader, you provide more value to the team as a multiplier than as an individual contributor. If you free up each person in your team to focus more clearly on their work, and complete that work more efficiently \u2013 that\u2019s greater than any engineering contribution you can make as an individual.<\/p>\n\n<p>The performance of the team will drop when the team is in this transition phase. The team is reconfiguring itself, working out what\u2019s important, what\u2019s not, how work gets done, who has what responsibilities.<\/p>\n\n<p>If you can successfully manage the transition, the team will be more productive than it is as a collection of individual contributors. That is your goal.<\/p>\n\n<h3 id=\"create-feedback-loops\">Create feedback loops<\/h3>\n\n<p>Part of being a good at people work is creating strong feedback loops from the people in the team to you. Have a reliable, predictable avenue for them to report problems and suggest changes, then act on their information, building trust.<\/p>\n\n<p>Organise regular one-on-ones with your team. Once a week, half an hour, away from the office if you can.<\/p>\n\n<p>Be honest and upfront with them that you\u2019re new to this, and are trying to work it out.<\/p>\n\n<p>Be vulnerable about your limits and abilities. Ask them to raise problems with how you\u2019re managing them immediately, and show you\u2019re listening to them by acting on their feedback promptly. Build trust by listening and changing behaviour.<\/p>\n\n<p>Find out their biggest fears and anxieties about the new work situation. Ask simply:<\/p>\n\n<blockquote>\n  <p>\u201cIs there anything I can do to help make things better?\u201d<\/p>\n<\/blockquote>\n\n<p>Create a feedback loop from team to the rest of the organisation by publicly praising good work from individuals. Raise the profile of their work to the broader audience in your org by publicly calling out good work or congratulating them on the successful delivery of features, projects, or quick bug fixes.<\/p>\n\n<p>Make sure you pass all the credit down to the team.<\/p>\n\n<h3 id=\"hard-truths\">Hard truths<\/h3>\n\n<p>In my first year as a manager I spent most of the time trying to make sense of the new context I found myself operating in. Expectations were re-calibrated (sometimes brutally), people were disappointed (some even left), deadlines were missed (occasionally by wide margins).<\/p>\n\n<p>These were realisations I came to that helped me cope in my first year:<\/p>\n\n<ul>\n  <li><strong>Demand will always exceed capacity.<\/strong> Doesn\u2019t matter how good you are at managing workload \u2013 team or personal \u2013 there will always be more to do. There will always be someone disappointed you\u2019re not doing the exact work they want, when they want it. Fuck the haters.<\/li>\n  <li><strong>Competence is rewarded with more work.<\/strong> If your boss or your boss\u2019s boss sees you doing a good job, they will want to see how much further you can go. Put more cynically: no good deed goes unpunished.<\/li>\n  <li><strong>There will always be a tension between doing technical work and doing people work.<\/strong> When you\u2019re not in a pure-management role (i.e. <em>team lead<\/em>, <em>supervisor<\/em>) you still have engineering work to do, and finding the balance will be messy \u2013 especially as you\u2019re new to this whole people management thing. You think you\u2019ve got a handle on it, and something will knock you for six. Keep going, reflect, try out different things, and you\u2019ll get there.<\/li>\n  <li><strong>Technical output is no longer the sole measurement of your job success.<\/strong> Your own technical output is a false measurement for your responsibilities to the team. You will always be disappointed if you measure your current self against your past self, that past self that only had technical delivery responsibilities. Your disappointment will lead you to prioritising technical work over people work, consequentially screwing over the people who are looking to you for help and guidance. <em>Prioritise people work<\/em>.<\/li>\n<\/ul>\n\n<p>Finally: everyone else is just making it up. Nobody comes into a management role with all the answers. The leaders you look up to have made heaps of mistakes that shaped their leadership and management style.<\/p>\n\n<p>Get to it.<\/p>\n","pubDate":"25 Jan 2016","link":"https:\/\/fractio.nl\/2016\/01\/25\/help-i-have-just-been-made-a-manager\/","guid":"https:\/\/fractio.nl\/2016\/01\/25\/help-i-have-just-been-made-a-manager\/"},{"title":"PreAccident Investigation Podcast Highlights, Sep-Oct 2015","description":"<p>These are notes I\u2019ve taken while binge listening to the last two months of the <a href=\"http:\/\/preaccidentpodcast.podbean.com\/\">PreAccident Investigation Podcast<\/a>, which you should <a href=\"https:\/\/itunes.apple.com\/au\/podcast\/preaccident-investigation\/id962990192?mt=2\">subscribe to<\/a>.<\/p>\n\n<!-- excerpt -->\n\n<h3 id=\"kent-whipple--the-power-of-the-story\">Kent Whipple \u2013 The power of the story<\/h3>\n\n<ul>\n  <li>When we investigate an accident, we don\u2019t tell the story of what happened, we tell a story about what didn\u2019t happen.<\/li>\n  <li>Identifying what didn\u2019t happen doesn\u2019t help you fix what did happened.<\/li>\n<\/ul>\n\n<p><a href=\"http:\/\/preaccidentpodcast.podbean.com\/e\/papod-41-comedian-kent-whipple-and-the-power-of-the-story\/\">Listen to the episode.<\/a><\/p>\n\n<h3 id=\"dr-alan-frankfurt---high-reliability-safety-and-delivering-babies\">Dr. Alan Frankfurt - High Reliability, Safety, and Delivering Babies<\/h3>\n\n<ul>\n  <li>Highly reliable teams don\u2019t realise they\u2019re highly reliable, they don\u2019t set off to become highly reliable, they set off to become more stable, safer, more effective, or learn.<\/li>\n  <li>Destroy vertical silos, create horizontal integrations. The silos stop us from working together and becoming as good as we can get. Horizontal integrations help people take ownership.<\/li>\n  <li>Prepare for events with a pre-brief: use a template, identify what the threats are, verbalise and share concerns.<\/li>\n  <li>Hold a post-action review soon after the operation, schedule around the surgeon because of time demands.<\/li>\n  <li>There are rarely technical issues, but there are always communication issues that come up.<\/li>\n  <li>Everyone in the team needs a role, and understand how that role fits into the goal of the team. \u201cI\u2019m gonna do my doctor thing, you\u2019re gonna do your nurse thing, but I\u2019m not any more important than you are.\u201d<\/li>\n<\/ul>\n\n<p><a href=\"http:\/\/preaccidentpodcast.podbean.com\/e\/papod-39-high-reliability-safety-and-delivering-babies-dr-alan-frankfurt\/\">Listen to the episode.<\/a><\/p>\n\n<h3 id=\"dr-jim-joy---critical-controls\">Dr Jim Joy - Critical Controls<\/h3>\n\n<ul>\n  <li>Risk registers end up being a list problems on paper that are useless as a management tool.<\/li>\n  <li>Critical controls are a more effective management tool for dealing with risks and events.<\/li>\n  <li>Controls are anything that prevents or mitigates an unwanted event, that we can use to improve our resilience when things go wrong.<\/li>\n  <li>Controls can be acts, objects, and systems.<\/li>\n  <li>Acts are behaviours we mandate or encourage.<\/li>\n  <li>Objects are tools that work by themselves.<\/li>\n  <li>Systems are combinations of acts and objects.<\/li>\n  <li>Training is not a control, supervision is not a control. We can\u2019t measure it, we can\u2019t validate it, we can\u2019t audit it.<\/li>\n  <li>Once we have controls, we can define performance requirements (pressure valve is released at x pressure, the operator understands how to perform x task in context), measure, then validate those performance requirements are being met.<\/li>\n  <li>Once we have requirements, we can set targets to assess the reliability of the controls, which is more of an objective discussion around metrics.<\/li>\n  <li>We can then feed these metrics into the design of the controls based on what happens in our organisations.<\/li>\n  <li>We need to move beyond thinking about risks as <em>likelihood x consequence<\/em>.<\/li>\n  <li>Risk is the degree to which <em>your controls aren\u2019t working<\/em>.<\/li>\n  <li><a href=\"http:\/\/icmm.com\/publications\/health-and-safety-critical-control-management-good-practice-guide\">Health and Safety Critical Control Management Good Practice Guide<\/a> from ICMM publications<\/li>\n<\/ul>\n\n<p><a href=\"http:\/\/preaccidentpodcast.podbean.com\/e\/papod-37-dr-jim-joy-critical-controls\/\">Listen to the episode.<\/a><\/p>\n\n<h3 id=\"dr-jim-barker---complexity\">Dr Jim Barker - Complexity<\/h3>\n\n<ul>\n  <li>We don\u2019t manage complexity, we move with it.<\/li>\n  <li>Think about complexity as fluidity instead of non-linearity, because there are linear aspects to our complex systems (like time).<\/li>\n<\/ul>\n\n<p><a href=\"http:\/\/preaccidentpodcast.podbean.com\/e\/papod-9-repeat-by-request-lets-talk-about-complexity-and-organizations-with-jim-barker\/\">Listen to the episode.<\/a><\/p>\n\n<h3 id=\"martha-acosta---the-4-things-leaders-control\">Martha Acosta - The 4 Things Leaders Control<\/h3>\n\n<ul>\n  <li>When leaders say \u201ccome to me with solutions, not problems\u201d it seems like a great empowerment move, but they\u2019re creating a distance between workers and management.<\/li>\n  <li>If people come up with solutions, wouldn\u2019t it be more empowering to just let them go and implement them, and only come to you when they have problems they can\u2019t solve?<\/li>\n  <li>The value leaders provide to their organisations is helping the people at the pointy end ask the right questions, and helping them create a solution at the pointy end.<\/li>\n  <li>\u201cWhen significant change comes up against significant culture, culture always wins\u201d - Edgar Schein<\/li>\n  <li>Culture is something that arises from behaviour. That behaviour tells us what matters, how we do things around here, what works and what doesn\u2019t. The internalisation of that behaviour is what becomes culture.<\/li>\n  <li>Outsiders see culture. Insiders have difficulty seeing it.<\/li>\n  <li>Once we see culture externally, we think we can change culture externally.<\/li>\n  <li>The four things leaders control are Roles (what people do), Processes (how we do work), Norms (how we interact with one another), Metrics (what we measure and incentivise).<\/li>\n  <li>Anxiety in the leadership structure is contagious, and can turn into fear lower down in the org structure.<\/li>\n  <li>Social anthropologists see culture as a bunch of intertwining narratives. If you add an anxiety narrative to your culture\u2019s story, that changes the story.<\/li>\n  <li>When you get people talking about their narrative in your culture\u2019s story, that reflection produces surprises and uncovers how things work.<\/li>\n<\/ul>\n\n<p><a href=\"http:\/\/preaccidentpodcast.podbean.com\/e\/papod-36-martha-acosta-returns-the-4-things-leaders-control\/\">Listen to the episode.<\/a><\/p>\n\n<h3 id=\"dr-eric-young--patient-safety-surgery\">Dr. Eric Young \u2013 Patient Safety, Surgery<\/h3>\n\n<ul>\n  <li>Checklists have become overused to the point where they\u2019re causing more harm than good (Dr Young has seen 84 items on one list), need to be kept down to a page per Checklist Manifesto.<\/li>\n  <li>The best way to reduce error rates is to ensure a consistent team is working together to perform the surgeries.<\/li>\n  <li>This isn\u2019t always a possibility, so ensuring consistent skills across all team members is the next best thing.<\/li>\n  <li>Dr Young is surprised more patients don\u2019t get actively involved in their medical care by asking their doctors questions (e.g. why this brand of joint replacement over another?) and finding out more about their treatments.<\/li>\n<\/ul>\n\n<p><a href=\"http:\/\/preaccidentpodcast.podbean.com\/e\/papod-35-patient-safety-surgery-and-young-dr-young\/\">Listen to the episode.<\/a><\/p>\n","pubDate":"11 Nov 2015","link":"https:\/\/fractio.nl\/2015\/11\/11\/preaccident-investigation-highlights-09102015\/","guid":"https:\/\/fractio.nl\/2015\/11\/11\/preaccident-investigation-highlights-09102015\/"},{"title":"Blame. Language. Sharing.","description":"<p>Failure can lead to blame or inquiry in your organisation.<\/p>\n\n<p>When failure leads to blame, organisations subscribe to the old view of human error. They construct a narrative that\u2019s far worse than the reality, a narrative that focuses on a single root cause, which is inevitably human error. This reductionist and deconstructive process has us go down-and-in, treating people and systems as separate entities, with people at the root of the cause.<\/p>\n\n<p>When failure leads to inquiry, organisations subscribe to the new view of human error. People are part of the systems, inquiry is angled up-and-out, focused on understanding the relationships and bigger picture ideas at play. This is difficult, because it involves acknowledging and embracing complexity.<\/p>\n\n<p>When failure leads to inquiry, we embrace different perspectives, different stories, different interests - and often these contradict one another. By embracing these differences, we create an opportunity for learning for people inside the organisation, navigating the delta between how we imagine work is completed in our organisation, and how it is actually done.<\/p>\n\n<p>Learning organisations have three distinct advantages:<\/p>\n\n<!-- excerpt -->\n\n<ul>\n  <li>They have feedback loops that deliver high quality feedback from the front lines,<\/li>\n  <li>Which allows people performing the work to focus on quality and delivery,<\/li>\n  <li>Which reduces the amount of defending of decisions by practitioners.<\/li>\n<\/ul>\n\n<p>These three advantages minimise the likelihood of a Cover Your Arse culture emerging, where people focus more on implementing insulation against potential blowback from performing work, than actually performing the work itself.<\/p>\n\n<p>I posit there are three contributing factors that inhibit learning in organisations:<\/p>\n\n<ul>\n  <li>Language we use when talking about and contextualising failure<\/li>\n  <li>Blame and the tainted narrative we construct via cognitive biases<\/li>\n  <li>Sharing of experiences in our organisations to uncover understanding<\/li>\n<\/ul>\n\n<h2 id=\"language\">Language<\/h2>\n\n<p>The words we use when talking about events are really important.<\/p>\n\n<p>Words are framing devices that can both expand and limit the scope of inquiry. These words are used during your investigations, retrospectives, learning reviews, brainstorming sessions, and post-mortems. But most importantly they\u2019re used when having daily conversations with your colleagues.<\/p>\n\n<h3 id=\"why\">Why<\/h3>\n\n<p><em>Why<\/em> is used to force people to justify actions, to attribute and apportion blame. <em>Why<\/em> goes down-and-in, focuses the inquiry on people, and is often used to phrase counterfactuals that focus attention on a past that didn\u2019t happen \u2013 <em>\u201cwhy didn\u2019t you answer the page?\u201d<\/em>, <em>\u201cwhy didn\u2019t you check the backups?\u201d<\/em>.<\/p>\n\n<p><em>Why<\/em> plays right into the hands of the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Fundamental_attribution_error\">Fundamental Attribution Error<\/a>, where we explain other people\u2019s actions by their personality, not the context they find themselves in, but we explain our own actions by our context, not our personality.<\/p>\n\n<h3 id=\"how\">How<\/h3>\n\n<p><em>How<\/em> is about articulating the mechanics of a situation, which is helpful for distancing people from the actions they took. <em>How<\/em> clarifies technical details - <em>\u201chow did the site go down?\u201d<\/em>, <em>\u201chow did the team react?\u201d<\/em> \u2013 but it can also limit the scope of the inquiry, as we focus on the mechanics, not the relationships at play in the larger system.<\/p>\n\n<h3 id=\"what\">What<\/h3>\n\n<p><em>What<\/em> uncovers reasoning, which is important for building empathy with people in complex systems \u2013 <em>\u201cwhat did you think was happening?\u201d<\/em>, <em>\u201cwhat did you do next?\u201d<\/em>. <em>What<\/em> makes it easier to point our investigations up-and-out, on the bigger picture contributing factors to an outcome. <em>What<\/em> encourages explaining in terms of foresight, and helps us take into account local rationality:<\/p>\n\n<blockquote>\n  <p>\u201cpeople make what they consider to be the best decision given the information available to them at the time\u201d<\/p>\n<\/blockquote>\n\n<p>Dekker describes explaining an incident in terms of foresight as understanding what people inside the tunnel saw, as they journeyed through it during an incident. <em>What<\/em> helps us uncover what the inside of the tunnel looked like.<\/p>\n\n<h2 id=\"blame\">Blame<\/h2>\n\n<p>Blame assigns responsibility for an outcome to a person. Often we use blame to say that people were neglectful, inattentive, or derelict of duty. It plays into this idea of bad apples, amoral actors in our midst who are working against the sanctity of pristine system the dirty humans keep fucking up.<\/p>\n\n<p>But assigning responsibility for an outcome to a person ignores a truth \u2013 sometimes bad things happen, and nobody is to blame. Furthermore, things go right more often than they go wrong.<\/p>\n\n<p>There are two cognitive biases at play when assigning blame to people: confirmation bias, and hindsight bias.<\/p>\n\n<p>But what is a cognitive bias? Simply, a cognitive bias is a mental shortcut your brain unconsciously takes when processing information. Your brain optimises for timeliness over accuracy when processing information, and applies heuristics to make decisions and form judgements. If those heuristics produce an incorrect result, we say that\u2019s an example of a cognitive bias.<\/p>\n\n<h3 id=\"confirmation-bias\">Confirmation bias<\/h3>\n\n<p>With the confirmation bias, we seek information that reinforces existing positions, and ignore alternative explanations. Worse still, we interpret ambiguous information in favour of our existing assumptions.<\/p>\n\n<p>Simply put: if you are looking for a human to blame, you\u2019re going to find one, regardless of contrary information.<\/p>\n\n<p>We can counter the confirmation bias by <a href=\"https:\/\/en.wikipedia.org\/wiki\/Confirmation_bias#In_finance\">appointing people<\/a> to play the devils advocate and take contrarian viewpoints during conversations and investigations.<\/p>\n\n<h3 id=\"hindsight-bias\">Hindsight bias<\/h3>\n\n<p>The hindsight bias alters our recollection of memories to fit a narrative of how we perceived and reacted to events. It\u2019s a type of memory distortion where we recall events to form a judgement, and talk about and contextualise events with knowledge of the outcome \u2013 often making ourselves look better in the process.<\/p>\n\n<p>The hindsight bias is dangerous because it can taint all your interactions with your team. It is your culture killer, altering our how we recall your perception of events and actions in stressful situations, driving a self-defensive wedge between you and you colleagues.<\/p>\n\n<p>It\u2019s important to eliminate hindsight bias when conducting post-mortems and investigations if we want a just outcome. The simplest way to achieve this is to explain events in terms of foresight, and this is made easier by using questions that start with <em>\u201chow\u201d<\/em> and <em>\u201cwhat\u201d<\/em>. Start the review at a point before the incident, and work your way forward. Resist the urge to jump ahead to the outcome and work your way back from that.<\/p>\n\n<p>Doing this is hard and requires a lot of self-restraint and practice. You\u2019ll make a lot of mistakes, and it takes time to get good at it. Even when you\u2019re good at it, you\u2019ll still occasionally find yourself slipping into old habits. It\u2019s the responsibility of the whole team to call each other out when they see each other fall into the hindsight bias trap, using words like <em>why<\/em> and <em>who<\/em>.<\/p>\n\n<p>We can also harness hindsight bias to give us insights into how things might break in the future.<\/p>\n\n<p>Before you take a new service live, gather the team together and ask them to brainstorm on a whiteboard or post-it notes what they think will break when they go live. Then clear away any notes you\u2019ve collectively taken, and ask them to imagine themselves 5 minutes after the feature has gone live. Now ask <em>\u201cwhat has just broken?\u201d<\/em>.<\/p>\n\n<p>You\u2019ll find the answers you get can be quite different.<\/p>\n\n<h2 id=\"sharing\">Sharing<\/h2>\n\n<p>Sharing our experiences after an incident happens is vital for the organisation to learn from individual and shared experiences. By sharing our experiences we have the opportunity to embrace different and often contradictory perspectives, stories, and interests.<\/p>\n\n<p>From these we can better understand what our organisations capabilities and weaknesses are, both when things go wrong but when things go right. This creates an opportunity to understand the delta between Work-as-Imagined and Work-as-Done in our organisations.<\/p>\n\n<p>We do this by holding retrospectives, investigations, post-mortems, or learning reviews \u2013 but the label we apply to the event is irrelevant.<\/p>\n\n<p>These events must be environments where people in your organisation feel they can speak their truth and experiences free of persecution or backlash. If you\u2019re in a leadership or management position, and people in your team are participating in these sharing experiences, be the shit umbrella you want to see in the world.<\/p>\n\n<p>Other people in your organisation will likely be skeptical of the findings (especially if there is a blameful culture of finding and singling out bad apples), so it\u2019s your responsibility to your people to shield them from the repercussions of being honest. Again, we are all locally rational:<\/p>\n\n<blockquote>\n  <p>\u201cpeople make what they consider to be the best decision given the information available to them at the time\u201d<\/p>\n<\/blockquote>\n\n<p>You have a limited window of opportunity to create an expectation that if you share there won\u2019t be blow back - if you fuck it up early on, people will be reluctant to share anything vaguely compromising about their experiences and actions in the future, and thus the organisation as a whole suffers from the missed opportunity to learn.<\/p>\n\n<p>Know the audience of the report you produce after you\u2019ve shared experiences. Sometimes this means you have to construct multiple reports, one for each audience. The story you tell across these reports should be the same, but alter the level of detail for the audience who is reading it. You may also need to omit different findings for different audiences so details don\u2019t get misconstrued.<\/p>\n\n<p>Beware of weasel words that show up in the report:<\/p>\n\n<ul>\n  <li><em>\u201cthe team should have\u2026\u201d<\/em> (counterfactual describing a past that never happened)<\/li>\n  <li><em>\u201cthe root cause of the outage was\u2026\u201d<\/em> (there is never one cause, there are many contributing factors)<\/li>\n  <li><em>\u201chuman error lead to\u2026\u201d<\/em> (our world is humans <em>and<\/em> systems, not humans <em>or<\/em> systems)<\/li>\n<\/ul>\n\n<p>Creating opportunities for sharing our experiences of accidents, incidents, and outages is mandatory if we want to learn about what our organisations capabilities and weaknesses are when things go wrong.<\/p>\n\n<p>To do this we have hold retrospectives or learning reviews or post-mortems, start at the beginning, and relentlessly eliminate our own and collective cognitive biases when talking about events, by using <em>what<\/em> and <em>how<\/em>, not <em>why<\/em> and <em>who<\/em>.<\/p>\n\n<p>Things go right more often than they go wrong, and we owe it to ourselves and our colleagues to understand what made our course of action the right one at the time, in spite of the outcome.<\/p>\n\n<hr \/>\n\n<p><em>This piece is a writeup of <a href=\"http:\/\/velocityconf.com\/devops-web-performance-eu-2015\/public\/schedule\/detail\/44013\">the talk I gave at Velocity Amsterdam 2015<\/a>.<\/em><\/p>\n","pubDate":"30 Oct 2015","link":"https:\/\/fractio.nl\/2015\/10\/30\/blame-language-sharing\/","guid":"https:\/\/fractio.nl\/2015\/10\/30\/blame-language-sharing\/"},{"title":"Management skills for new leaders","description":"<p>At DevOpsDays Melbourne 2015 I facilitated an <a href=\"https:\/\/en.wikipedia.org\/wiki\/Open_Space_Technology\">Open Space<\/a> session on management and leadership.<\/p>\n\n<p>The purpose of the session was for experienced leaders to share their stories and discuss what skills are required for effective leadership, so new managers and people who are thinking about making the switch could learn from people who have been before them.<\/p>\n\n<p>You can find the <a href=\"https:\/\/soundcloud.com\/auxesisauxesis\/management-skills-for-new-leaders\">original audio on SoundCloud<\/a>.<\/p>\n\n<p>This is a writeup of what was discussed in the session.<\/p>\n\n<!-- excerpt -->\n\n<h3 id=\"reviews\">Reviews<\/h3>\n\n<p>At some stage you\u2019re going to have to do a performance review.<\/p>\n\n<p>Being honest in the review can be hard when you start, but the people in your team really appreciate feedback because it\u2019s personal.<\/p>\n\n<p>Reviews should contain no surprises. If you get to a review and your team member is surprised by something said, it\u2019s too late. When you give feedback, be specific and call out specific good work \u2013 it shows that you as a leader are noticing and appreciating the work that\u2019s being done. Generic platitudes of \u201cgood job\u201d aren\u2019t effective.<\/p>\n\n<p>But don\u2019t just wait for a review - force yourself to give feedback whenever you can. Showing that you notice what people in your team are doing is really powerful, and is a big influencer of behaviour. You might feel like you\u2019re giving too much feedback to your team, but every person on your team is only seeing a slice of the feedback.<\/p>\n\n<h3 id=\"11s\">1:1s<\/h3>\n\n<p>Sometimes 1:1s are the last thing you want to do if you\u2019re an introvert.<\/p>\n\n<p>It shows up in your calendar, and you think \u201cI wonder if I just don\u2019t show up today?\u201d. But when your boss doesn\u2019t show up, you both stop talking about the 1:1, and mutually implicitly decide not to do more 1:1s. Force yourself to do them even when you don\u2019t want to \u2013 you\u2019ll often find within a minute that you\u2019re enjoying it, because you\u2019re feeding off the response you\u2019re getting: that it\u2019s important to them.<\/p>\n\n<p>People might do 1:1s because they feel they have to.<\/p>\n\n<p>Why should I do it? Why should I talk about family, weekend, work? We\u2019re engineers! We know what each other do!<\/p>\n\n<p>But people are more engaged if you know about their family, and they know about yours. It\u2019s surprising to know these things work, but also relieving \u2013 it\u2019s not black magic, it\u2019s just how humans work. We attach to each other more when we\u2019re socially connected.<\/p>\n\n<p>You should be meeting people a minimum every fortnight, so there are no nasty surprises.<\/p>\n\n<p>But separate 1:1s-as-a-status-update from 1:1s-as-a-personal-update. It\u2019s important to do both, but separate the concerns.<\/p>\n\n<h3 id=\"mentoring\">Mentoring<\/h3>\n\n<p>Get team leads together in your organisation, and get them sharing experiences and lessons. Make sure new managers talk to one another, and bounce ideas around.<\/p>\n\n<p>It\u2019s also important to pair up new managers with an experienced manager to bounce ideas around.<\/p>\n\n<p>These pairing doesn\u2019t have to be in the same department, because it\u2019s dealing with people problems, not technical ones.<\/p>\n\n<p>Also think about finding someone to bounce ideas off outside of the company \u2013 talk to ex-bosses, find mentors, individuals. They give you a perspective because they\u2019re not trapped in your day-to-day.<\/p>\n\n<p>Best bit of advice ever given to David Spriggs, CEO of Infoxchange: \u201cTreat all your staff as if they\u2019re volunteers\u201d:<\/p>\n\n<p><img src=\"https:\/\/farm1.staticflickr.com\/301\/19823662656_31aac34e5b_c.jpg\" alt=\"David Spriggs delivering the quote\" class=\"img-responsive\" \/><\/p>\n\n<p>Listen to people, look after them. Don\u2019t try and do too much work as a manager \u2013 you\u2019re there to be a multiplier.<\/p>\n\n<h3 id=\"role-models\">Role models<\/h3>\n\n<p>Everyone was likely managed by a good manager at some point.<\/p>\n\n<p>We\u2019ve all seen and experienced good things when working for good management. We often forget what those good things are when we make the change to management. You\u2019ve got to re-learn it all. You\u2019re not receiving it, you\u2019re giving it. This is super hard.<\/p>\n\n<p>Conversely, a lot of people have never had good technical management, so they may imitate what they\u2019ve seen, perpetuating bad practices because it\u2019s all they\u2019ve ever known. For example: <em>\u201cI haven\u2019t seen my manager in 3 months\u201d<\/em>.<\/p>\n\n<h3 id=\"promotion-vs-career-change\">Promotion vs career change<\/h3>\n\n<p>There\u2019s a preconception that you have to do a lot of tasks as a manager.<\/p>\n\n<p>You\u2019re going up to get more responsibility, be paid more, and have greater input into the company direction.<\/p>\n\n<p>Often people will get to a point in their technical career where they are unable to advance any further, and there are few companies that help people go further on the technical career development track. Rackspace have the <a href=\"http:\/\/www.networkworld.com\/article\/2935936\/careers\/rackspace-creates-career-path-for-tech-execs-who-dont-want-to-manage-people.html\">technical<\/a> <a href=\"http:\/\/www.rackspace.com\/blog\/technical-career-track-2-0-allows-rackspace-tech-talent-to-lead\/\">career<\/a> <a href=\"http:\/\/www.rackspace.com\/talent\/tct\/\">track<\/a>, where engineers can be paid more and have more input in the company than some execs.<\/p>\n\n<p>As managers we need to create the opportunity for our people to grow in the direction they want.<\/p>\n\n<p>Traditionally, people in tech have been promoted due to technical ability and intelligence \u2013 less on emotional intelligence, and their ability to communicate with people. The higher up in the org structure you go, the more your ability to socially interact with other people becomes important.<\/p>\n\n<p>You have to critically assess how good you are at that. Make sure you\u2019re getting feedback, as well as giving feedback.<\/p>\n\n<h3 id=\"celebrating-successes\">Celebrating successes<\/h3>\n\n<p>Linda Rising\u2019s \u201cDo Food\u201d pattern from <a href=\"http:\/\/www.fearlesschangepatterns.com\">Fearless Change<\/a> talks about introducing food into celebration and retros, as a way of bringing the team together.<\/p>\n\n<p>Just a coffee goes a long way. Removing people from the office environment allows them to open up more about problems and successes.<\/p>\n\n<p>How do you make food or drink celebrations work with remote teams? Everyone buys their own food and drink, gets reimbursed, shares on the video conference.<\/p>\n\n<p>If you\u2019re going to cater celebrations, be aware of dietary requirements. If you\u2019re doing regular 1:1s, you\u2019ll know people\u2019s dietary requirements ahead of time.<\/p>\n\n<h3 id=\"distributed-teams-remote-workers-co-located-offices\">Distributed teams, remote workers, co-located offices<\/h3>\n\n<p>What works: dedicated chat rooms, per-product streams, that people can opt into as they please. Non-technical folk should hang out in those rooms too. Also have a general themed rooms too, like \u201cdevops\u201d. Having management in those rooms is a great way to identify problems in the organisation before they spiral out of control.<\/p>\n\n<p>There\u2019s a certain etiquette when working remotely. People don\u2019t realise that they need to digitally note things that are said face-to-face, so other people who aren\u2019t in the same room are up to speed. To prime co-located teams for working remotely, spread the team across the office so they\u2019re not physically located together.<\/p>\n\n<p><img src=\"https:\/\/farm1.staticflickr.com\/399\/19227220254_d2a16f5746_c.jpg\" alt=\"Wayne Ingram asking questions about managing an inherited distributed team\" class=\"img-responsive\" \/><\/p>\n\n<p>Sometimes you need to pull people together into the same place regularly (e.g. every 3 months). The frequency and size of events will vary from team to team.<\/p>\n\n<p>If you don\u2019t have regular face to face communication, the co-located offices and people can fall into old patterns. Even if you send someone to the other office permanently, they adopt the culture of that office, become \u201cone of them\u201d. Exchanges need to be two-way.<\/p>\n\n<p>There are cultural concerns around remote work in different countries. In some cultures there is a certain level of prestige around having your manager come and visit you. If your boss doesn\u2019t come to visit, and other teams have their bosses visit, that\u2019s a negative mark against you.<\/p>\n\n<p>Teams that are partially distributed are harder to manage than fully distributed teams. As a manager you can send people that are together home to work, so they work like the people who are distributed. Consider walking around the office before\/during\/after standups to show the office environment beyond the normal meeting rooms. It makes people feel included.<\/p>\n\n<h3 id=\"when-youve-made-the-wrong-move\">When you\u2019ve made the wrong move<\/h3>\n\n<p>What happens when you\u2019ve made the career change, and you realise you don\u2019t want it any more? How do you address the impression that you aren\u2019t successful when you go back to your prior role?<\/p>\n\n<p>It\u2019s important for your organisation to recognise the challenges people are facing in their new roles and support them.<\/p>\n\n<blockquote>\n  <p>\u201cThe fact is, if you\u2019re miserable in a leadership role, you\u2019re probably not doing a good job. Save your team the pain, and change.\u201d \u2013 <a href=\"https:\/\/twitter.com\/HannahBrowne\">Hannah Browne<\/a><\/p>\n<\/blockquote>\n\n<p>Move towards a position you provide the most value in, and you are the most happy in.<\/p>\n\n<p><a href=\"\/2014\/09\/19\/not-a-promotion-a-career-change\/\">It\u2019s not a promotion, it\u2019s a career change<\/a> helped people realise that the move is not <em>\u201cI have all this responsibility and stuff I need to look after\u201d<\/em>, it\u2019s <em>\u201cI have different things to think about, I have new stuff to learn about people, communication, relationships, how to be the multiplier\u201d<\/em>.<\/p>\n\n<p>Leadership is a fundamentally different set of skills to engineering.<\/p>\n\n<p><img src=\"https:\/\/farm1.staticflickr.com\/339\/19228947803_4c8a7f5883_c.jpg\" alt=\"Discussing how to recover from a bad career change\" class=\"img-responsive\" \/><\/p>\n\n<p>We don\u2019t look at a hairdresser, or a carpenter, and say the hairdresser should be able to knock you up a house, and the carpenter can dye your hair.<\/p>\n\n<p>In our knowledge industry we expect that people can change the work they do at the drop of a hat \u2013 with minimal mentoring, support, training, guidance, advice \u2013 and be successful. We all have a role to play to help people who take the step and decide they don\u2019t love it to not feel like they\u2019re losing face.<\/p>\n\n<p>Go back to something you love and are passionate about. You have invested years of development in those skills.<\/p>\n\n<h3 id=\"leadership-vs-management\">Leadership vs Management<\/h3>\n\n<p>Aren\u2019t leadership and management separate things?<\/p>\n\n<p>The feeling in the room was that to be a good manager you have to be a good leader, but you can be a good leader without being a good manager.<\/p>\n\n<p>Leadership isn\u2019t a position, it\u2019s a function that anyone can adopt. David Marquet\u2019s <a href=\"http:\/\/www.amazon.com\/Turn-Ship-Around-Turning-Followers\/dp\/1591846404\">Turn The Ship Around<\/a> is a great case study of encouraging a culture of distributed responsibility, creating leaders at all levels of your organisation.<\/p>\n\n<p>Management has a bad name, leadership is the trendy alternative, but they are distinct things. When leading you have to look after your people, because you\u2019re working for them.<\/p>\n\n<p>We\u2019ve heard the term \u201cmanagement heavy\u201d, but you don\u2019t really hear the term \u201cleadership heavy\u201d - maybe that\u2019s an insight into the differences between the two functions?<\/p>\n\n<p>Leadership cuts across different jobs and industries. Management less so, because it can be focused on technical details. Leaders that are great working with, encouraging, motivating, collaborating with people are successful.<\/p>\n\n<p>Story time from <a href=\"https:\/\/au.linkedin.com\/in\/chrismadden\">Chris Madden<\/a>:<\/p>\n\n<blockquote>\n  <p>Startup during the dotcom, hired 25 engineers, had a great team. But there was nobody in the team that was suitable to lead.<\/p>\n\n  <p>Startup leadership went out, found someone who had run restaurants, and had herded cats in the film industry. Although he didn\u2019t have any technical experience, he was parachuted into this great team of engineers.<\/p>\n\n  <p>The engineers respected him because he was successfully doing work that they weren\u2019t good at.<\/p>\n<\/blockquote>\n\n<p>One thing new managers can be bad at is knowing when to be direct. You don\u2019t want to tell anyone what to do because \u201cyou should just decide\u201d.<\/p>\n\n<p>But eventually you\u2019ll get feedback from the team that sometimes they just want you to make a decision, set a direction, so they can get on with the job.<\/p>\n\n<p>Regardless of whether you\u2019re in a management or leadership position, make sure you have good feedback loops from the team.<\/p>\n\n<h3 id=\"understanding-your-team\">Understanding your team<\/h3>\n\n<p>Understand how your people are working.<\/p>\n\n<p>If you\u2019re in a leadership position but come from an operations background, spend time understanding how developers work and think.<\/p>\n\n<blockquote>\n  <p>\u201cNo-one gets out of bed in the morning with the express purpose of making your life miserable\u201d<\/p>\n<\/blockquote>\n\n<p>Sometimes people will drive you insane. But everyone has their own stuff going on. People behave how they do for a reason, so spend time understanding why.<\/p>\n\n<p>Recognise that everyone is different. A technique that works for one person won\u2019t be guaranteed to work with the next person in your team. By having social interaction with the people in your team, you can work out what\u2019s required of you to work with that person effectively.<\/p>\n\n<p>Devops and changing roles within a team is a useful mechanism for building empathy. There\u2019s the difference between understanding what someone is going through and actually living it.<\/p>\n\n<p>Dan Pink\u2019s <a href=\"http:\/\/www.danpink.com\/drive\/\">Drive<\/a> discusses a new model for understanding people\u2019s motivations. Once you remove money as a concern, people are driven by autonomy, mastery, purpose. When you have 1:1s, use Drive as a framework for working out which category they fit into.<\/p>\n\n<iframe width=\"853\" height=\"480\" src=\"https:\/\/www.youtube.com\/embed\/u6XAPnuFjJc\" frameborder=\"0\" class=\"img-responsive\" allowfullscreen=\"\"><\/iframe>\n\n<p>Involve the team in the decision making process. Don\u2019t be \u201cmy way or the highway\u201d.<\/p>\n\n<p>Talk about ideas, collectively own it, build consensus around decisions. The work we\u2019re all doing is hard, and you have to be pretty clever to do it well. As the team grows and gets better, you can start thinking \u201cI wouldn\u2019t be able to be part of this team\u201d.<\/p>\n\n<p>If you leave the team, there should be someone in place to take over your responsibilities.<\/p>\n\n<h3 id=\"the-value-you-provide-as-a-leader\">The value you provide as a leader<\/h3>\n\n<p>You know you\u2019re doing your job as a leader right when you realise that there\u2019s more value in the communication you facilitate than the tasks you\u2019re performing.<\/p>\n\n<p>You\u2019re not the leader because you\u2019re the best at every job. You\u2019re not delegating tasks because you\u2019ve run out of time to do the tasks.<\/p>\n\n<p>You\u2019re delegating because you genuinely believe the people you\u2019re giving the work to can do it better than you. Your responsibility is to create a context in which people in your team can succeed. You do this by talking to them, understanding their motivations, giving them purpose.<\/p>\n\n<p>Sometimes people think that management or leadership isn\u2019t something they can do, because they\u2019re engineers, and it\u2019s not their responsibility, and they\u2019ll need to change the org structure to achieve the outcome.<\/p>\n\n<p>But leadership is the responsibility of everyone in a team, and it\u2019s within all your abilities.<\/p>\n\n<h3 id=\"resources\">Resources<\/h3>\n\n<ul>\n  <li>Meetup: <a href=\"http:\/\/www.meetup.com\/mad-managers\/\">Melbourne Agile Dev Managers<\/a> on Meetup, \u201cMAD managers\u201d<\/li>\n  <li>Meetup: Once you\u2019re in a leadership position, or are aspiring, there\u2019s the <a href=\"http:\/\/www.meetup.com\/cto-school-melbourne\/\">Melbourne CTO<\/a> and <a href=\"http:\/\/www.meetup.com\/cto-school-sydney\/\">Sydney CTO<\/a> schools.<\/li>\n  <li>Podcast: <a href=\"https:\/\/www.manager-tools.com\/\">Manager Tools<\/a>. Two US-based management consultants talking about new topics every week.<\/li>\n  <li>Newsletter: <a href=\"http:\/\/softwareleadweekly.com\">Software Lead Weekly<\/a> is a free weekly newsletter of curated management and leadership articles. <a href=\"https:\/\/twitter.com\/orenellenbogen\">Oren Ellenbogen<\/a> maintains a Trello board of all the articles included in the newsletter over the last few years.<\/li>\n  <li>Book: Oren wrote a book called <a href=\"http:\/\/leadingsnowflakes.com\">Leading Snowflakes<\/a> off the back of the Software Lead Weekly newsletter. It\u2019s specific to people making the transition from engineer to engineering manager.<\/li>\n  <li>Book: <a href=\"https:\/\/leanpub.com\/talking-with-tech-leads\">Talking With Tech Leads<\/a> by Patrick Kua on Leanpub, lots of interviews with managers<\/li>\n  <li>Book: <a href=\"http:\/\/www.amazon.com\/Behind-Closed-Doors-Management-Programmers\/dp\/0976694026\">Behind Closed Doors<\/a> by Esther Derby &amp; Johanna Rothman is a good introduction to software development team management.<\/li>\n  <li>Book: <a href=\"http:\/\/www.amazon.com\/Turn-Ship-Around-Turning-Followers\/dp\/1591846404\">Turn The Ship Around<\/a> by David Marquet. A great case study of encouraging a culture of distributed responsibility, creating leaders at all levels of your organisation.<\/li>\n  <li>Tool: <a href=\"https:\/\/idonethis.com\">iDoneThis<\/a>, SaaS, sends an email to each person in your team asking \u201cwhat have you done?\u201d, you itemise everything you\u2019ve done, sends back to your team or your manager. At review time you can go over accomplishments in fine detail, and raise issues as they come up.<\/li>\n  <li>Photos in this post are from the <a href=\"https:\/\/www.flickr.com\/photos\/devopsaustralia\/sets\/72157656124332061\">DevOps Australia Flickr set<\/a>, under a <a href=\"https:\/\/creativecommons.org\/licenses\/by-nc\/2.0\/\">CC BY-NC 2.0<\/a> license.<\/li>\n<\/ul>\n\n","pubDate":"30 Jul 2015","link":"https:\/\/fractio.nl\/2015\/07\/30\/management-skills-for-new-leaders\/","guid":"https:\/\/fractio.nl\/2015\/07\/30\/management-skills-for-new-leaders\/"},{"title":"Talk-To-Think, Think-To-Talk, and leadership","description":"<blockquote>\n  <p>This cop threw me to the ground, cos hip hop is violent,\nSaid \u201cYou got freedom of speech, just choose to remain silent\u201d<\/p>\n\n  <p>\u2013 Hilltop Hoods, <a href=\"https:\/\/www.youtube.com\/watch?v=Xi2cdsyjFqQ\">Mic Felon<\/a><\/p>\n<\/blockquote>\n\n<p>Effectively communicating with people in and around your team is the most important skill you need to develop as a leader.<\/p>\n\n<p>How you communicate with people in your team defines how you build relationships and trust. Understanding how you communicate with people is key to being an effective leader and multiplier.<\/p>\n\n<p>There are two main communication styles: talk to think, and think to talk.<\/p>\n\n<!-- excerpt -->\n\n<h2 id=\"talk-to-think\">Talk-to-think<\/h2>\n\n<p>Talk-to-think values speed over accuracy.<\/p>\n\n<p>It is rapid fire brainstorming.<\/p>\n\n<p>You say what comes to mind, no filter.<\/p>\n\n<p>Don\u2019t hold back. Be bold. Agitate.<\/p>\n\n<p>It\u2019s messy, chaotic, beautiful.<\/p>\n\n<p>This communication style is excellent for covering lots of ground quickly, especially if you\u2019re trying to quickly sketch a picture of a problem domain and potential solutions amongst of group of people. You use approximations of terms and ideas - the details don\u2019t matter as much, as long as you communicate the gist.<\/p>\n\n<h2 id=\"think-to-talk\">Think-to-talk<\/h2>\n\n<p>Think-to-talk values accuracy over speed.<\/p>\n\n<p>It is measured, sometimes slow, but always methodical. You don\u2019t say what comes to mind immediately - you spend time thinking about articulating your ideas and arguments before saying them. You chose your words carefully, and embrace the ebbing silence.<\/p>\n\n<p>Think-to-talk is excellent for covering a smaller, sometime sensitive topic area with depth and nuance.<\/p>\n\n<h2 id=\"your-communication-style\">Your communication style<\/h2>\n\n<p>You tend to use one style over the other, but the styles are not mutually exclusive.<\/p>\n\n<p>I am firmly in the Think-To-Talk camp. When I started my <a href=\"\/2014\/09\/19\/not-a-promotion-a-career-change\/\">career change<\/a> I spent a lot of time befuddled why some conversations flowed effortlessly, and others felt like a bucking horse I could barely hold onto.<\/p>\n\n<p>Realising that my experience was not universal and I needed to level up on my communication styles is one of the things I wish I was told when I became a manager.<\/p>\n\n<p>Everyone is different. Some people\u2019s communication is dominated by one style, others are someone in between. Some can fluidly move between styles, others take a while to transition, if at all. Fluidity and style are intersecting spectrums.<\/p>\n\n<p>The good news is that either style can be learned - they just require practice and patience.<\/p>\n\n<p>And you need to be adept at both if you\u2019re going to be an effective leader.<\/p>\n\n<h2 id=\"leadership-and-the-two-styles\">Leadership and the two styles<\/h2>\n\n<p>You likely are proficient with one of the styles. Now you\u2019re in the midst of your career change, you need to start cultivating the other style.<\/p>\n\n<p>Why? Because your job will have you in situations where you\u2019ll need to pick and choose the style based on the problem you\u2019re dealing with. Also, it\u2019s not about you - the people you lead are going to employ a mix of styles that you\u2019ll need to match and adapt to.<\/p>\n\n<p>You\u2019ll often find that conversations stick to one communication style. Through experience you\u2019ll get better at predicting ahead of time what style is called for, based on the topics, and who you\u2019re talking to in your team.<\/p>\n\n<p>Always be cautious of what you think you know - the situation may change and you\u2019ll need to change your style. As Mark Twain said:<\/p>\n\n<blockquote>\n  <p>It ain\u2019t what you don\u2019t know that gets you into trouble. It\u2019s what you know for sure that just ain\u2019t so.<\/p>\n<\/blockquote>\n\n<p>Sometimes you\u2019ll need to switch styles mid-conversation. This can happen mid-1:1 when you\u2019re switching from The Idea to The Person. Making that switch can be hard, and you\u2019ll probably mess it up. That\u2019s cool, we\u2019ve all been there. You\u2019ll get better with practice, just keep at it.<\/p>\n\n<p>One of the most important things you can do to cultivate your skill is to spend time every day reflecting on the conversations you\u2019re having with your people. Do it at the end of every conversation, or at the end of the day. Just make sure you do it.<\/p>\n\n<p>The two questions you need to ask:<\/p>\n\n<ul>\n  <li>What style was in use by people in the room?<\/li>\n  <li>Was the style appropriate, given the topic?<\/li>\n<\/ul>\n\n<h2 id=\"what-style-should-i-use\">What style should I use?<\/h2>\n\n<p>The context of the conversation determines what style you use. It\u2019s your job to identify that context. Practice, practice, practice.<\/p>\n\n<p>When picking the style, ask yourself: What are the implications of the conversation for the people involved? Are you talking about <em>ideas<\/em>, or <em>people<\/em>?<\/p>\n\n<p><strong>Talk-to-think is brilliant for discussing ideas<\/strong>. You\u2019ll use it heavily for technical problem solving, when sketching out a problem and devising potential solutions as a team. Talk-to-think can also be used for organisational problem solving, when discussing org structure problems, organisational debt or inefficiencies. The caveat is that you need to be really fucking clear with the team that the conversation is a hypothetical brainstorm, and nothing is changing. It\u2019s risky, and I would avoid having those sort of organisational problem solving discussions unless you know your team exceptionally well, and are confident in your ability to reroute the conversation when things get dicey. Tread with care.<\/p>\n\n<p><strong>Think-to-talk is brilliant for discussing people<\/strong>. This is the style you\u2019ll want to use in your 1:1s when talking about reporting lines, career development, rates and salaries. Slow, methodical, precise conversations are important for setting expectations and not creating confusion and uncertainty about peoples positions within the company.<\/p>\n\n<p>You need to be aware that the people you\u2019re talking to may simply not be comfortable communicating in your preferred style. Sometimes the other person isn\u2019t that good yet at using your preferred style. This will feel like a drag to you, because you want to use the most efficient style for the situation.<\/p>\n\n<p>But it\u2019s not about you.<\/p>\n\n<p>It\u2019s your responsibility to identify what\u2019s going on and compromise. Look for cues that your style isn\u2019t working. When using Talk-To-Think, the other person will be talking less and less, withdrawing from the conversation. When Think-To-Talking, the other person can be frustrated their conversational energy isn\u2019t being matched.<\/p>\n\n<h2 id=\"how-you-are-perceived\">How you are perceived<\/h2>\n\n<p>Sometimes you\u2019ll misjudge the conversation and pick the wrong style.<\/p>\n\n<p>If you\u2019re using Think-To-Talk with a Talk-To-Thinker you\u2019ll appear haughty, aloof, coldly calculating, surgical, and uncaring.<\/p>\n\n<p>Talk-to-Thinking with a Think-To-Talker will paint you as scatterbrained, flippant, irrationally vigorous, overbearing, interrupting, and uncaring.<\/p>\n\n<p>You\u2019ll note that uncaring is shared. An empathy gap is at the root of the miscalibration.<\/p>\n\n<p>Nobody wants to be perceived as any of these things. Watch for cues, be aware of how people are reacting to what you and others are saying.<\/p>\n\n<p>Are you slowing the conversation down by not engaging more vigorously? Are you getting too caught up in detail? Switch to Talk-To-Think.<\/p>\n\n<p>Are you confusing the other person by using lots of potentially conflicting ideas? Are they growing more concerned with every word that comes out of your mouth? Switch to Think-To-Talk.<\/p>\n\n<h2 id=\"be-the-talk-to-think-umbrella\">Be the Talk-To-Think umbrella<\/h2>\n\n<p>People spend a <em>lot<\/em> of time looking up at what the people above them in the org structure are doing, what they\u2019re saying, who they\u2019re saying it to, and how often they say it. Couple that with a Talk-To-Think communication style up the chain, and it constantly creates and cultivates concerning confusion and uncertainty.<\/p>\n\n<p>It is damaging as fuck to people in your team because they don\u2019t know how seriously to take ideas from people further up the chain, forcing them into a terrible feedback loop of watching for more cues that have them despairing further.<\/p>\n\n<p>If you see Talk-To-Think communication coming from above, especially around strategic direction, it\u2019s your duty to sheild your team from that and turn that noise into signal. Distill those opinions into facts, create certainty for your team.<\/p>\n\n<h2 id=\"be-prepared\">Be prepared<\/h2>\n\n<p>Before you walk into a conversation, you owe it to your people mentally prepare for the style you need to use. This doesn\u2019t have to be an ornate, time consuming ritual - <a href=\"https:\/\/en.wikipedia.org\/wiki\/Priming_(psychology)\">some simple priming<\/a> is enough.<\/p>\n\n<p>If you are a Think-To-Talker going into a Talk-To-Think melee, try listening to uptempo trigger music, or going for a walk or quick jog around the block.<\/p>\n\n<p>Talk-To-Thinker going to the Think-To-Talk doldrums? Listen to downtempo trigger music, or limit sensory inputs by nesting yourself in a quiet dark room.<\/p>\n\n<p>Have a routine. Prime yourself, have triggers, experiment and change. Where possible integrate some sort of physical activity into the trigger, and avoid screens.<\/p>\n\n<p>Understanding your communication strengths and weaknesses is one of the hardest but most rewarding things you can do in your management career change.<\/p>\n\n<p>Diligent and disciplined mastery of this alone puts you heads and shoulders above the rest, and the people you lead will respect you for treating them how they want to be treated.<\/p>\n","pubDate":"10 Jul 2015","link":"https:\/\/fractio.nl\/2015\/07\/10\/think-talk-leadership\/","guid":"https:\/\/fractio.nl\/2015\/07\/10\/think-talk-leadership\/"},{"title":"CD for infrastructure services","description":"<p>For the last 6 months I\u2019ve been consulting on a project to build a monitoring metrics storage service to store several hundred thousand metrics that are updated every ten seconds. We decided to build the service in a way that could be continuously deployed and use as many existing Open Source tools as possible.<\/p>\n\n<p>There is a <a href=\"https:\/\/puppetlabs.com\/sites\/default\/files\/2014-state-of-devops-report.pdf\">growing body<\/a> of evidence to show that continuous deployment of applications lowers defect rates and improves software quality. However, the significant corpus of literature and talks on continuous delivery and deployment is primarily focused on applications - there is scant information available on applying these CD principals to the work that infrastructure engineers do every day.<\/p>\n\n<p>Through the process of building a monitoring service with a continous deployment mindset, we\u2019ve learnt quite a bit about how to structure infrastructure services so they can be delivered and deployed continuously. In this article we\u2019ll look at some of the principals you can apply to your infrastructure to start delivering it continuously.<\/p>\n\n<!-- excerpt -->\n\n<h2 id=\"how-to-cd-your-infrastructure-successfully\">How to CD your infrastructure successfully<\/h2>\n\n<p>There are two key principals for doing CD with infrastructure services successfully:<\/p>\n\n<ol>\n  <li><strong>Optimise for fast feedback.<\/strong> This is essential for quickly validating your changes match the business requirements, and eliminating technical debt and sunk cost before it spirals out of control.<\/li>\n  <li><strong>Chunk your changes.<\/strong> A CD mindset forces you to think about creating the shortest <em>and smoothest<\/em> path to production for changes to go live. Anyone who has worked on public facing systems knows that many big changes made at once rarely result in happy times for anyone involved. Delivering infrastructure services continuously doesn\u2019t absolve you from good operational practice - it\u2019s an opportunity to create a structure that re-inforces such practices.<\/li>\n<\/ol>\n\n<h2 id=\"definitions\">Definitions<\/h2>\n\n<ul>\n  <li>Continous Delivery is different from Continuous Deployment in that in Continuous Delivery there is some sort of human intevention required to promote a change from one stage of the pipeline to the next. In Continuous Deployment no such breakpoint exists - changes are promoted automatically. The speed of Continuous Deployment comes at the cost of potentially pushing a breaking change live. Most discussion of \u201cCD\u201d rarely qualifies the terms.<\/li>\n  <li>An infrastructure service is a configuration of software and data that is consumed by other software - not by end users themselves. Think of them as \u201cthe gears of the internet\u201d. Examples of infrastructure services include DNS, databases, Continuous Integration systems, or monitoring.<\/li>\n<\/ul>\n\n<h2 id=\"what-the-pipeline-looks-like\">What the pipeline looks like<\/h2>\n\n<ol>\n  <li><strong>Push.<\/strong> An engineer makes a change to the service configuration and pushes it to a repository. There may be ceremony around how the changes are reviewed, or they could be pushed directly into <code class=\"language-plaintext highlighter-rouge\">master<\/code>.<\/li>\n  <li><strong>Detect and trigger.<\/strong> The CI system detects the change and triggers a build. This can be through polling the repository regularly, or a hosted version control system (like GitHub) may call out via a webhook.<\/li>\n  <li><strong>Build artifacts.<\/strong> The build sets up dependencies and builds any required software artifacts that will be deployed later.<\/li>\n  <li><strong>Build infrastructure.<\/strong> The build talks to an IaaS service to build the necessary network, storage, compute, and load balancing infrastructure. The IaaS service may be run by another team within the business, or an external provider like AWS.<\/li>\n  <li><strong>Orchestrate infrastructure.<\/strong> The build uses some sort of configuration management tool to string the provisioned infrastructure together to provide the service.<\/li>\n<\/ol>\n\n<p>There is a testing step between almost all of these steps. Automated verification of the changes about to be deployed and the state of the running service after the deployment is crucial to doing CD effectively. Without it, CD is just a framework for continuously shooting yourself in the foot faster and not learning to stop. <em>You will fail if you don\u2019t build feedback into every step of your CD pipeline.<\/em><\/p>\n\n<h2 id=\"defining-the-service-for-quality-feedback\">Defining the service for quality feedback<\/h2>\n\n<ul>\n  <li><strong>Decide what guarantees you are providing<\/strong> to your users. A good starting point for thinking about about what those guarantees should be is the CAP theorem. Decide if the service you\u2019re building is an AP or CP system. Infrastructure services generally tend towards AP, but there are cases where CP is preferred (e.g. databases).<\/li>\n  <li><strong>Define your SLAs.<\/strong> This is where you quantify the guarantees you\u2019ve just made to your users. These SLAs will relate to service throughput, availability, and data consistency (note the overlap with CAP theorem). <em>95e response time for monitoring metric queries in a one hour window is &lt; 1 second<\/em>, and <em>a single storage node failure does not result in graph unavailability<\/em> are examples of SLAs.<\/li>\n  <li><strong>Codify your SLAs as tests and checks.<\/strong> Once you\u2019ve quantified your guarantees SLAs, this is how you get automated feedback throughout your pipeline. These tests must be executed while you\u2019re making changes. Use your discretion as to if you run all of the tests after every change, or a subset.<\/li>\n  <li><strong>Define clear interfaces.<\/strong> It\u2019s extremely rare you have a service that is one monolithic component that does everything. Infrastructure services are made of multiple moving parts that work together to provide the service, e.g. multiple PowerDNS instances fronting a MySQL cluster. Having clear, well defined interfaces are important for verifying expected interactions between parts before and after changes, as well as during the normal operation of the service.<\/li>\n  <li><strong>Know your data.<\/strong> Understanding where the data lives in your service is vital to understanding how failures will cascade throughout your service when one part fails. Relentlessly eliminate state within your service by pushing it to one place and front access with horizontally scalable immutable parts. Your immutable infrastructure is then just a stateless application.<\/li>\n<\/ul>\n\n<h2 id=\"making-it-fast\">Making it fast<\/h2>\n\n<p><strong>Getting iteration times down<\/strong> is the most important goal for achieving fast feedback. From pushing a change to version control to having the change live should take less than 5 minutes (excluding cases where you\u2019ve gotta build compute resources). Track execution time on individual stages in your pipeline with <code class=\"language-plaintext highlighter-rouge\">time(1)<\/code>, logged out to your CI job\u2019s output. Analyse this data to determine the min, max, median and 95e execution time for each stage. Identify what steps are taking the longest and optimise them.<\/p>\n\n<p><strong>Get your CI system close to the action.<\/strong> One nasty aspect of working with infrastructure services is the latency between where you are making changes from, and the where the service you\u2019re making changes to is hosted. By moving your CI system into the same point of presence as the service, you minimise latency between the systems.<\/p>\n\n<p>This is especially important when you\u2019re interacting with an IaaS API to inventory compute or storage resources at the beginning of a build. Before you can act on any compute resources to install packages or change configuration files you need to ensure those compute resources exist, either by building up an inventory of them or creating them and adding them to said inventory.<\/p>\n\n<p>Every time your CD runs it has to talk to your IaaS provider to do these three steps:<\/p>\n\n<ol>\n  <li>Does the thing exist?<\/li>\n  <li>Maybe make a change to create the thing<\/li>\n  <li>Get info about the thing<\/li>\n<\/ol>\n\n<p>Each of these steps requires sending and recieving often non-trivial amounts of data that will be affected by network and processing latency.<\/p>\n\n<p>By moving your CI close to the IaaS API, you get a significant boost in run time performance. By doing this on the monitoring metrics storage project we reduced the CD pipeline build time from 20 minutes to 5 minutes.<\/p>\n\n<p><strong>Push all your changes through CI.<\/strong> It\u2019s tempting when starting out your CD efforts to push some changes through the pipeline, but still make ad-hoc changes outside the pipeline, say from your local machine.<\/p>\n\n<p>This results in several problems:<\/p>\n\n<ul>\n  <li>You don\u2019t receive the latency reducing benefits of having your CI system close to the infrastructure.<\/li>\n  <li>You limit visibility to other people in your team as to what changes have actually been made to the service. That quick fix you pushed from your local machine might contribute to a future failure that your colleagues will have no idea about. The team as a whole benefits from having an authoriative log of all changes made.<\/li>\n  <li>You end up with divergent processes - one for ad-hoc changes and another for Real Changes\u2122. Now you\u2019re optimising two processes, and those optimisations will likely clobber one another. Have fun.<\/li>\n  <li>You reduce your confidence that changes made in one environment will apply cleanly to another. If you\u2019re pushing changes through multiple environments before they are applied to your production environment, you reduce the certainty that one off changes in one environment won\u2019t cause changes to pass there but fail elsewhere.<\/li>\n<\/ul>\n\n<p>There\u2019s no point in lying: <em>pushing all changes through CI is hard<\/em> but worth it. It requires thinking about changes differently and embracing a different way of working.<\/p>\n\n<p>The biggest initial pushback you\u2019ll probably get is having to context switch between your terminal where you\u2019re making changes and the web browser where you\u2019re tracking the CI system output. This context switch sounds trivial but I dare you to try it for a few hours and not feel like you\u2019re working more slowly.<\/p>\n\n<p>Netflix Skunkworks\u2019 <a href=\"https:\/\/github.com\/Netflix-Skunkworks\/jenkins-cli\">jenkins-cli<\/a> is an absolutely godsend here - it allows you to start, stop, and tail jobs from your command line. Your workflow for making changes now looks something like this:<\/p>\n\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>git push <span class=\"o\">&amp;&amp;<\/span> jenkins start <span class=\"nv\">$job<\/span> <span class=\"o\">&amp;&amp;<\/span> jenkins <span class=\"nb\">tail<\/span> <span class=\"nv\">$job<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>The <code class=\"language-plaintext highlighter-rouge\">tail<\/code> is the real killer feature here - you get the console output from Jenkins on your command line without the need to switch away to your browser.<\/p>\n\n<h2 id=\"chunking-your-changes\">Chunking your changes<\/h2>\n\n<p><strong>Change one, test one<\/strong> is a really important way of thinking about how to apply changes so they are more verifiable. When starting out CD the easiest path is to make all your changes and then test them straight away, e.g.<\/p>\n\n<blockquote>\n  <ul>\n    <li>Change app<\/li>\n    <li>Change database<\/li>\n    <li>Change proxy<\/li>\n    <li>Test app<\/li>\n    <li>Test database<\/li>\n    <li>Test proxy<\/li>\n  <\/ul>\n<\/blockquote>\n\n<p>What happens when your changes cause multiple tests to fail? You\u2019re faced with having to debug multiple moving parts without solid information on what is contributing to the failure.<\/p>\n\n<p>There\u2019s a very simple solution to this problem - and test immediately after you make changes:<\/p>\n\n<blockquote>\n  <ul>\n    <li>Change app<\/li>\n    <li>Test app<\/li>\n    <li>Change database<\/li>\n    <li>Test database<\/li>\n    <li>Change proxy<\/li>\n    <li>Test proxy<\/li>\n  <\/ul>\n<\/blockquote>\n\n<p>When you make changes to the app that fail the tests, you\u2019ll get fast feedback and automatically abort all the other changes until you debug and fix the problem in the app layer.<\/p>\n\n<p>If you were applying changes by hand you would likely be doing something like this anyway, so encode that good practice into your CD pipeline.<\/p>\n\n<p><strong>Tests must finish quickly<\/strong>. If you\u2019ve worked on a code base with good test coverage you\u2019ll  know that slow tests are a huge productivity killer. Exactly the same here - the tests should be a help not a hinderance. Aim to keep each test executing in under 10 seconds, preferably under 5 seconds.<\/p>\n\n<p>This means you must make compromises in what you test. Test for really obvious things like <em>\u201cIs the service running?\u201d<\/em>, <em>\u201cCan I do a simple query?\u201d<\/em>, <em>\u201cAre there any obviously bad log messages?\u201d<\/em>. You\u2019ll likely see the crossover here with \u201ctraditional\u201d monitoring checks. You know, those ones railed against as being bad practice because they don\u2019t sufficiently exercise the entire stack.<\/p>\n\n<p>In this case, they are a pretty good indication your change has broken something. Aim for \u201cgood enough\u201d fast coverage in your CD pipeline which complements your longer running monitoring checks to verify things like end-to-end behaviour.<\/p>\n\n<p><a href=\"http:\/\/serverspec.org\">Serverspec<\/a> is your friend for quickly writing tests for your infrastructure.<\/p>\n\n<p><strong>Make the feedback visual<\/strong>. The raw data is cool, but graphs are better. If you\u2019re doing a simple threshold check and you\u2019re using something like Librato or Datadog, link to a dashboard.<\/p>\n\n<p>If you want to take your visualisation to the next level, use gnuplot\u2019s <a href=\"http:\/\/www.gnuplot.info\/scripts\/99bottles.gp\">dumb<\/a> <a href=\"http:\/\/www.maclife.com\/article\/columns\/terminal_101_graphing_gnuplot\">terminal<\/a> <a href=\"http:\/\/calliopesounds.blogspot.com.au\/2011\/12\/i-have-to-say-again-i-find-gnuplots.html\">output<\/a> to graph metrics on the command line:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>\n\n  1480 ++---------------+----------------+----------------+---------------**\n       +                +                +                + ************** +\n  1460 ++                                            *******              ##\n       |                                      *******                 #### |\n  1440 ++                    *****************                 #######    ++\n       |                  ***                                ##            |\n  1420 *******************                                  #             ++\n       |                                                   #               |\n  1400 ++                                                ##               ++\n       |                                             ####                  |\n       |                                          ###                      |\n  1380 ++                                      ###                        ++\n       |                                     ##                            |\n  1360 ++                               #####                             ++\n       |                            ####                                   |\n  1340 ++                    #######                                      ++\n       |                  ###                                              |\n  1320 ++          #######                                                ++\n       ############     +                +                +                +\n  1300 ++---------------+----------------+----------------+---------------++\n       0                5                10               15               20\n\n\nCRITICAL: Deviation (116.55) is greater than maximum allowed (100.00)\n<\/code><\/pre><\/div><\/div>\n\n<h2 id=\"conclusion\">Conclusion<\/h2>\n\n<p>CD of infrastructure services is possible provided you stick to the two guiding principals:<\/p>\n\n<ol>\n  <li>Optimise for fast feedback.<\/li>\n  <li>Chunk your changes.<\/li>\n<\/ol>\n\n<p>Focus on constantly identifying and eliminating bottlenecks in your CD pipeline to get your iteration time down.<\/p>\n","pubDate":"22 May 2015","link":"https:\/\/fractio.nl\/2015\/05\/22\/cd-for-infrastructure-services\/","guid":"https:\/\/fractio.nl\/2015\/05\/22\/cd-for-infrastructure-services\/"},{"title":"Why do you want to lead people?","description":"<p>Understanding your motivations for a career change into management is vitally important to understanding what kind of manager you want to be.<\/p>\n\n<p>When I made the transition into management, I didn\u2019t have a clear idea of what my motivations were. I had vague feelings of wanting to explore the challenges of managing people. I also wanted to test myself and see if I could do as good a job as role models throughout my career.<\/p>\n\n<p>But all of this was vague, unquantifiable feelings that took a while to get a handle on. Understanding, questioning, and clarifying my motivations was something I put a lot of thought into in the first year of my career change.<\/p>\n\n<p>People within your teams will spend much more time than you realise looking at and analysing what you are doing, and they will pick up on what your motivations are, and where your priorities lie.<\/p>\n\n<p>They will mimic these behaviours and motivations, both positive and negative. You are a signalling mechanism to the team about what\u2019s important and what\u2019s not.<\/p>\n\n<p>This is a huge challenge for people making the career change! You\u2019re still working all this shit out, and you\u2019ve got the ever gazing eye of your team examining and dissecting all of your actions.<\/p>\n\n<p>These are some of the motivations I\u2019ve picked up on in myself and others when trying to understand what drew me to the management career change.<\/p>\n\n<!-- excerpt -->\n\n<h2 id=\"money\">Money<\/h2>\n\n<p>It is undeniable that there is a pay bump when moving to  management. In most organisations, the pay ceiling is much higher in management than in engineering.<\/p>\n\n<p>Many engineers who rise through the ranks get to a point where the only way they will earn more is if they switch from engineering to management, so that becomes the primary motivation.<\/p>\n\n<p>The pay is higher for a good reason though - it\u2019s actually difficult to do the job well! Management looks easy from the outside, but it\u2019s difficult on the inside. Again, our friends <a href=\"http:\/\/en.wikipedia.org\/wiki\/Dunning%E2%80%93Kruger_effect\">Dunning and Kruger<\/a> posit that for a given skill, incompetent people will:<\/p>\n\n<blockquote>\n  <ul>\n    <li>tend to overestimate their own level of skill<\/li>\n    <li>fail to recognize genuine skill in others<\/li>\n    <li>fail to recognize the extremity of their inadequacy<\/li>\n    <li>recognize and acknowledge their own previous lack of skill, if they are exposed to training for that skill<\/li>\n  <\/ul>\n<\/blockquote>\n\n<p>Poor decisions are obvious and easy to criticise. Because we spend a lot of time looking at those in our organisation above us, we\u2019re finely attuned to mistakes and inadequacies, and tend to glass over the good things they do.<\/p>\n\n<p>Understanding what about those decisions and behaviours makes sense to the people making them is difficult but vital to effectively working with others, regardless of whether you\u2019re in management, engineering, sales, finance, or operations.<\/p>\n\n<p>More often than not, there are good reasons behind bad decisions. We are all <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bounded_rationality\">locally rational<\/a>.<\/p>\n\n<p>The pay bump has strings attached - you\u2019re going to be making plenty of decisions, both good and bad, and wearing the consequences of them.<\/p>\n\n<p>You are being paid to be empathetic - to understand how people are feeling, how implementing change will affect people, how to keep them motivated and working towards the big picture goal. None of these tasks are simple!<\/p>\n\n<p>If you\u2019re primarily motivated to move into management by better pay, then you need to seriously consider how that motivation will affect the people that report to you, how mimicry of those motivations and behaviours by people in your team flow on other teams you work with, and what you need to do to meet the commitments you have to your team.<\/p>\n\n<p>Will you be doing the bare minimum to collect your paycheck? What\u2019s stopping you from becoming an example of the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Peter_Principle\">Peter Principle<\/a>? What skills do you need to develop to meet your people\u2019s needs and expectations?<\/p>\n\n<p>The hard problems in tech are not technology, they\u2019re people. That is why management pays more.<\/p>\n\n<h2 id=\"influence\">Influence<\/h2>\n\n<p>Being in management grants you power and influence in your organisation to build and run things as you see fit.<\/p>\n\n<p>This is often a key motivation for people who want to transition from engineering to management - they have a clarity of vision and they want the power to mandate how things should be built, and implement that vision.<\/p>\n\n<p>The motivation is always rooted in good intentions (\u201cthings could be so much more efficient if everyone just listened and did what I said\u201d), and often results in a industrialist approach to managing people - \u201cmanager smart, worker stupid\u201d.<\/p>\n\n<h3 id=\"the-influence-trap\">The influence trap<\/h3>\n\n<p>Your influence can be wielded as a lever (leadership) or as a vise (management). Levers are useful at moving heavy objects but lack precision. Vises are very precise but a weight too heavy will slip from them.<\/p>\n\n<p>Vises are an alluring way for first time managers to work. The vise management style is prescriptive, centrally co-ordinated, command and control. And if you watch carefully you\u2019ll soon realise it limits the potential of the team.<\/p>\n\n<p>Prescriptive, vise-like management assumes you are the smartest person in the room, and know best how things should be done.<\/p>\n\n<p>It doesn\u2019t multiply the teams effectiveness. The point of being a manager is to be a lever that multiplies the effectiveness of the team - to synthesise different and conflicting ideas to come to decisions and solutions nobody could have anticipated or come up by themselves. This is near impossible if you solely wield your influence as a vise.<\/p>\n\n<p>Studies show people\u2019s problem individual performance lifts after <a href=\"http:\/\/psnet.ahrq.gov\/public\/Weaver-JCJQPS-2010-ID-17607.pdf\">being exposed to teamwork situations and training<\/a>.<\/p>\n\n<p>Prescriptive management increases the gap between <em>Work As Imagined<\/em> vs <em>Work As Done<\/em>. While conceptually you may have a great idea about how to solve a problem or operate a system daily, the people implementing your plans always discover gaps between the concept and implementation. Over time these gaps become larger, to the point you have a distorted view of how work is being done compared to how it\u2019s actually being carried out.<\/p>\n\n<p>You optimise the effectiveness of the system by having tight feedback loops, open communication channels where people are rewarded for providing both negative and positive feedback about the design and operation of the system. As a manager, this means you need to be actively engaging with the people in your team - finding out what they think and feel about the work.<\/p>\n\n<p>Finally, prescriptive management is an empirically bad way of retaining creative talent. Constant overruling and minimisation of feedback is a great way to piss people off. If you hire creative, intelligent, capable people and keep them locked in a box, they\u2019re going to break out.<\/p>\n\n<h3 id=\"multiplying-trust-and-happiness\">Multiplying, trust, and happiness<\/h3>\n\n<p>Maybe you are the smartest person in the room, but others will bring knowledge and experience to the table you simply don\u2019t have.<\/p>\n\n<p>You get the best out of the team by creating a safe space for people to put forward ideas, argue them without recriminations, and build consensus.<\/p>\n\n<p>The goal for people leading high performing teams should be to have the output of the team be greater than the sum of the individual efforts of people in the team.<\/p>\n\n<p>Your status as a manager grants you power within your organisation. That power must be wielded responsibly. You won\u2019t know if you\u2019re wielding that power responsibly in the first 12 months of the career change, at best.<\/p>\n\n<p>You must constantly assess whether the decisions you\u2019re making are the best for the people who report to you. It\u2019s a constant tightrope act to balance the needs of your people over the needs of the business.<\/p>\n\n<p>It\u2019s easy to pass the policy buck and say \u201cI\u2019m just following orders\u201d when implementing unpopular changes, but you do have a responsibility to identify and push back on change that negatively affects people before you roll it out, and minimise the unavoidable negative effects of that change.<\/p>\n\n<p>It does not take long for things to come apart when you take your eye off the ball and stop looking out for the team. Trust is hard to build, and easy to lose. People spend a lot of time looking at you and analysing your behaviour. They will notice much earlier than you realise when you take your eye off the ball.<\/p>\n\n<p>It takes <a href=\"http:\/\/bobsutton.typepad.com\/my_weblog\/2010\/05\/bad-is-stronger-than-good-the-5-to-1-rule.html\">at least 5 positive interactions<\/a> to start re-establishing trust after you\u2019ve breached it.<\/p>\n\n<p>Being in a management position grants you the power to shape how people within your organisation do their work. This means you have a direct influence over their happiness and wellbeing. Blindly implementing policy and not empathising with the people in your team can cause irreparable damage and create emotional scar tissue that will stay with people for years, if not decades.<\/p>\n\n<p>Your power must be wielded responsibly. Do not fuck this up. When you do (and don\u2019t worry, you will, we all have), own your mistakes, apologise, and rebuild the trust.<\/p>\n\n<h2 id=\"personal-development--career-change\">Personal development \/ Career change<\/h2>\n\n<p>Personal development is a pretty good motivation for a career change to management! You want to challenge yourself to do a better job than those before and around you.<\/p>\n\n<p>A huge personal motivation for me when moving into management was to treat others better than I had been treated throughout my career until that point.<\/p>\n\n<p>Working in environments where the happiness of people was not the primary concern of those in charge is not a fun experience. Shared negative and stressful experiences helped me form close bonds and develop a camaraderie with the people I worked with. I couldn\u2019t say the same about the people I worked for.<\/p>\n\n<p>Those relationships are something I value, but I wouldn\u2019t want anyone else to have to go through what we did just to obtain that sort of relationship.<\/p>\n\n<p>The challenge for me was clear: was it possible to develop that camaraderie within the team I lead through purely positive experiences?<\/p>\n\n<p>Looking back at how particular decisions and behaviours I experienced affected me and other people in the teams I worked in in these stressful environments, there were some obvious things that I could improve on.<\/p>\n\n<p>There were other decisions I considered to be poor at the time, but after finding myself in similar positions I made similar choices.<\/p>\n\n<p>I failed fairly terribly at the transition during the first 12 months of my career change. Someone in my team described my management style as \u201cabsent father\u201d. That really put into perspective that my priorities were misplaced, and I needed to focus on the team and not my own individual performance.<\/p>\n\n<p>My first experience working in tech was overwhelmingly positive. The working environment and management I experienced on a daily basis in the first 3 years of working in tech is the experience I aspire to create for people in the teams I lead every day.<\/p>\n\n<p>The times I had a \u201cgood boss\u201d are some of my best memories in my career. I was focused on the work, consistently delivered things I was excited about, and rarely worried about troubles elsewhere in the business (and it turned out there were a lot of them).<\/p>\n\n<p>The enduring attitude from that time is the feeling of working with, not for my manager. We worked as a team to solve problems together, not as individuals off doing our own thing. That\u2019s the feeling I want to create in the teams I lead.<\/p>\n\n<hr \/>\n\n<p>Understanding what motivates your career change is not an easy task.<\/p>\n\n<p>At the end of the first year of my career change, my motivations lay somewhere between influence and personal development.<\/p>\n\n<p>These motivations have morphed over time. Today, my focus is the happiness of the people I work with.<\/p>\n\n<p>You need to undertake a constant process of self-reflection and a space to develop your understanding of your motivations. It\u2019s important you create the time and space to do this!<\/p>\n\n<p>The simplest trap to fall into in your first year is to be focused on the daily grind, the tactical details, and not think about the bigger picture.<\/p>\n\n<p>This is something that affects experienced and novice managers alike, and it\u2019s important to establish good personal habits early on so you have time to reflect on what motivates you, and what sort of leader you\u2019re going to be.<\/p>\n","pubDate":"03 Oct 2014","link":"https:\/\/fractio.nl\/2014\/10\/03\/why-do-you-want-to-lead-people\/","guid":"https:\/\/fractio.nl\/2014\/10\/03\/why-do-you-want-to-lead-people\/"},{"title":"It's not a promotion - it's a career change","description":"<p>The biggest misconception engineers have when thinking about moving into management is they think it\u2019s a promotion.<\/p>\n\n<p>Management is not a promotion. It is a career change.<\/p>\n\n<p>If you want to do your leadership job effectively, you will be exercising a vastly different set of skills on a daily basis to what you are exercising as an engineer. Skills you likely haven\u2019t developed and are unaware of.<\/p>\n\n<p>Your job is not to be an engineer. Your job is not to be a manager. Your job is to <a href=\"https:\/\/www.youtube.com\/watch?v=jGPrU15GuSw\">be a multiplier<\/a>.<\/p>\n\n<p>You exist to remove roadblocks and eliminate interruptions for the people you work with.<\/p>\n\n<p>You exist to listen to people (not just hear them!), to build relationships and trust, to deliver bad news, to resolve conflict in a just way.<\/p>\n\n<p>You exist to think about the bigger picture, ask provoking and sometimes difficult questions, and relate the big picture back to something meaningful, tangible, and actionable to the team.<\/p>\n\n<p>You exist to advocate for the team, to promote the group and individual achievements, to gaze into unconstructive criticism and see underlying motivations, and sometimes even give up control and make sacrifices you are uncomfortable or disagree with.<\/p>\n\n<p>You exist to make systemic improvements with the help of the people you work with.<\/p>\n\n<p>Does this sound like engineering work?<\/p>\n\n<p>The truth of the matter is this: you are woefully unprepared for a career in management, and you are unaware of how badly unprepared you are.<\/p>\n\n<p>There are two main contributing factors that have put you in this position:<\/p>\n\n<ul>\n  <li>The Dunning-Kruger effect<\/li>\n  <li>Systemic undervaluation of non-technical skills in tech<\/li>\n<\/ul>\n\n<!-- excerpt -->\n\n<h3 id=\"systemic-undervaluation-of-non-technical-skills\">Systemic undervaluation of non-technical skills<\/h3>\n\n<p>Technical skills are emphasised above all in tech. It is <a href=\"http:\/\/modelviewculture.com\/pieces\/the-startup-mythologies-trifecta\">part of our mythology<\/a>.<\/p>\n\n<p>Technical skill is the dominant currency within our industry. It is highly valued and sought after. If you haven\u2019t read all the posts on the Hacker News front page today, or you\u2019re not running the latest releases of all your software, or you haven\u2019t recently pulled all-nighter coding sessions to ship that killer feature, you\u2019re falling behind bro.<\/p>\n\n<p>Naturally, for an industry so unhealthily focused on technical skills, they tend to be the deciding factor for hiring people.<\/p>\n\n<p>Non-technical skills that are lacking, like teamwork, conflict resolution, listening, and co-ordination, are often overlooked and excused away in engineering circles. They are seen as being of <a href=\"http:\/\/modelviewculture.com\/pieces\/the-myth-of-the-non-technical-startup-employee\">lesser importance<\/a> than technical skills, and organisations frequently compensate for, minimise the effects of, and downplay the importance of these skills.<\/p>\n\n<p>If you really want to see where our industry places value, just think about the terms \u201chard\u201d and \u201csoft\u201d we use to describe and differentiate between the two groups of skills. What sort of connotations do each of those words have, and what implicit biases do they feed into and trigger?<\/p>\n\n<p>If you\u2019re an engineer thinking about going into management, you are a product of this culture.<\/p>\n\n<p>There are a handful of organisations that create cultural incentives to develop these non-technical skills in their engineers, but these organisations are, by and large, unicorns.<\/p>\n\n<p>And if you want to lead people, you\u2019re in for a rude shock if you haven\u2019t developed those non-technical skills.<\/p>\n\n<p>Because guess what - you can\u2019t lead people in the same way you write code or manage machines. If you could, management would have been automated a long time ago.<\/p>\n\n<h3 id=\"the-dunning-kruger-effect\">The Dunning-Kruger effect<\/h3>\n\n<p>The identification of the Dunning-Kruger effect is one of the most interesting development of modern psychology, and one of the most revelatory insights available to our industry.<\/p>\n\n<p>In 1999 David Dunning and Justin Kruger started publishing the results of experiments on the ability of people to <a href=\"http:\/\/en.wikipedia.org\/wiki\/Dunning%E2%80%93Kruger_effect\">self-assess competence<\/a>:<\/p>\n\n<blockquote>\n  <p>Dunning and Kruger proposed that, for a given skill, incompetent people will:<\/p>\n\n  <ul>\n    <li>tend to overestimate their own level of skill<\/li>\n    <li>fail to recognize genuine skill in others<\/li>\n    <li>fail to recognize the extremity of their inadequacy<\/li>\n    <li>recognize and acknowledge their own previous lack of skill, if they are exposed to training for that skill<\/li>\n  <\/ul>\n<\/blockquote>\n\n<p>If you\u2019ve had a career in tech without any leadership responsibilities, you\u2019ve likely had thoughts like:<\/p>\n\n<ul>\n  <li>\u201cManaging people can\u2019t be that hard.\u201d<\/li>\n  <li>\u201cMy boss has no idea what they are doing.\u201d<\/li>\n  <li>\u201cI could do a better job than them.\u201d<\/li>\n<\/ul>\n\n<p>Congratulations! You\u2019ve been partaking in the Dunning-Kruger effect.<\/p>\n\n<p>The bad news: Dunning-Kruger is exacerbated by the systemic devaluation of non-technical skills within tech.<\/p>\n\n<p>The good news: soon after going into leadership, the scope of your lack of skill, and unawareness of your lack of skill, will become plain for you to see.<\/p>\n\n<p>Also, everyone else around you will see it.<\/p>\n\n<h3 id=\"multiplied-impact\">Multiplied impact<\/h3>\n\n<p>This is the heart of the matter: by being elevated into a position of leadership, you are being granted a responsibility over people\u2019s happiness and wellbeing.<\/p>\n\n<p>Mistakes made due to lack of skill and awareness can cause people irreparable damage and create emotional scar tissue that will stay with people for years, if not decades.<\/p>\n\n<p>Conversely, by developing skills and helping your team row in the same direction, you can also create positive experiences that will last with people their entire careers.<\/p>\n\n<p>The people in your team will spend a lot of time looking up at you - far more time than what you realise. Everything you do will be analysed and disected, sometime fairly, sometimes not.<\/p>\n\n<p>If you\u2019re not willing to push yourself, develop the skills, and fully embrace the career change, maybe you should stay on the engineering career development track.<\/p>\n\n<p>But it\u2019s not all doom and gloom.<\/p>\n\n<p>By striving to be a multiplier, the effects of the hard work you and the team put in can be far greater than what you can achieve individually.<\/p>\n\n<p>You only reap the benefits of this if you shift your measure of job satisfaction from your own performance to the group\u2019s.<\/p>\n\n<h3 id=\"real-work\">\u201cReal work\u201d<\/h3>\n\n<p>Many engineers who change into management feel disheartened because they\u2019re not getting as much \u201creal work\u201d done.<\/p>\n\n<p>If you dig deeper, \u201creal work\u201d is always linked to their own individual performance. Of course you\u2019re not going to perform to the same level as an engineer - you\u2019re working towards the same goals, but you are each working on fundamentally different tasks to get there!<\/p>\n\n<p>Focusing on your own skills and performance can be a tough loop to break out of - individual achievement is bound up in the same mythology as technical skills - it\u2019s something highly prized and disproportionately incentivised in much of our culture.<\/p>\n\n<p>If you\u2019ve decided to undertake this career change, it\u2019s important to treat your lack of skill as a learning opportunity, develop a hunger for learning more and developing your skills, routinely reflect on your experiences and compare yourself to your cohort.<\/p>\n\n<p>None of these things are easy - I struggled with feelings of inadequacy in meeting the obligations of my job for the first 3 years of being in a leadership position. Once I worked out that I was tying job satisfaction to engineering performance, it was a long and hard struggle to re-link my definition of success to group performance.<\/p>\n\n<p>If everything you\u2019ve read here hasn\u2019t scared you, and you\u2019ve committed to the change to management, there are three key things you can start doing to start skilling up:<\/p>\n\n<ol>\n  <li>Do professional training.<\/li>\n  <li>Get mentors.<\/li>\n  <li>Educate yourself.<\/li>\n<\/ol>\n\n<h3 id=\"training\">Training<\/h3>\n\n<p>Tech has a bias against professional training that doesn\u2019t come from universities. Engineering organisations tend to value on-the-job experience over training and certification. A big part of that comes from a lot of technical training outside of universities being a little bit shit.<\/p>\n\n<p>Our experience of bad training in the technical domain doesn\u2019t apply to management - there is plenty of quality short course management training available, that other industries have been financing the development of the last couple of decades.<\/p>\n\n<p>In Australia, <a href=\"http:\/\/www.aim.com.au\/\">AIM<\/a> provide several courses ranging from introductory to advanced management and leadership development.<\/p>\n\n<p>Do your research, ask around, find what people would recommend, then make the case for work to pay for it.<\/p>\n\n<h3 id=\"mentors\">Mentors<\/h3>\n\n<p>Find other people in your organisation you can talk to about the challenges you are facing developing your non-technical skills. This person doesn\u2019t necessarily need to be your boss - in fact diversifying your mentors is important for developing skills to entertain multiple perspectives on the same situation.<\/p>\n\n<p>If you\u2019re lucky, your organisation assigns new managers a buddy to act as a mentor, but professional development maturity for management skills varys widely across organisations.<\/p>\n\n<p>If you don\u2019t have anyone in your organisation to act as a mentor or buddy, then seek out old bosses and see if they\u2019d be willing to chat for half an hour every few weeks.<\/p>\n\n<p>I have semi-regular breakfast catchups with a former boss from very early on in my career that are always a breath of fresh air - to the point where my wife actively encourages me to catch up because of how less stressed I am afterwards.<\/p>\n\n<p>Another option is to find other people in your organisation also going through the same transition from engineer to manager as you. You won\u2019t have all the answers, but developing a safe space to bounce ideas around and talk about problems you\u2019re struggling with is a useful tool.<\/p>\n\n<h3 id=\"self-education\">Self-education<\/h3>\n\n<p>I spend a lot of time reading and sharing articles on management and leadership - far more time than I spend on any technical content.<\/p>\n\n<p>At the very beginning of your journey it\u2019s difficult to identify what is good and what is bad, what is gold and what is fluff. I have read a lot of crappy advice, but four years into the journey my barometer for advice is becoming more accurate.<\/p>\n\n<p>Also, be careful of only reading things that re-inforce your existing biases and leadership knowledge. If there\u2019s a particular article I disagree with, I\u2019ll often spend a 5 minutes jotting a brief critique. I\u2019ll either get better at articulating to others what about that idea is flawed, or my perspective will become more nuanced.<\/p>\n\n<p>It\u2019s also pertinent to note how the article made you feel, and reflect for a moment on what about the article made you to feel that way.<\/p>\n\n<p>If you\u2019re scratching your head for where to start, I recommend Bob\u2019s Sutton \u201cThe No Asshole Rule\u201d, then \u201cGood Boss, Bad Boss\u201d. Sutton\u2019s work is rooted in evidence based management (he\u2019s not talking out of his arse - he\u2019s been to literally thousands of companies and observed how they work), but writes in an engaging and entertaining way.<\/p>\n\n<hr \/>\n\n<p>Almost four years into my career change, I can say that it\u2019s been worth it. It has not been easy. I have made plenty of mistakes, have prioritised incorrectly, and hurt people accidentally.<\/p>\n\n<p>But so has everyone else. Nobody else has this nailed. Even the best managers are constantly learning, adapting, improving.<\/p>\n\n<p>Think about it this way: you\u2019re going to accumulate leadership skills faster than people who have made the change because you\u2019re starting with nothing. The difference is nuance and tact that comes from experience, something you can develop by sticking with your new career.<\/p>\n\n<p>This will only happen when you fully commit to your new career, and you change your definition for success to meet your new responsibilities as a manager.<\/p>\n","pubDate":"19 Sep 2014","link":"https:\/\/fractio.nl\/2014\/09\/19\/not-a-promotion-a-career-change\/","guid":"https:\/\/fractio.nl\/2014\/09\/19\/not-a-promotion-a-career-change\/"},{"title":"Applying cardiac alarm management techniques to your on-call","description":"<blockquote>\n  <p>If alarms are more often false than true, a culture emerges on the unit in that staff may delay response to alarms, especially when staff are engaged in other patient care activities, and more important critical alarms may be missed.<\/p>\n<\/blockquote>\n\n<p>One of the most difficult challenges we face in the operations field right now is \u201calert fatigue\u201d. Alert fatigue is a term the tech industry has borrowed from a similar term used in the medical industry, \u201calarm fatigue\u201d - a phenomenon of people being so desensitised to the alarm noise from monitors that they fail to notice or react in time.<\/p>\n\n<p>In an on-call scenario, I posit two main factors contribute to alert fatigue:<\/p>\n\n<ul>\n  <li>The accuracy of the alert.<\/li>\n  <li>The volume of alerts received by the operator.<\/li>\n<\/ul>\n\n<p>Alert fatigue can manifest itself in many ways:<\/p>\n\n<ul>\n  <li>Operators delaying a response to an alert they\u2019ve seen before because \u201cit\u2019ll clear itself\u201d.<\/li>\n  <li>Impaired reasoning and creeping bias, due to physical or mental fatigue.<\/li>\n  <li>Poor decision making during incidents, due to an overload of alerts.<\/li>\n<\/ul>\n\n<p>Earlier this year a story <a href=\"http:\/\/www.npr.org\/blogs\/health\/2014\/01\/24\/265702152\/silencing-many-hospital-alarms-leads-to-better-health-care\">popped up<\/a> about a Boston hospital that silenced alarms to improve the standard of care. It sounded counter-intuitive, but in the context of the alert fatigue problems we\u2019re facing, I wanted to get a better understanding of <a href=\"http:\/\/journals.lww.com\/jcnjournal\/Abstract\/2014\/09000\/Novel_Approach_to_Cardiac_Alarm_Management_on.16.aspx\">what they actually did<\/a>, and how we could potentially apply it to our domain.<\/p>\n\n<!-- excerpt -->\n\n<h3 id=\"the-study\">The Study<\/h3>\n\n<p>When rolling out new cardiac telemetry monitoring equipment in 2008 to all adult inpatient clinical units at Boston Medical Center (BMC), a Telemetry Task Force (TTF) was convened to develop standards for patient monitoring. The TTF was a multidisciplinary team drawing people from senior management, cardiologists, physicians, nursing practitioners and directors, clinical instructors, and a quality and patient safety specialist.<\/p>\n\n<p>BMC\u2019s cardiac telemetry monitoring equipment provide configurable limit alarms (we know this as \u201cthresholding\u201d), with alarms for four levels: message, advisory, warning, crisis. These alarms can either be visual or auditory.<\/p>\n\n<p>As part of the rollout, TTF members observed nursing staff responding to alarms from equipment configured with factory default settings. The TTF members observed that alarms were frequently ignored by nursing staff, but for a good reason - the alarms would self-reset and stop firing.<\/p>\n\n<p>To frame this behaviour from an operations perspective, this is like a Nagios check passing a threshold for a <code class=\"language-plaintext highlighter-rouge\">CRITICAL<\/code> alert to fire, the on-call team member receiving the alert, sitting on it for a few minutes, and the alert recovering all by itself.<\/p>\n\n<p>When the nursing staff were questioned about this behaviour, they reported that more often than not the alarms self-reset, and answering every alarm pulled them away from looking after patients.<\/p>\n\n<p>Fast forward 3 years, and in 2011 BMC started an Alarm Management Quality Improvement Project that experimented with multiple approaches to reducing alert fatigue:<\/p>\n\n<ul>\n  <li>Widen the acceptable thresholds for patient vitals so alarms would fire less often.<\/li>\n  <li>Eliminate all levels of alarms except \u201cmessage\u201d and \u201ccrisis\u201d. Crisis alarms would emit an audible alert, while message history would build up on the unit\u2019s screen for the next nurse to review.<\/li>\n  <li>Alarms that had the ability to self-reset (recover on their own) were disabled.<\/li>\n  <li>If false positives were detected, nursing staff were required to tune the alarms as they occurred.<\/li>\n<\/ul>\n\n<p>The approaches were applied over the course of 6 weeks, with buy-in from all levels of staff, most importantly with nursing staff who were responding to the alarms.<\/p>\n\n<p>Results from the study were clear:<\/p>\n\n<ul>\n  <li>The number of total audible alarms decreased by 89%. This should come as no surprise, given the alarms were tuned to not fire as often.<\/li>\n  <li>The number of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hospital_emergency_codes#Code_Blue\">code blues<\/a> decreased by 50%. This indicates that the reduction of work from the elimination of constant alarms freed up nurses to provide more proactive care, and that lower priority alarms for precursor problems for code blues are more likely to be responded to.<\/li>\n  <li>The number of Rapid Response Team activations on the unit stayed constant. It\u2019s reasonable to assert that the operational effectiveness of the unit was maintained even though alarms fired less often.<\/li>\n  <li>Anonymous surveys of nurses on the unit showed an increase in satisfaction with the level of noise on the unit, with night staff reporting they \u201ckept going back to the central station to reassure themselves that the central station was working\u201d. One anonymous comment stated \u201cI feel so much less drained going home at the end of my shift\u201d.<\/li>\n<\/ul>\n\n<p>At the conclusion of the study, the nursing staff requested that the previous alarming defaults were not restored.<\/p>\n\n<h3 id=\"analysis\">Analysis<\/h3>\n\n<p>The approach outlined in the study is pretty simple: change the default alarm thresholds so they don\u2019t fire unless action <em>must<\/em> be taken, and give the operator the power to tune the alarms if the alarm is inaccurate.<\/p>\n\n<p>Alerts should exist in two states: nothing is wrong, and the world is on fire.<\/p>\n\n<p>But the elimination of alarms that have the ability to recover is a really surprising solution. Can we apply that to monitoring in an operations domain?<\/p>\n\n<p>Two obvious methods to make this happen:<\/p>\n\n<ul>\n  <li>Remove checks that have the ability to self-recover.<\/li>\n  <li>Redesign checks so they can\u2019t self-recover.<\/li>\n<\/ul>\n\n<p>For redesigning checks, I\u2019ve yet to encounter a check designed to <em>not<\/em> recover when thresholds are no longer exceeded. That would be a very surprising alerting behaviour to stumble upon in the wild, that most operators, myself included, would likely attribute to a bug in the check. Socially, a check redesign like that would break many fundamental assumptions operators have about their tools.<\/p>\n\n<p>From a technical perspective, a non-recovering check would require the check having some sort of memory about its previous states and acknowledgements, or at least have the alerting mechanism do this. This approach is totally possible in the realm of more <a href=\"http:\/\/flapjack.io\/\">modern tools<\/a>, but is not in any way commonplace.<\/p>\n\n<p>Regardless of the problems above, I believe adopting this approach in an operations domain would be achievable and I would love to see data and stories from teams who try it.<\/p>\n\n<p>As for removing checks, that\u2019s actually pretty sane! The typical CPU\/memory\/disk utilisation alerts engineers receive can be handy diagnostics during outages, but in almost all modern environments they are terrible indicators for anomalous behaviour, let alone something you want to wake someone up about. If my site can take orders, why should I be woken up about a core being pegged on a server I\u2019ve never heard of?<\/p>\n\n<p>Looking deeper though, the point of removing alarms that self-recover is to <em>eliminate the background noise of alarms that are ignorable<\/em>. This ensures each and every alarm that fires actually requires action, is investigated, acted upon, or is tuned.<\/p>\n\n<p>This is only possible if the volume of alerts is low enough, or there are enough people to distribute the load of responding to alerts. Ops teams that meet both of these criteria do exist, but they\u2019re in the minority.<\/p>\n\n<p>Another consideration is that checks for operations teams are cheap, but physical equipment for nurses is not. I can go and provision a couple of thousand new monitoring checks in a few minutes and have them alert me on my phone, and do all that without even leaving my couch. There\u2019s capacity constraints on the telemetry monitoring in hospitals - budgets limit the number of potential alarms that can be deployed and thus fire, and a person physically needs to move and act on a check to silence it.<\/p>\n\n<p>Also consider that hospitals are dealing with <a href=\"http:\/\/www.slideshare.net\/randybias\/architectures-for-open-and-scalable-clouds\/20\">pets, not cattle<\/a>. Each patient is a genuine snowflake, and the monitoring equipment has to be tuned for size, weight, health. We are extremely lucky in that most modern infrastructure is built from standard, similarly sized components. The approach outlined in this study may be more applicable to organisations who are still looking after pets.<\/p>\n\n<p>There are constraints and variations in physical systems like hospitals that simply don\u2019t apply to the technical systems we\u2019re nurturing, but there is a commonality between the fields: thinking about the purpose of the alarm, and how people are expected to react to it firing, is an extremely important consideration when designing the interaction.<\/p>\n\n<p>One interesting anecdote from the study was that extracting alarm data was a barrier to entry, as manufacturers often don\u2019t provide mechanisms to easily extract data from their telemetry units. We have a natural advantage in operations in that we tend to own our monitoring systems end-to-end and can extract that data, or have access to APIs to easily gather the data.<\/p>\n\n<p>The key takeaway the authors of the article make clear is this:<\/p>\n\n<blockquote>\n  <p>Review of actual alarm data, as well as observations regarding how nursing staff interact with cardiac monitor alarms, is necessary to craft meaningful quality alarm initiatives for decreasing the burden of audible alarms and clinical alarm fatigue.<\/p>\n<\/blockquote>\n\n<p>Regardless of whether you think any of the methods employed above make sense in the field of operations, it\u2019s difficult to argue against collecting and analysing alerting data.<\/p>\n\n<p>The thing that excites me so much about this study is there is actual data to back the proposed techniques up! This is something we really lack in the field of operations, and it would be amazing to see more companies publish studies analysing different alert management techniques.<\/p>\n\n<p>Finally, the authors lay out some recommendations for other institutions can use to improve alarm fatigue without requiring additional resources or technology.<\/p>\n\n<p>To adapt them to the field of operations:<\/p>\n\n<ul>\n  <li>Establish a multidisciplinary alerting work group (dev, ops, management).<\/li>\n  <li>Extract and analyse alerting data from your monitoring system.<\/li>\n  <li>Eliminate alerts that are inactionable, or are likely to recover themselves.<\/li>\n  <li>Standardise default thresholds, but allow local variations to be made by people responding to the alerts.<\/li>\n<\/ul>\n","pubDate":"26 Aug 2014","link":"https:\/\/fractio.nl\/2014\/08\/26\/cardiac-alarms-and-ops\/","guid":"https:\/\/fractio.nl\/2014\/08\/26\/cardiac-alarms-and-ops\/"},{"title":"Rethinking monitoring post-Monitorama PDX","description":"<p>The two key take home messages from <a href=\"http:\/\/monitorama.com\/\">Monitorama PDX<\/a> are this:<\/p>\n\n<ul>\n  <li>We are mistakenly developing monitoring tools for ops people, not the developers who need them most.<\/li>\n  <li>Our over-reliance on strip charts as a method for visualising numerical data is hurting ops as a craft.<\/li>\n<\/ul>\n\n<h3 id=\"death-to-strip-charts\">Death to strip charts<\/h3>\n\n<p>Two years ago when I received my hard copy of William S. Cleveland\u2019s <a href=\"http:\/\/www.amazon.com\/Elements-Graphing-Data-William-Cleveland\/dp\/0963488414\">The Elements of Graphing Data<\/a>, I eagerly opened it and scoured its pages for content on how to better visualise time series data. There were a few interesting methods to improve the visual perception of data in strip charts (banking to 45\u02da, limiting the colour palette), but to my disappointment there were no more than ~30 pages in the 297 page tome that addressed visualising time series data.<\/p>\n\n<!-- excerpt -->\n\n<p>In his talk at Monitorama PDX, <a href=\"https:\/\/twitter.com\/drqz\">Neil Gunther<\/a> <a href=\"http:\/\/www.slideshare.net\/RedRooz\/monrama14-melangenjg\">goes on a whirlwind tour<\/a> of visualising data used by ops daily with visual tools other than time series strip charts. By ignoring time, looking at the distribution, and applying various transformations to the axes (linear-log, log-log, log-linear), Neil demonstrates how you can expose patterns in data (like power law distributions) that were simply invisible in the traditional linear time series form.<\/p>\n\n<p>Neil\u2019s talk explains why Cleveland\u2019s <em>Elements<\/em> gives so little time to time series strip charts - they are a limited tool that obfuscates data that doesn\u2019t match all but a very limited set of patterns.<\/p>\n\n<p>Strip charts are the PHP Hammer of monitoring.<\/p>\n\n<p><a href=\"https:\/\/www.flickr.com\/photos\/raindrift\/7095238893\/\">\n  <img src=\"https:\/\/farm8.staticflickr.com\/7226\/7095238893_5000f6e57d_c.jpg\" class=\"img-responsive\" alt=\"the infamous php hammer\" \/>\n<\/a><\/p>\n\n<p>We have been conditioned to accept strip charts as the One True Way to visualise time series data, and it is fucking us over without us even realising it. <strong>Time series strip charts are the single biggest engineering problem holding monitoring as a craft back.<\/strong><\/p>\n\n<p>It\u2019s time to shape our future by building new tools and extending existing ones to visualise data in different ways.<\/p>\n\n<p>This requires improving the statistical and visual literacy of tool developers (who are providing the generalised tools to visualise the data), and the people who are using the graphs to solve problems.<\/p>\n\n<p>There is another problem here, which <a href=\"https:\/\/twitter.com\/rashidkpc\">Rashid Khan<\/a> touched on during his time on stage: many people are using <a href=\"http:\/\/www.elasticsearch.org\/overview\/logstash\/\">logstash<\/a> &amp; <a href=\"http:\/\/www.elasticsearch.org\/overview\/kibana\/\">Kibana<\/a> directly and avoid numerical metric summaries of log data because that numerical data is just an abstraction of an abstraction.<\/p>\n\n<p>The textual logs provide far more insight into what\u2019s happening than numbers:<\/p>\n\n<p><img src=\"http:\/\/i.imgur.com\/SMnllI6.jpg\" class=\"img-responsive\" alt=\"Stacktrace or GTFO\" \/><\/p>\n\n<p>As an ops team, you have one job: provide a platform app developers can wire up logs, checks, and metrics to (in that order). Expose that to them in a meaningful way for analysis later on.<\/p>\n\n<h3 id=\"the-real-target-audience-for-monitoring-or-how-you-can-make-money-in-the-monitoring-space\">The real target audience for monitoring (or, How You Can Make Money In The Monitoring Space)<\/h3>\n\n<p><a href=\"https:\/\/twitter.com\/adrianco\">Adrian Cockcroft<\/a> made a great point in his keynote: we are building monitoring tools for ops people, not the developers who need them most. This is a piercing insight that fundamentally reframes the problem domain for people building monitoring tools.<\/p>\n\n<p>Building monitoring tools and clean integration points for developers is the most important thing we can do if we want to actually improve the quality of people\u2019s lives on a day to day basis.<\/p>\n\n<p>Help your developers ship a Sensu config &amp; checks as part of their app. You can even leverage <a href=\"http:\/\/www.slideshare.net\/m_richardson\/serverspec-and-sensu-testing-and-monitoring-collide\">existing testing frameworks<\/a> they are already familiar with.<\/p>\n\n<iframe src=\"http:\/\/www.slideshare.net\/slideshow\/embed_code\/33628339\" width=\"427\" height=\"356\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" style=\"border:1px solid #CCC; border-width:1px 1px 0; margin-bottom:5px; max-width: 100%;\" allowfullscreen=\"\"> <\/iframe>\n\n<p>This puts the power &amp; responsibility of monitoring applications into the hands of people who are closest to the app. Ops still provide value: delivering a scalable monitoring platform, and working with developers to instrument &amp; check their apps. You are reducing duplication of effort and have time to educate non-ops people on how to get the best insight into what\u2019s happening.<\/p>\n\n<p>There is still a room for monitoring tools as we\u2019ve traditionally used them, but that\u2019s mostly limited to providing insight into the platforms &amp; environments that ops are providing to developers to run their applications.<\/p>\n\n<p>The majority of application developers don\u2019t care about the internal functioning of the platform though, and they almost certainly don\u2019t want to be alerted about problems within the platform, other than \u201cthe platform has problems, we\u2019re working on fixing them\u201d.<\/p>\n\n<p>The money in the monitoring industry is in building monitoring tools to eliminate the friction for developers get better insight into how their applications are performing and behaving in the real world. New Relic is living proof of this, but the market is far larger than what New Relic is currently catering to, and it\u2019s a far larger market than the ops tools market because developers are much more willing to adopt new tools, experiment, and tinker.<\/p>\n\n<p>If you can provide a method for developers to expose application state in a meaningful way while lowering the barrier of entry, they will jump at it.<\/p>\n\n<p>So are you building monitoring tools for the future?<\/p>\n","pubDate":"10 May 2014","link":"https:\/\/fractio.nl\/2014\/05\/10\/rethinking-monitoring\/","guid":"https:\/\/fractio.nl\/2014\/05\/10\/rethinking-monitoring\/"},{"title":"Flapjack, heartbeating, and one off events","description":"<p>Flapjack assumes a constant stream of events from upstream event producers, and this is fundamental to Flapjack\u2019s design.<\/p>\n\n<p><img src=\"http:\/\/media.giphy.com\/media\/yeUxljCJjH1rW\/giphy.gif\" alt=\"a beating heart\" \/><\/p>\n\n<p>Flapjack asks a fundamentally different question to other notification and alerting systems: \u201cHow long has a check been failing for?\u201d. Flapjack cares about the elapsed time, not the number of observed failures.<\/p>\n\n<p>Alerting systems that depend on counting the number of observed failures to decide whether to send an alert suffer problems when the observation interval is variable.<\/p>\n\n<p>Take this scenario with a LAMP stack running in a large cluster:<\/p>\n\n<!-- excerpt -->\n\n<blockquote>\n  <ol>\n    <li>Nagios detects a single failure in the database layer. It increments the soft state by 1.<\/li>\n    <li>Nagios detects every service that depends on the database layer is now failing due to timeouts. It increments the soft state by 1 for each of these services.<\/li>\n    <li>The timeouts for each of these services cause the next recheck of the original database layer check to be delayed (e.g. after an additional 3 minutes). When it is eventually checked, its soft state is incremented.<\/li>\n    <li>The timeouts for the other services get bigger, causing the database layer check to be delayed further.<\/li>\n    <li>Eventually the original database layer check enters a hard state and alerts.<\/li>\n  <\/ol>\n<\/blockquote>\n\n<p>The above example is a little exaggerated, however the problem with using observed failure counts as a basis for alerting are obvious.<\/p>\n\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Control_theory\">Control theory<\/a> gives us a lot of practical tools for modelling scenarios like these, and the answer is never pretty - if you rely on the number of times you\u2019ve observed a failure to determine if you need to send an alert, you\u2019re alerting effectiveness is limited by any latency in your checkers.<\/p>\n\n<p>By looking at how long something has been failing for, Flapjack limits the effects of latency in the observation interval, and provides alerts to humans about problems faster.<\/p>\n\n<p>This leads to an interesting question though - <strong>can I send a one-off event to Flapjack?<\/strong><\/p>\n\n<p>Technically you can - Flapjack just won\u2019t notify anyone until:<\/p>\n\n<ul>\n  <li>Two events (or more) have been received by Flapjack.<\/li>\n  <li>30 seconds have elapsed between the first event received by Flapjack and the latest.<\/li>\n<\/ul>\n\n<p>This is due to the aforementioned heartbeating behaviour that is baked into Flapjack\u2019s design.<\/p>\n\n<p>As more people are using Flapjack we are seeing increasing demand for one-off event submission. There are two key cases:<\/p>\n\n<ul>\n  <li>Arbitrary event submission via HTTP<\/li>\n  <li>Routing CloudWatch alarms via Flapjack<\/li>\n<\/ul>\n\n<p>One way to solve this would be to build a bridge that accepts one-off events, and periodically dispatches a cached value for these events to Flapjack.<\/p>\n\n<p>Flapjack will definitely close this gap in the future.<\/p>\n","pubDate":"09 May 2014","link":"https:\/\/fractio.nl\/2014\/05\/09\/flapjack-heartbeating\/","guid":"https:\/\/fractio.nl\/2014\/05\/09\/flapjack-heartbeating\/"},{"title":"Data driven alerting with Flapjack + Puppet + Hiera","description":"<p>On Monday I gave a talk at <a href=\"https:\/\/www.eventbrite.com\/e\/si-11891-puppet-camp-sydney-tickets-9778308183\">Puppet Camp Sydney 2014<\/a> about managing Flapjack data (specifically: contacts, notification rules) with Puppet + Hiera.<\/p>\n\n<p>There was a live demo of some new Puppet types I\u2019ve written to manage the data within Flapjack. This is incredibly useful if you want to configure how your on-call are notified from within Puppet.<\/p>\n\n<p><a href=\"http:\/\/www.youtube.com\/watch?v=pV-kv9J-w-Q\">Video<\/a>:<\/p>\n\n<iframe width=\"770\" height=\"577\" src=\"\/\/www.youtube.com\/embed\/pV-kv9J-w-Q?rel=0\" frameborder=\"0\" allowfullscreen=\"\"><\/iframe>\n\n<p><a href=\"https:\/\/speakerdeck.com\/auxesis\/data-driven-alerting-with-flapjack-plus-puppet-plus-hiera\">Slides<\/a>:<\/p>\n\n<script async=\"\" class=\"speakerdeck-embed\" data-id=\"a16e2070743d0131f1361a9a8f72571c\" data-ratio=\"1.33333333333333\" src=\"\/\/speakerdeck.com\/assets\/embed.js\"><\/script>\n\n<p>The code is a little rough around the edges, but you can try it out at in the <a href=\"https:\/\/github.com\/flpjck\/vagrant-flapjack\/tree\/puppet-type\"><code class=\"language-plaintext highlighter-rouge\">puppet-type<\/code> branch on vagrant-flapjack<\/a>.<\/p>\n","pubDate":"12 Feb 2014","link":"https:\/\/fractio.nl\/2014\/02\/12\/data-driven-alerting-with-flapjack-puppet-hiera\/","guid":"https:\/\/fractio.nl\/2014\/02\/12\/data-driven-alerting-with-flapjack-puppet-hiera\/"},{"title":"The questions that should have been asked after the RMS outage","description":"<blockquote>\n  <h3 id=\"routine-error-caused-nsw-roads-and-maritime-outage\"><a href=\"http:\/\/www.itnews.com.au\/BlogEntry\/369805,routine-error-caused-nsw-roads-and-maritime-outage.aspx\">Routine error caused NSW Roads and Maritime outage<\/a><\/h3>\n\n  <p>The <a href=\"http:\/\/www.rms.nsw.gov.au\/\">NSW Roads and Maritime Services<\/a>\u2019 driver and vehicle registration service suffered a full-day outage on Wednesday due to human error during a routine exercise, an initial review has determined.<\/p>\n\n  <p>Insiders told ITnews that the outage, which affected services for most of Wednesday, was triggered by an error made by a database administrator employed by outsourced IT supplier, Fujitsu.<\/p>\n\n  <p>The technician had made changes to what was assumed to be the test environment for RMS\u2019 Driver and Vehicle system (DRIVES), which processes some 25 million transactions a year, only to discover the changes were being made to a production system, iTnews was told.<\/p>\n<\/blockquote>\n\n<p>There is a lot to digest here, so let\u2019s start our analysis with two simple and innocuous words in the opening paragraph: \u201croutine exercise\u201d.<\/p>\n\n<!-- excerpt -->\n\n<p>If the exercise is routine, what is the frequency that release routine is followed? Once a day? Once a week? Once a month? Once a year?<\/p>\n\n<p>The article provides some insight into this:<\/p>\n\n<blockquote>\n  <p>\u201cThe activity on Tuesday night was carried out ahead of a standard quarterly release of the DRIVES system,\u201d a spokesman for Service NSW said.<\/p>\n<\/blockquote>\n\n<p>The statement suggests releases are being done every 3 months. By government standards, RMS\u2019s schedules are likely quite progressive, given their public track record for smooth IT operations and <a href=\"http:\/\/delimiter.com.au\/2013\/02\/15\/the-nsw-rtas-imacs-lasted-a-full-decade\/\">an innovative IT procurement strategy<\/a>.<\/p>\n\n<p>But it\u2019s still a large delta between releases. Given many organisations are moving to daily and even hourly releases to reduce the risk of failures, 3 months release cycles are relatively archaic.<\/p>\n\n<p>Think about everything that can change in three months. There will be big changes that are low impact. There will be small changes that are high impact. There will be everything in between. There will be changes that are determined to be low impact &amp; low risk, but in hindsight will be considered high impact &amp; high risk.<\/p>\n\n<p>Now think about releasing all those changes at once. The longer you wait between releases, the greater the risk something will go wrong.<\/p>\n\n<p>What are the organisational factors that make three monthly releases acceptable? Are people within RMS aware of the pain their current practices are causing? Are those aware of the pain in management or are they on the front line?<\/p>\n\n<p>Are either of these groups pushing for more frequent releases? Do any of those people have the power within RMS to make those change happen? What are the channels for driving organisational change and process improvement?<\/p>\n\n<p>If those channels don\u2019t exist, or when those channels fail, how does the organisation react? How do people within the organisation react?<\/p>\n\n<p>These are all interesting questions that will go a long way to uncovering the extent of the problem, and give people a starting point to address those problems.<\/p>\n\n<p>But that\u2019s rarely the type of coverage you get in the media. The typical narrative in the media and organisations that aren\u2019t learning from their mistakes is very simple:<\/p>\n\n<ul>\n  <li><strong>Attribute<\/strong> blame to bad apples.<\/li>\n  <li><strong>Find<\/strong> the scape goat.<\/li>\n  <li><strong>Excise<\/strong> the perpetrator.<\/li>\n<\/ul>\n\n<p>Discussion of these complex issues focuses exclusively on \u201chuman error\u201d.<\/p>\n\n<p>But what happens if you replaced the human in that situation with another? The answer is almost certainly going to be \u201cexactly the same outcome\u201d.<\/p>\n\n<p>Humans are just actors in a complex system, or complex systems nested in other complex systems. They are locally rational. They are doing their best based on the information they have at hand. Nobody wakes up in the morning with the intention of causing an accident.<\/p>\n\n<p>We have the facts (or at least a media-manipulated interpretation of them). We know the outcome of the \u201cbad actions\u201d. Our knowledge of the outcome (a day-long outage) taints any interpretation of the events from an \u201cobjective\u201d point of view. This is <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hindsight_bias\">hindsight bias<\/a> in its rawest form.<\/p>\n\n<p>We use hindsight to pass judgement on people who were closest to the thing that went wrong. <a href=\"http:\/\/sidneydekker.com\/\">Dekker<\/a> says <em>\u201chindsight converts a once vague, unlikely future into an immediate, certain path\u201d<\/em>. After an accident we draw a line in the sand and say:<\/p>\n\n<blockquote>\n  <p><em>\u201cThere! They crossed the line! They should have known better!\u201d<\/em><\/p>\n<\/blockquote>\n\n<p>But in the fog of war those actors in a complex system were making what they considered to be rational decision based on the information they had at hand. Before the accident, the line is a band of grey, and people in the system are drifting within that band. After an accident, that band of gray rapidly consolidates into a thin dark line that the people closest to the accident are conveniently on the other side of.<\/p>\n\n<p>Being mindful of our own hindsight bias, it\u2019s critical we start at the very beginning: What was the operator thinking when they were performing this routine exercise? What information did the operator have at hand that informed their judgements?<\/p>\n\n<p>How many times had the operator performed that \u201croutine\u201d exercise?<\/p>\n\n<p>If they had performed that exercise before, what was different about this instance of the exercise?<\/p>\n\n<p>If they hadn\u2019t performed that exercise before, what training had they received? What support were they provided? Was someone double checking every item on their checklist?<\/p>\n\n<p>What types of behaviour does the organisation incentivise? Does it reward people who take risks, improve processes, and improvise to get the job done? Or does it reward people who don\u2019t rock the boat, who shut up and do their work \u2014 no questions asked?<\/p>\n\n<p>If the incentive is not to rock the boat, do people have the power to put up their hands when their workload is becoming unmanageable? How does the organisation react to people who identify and raise problems with workload? Is their workload managed to an achievable level, or are they told to suck it up?<\/p>\n\n<p>Are the powers to flag excessive workload extended to people who work with the organisation, but aren\u2019t necessarily members of the organisation \u2014 like contractors, or outsourced suppliers?<\/p>\n\n<p>And most importantly of all \u2014 after an accident, what effect do the words and actions of those in management send to employees? What effect do they have on supplier relationships?<\/p>\n\n<p>The message in RMS\u2019s case is pretty clear:<\/p>\n\n<blockquote>\n  <p><strong>\u201cIf you make a mistake we\u2019ll publicly hang you out to dry.\u201d<\/strong><\/p>\n<\/blockquote>\n\n<p>A culture that prioritises blaming individuals over identifying and improving systemic flaws is not a culture I would choose to be part of.<\/p>\n","pubDate":"20 Jan 2014","link":"https:\/\/fractio.nl\/2014\/01\/20\/questions-that-should-have-been-asked-after-the-rms-outage\/","guid":"https:\/\/fractio.nl\/2014\/01\/20\/questions-that-should-have-been-asked-after-the-rms-outage\/"},{"title":"The How and Why of Flapjack","description":"<p>In October <a href=\"https:\/\/twitter.com\/rodjek\/\">@rodjek<\/a> <a href=\"https:\/\/twitter.com\/rodjek\/status\/395504983841329152\">asked on Twitter<\/a>:<\/p>\n\n<blockquote>\n  <p>\u201cI\u2019ve got a working Nagios (and maybe Pagerduty) setup at the moment. Why and how should I go about integrating Flapjack?\u201d<\/p>\n<\/blockquote>\n\n<p>Flapjack will be immediately useful to you if:<\/p>\n\n<ul>\n  <li>You want to <strong>identify failures faster<\/strong> by rolling up your alerts across multiple monitoring systems.<\/li>\n  <li>You monitor infrastructures that have <strong>multiple teams<\/strong> responsible for keeping them up.<\/li>\n  <li>Your monitoring infrastructure is <strong>multitenant<\/strong>, and each customer has a <strong>bespoke alerting strategy<\/strong>.<\/li>\n  <li>You want to dip your toe in the water and try alternative check execution engines like Sensu, Icinga, or cron in parallel to Nagios.<\/li>\n<\/ul>\n\n<!-- excerpt -->\n\n<h3 id=\"the-double-edged-nagios-sword-or-why-monolithic-monitoring-systems-hurt-you-in-the-long-run\">The double-edged Nagios sword (or why monolithic monitoring systems hurt you in the long run)<\/h3>\n\n<p>One short-term advantage of Nagios is how much it can do for you out of the box. Check execution, notification, downtime, acknowledgements, and escalations can all be handled by Nagios if you invest a small amount of time understanding how to configure it.<\/p>\n\n<p>This short-term advantage can turn into a long-term disadvantage: because Nagios does so much out of the box, you heavily invest in a single tool that does everything for you. When you hit cases that fit outside the scope of what Nagios can do for you easily, the cost of migrating away from Nagios can be quite high.<\/p>\n\n<p>The biggest killer when migrating away from Nagios is you either have to:<\/p>\n\n<ul>\n  <li>Find a replacement tool that matches Nagios\u2019s feature set very closely (or at least the subset of features you\u2019re using)<\/li>\n  <li>Find a collection of tools that integrate well with one another<\/li>\n<\/ul>\n\n<p>Given the composable monitoring world we live in, the second option is more preferable, but not always possible.<\/p>\n\n<h3 id=\"enter-flapjack\">Enter Flapjack<\/h3>\n\n<p><img src=\"http:\/\/farm6.staticflickr.com\/5538\/11716646205_cff25966aa_o.png\" alt=\"flapjack logo\" \/><\/p>\n\n<p>Flapjack aims to be a flexible notification system that handles:<\/p>\n\n<ul>\n  <li>Alert routing (determining who should receive alerts based on interest, time of day, scheduled maintenance, etc)<\/li>\n  <li>Alert summarisation (with per-user, per media summary thresholds)<\/li>\n  <li>Your standard operational tasks (setting scheduled maintenance, acknowledgements, etc)<\/li>\n<\/ul>\n\n<p>Flapjack sits downstream of your check execution engine (like Nagios, Sensu, Icinga, or cron), processing events to determine if a problem has been detected, who should know about the problem, and how they should be told.<\/p>\n\n<h3 id=\"a-team-player-composable-monitoring-pipelines\">A team player (composable monitoring pipelines)<\/h3>\n\n<p>Flapjack aims to be composable - you should be able to easily integrate it with your existing monitoring check execution infrastructure.<\/p>\n\n<p>There are three immediate benefits you get from Flapjack\u2019s composability:<\/p>\n\n<ul>\n  <li><strong>You can experiment with different check execution engines<\/strong> without needing to reconfigure notification settings across all of them. This helps you be more responsive to customer demands and try out new tools without completely writing off your existing monitoring infrastructure.<\/li>\n  <li><strong>You can scale your Nagios horizontally.<\/strong> Nagios can be really performant if you don\u2019t use notifications, acknowledgements, downtime, or parenting. Nagios executes static groups of checks efficiently, so scale the machines you run Nagios on horizontally and use Flapjack to aggregate events from all your Nagios instances and send alerts.<\/li>\n  <li><strong>You can run multiple check execution engines in production.<\/strong> Nagios is well suited to some monitoring tasks. Sensu is well suited to others. Flapjack makes it easy for you to use both, and keep your notification settings configured in one place.<\/li>\n<\/ul>\n\n<p>While you\u2019re getting familiar with how Flapjack and Nagios play together, you can even do a side-by-side comparison of how Flapjack and Nagios alert by configuring them both to alert at the same time.<\/p>\n\n<h3 id=\"multitenant-monitoring\">Multitenant monitoring<\/h3>\n\n<p>If you work for a service provider, you almost certainly run shared infrastructure to monitor the status of the services you sell your customers.<\/p>\n\n<p>Exposing the observed state to customers from your monitoring system can be a real challenge - most monitoring tools simply aren\u2019t built for this particular requirement.<\/p>\n\n<p><a href=\"http:\/\/bulletproof.net.au\">Bulletproof<\/a> spearheaded the reboot of Flapjack because multitenancy is a core requirement of Bulletproof\u2019s monitoring platform - we run a shared monitoring platform, and we have very strict requirements about segregating customers and their data from one another.<\/p>\n\n<p>To achieve this, we keep the security model in Flapjack extraordinarily simple - if you can authenticate against Flapjack\u2019s HTTP APIs, you can perform any action.<\/p>\n\n<p>Flapjack pushes authorization complexity to the consumer, because every organisation is going to have very particular security requirements, and Flapjacks wants to make zero assumptions about what those requirements are going to be.<\/p>\n\n<p>If you\u2019re serious about exposing this sort of data and functionality to your customers, you will need to do some grunt work to provide it through whatever customer portals you already run. We provide a <a href=\"https:\/\/github.com\/flpjck\/flapjack-diner\">very extensive Ruby API client<\/a> to help you integrate with Flapjack, and Bulletproof has been using this API client in production for over a year in our customer portal.<\/p>\n\n<p>One shortfall of Flapjack right now is we perhaps take multitenancy a little too seriously - the Flapjack user experience for single tenant users still needs a little work.<\/p>\n\n<p>In particular, there are some inconsistencies and behaviours in the Flapjack APIs that make sense in a multitenant context, but are pretty surprising for single tenant use cases.<\/p>\n\n<p>We\u2019re <a href=\"https:\/\/github.com\/flpjck\/flapjack\/issues\/381\">actively<\/a> <a href=\"https:\/\/github.com\/flpjck\/flapjack\/issues\/339\">improving<\/a> <a href=\"https:\/\/github.com\/flpjck\/flapjack\/issues\/396\">the single tenant user experience<\/a> for the Flapjack <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/Release-plan-for-1.0\">1.0 release<\/a>.<\/p>\n\n<p>One other killer feature of Flapjack that\u2019s worth mentioning: updating any setting via Flapjack\u2019s HTTP API doesn\u2019t require any sort of restart of Flapjack.<\/p>\n\n<p>This is a significant improvement over tools like Nagios that require full restarts for simple notification changes.<\/p>\n\n<h3 id=\"multiple-teams\">Multiple teams<\/h3>\n\n<p>Flapjack is useful for organisations who segregate responsibility for different systems across different teams, much in the same way Flapjack is useful in a multitenant context.<\/p>\n\n<p>For example:<\/p>\n\n<ul>\n  <li>Your organisation has two on-call rosters - one for customer alerts, and one for internal infrastructure alerts.<\/li>\n  <li>Your organisation is product focused, with dedicated teams owning the availability of those products end-to-end.<\/li>\n<\/ul>\n\n<p>You can feed all your events into Flapjack so operationally you have a single aggregated source of truth of monitoring state, and use the same multitenancy features to create custom alerting rules for individual teams.<\/p>\n\n<p>We\u2019re starting to experiment with this at Bulletproof as development teams start owning the availability of products end-to-end.<\/p>\n\n<h3 id=\"summarisation\">Summarisation<\/h3>\n\n<p>Probably the most powerful Flapjack feature is alert summarisation. Alerts can be summarised on a per-media, per-contact basis.<\/p>\n\n<p>What on earth does that mean?<\/p>\n\n<p>Contacts (people) are associated with checks. When a check alerts, a contact can be notified on multiple media (Email, SMS, Jabber, PagerDuty).<\/p>\n\n<p>Each media has a summarisation threshold that allows a contact to specify when alerts should be \u201crolled up\u201d so the contact doesn\u2019t receive multiple alerts during incidents.<\/p>\n\n<p>If you\u2019ve used <a href=\"http:\/\/pagerduty.com\/\">PagerDuty<\/a> before, you\u2019ve almost certainly experienced similar behaviour when you have multiple alerts assigned to you at a time.<\/p>\n\n<p>Summarisation is particularly useful in multitenant environments where contacts only care about a subset of things being monitored, and don\u2019t want to be overwhelmed with alerts for each individual thing that has broken.<\/p>\n\n<p>To generalise, large numbers of alerts either indicate a total system failure of the thing being monitored, and or false-positives in the monitoring system.<\/p>\n\n<p>In either case, nobody wants to receive a deluge of alerts.<\/p>\n\n<p>Mitigating the effects of monitoring false-positives are especially important when you consider how failures in the <a href=\"https:\/\/fractio.nl\/2013\/03\/25\/data-failures-compartments-pipelines\/\">monitoring pipeline<\/a> cascade into surrounding stages of the pipeline.<\/p>\n\n<p>Monitoring alert recipients generally don\u2019t care about the extent of a monitoring system failure (how many things are failing simultaneously, as evidenced by an alert for each thing), they care that the monitoring system can\u2019t be trusted right now (at least until the underlying problem is fixed).<\/p>\n\n<h3 id=\"what-flapjack-is-not\">What Flapjack is not<\/h3>\n\n<ul>\n  <li><strong>Check execution engine.<\/strong> Sensu, Nagios, and cron already do a fantastic job of this. You still need to configure a tool to run your monitoring checks - Flapjack just processes events generated elsewhere and does notification magic.<\/li>\n  <li><strong>PagerDuty replacement.<\/strong> Flapjack and PagerDuty <em>complement<\/em> one another. PagerDuty has excellent on-call scheduling and escalation support, which is something that Flapjack doesn\u2019t try to go near. Flapjack can trigger alerts in PagerDuty.<\/li>\n<\/ul>\n\n<p>At Bulletproof we use Flapjack to process events from Nagios, and work out if our on-call or customers should be notified about state changes. Our customers receive alerts directly from Flapjack, and our on-call receive alerts from PagerDuty, via Flapjack\u2019s PagerDuty gateway.<\/p>\n\n<p>The Flapjack PagerDuty gateway has a neat feature: it polls the PagerDuty API for alerts it knows are unacknowledged, and will update Flapjack\u2019s state if it detects alerts have been acknowledged in PagerDuty.<\/p>\n\n<p>This is super useful for eliminating the double handling of alerts, where an on-call engineer acknowledges an alert in PagerDuty, and then has to go and acknowledge the alert in Nagios.<\/p>\n\n<p>In the Flapjack world, the on-call engineer acknowledges the alert in PagerDuty, Flapjack notices the acknowledgement in PagerDuty, and Flapjack updates its own state.<\/p>\n\n<h3 id=\"how-do-i-get-started\">How do I get started?<\/h3>\n\n<p>Follow the <a href=\"http:\/\/flapjack.io\/quickstart\/\">quickstart guide<\/a> to get Flapjack running locally using Vagrant.<\/p>\n\n<p>The quickstart guide will take you through basic Flapjack configuration, pushing events check results from Nagios into Flapjack, and configuring contacts and entities.<\/p>\n\n<p>Once you\u2019ve finished the tutorial, check out the <a href=\"https:\/\/github.com\/flpjck\/vagrant-flapjack\/tree\/master\/dist\/modules\/flapjack\">Flapjack Puppet module<\/a> and <a href=\"https:\/\/github.com\/flpjck\/vagrant-flapjack\/blob\/master\/dist\/manifests\/site.pp\">manifest that sets up<\/a> the Vagrant box.<\/p>\n\n<p>Examining the Puppet module will give you a good starting point for rolling out Flapjack into your monitoring environment.<\/p>\n\n<h3 id=\"where-to-next\">Where to next?<\/h3>\n\n<p>We\u2019re gearing up to release Flapjack 1.0.<\/p>\n\n<p>If you take a look at Flapjack in the next little while, please let us know any feedback you have on the <a href=\"https:\/\/groups.google.com\/forum\/#!forum\/flapjack-project\">Google group<\/a>, or ping <a href=\"https:\/\/twitter.com\/auxesis\">@auxesis<\/a> or <a href=\"https:\/\/twitter.com\/jessereynolds\">@jessereynolds<\/a> on Twitter.<\/p>\n\n<p><a href=\"https:\/\/twitter.com\/jessereynolds\">Jesse<\/a> and I are also <a href=\"http:\/\/linux.conf.au\/schedule\/30261\/view_talk?day=wednesday\">running a tutorial at linux.conf.au 2014<\/a> in Perth next Wednesday, and we\u2019ll make the slides available online.<\/p>\n\n<p>Happy Flapjacking!<\/p>\n","pubDate":"03 Jan 2014","link":"https:\/\/fractio.nl\/2014\/01\/03\/the-how-and-why-of-flapjack\/","guid":"https:\/\/fractio.nl\/2014\/01\/03\/the-how-and-why-of-flapjack\/"},{"title":"CLI testing with RSpec and Cucumber-less Aruba","description":"<p>At <a href=\"http:\/\/bulletproof.net\/\">Bulletproof<\/a>, we are increasingly finding home brew systems tools are critical to delivering services to customers.<\/p>\n\n<p>These tools are generally wrapping a collection of libraries and other general Open Source tools to solve specific business problems, like automating a service delivery pipeline.<\/p>\n\n<p>Traditionally these systems tools tend to lack good tests (or simply any tests) for a number of reasons:<\/p>\n\n<ul>\n  <li>The tools are quick and dirty<\/li>\n  <li>The tools model business processes that are often in flux<\/li>\n  <li>The tools are written by systems administrators<\/li>\n<\/ul>\n\n<p>Sysadmins don\u2019t necessarily have a strong background in software development. They are likely proficient in Bash, and have hacked a little Python or Ruby. If they\u2019ve really gotten into the <a href=\"http:\/\/stochasticresonance.wordpress.com\/2009\/07\/12\/infrastructure-renaissance\/\">infrastructure as code<\/a> thing they might have delved into the innards of Chef and Puppet and been exposed to those projects respective testing frameworks.<\/p>\n\n<p>In a lot of cases, testing is seen as <em>\u201csomething I\u2019ll get to when I become a real developer\u201d<\/em>.<\/p>\n\n<!-- excerpt -->\n\n<p>The success of technical businesses can be tied to the <a href=\"http:\/\/algeri-wong.com\/yishan\/engineering-management-tools-are-top-priority.html\">quality of their tools<\/a>.<\/p>\n\n<p>Ask any software developer how they\u2019ve felt inheriting an untested or undocumented code base, and you\u2019ll likely hear wails of horror. Working with such a code base is a painful exercise in frustration.<\/p>\n\n<p>And this is what many sysadmins are doing on a daily basis when hacking on their janky scripts that have <a href=\"http:\/\/www.codinghorror.com\/blog\/2007\/10\/why-does-software-spoil.html\">evolved to send and read email<\/a>.<\/p>\n\n<p>So lets build better systems tools:<\/p>\n\n<!-- excerpt -->\n\n<ul>\n  <li>We want to ensure our systems tools are of a consistent high quality<\/li>\n  <li>We want to ensure new functionality doesn\u2019t break old functionality<\/li>\n  <li>We want to verify we don\u2019t introduce regressions<\/li>\n  <li>We want to streamline peer review of changes<\/li>\n<\/ul>\n\n<p>We can achieve much of this by skilling up sysadmins on how to write tests, adopting a developer mindset to write system tools, and provide them a good framework that helps frame questions that can be answered with tests.<\/p>\n\n<p>We want our engineers to feel confident their changes are going to work, and they are consistently meeting our quality standards.<\/p>\n\n<h2 id=\"but-what-do-you-test\">But what do you test?<\/h2>\n\n<p>We\u2019ve committed to testing, but what exactly do we test?<\/p>\n\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Unit_testing\">Unit<\/a> and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Integration_testing\">integration<\/a> tests are likely not relevant unless the cli tool is large and unwieldy.<\/p>\n\n<p><strong>The user of the tool doesn\u2019t care whether the tool is tested. The user cares whether they can achieve a goal.<\/strong> Therefore, the tests should verify that the user can achieve those goals.<\/p>\n\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Acceptance_testing\">Acceptance tests<\/a> are a good fit because we want to treat the cli tool as a black box and test what the user sees.<\/p>\n\n<p>Furthermore, we don\u2019t care how the tool is actually built.<\/p>\n\n<p>We can write a generic set of high level tests that are decoupled from the language the tool is implemented in, and refactor the tool to a more appropriate language once we\u2019re more familiar with the problem domain.<\/p>\n\n<h2 id=\"how-do-you-test-command-line-applications\">How do you test command line applications?<\/h2>\n\n<p><a href=\"https:\/\/github.com\/cucumber\/aruba\">Aruba<\/a> is a great extension to <a href=\"http:\/\/cukes.info\/\">Cucumber<\/a> that helps you write high level acceptance tests for command line applications, regardless of the language those cli apps are written in.<\/p>\n\n<p>There are actually two parts to Aruba:<\/p>\n\n<ol>\n  <li>Pre-defined Cucumber steps for running + verifying behaviour of command line applications locally<\/li>\n  <li>An API to perform the actual testing, that is called by the Cucumber steps<\/li>\n<\/ol>\n\n<div class=\"language-cucumber highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> create a file\n  <span class=\"err\">Given a file named \"foo\/bar\/example.txt\" with<\/span><span class=\"p\">:<\/span>\n    <span class=\"s\">\"\"\"\n    hello world\n    \"\"\"<\/span>\n  <span class=\"nf\">When <\/span>I run `cat foo\/bar\/example.txt`\n  <span class=\"nf\">Then <\/span>the output should contain exactly <span class=\"s\">\"hello world\"<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>The other player in the command line application testing game is <a href=\"http:\/\/serverspec.org\/\">serverspec<\/a>. It can do very similar things to Aruba, and provides some fancy <a href=\"http:\/\/rspec.info\/\">RSpec<\/a> matchers and helper methods to make the tests look neat and elegant:<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"n\">describe<\/span> <span class=\"n\">package<\/span><span class=\"p\">(<\/span><span class=\"s1\">'httpd'<\/span><span class=\"p\">)<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">it<\/span> <span class=\"p\">{<\/span> <span class=\"n\">should<\/span> <span class=\"n\">be_installed<\/span> <span class=\"p\">}<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"n\">describe<\/span> <span class=\"n\">service<\/span><span class=\"p\">(<\/span><span class=\"s1\">'httpd'<\/span><span class=\"p\">)<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">it<\/span> <span class=\"p\">{<\/span> <span class=\"n\">should<\/span> <span class=\"n\">be_enabled<\/span>   <span class=\"p\">}<\/span>\n  <span class=\"n\">it<\/span> <span class=\"p\">{<\/span> <span class=\"n\">should<\/span> <span class=\"n\">be_running<\/span>   <span class=\"p\">}<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"n\">describe<\/span> <span class=\"n\">port<\/span><span class=\"p\">(<\/span><span class=\"mi\">80<\/span><span class=\"p\">)<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">it<\/span> <span class=\"p\">{<\/span> <span class=\"n\">should<\/span> <span class=\"n\">be_listening<\/span> <span class=\"p\">}<\/span>\n<span class=\"k\">end<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>The cool thing about serverspec that sets it apart from Aruba is it can test things locally <em>and<\/em> remotely via SSH.<\/p>\n\n<p>This is useful when testing automation that creates servers somewhere: run the tool, connect to the server created, verify conditions are met.<\/p>\n\n<p>But what happens when we want to test the behaviour of tools that create things both locally and remotely? For local testing Aruba is awesome. For remote testing, serverspec is a great fit.<\/p>\n\n<p>But Aruba is Cucumber, and serverspec is RSpec. Does this mean we have to write and maintain two separate test suites?<\/p>\n\n<p>Given we\u2019re trying to encourage people who have traditionally never written tests before to write tests, we want to remove extraneous tooling to make testing as simple as possible.<\/p>\n\n<p>A single test suite is a good start.<\/p>\n\n<p>This test suite should be able to run both local + remote tests, letting us use the powerful built-in tests from Aruba, and the great remote tests from serverspec.<\/p>\n\n<p>There are two obvious ways to slice this:<\/p>\n\n<ol>\n  <li>Use serverspec like Aruba - build common steps around serverspec matchers<\/li>\n  <li>Use the Aruba API without the Cucumber steps<\/li>\n<\/ol>\n\n<p>We opted for the second approach - use the Aruba API from within RSpec, sans the Cucumber steps.<\/p>\n\n<p>Opinions on Cucumber within Bulletproof R&amp;D are split between love and loathing. There\u2019s a reasonable argument to be made that Cucumber adds a layer of abstraction to tests that increases maintenance of tests and slows down development. On the other hand, Cucumber is great for capturing high level user requirements in a format those users are able to understand.<\/p>\n\n<p>Again, given we are trying to keep things as simple as possible, eliminating Cucumber from the testing setup to focus purely on RSpec seemed like a reasonable approach.<\/p>\n\n<p>The path was pretty clear:<\/p>\n\n<ol>\n  <li>Do a small amount of grunt work to allow the Aruba API to be used in RSpec<\/li>\n  <li>Provide small amount of coaching to developers on workflow<\/li>\n  <li>Let the engineers run wild<\/li>\n<\/ol>\n\n<h2 id=\"how-do-you-make-aruba-work-without-cucumber\">How do you make Aruba work without Cucumber?<\/h2>\n\n<p>It turns out this was easier than expected.<\/p>\n\n<p>First you add Aruba to your Gemfile<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># Gemfile<\/span>\n<span class=\"n\">source<\/span> <span class=\"s1\">'https:\/\/rubygems.org'<\/span>\n\n<span class=\"n\">group<\/span> <span class=\"ss\">:development<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">gem<\/span> <span class=\"s1\">'rake'<\/span>\n  <span class=\"n\">gem<\/span> <span class=\"s1\">'rspec'<\/span>\n  <span class=\"n\">gem<\/span> <span class=\"s1\">'aruba'<\/span>\n<span class=\"k\">end<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>Run the obligatory <code class=\"language-plaintext highlighter-rouge\">bundle<\/code> to ensure all dependencies are installed locally:<\/p>\n\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>bundle\n<\/code><\/pre><\/div><\/div>\n\n<p>Add a default Rake task to execute tests, to speed up the developer\u2019s workflow, and make tests easy to run from CI:<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># Rakefile<\/span>\n\n<span class=\"nb\">require<\/span> <span class=\"s1\">'rspec\/core\/rake_task'<\/span>\n\n<span class=\"no\">RSpec<\/span><span class=\"o\">::<\/span><span class=\"no\">Core<\/span><span class=\"o\">::<\/span><span class=\"no\">RakeTask<\/span><span class=\"p\">.<\/span><span class=\"nf\">new<\/span><span class=\"p\">(<\/span><span class=\"ss\">:spec<\/span><span class=\"p\">)<\/span>\n\n<span class=\"n\">task<\/span> <span class=\"ss\">:default<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"p\">[<\/span><span class=\"ss\">:spec<\/span><span class=\"p\">]<\/span>\n\n<\/code><\/pre><\/div><\/div>\n\n<p>Bootstrap the project with RSpec:<\/p>\n\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"nv\">$ <\/span>rspec <span class=\"nt\">--init<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>Require and include the Aruba API bits in the specs:<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># spec\/template_spec.rb<\/span>\n\n<span class=\"nb\">require<\/span> <span class=\"s1\">'aruba'<\/span>\n<span class=\"nb\">require<\/span> <span class=\"s1\">'aruba\/api'<\/span>\n\n<span class=\"kp\">include<\/span> <span class=\"no\">Aruba<\/span><span class=\"o\">::<\/span><span class=\"no\">Api<\/span>\n\n<\/code><\/pre><\/div><\/div>\n\n<p>This pulls in <em>just<\/em> the API helper methods in the <code class=\"language-plaintext highlighter-rouge\">Aruba::Api<\/code> namespace. These are what we\u2019ll be using to run commands, test outputs, and inspect files. The <code class=\"language-plaintext highlighter-rouge\">include Aruba::Api<\/code> makes those methods available in the current namespace.<\/p>\n\n<p>Then we set up <code class=\"language-plaintext highlighter-rouge\">PATH<\/code> so the tests know where executables are:<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># spec\/template_spec.rb<\/span>\n<span class=\"nb\">require<\/span> <span class=\"s1\">'pathname'<\/span>\n\n<span class=\"n\">root<\/span> <span class=\"o\">=<\/span> <span class=\"no\">Pathname<\/span><span class=\"p\">.<\/span><span class=\"nf\">new<\/span><span class=\"p\">(<\/span><span class=\"kp\">__FILE__<\/span><span class=\"p\">).<\/span><span class=\"nf\">parent<\/span><span class=\"p\">.<\/span><span class=\"nf\">parent<\/span>\n\n<span class=\"c1\"># Allows us to run commands directly, without worrying about the CWD<\/span>\n<span class=\"no\">ENV<\/span><span class=\"p\">[<\/span><span class=\"s1\">'PATH'<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"<\/span><span class=\"si\">#{<\/span><span class=\"n\">root<\/span><span class=\"p\">.<\/span><span class=\"nf\">join<\/span><span class=\"p\">(<\/span><span class=\"s1\">'bin'<\/span><span class=\"p\">).<\/span><span class=\"nf\">to_s<\/span><span class=\"si\">}#{<\/span><span class=\"no\">File<\/span><span class=\"o\">::<\/span><span class=\"no\">PATH_SEPARATOR<\/span><span class=\"si\">}#{<\/span><span class=\"no\">ENV<\/span><span class=\"p\">[<\/span><span class=\"s1\">'PATH'<\/span><span class=\"p\">]<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span>\n\n<\/code><\/pre><\/div><\/div>\n\n<p>The <code class=\"language-plaintext highlighter-rouge\">PATH<\/code> environment variable is used by Aruba to find commands we want to run. We could specify a full path in each test, but by setting <code class=\"language-plaintext highlighter-rouge\">PATH<\/code> above we can just call the tool by its name, completely pathless, like we would be doing on a production system.<\/p>\n\n<h2 id=\"how-do-you-go-about-writing-tests\">How do you go about writing tests?<\/h2>\n\n<p>The workflow for writing stepless Aruba tests that still use the Aruba API is pretty straight forward:<\/p>\n\n<ol>\n  <li>Find the relevant step from <a href=\"https:\/\/github.com\/cucumber\/aruba\/blob\/master\/lib\/aruba\/cucumber.rb\">Aruba\u2019s <code class=\"language-plaintext highlighter-rouge\">cucumber.rb<\/code><\/a><\/li>\n  <li>Look at how the step is implemented (what methods are called, what arguments are passed to the method, how is output captured later on, etc)<\/li>\n  <li>Take a quick look at how the method is implemented <a href=\"https:\/\/github.com\/cucumber\/aruba\/blob\/master\/lib\/aruba\/api.rb\">in Aruba::Api<\/a><\/li>\n  <li>Write your tests in pure-RSpec<\/li>\n<\/ol>\n\n<p>Here\u2019s an example test:<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># spec\/template_spec.rb<\/span>\n\n<span class=\"c1\"># genud is the name of the tool we're testing<\/span>\n<span class=\"n\">describe<\/span> <span class=\"s2\">\"genud\"<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">describe<\/span> <span class=\"s2\">\"YAML templates\"<\/span> <span class=\"k\">do<\/span>\n    <span class=\"n\">it<\/span> <span class=\"s2\">\"should emit valid YAML to STDOUT\"<\/span> <span class=\"k\">do<\/span>\n      <span class=\"n\">fqdn<\/span>      <span class=\"o\">=<\/span> <span class=\"s1\">'bprnd-test01.bulletproof.net'<\/span>\n\n      <span class=\"c1\"># Run the command with Aruba's run_simple helper<\/span>\n      <span class=\"n\">run_simple<\/span> <span class=\"s2\">\"genud --fqdn <\/span><span class=\"si\">#{<\/span><span class=\"n\">fqdn<\/span><span class=\"si\">}<\/span><span class=\"s2\"> --template <\/span><span class=\"si\">#{<\/span><span class=\"n\">template<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span>\n\n      <span class=\"c1\"># Test the YAML can be parsed<\/span>\n      <span class=\"nb\">lambda<\/span> <span class=\"p\">{<\/span>\n        <span class=\"n\">userdata<\/span> <span class=\"o\">=<\/span> <span class=\"no\">YAML<\/span><span class=\"p\">.<\/span><span class=\"nf\">parse<\/span><span class=\"p\">(<\/span><span class=\"n\">all_output<\/span><span class=\"p\">)<\/span>\n        <span class=\"n\">userdata<\/span><span class=\"p\">.<\/span><span class=\"nf\">should_not<\/span> <span class=\"n\">be_nil<\/span>\n      <span class=\"p\">}.<\/span><span class=\"nf\">should_not<\/span> <span class=\"n\">raise_error<\/span>\n      <span class=\"n\">assert_exit_status<\/span><span class=\"p\">(<\/span><span class=\"mi\">0<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">end<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span>\n\n<\/code><\/pre><\/div><\/div>\n\n<h2 id=\"multiple-inputs-and-drying-up-the-tests\">Multiple inputs, and DRYing up the tests<\/h2>\n\n<p>Testing multiple inputs and outputs of the tool is important for verifying the behaviour of the tool in the wild.<\/p>\n\n<p>Specifically, we want to know the same inputs create the same outputs if we make a change to the tool, and we want to know that new inputs we add are valid in multiple use cases.<\/p>\n\n<p>We also don\u2019t want to write test cases for each instance of test data - generating the tests automatically would be ideal.<\/p>\n\n<p>Our first approach at doing this was to glob a bunch of test data and test the behaviour of the tool for each instance of test data:<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># spec\/template_spec.rb<\/span>\n\n<span class=\"n\">describe<\/span> <span class=\"s2\">\"genud\"<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">describe<\/span> <span class=\"s2\">\"YAML templates\"<\/span> <span class=\"k\">do<\/span>\n    <span class=\"n\">it<\/span> <span class=\"s2\">\"should emit valid YAML to STDOUT\"<\/span> <span class=\"k\">do<\/span>\n\n      <span class=\"c1\"># The inputs we want to test<\/span>\n      <span class=\"n\">templates<\/span> <span class=\"o\">=<\/span> <span class=\"no\">Dir<\/span><span class=\"p\">.<\/span><span class=\"nf\">glob<\/span><span class=\"p\">(<\/span><span class=\"n\">root<\/span> <span class=\"o\">+<\/span> <span class=\"s1\">'templates'<\/span> <span class=\"o\">+<\/span> <span class=\"s2\">\"*.yaml.erb\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">template<\/span><span class=\"o\">|<\/span>\n        <span class=\"n\">fqdn<\/span>     <span class=\"o\">=<\/span> <span class=\"s1\">'hello.example.org'<\/span>\n\n        <span class=\"c1\"># Run the command with Aruba's run_simple helper<\/span>\n        <span class=\"n\">run_simple<\/span> <span class=\"s2\">\"genud --fqdn <\/span><span class=\"si\">#{<\/span><span class=\"n\">fqdn<\/span><span class=\"si\">}<\/span><span class=\"s2\"> --template <\/span><span class=\"si\">#{<\/span><span class=\"n\">template<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span>\n\n        <span class=\"c1\"># Test the YAML can be parsed<\/span>\n        <span class=\"nb\">lambda<\/span> <span class=\"p\">{<\/span>\n          <span class=\"n\">userdata<\/span> <span class=\"o\">=<\/span> <span class=\"no\">YAML<\/span><span class=\"p\">.<\/span><span class=\"nf\">parse<\/span><span class=\"p\">(<\/span><span class=\"n\">all_output<\/span><span class=\"p\">)<\/span>\n          <span class=\"n\">userdata<\/span><span class=\"p\">.<\/span><span class=\"nf\">should_not<\/span> <span class=\"n\">be_nil<\/span>\n        <span class=\"p\">}.<\/span><span class=\"nf\">should_not<\/span> <span class=\"n\">raise_error<\/span>\n        <span class=\"n\">assert_exit_status<\/span><span class=\"p\">(<\/span><span class=\"mi\">0<\/span><span class=\"p\">)<\/span>\n      <span class=\"k\">end<\/span>\n    <span class=\"k\">end<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>This worked great provided all the tests were passing, but the tests themselves became very black box when one of the test data input caused a failure.<\/p>\n\n<p>The engineer would need to add a bunch of <code class=\"language-plaintext highlighter-rouge\">puts<\/code> statements all over the place to determine which input was causing the failure. And even worse, early test failures mask failures in later test data.<\/p>\n\n<p>To combat this, we DRY\u2019d up the tests by doing the Dir.glob once in the outer scope, rather than in each test:<\/p>\n\n<div class=\"language-ruby highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c1\"># spec\/template_spec.rb<\/span>\n\n<span class=\"n\">describe<\/span> <span class=\"s2\">\"genud\"<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">templates<\/span> <span class=\"o\">=<\/span> <span class=\"no\">Dir<\/span><span class=\"p\">.<\/span><span class=\"nf\">glob<\/span><span class=\"p\">(<\/span><span class=\"n\">root<\/span> <span class=\"o\">+<\/span> <span class=\"s1\">'templates'<\/span> <span class=\"o\">+<\/span> <span class=\"s2\">\"*.yaml.erb\"<\/span><span class=\"p\">)<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">template<\/span><span class=\"o\">|<\/span>\n    <span class=\"n\">describe<\/span> <span class=\"s2\">\"YAML templates\"<\/span> <span class=\"k\">do<\/span>\n      <span class=\"n\">describe<\/span> <span class=\"s2\">\"<\/span><span class=\"si\">#{<\/span><span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">basename<\/span><span class=\"p\">(<\/span><span class=\"n\">template<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span> <span class=\"k\">do<\/span>\n        <span class=\"n\">it<\/span> <span class=\"s2\">\"should emit valid YAML to STDOUT\"<\/span> <span class=\"k\">do<\/span>\n\n          <span class=\"n\">fqdn<\/span>     <span class=\"o\">=<\/span> <span class=\"s1\">'hello.example.org'<\/span>\n\n          <span class=\"c1\"># Run the command with Aruba's run_simple helper<\/span>\n          <span class=\"n\">run_simple<\/span> <span class=\"s2\">\"genud --fqdn <\/span><span class=\"si\">#{<\/span><span class=\"n\">fqdn<\/span><span class=\"si\">}<\/span><span class=\"s2\"> --template <\/span><span class=\"si\">#{<\/span><span class=\"n\">template<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span>\n\n          <span class=\"c1\"># Test the YAML can be parsed<\/span>\n          <span class=\"nb\">lambda<\/span> <span class=\"p\">{<\/span>\n            <span class=\"n\">userdata<\/span> <span class=\"o\">=<\/span> <span class=\"no\">YAML<\/span><span class=\"p\">.<\/span><span class=\"nf\">parse<\/span><span class=\"p\">(<\/span><span class=\"n\">all_output<\/span><span class=\"p\">)<\/span>\n            <span class=\"n\">userdata<\/span><span class=\"p\">.<\/span><span class=\"nf\">should_not<\/span> <span class=\"n\">be_nil<\/span>\n          <span class=\"p\">}.<\/span><span class=\"nf\">should_not<\/span> <span class=\"n\">raise_error<\/span>\n          <span class=\"n\">assert_exit_status<\/span><span class=\"p\">(<\/span><span class=\"mi\">0<\/span><span class=\"p\">)<\/span>\n        <span class=\"k\">end<\/span>\n      <span class=\"k\">end<\/span>\n    <span class=\"k\">end<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span>\n<\/code><\/pre><\/div><\/div>\n\n<p>This produces a nice clean test output that decouples the tests from one another while providing the engineer more insight into what test data triggered a failure:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>$ be rake\n\ngenud\n  YAML templates\n    test.yaml.erb\n      should emit valid YAML to STDOUT\n  YAML templates\n    test2.yaml.erb\n      should emit valid YAML to STDOUT\n<\/code><\/pre><\/div><\/div>\n\n<h2 id=\"where-to-from-here\">Where to from here?<\/h2>\n\n<p>The above test rig is a good first pass at meeting our goals for building systems tools:<\/p>\n\n<ul>\n  <li>We want to ensure our systems tools are of a consistent high quality<\/li>\n  <li>We want to ensure new functionality doesn\u2019t break old functionality<\/li>\n  <li>We want to verify we don\u2019t introduce regressions<\/li>\n  <li>We want to streamline peer review of changes<\/li>\n<\/ul>\n\n<p>\u2026 but we want to take it to the next level: integrating serverspec into the same test suite.<\/p>\n\n<p>Having a quick feedback loop to verify local operation of the tool is essential to engineer productivity, especially when remote operations of these type of system tools can take upwards of 10 minutes to complete.<\/p>\n\n<p>But we have to verify the output of local operation actually creates the desired service at the other end. serverspec will help us do this.<\/p>\n","pubDate":"06 Dec 2013","link":"https:\/\/fractio.nl\/2013\/12\/06\/cli-testing-with-rspec-and-cucumber-less-aruba\/","guid":"https:\/\/fractio.nl\/2013\/12\/06\/cli-testing-with-rspec-and-cucumber-less-aruba\/"},{"title":"Just post mortems","description":"<p>Earlier this week I <a href=\"https:\/\/vimeo.com\/75321812\">gave a talk<\/a> at Monitorama EU on <a href=\"https:\/\/speakerdeck.com\/auxesis\/the-psychology-of-alert-design\">psychological factors that should be considered when designing alerts<\/a>.<\/p>\n\n<p><a href=\"https:\/\/twitter.com\/mindweather\">Dave Zwieback<\/a> pointed me to a great blog post of his on <a href=\"http:\/\/mindweather.com\/2013\/09\/13\/the-human-side-of-postmortems\/\">managing the human side of post mortems<\/a>, which bookends nicely with my talk:<\/p>\n\n<blockquote>\n  <p>Imagine you had to write a postmortem containing statements like these:<\/p>\n\n  <blockquote>\n    <p>We were unable to resolve the outage as quickly as we would have hoped because our decision making was impacted by extreme stress.<\/p>\n\n    <p>We spent two hours repeatedly applying the fix that worked during the previous outage, only to find out that it made no difference in this one.<\/p>\n\n    <p>We did not communicate openly about an escalating outage that was caused by our botched deployment because we thought we were about to lose our jobs.<\/p>\n  <\/blockquote>\n\n  <p>While the above scenarios are entirely realistic, it\u2019s hard to find many postmortem write-ups that even hint at these \u201chuman factors.\u201d Their absence is, in part, due to the social stigma associated with publicly acknowledging their contribution to outages.<\/p>\n<\/blockquote>\n\n<p>Dave\u2019s third example dovetails well with some of the examples in <a href=\"http:\/\/www.amazon.com\/Just-Culture-Balancing-Safety-Accountability\/dp\/1409440605\/\">Dekker\u2019s Just Culture<\/a>.<\/p>\n\n<!-- excerpt -->\n\n<p>Dekker posits that people fear the consequences of reporting mistakes because:<\/p>\n\n<ul>\n  <li>They don\u2019t know what the consequences will be<\/li>\n  <li>The consequences of reporting can be really bad<\/li>\n<\/ul>\n\n<p>The last point can be especially important when you consider how things like <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hindsight_bias\">hindsight bias<\/a> elevate the importance of proximity.<\/p>\n\n<p>Simply put: when looking at the consequences of an accident, we tend to blame people who were closest to the thing that went wrong.<\/p>\n\n<p>In the middle of an incident, unless you know your organisation has your back if you volunteer mistakes you have made or witnessed, you are more likely to withhold situationally helpful but professionally damaging information.<\/p>\n\n<p>This limits the team\u2019s operational effectiveness and perpetuates a culture of secrecy, thwarting any organisational learning.<\/p>\n\n<p>I think for Dave\u2019s first example to work effectively (<em>\u201cour decision making was impacted by extreme stress\u201d<\/em>), you would need to quantify what the causes and consequences of that stress are.<\/p>\n\n<p>At <a href=\"http:\/\/www.bulletproof.net\/\">Bulletproof<\/a> we are very open to customers in our problem analyses about the technical details of what fails, because our customers are deeply technical themselves, appreciate the detail, and would cotton on quickly if we were pulling the wool over their eyes.<\/p>\n\n<p>This works well for all parties because all parties have comparable levels of technical knowledge.<\/p>\n\n<p>There is risk when talk about stress in general terms because psychological knowledge is not evenly distributed.<\/p>\n\n<p>Because every man and his dog has experienced stress, every man and his dog feel qualified to talk about and comment on other people\u2019s reactions to stress. Furthermore, it\u2019s a natural reaction to distance yourself from bad qualities you recognise in yourself by attacking and ridiculing those qualities in others.<\/p>\n\n<p>I\u2019d wager that outsiders would be more reserved in passing judgement when unfamiliar concepts or terminology is used (e.g. talking about <a href=\"http:\/\/en.wikipedia.org\/wiki\/Confirmation_bias\">confirmation bias<\/a>, the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Semmelweis_reflex\">Semmelweis reflex<\/a>, etc).<\/p>\n\n<p>You could reasonably argue that by using those concepts or terminology you are deliberately using jargon to obfuscate information to those outsiders and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Cover_your_ass\">Cover Your Arse<\/a>, however I would counter that it\u2019s a good opportunity to open a dialog with those outsiders on building just cultures, eschewing the use of labels like human error, and how cognitive biases are amplified in stressful situations.<\/p>\n","pubDate":"25 Sep 2013","link":"https:\/\/fractio.nl\/2013\/09\/25\/just-post-mortems\/","guid":"https:\/\/fractio.nl\/2013\/09\/25\/just-post-mortems\/"},{"title":"Counters not DAGs","description":"<p>Monitoring dependency graphs are fine for small environments, but they are not a good fit for nested complex environments, like those that make up modern web infrastructures.<\/p>\n\n<p><a href=\"http:\/\/en.wikipedia.org\/wiki\/Directed_acyclic_graph\">DAGs<\/a> are a very alluring data structure to represent monitoring relationships, but they fall down once you start using them to represent relationships at scale:<\/p>\n\n<ul>\n  <li>\n    <p><strong>There is an assumption there is a direct causal link between edges of the graph.<\/strong> It\u2019s very tempting to believe that you can trace failure from one edge of the graph to another. Failures in one part of a complex systems all too often have weird effects and <a href=\"http:\/\/en.wikipedia.org\/wiki\/Butterfly_effect\">induce failure on other components<\/a> of the same system that are quite removed from one another.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Complex systems are almost impossible to model.<\/strong> With time and an endless stream of money you can sufficiently model the failure modes within complex systems in isolation, but fully understanding and predicting how complex systems interact and relate with one another is almost impossible. The only way to model this effectively is to <a href=\"http:\/\/www.fas.org\/man\/dod-101\/sys\/ship\/docs\/art7su98.htm\">have a closed system with very few external dependencies<\/a>, which is the opposite of the situation every web operations team is in.<\/p>\n  <\/li>\n  <li>\n    <p><strong>The cost of maintaining the graph is non trivial.<\/strong> You could employ a team of extremely skilled engineers to understand and model the relationships between each component in your infrastructure, but their work would never be done. On top of that, given the sustained growth most organisations experience, whatever you model will likely change within 12-18 months. <em>Fundamentally it would not provide a good return on investment<\/em>.<\/p>\n  <\/li>\n<\/ul>\n\n<!-- excerpt -->\n\n<h3 id=\"check_check\">check_check<\/h3>\n\n<p>This isn\u2019t a new problem.<\/p>\n\n<p><a href=\"https:\/\/twitter.com\/jordansissel\">Jordan Sissel<\/a> wrote a great post as part of Sysadvent almost three years ago <a href=\"http:\/\/sysadvent.blogspot.com\/2010\/12\/day-6-aggregating-monitoring-checks-for.html\">about check_check<\/a>.<\/p>\n\n<p>His approach is simple and elegant:<\/p>\n\n<ul>\n  <li>Configure checks in Nagios, but configure a contact that drops the alerts<\/li>\n  <li>Read Nagios\u2019s state out of a file + parse it<\/li>\n  <li>Aggregate the checks by regex, and alert if a percentage is critical<\/li>\n<\/ul>\n\n<p>It\u2019s a godsend for people who manage large Nagios instances, but it starts falling down if you\u2019ve got multiple independent Nagios instances (shards) that are checking the same thing.<\/p>\n\n<p>You still end up with a situation where each of your shards alert if the shared entity they\u2019re monitoring fails.<\/p>\n\n<h3 id=\"flapjack\">Flapjack<\/h3>\n\n<p>This is the concrete use case behind why <a href=\"http:\/\/bulletproof.net\/\">we\u2019re<\/a> <a href=\"https:\/\/fractio.nl\/2013\/03\/15\/rebooting-flapjack\/\">rebooting Flapjack<\/a> - we want to stream the event data from all Nagios shards to Flapjack, and do smart things around notification.<\/p>\n\n<p>The approach we\u2019re looking at in Flapjack is pretty similar to <code class=\"language-plaintext highlighter-rouge\">check_check<\/code> - set thresholds on the number of failure events we see for particular entities - but we want to take it one step further.<\/p>\n\n<p>Entities in Flapjack <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/API#wiki-get_entities_id_tags\">can be tagged<\/a>, so we automatically create \u201cfailure counters\u201d for each of those tags.<\/p>\n\n<p>When checks on those entities fail, we simply increment each of those failure counters. Then we can set thresholds on each of those counters (based on absolute value like &gt; 30 entities, or percentage like &gt; 70% of entities), and perform intelligent actions like:<\/p>\n\n<ul>\n  <li>Send a single notification to on-call with a summary of the failing tag counters<\/li>\n  <li>Rate limit alerts and provide summary alerts to customers<\/li>\n  <li>Wake up the relevant owners of the infrastructure that is failing<\/li>\n  <li>Trigger a \u201cworkaround engine\u201d that attempts to resolve the problem in an automated way<\/li>\n<\/ul>\n\n<p>The result of this is that on-call aren\u2019t overloaded with alerts, we involve the people who can fix the problems sooner, and it all works across multiple event sources.<\/p>\n\n<p><strong>One note on complexity<\/strong>: I am not convinced that automated systems that try to derive meaning from relationships in a graph (or even tag counters) and present the operator with a conclusion are going to provide anything more than a best-guess abstraction of the problem. <em>In the real world, that best guess is most likely wrong<\/em>.<\/p>\n\n<p>We need to provide better rollup capabilities that give the operator a summarised view of the current facts, and allow the operator to do their own investigation untainted by the assumptions of the programmer who wrote the inaccurate heuristic.<\/p>\n\n<p>The benefit of Flapjack\u2019s (and <code class=\"language-plaintext highlighter-rouge\">check_check<\/code>\u2019s) approach also minimises the maintainability aspect, as tagging of entities becomes the only thing required to build smarter aggregation + analysis tools. This information can easily be pulled out configuration management.<\/p>\n\n<p>More metadata == more granularity == faster resolution times.<\/p>\n","pubDate":"17 Jul 2013","link":"https:\/\/fractio.nl\/2013\/07\/17\/counters-not-dags\/","guid":"https:\/\/fractio.nl\/2013\/07\/17\/counters-not-dags\/"},{"title":"How we do Kanban","description":"<p>At my <a href=\"http:\/\/bulletproof.net\/\">day job<\/a>, I run a <a href=\"http:\/\/bob.mcwhirter.org\/blog\/2010\/09\/13\/remote-worker-distributed-team\/\">distributed team<\/a> of infrastructure coders spread across Australia + one in Vietnam. Our team is called the Software team, but we\u2019re more analogous to a product focused <a href=\"http:\/\/en.wikipedia.org\/wiki\/Research_and_development\">Research &amp; Development<\/a> team.<\/p>\n\n<p>Other teams at Bulletproof are a mix of office and remote workers, but our team is a little unique in that we\u2019re fully distributed. We do daily standups using Google Hangouts, and try to do face to face meetups every few months at Bulletproof\u2019s offices in Sydney.<\/p>\n\n<p>Intra-team communication is something we\u2019re good at, but I\u2019ve been putting a lot of effort lately into improving how our team communicates with others in the business.<\/p>\n\n<p>This is a post I wrote on our internal company blog explaining how we schedule work, and why we work this way.<\/p>\n\n<hr \/>\n\n<p><img src=\"http:\/\/farm3.staticflickr.com\/2819\/8757261526_b02aa4d973_c.jpg\" alt=\"our physical wallboard in the office\" \/><\/p>\n\n<h3 id=\"what-on-earth-is-this\">What on earth is this?<\/h3>\n\n<p>This is a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Kanban_board\">Kanban board<\/a>.<\/p>\n\n<!-- excerpt -->\n\n<p>A Kanban board is a tool for implementing Kanban. <a href=\"http:\/\/en.wikipedia.org\/wiki\/Kanban\">Kanban<\/a> is a scheduling system developed at Toyota in the 70\u2019s as part of the broader <a href=\"http:\/\/en.wikipedia.org\/wiki\/Toyota_Production_System\">Toyota Production System<\/a>.<\/p>\n\n<p>Applied to <a href=\"http:\/\/en.wikipedia.org\/wiki\/Kanban_\\(development\\)\">software development<\/a>, the top three things Kanban aims to achieve are:<\/p>\n\n<ul>\n  <li><strong>Visualise<\/strong> the flow of work<\/li>\n  <li><strong>Limit<\/strong> the Work-In-Progress (WIP)<\/li>\n  <li><strong>Manage<\/strong> and optimise the flow of work<\/li>\n<\/ul>\n\n<h3 id=\"how-does-kanban-work-for-the-software-team\">How does Kanban work for the Software team?<\/h3>\n\n<p>In practical terms, work tends to be tracked in:<\/p>\n\n<ul>\n  <li><strong><a href=\"http:\/\/bestpractical.com\/rt\/\">RT tickets<\/a><\/strong>, as created using the standard request process, or escalated from other teams<\/li>\n  <li><strong><a href=\"https:\/\/github.com\/features\/projects\/issues\">GitHub issues<\/a><\/strong>, for product improvements, and work discovered while doing other work<\/li>\n  <li><strong>Ad-hoc requests<\/strong>, through informal communication channels (IM, email)<\/li>\n<\/ul>\n\n<p>Because Software deals with requests from many audiences, we use a Kanban board to visualise work from request to completion across all these systems.<\/p>\n\n<h3 id=\"managing-flow\">Managing flow<\/h3>\n\n<p>As of writing, we have 5 stages a task progresses through:<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8118\/8757262454_ffddc8d41e_c.jpg\" alt=\"the board\" \/><\/p>\n\n<ul>\n  <li><strong>To Do<\/strong> - tasks <a href=\"http:\/\/en.wikipedia.org\/wiki\/Triage\">triaged<\/a>, and scheduled to be worked on next<\/li>\n  <li><strong>Doing<\/strong> - tasks being worked on right now<\/li>\n  <li><strong>Deployable<\/strong> - completed tasks that need to be released to production in the near future (generally during change windows)<\/li>\n  <li><strong>Done<\/strong> - completed tasks<\/li>\n<\/ul>\n\n<p>That\u2019s only 4 - there is another stage called the Icebox. This is for tasks we\u2019re aware of, but haven\u2019t been triaged and aren\u2019t scheduled to be worked on yet.<\/p>\n\n<p>Done tasks are cleaned out once a week on Mondays, after the morning standup.<\/p>\n\n<p><strong>Triage<\/strong> is the process of taking a request and:<\/p>\n\n<ul>\n  <li>Determining the business priority<\/li>\n  <li>Breaking it up into smaller tasks<\/li>\n  <li>(Tentatively) allocating it to someone<\/li>\n  <li>Classifying the type of work (Internal, Customer, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Business_as_usual_\\(business\\)\">BAU<\/a>)<\/li>\n  <li>Estimating a task completion time<\/li>\n<\/ul>\n\n<p>We use the board exclusively to visualise the tasks - we don\u2019t communicate with the stakeholder through the board.<\/p>\n\n<p>Each task has a pointer to the system the request originated from:<\/p>\n\n<p><img src=\"http:\/\/farm6.staticflickr.com\/5454\/8756135267_1000189eca_c.jpg\" alt=\"detailed view\" \/><\/p>\n\n<p>\u2026and a little bit of metadata about the overall progress.<\/p>\n\n<p>Communication with the stakeholder is done through the RT ticket \/ GitHub issue \/ email.<\/p>\n\n<h3 id=\"limiting-wip\">Limiting WIP<\/h3>\n\n<p>The <a href=\"http:\/\/en.wikipedia.org\/wiki\/Work_in_process\">WIP<\/a> Limit is an artificial limit on the number of tasks the whole team can work on simultaneously. We currently calculate the WIP as:<\/p>\n\n<blockquote>\n  <p>(Number of people in Software) x 2<\/p>\n<\/blockquote>\n\n<p>The goal here is to ensure no one person is ever working on more than 2 tasks at once.<\/p>\n\n<p>I can hear you thinking <em>\u201cThat\u2019s crazy and will never work for me! I\u2019m always dealing with multiple requests simultaneously\u201d<\/em>.<\/p>\n\n<p>The key to making the WIP Limit work is that <strong>tasks are never pushed<\/strong> through the system - <strong>they are pulled<\/strong> by the people doing the work. Once you finish your current task, you pull across the next highest priority task from the To Do column.<\/p>\n\n<p>The WIP Limit is particularly useful when coupled with visualising flow because:<\/p>\n\n<ul>\n  <li>If people need to work on more than 2 things at once, it\u2019s indicative of a bigger scheduling contention problem that needs to be solved. We are likely context switching rapidly, which rapidly reduces our delivery throughput.<\/li>\n  <li>If the team is constantly working at the WIP limit, we need more people. We always aim to have at least 20% slack in the system to deal with ad-hoc tasks that bubble up throughout the day. If we\u2019re operating at 100% capacity, we have no room to breathe, and this severely reduces our operational effectiveness.<\/li>\n<\/ul>\n\n<h3 id=\"visualising-flow\">Visualising flow<\/h3>\n\n<p>Work makes it way from left to right across the board.<\/p>\n\n<p>This is valuable for communicating to people where their requests sit in the overall queue of work, but also in identifying bottlenecks where work isn\u2019t getting completed.<\/p>\n\n<p>The <a href=\"http:\/\/kanbanery.com\/\">Kanban tool<\/a> we use colour codes tasks based on how long they have been sitting in the same column:<\/p>\n\n<p><img src=\"http:\/\/farm3.staticflickr.com\/2857\/8756137525_ff6a337ca7.jpg\" alt=\"colour coding of tasks\" \/><\/p>\n\n<p>This is vital for identifying work that people are blocking on completing, and tends to be indicative of one of two things:<\/p>\n\n<ul>\n  <li>Work that is too large and needs to be broken down into smaller tasks<\/li>\n  <li>Work that is more complex or challenging than originally anticipated<\/li>\n<\/ul>\n\n<p>The latter is an interesting case, because it may require pulling people off other work to help the person assigned that task push through and complete that work.<\/p>\n\n<p>Normally as a manager this isn\u2019t easy to discover unless you are regularly polling your people about their progress, but that behaviour is incredibly annoying to be on the receiving end of.<\/p>\n\n<p>The board is updated in real time as people in the team do work, which means as a manager I can get out of their way and let them Get Shit Done while having a passive visual indicator of any blockers in the system.<\/p>\n","pubDate":"20 May 2013","link":"https:\/\/fractio.nl\/2013\/05\/20\/how-we-do-kanban\/","guid":"https:\/\/fractio.nl\/2013\/05\/20\/how-we-do-kanban\/"},{"title":"Escalating Complexity","description":"<p>Back in 2009 when I was backpacking around Europe I remember waking up on the morning of June 1 and reading about how an Air France flight had disappeared somewhere over the Atlantic.<\/p>\n\n<p>The lack of information on what happened to the flight intrigued me, and given the traveling I was doing, I was left wondering \u201cwhat if I was on that plane?\u201d<\/p>\n\n<p>Keeping an ear out for updates, in December 2011 I stumbled upon the <a href=\"http:\/\/www.popularmechanics.com\/technology\/aviation\/crashes\/what-really-happened-aboard-air-france-447-6611877\">Popular Mechanics article<\/a> describing the final moments of the flight. I was left fascinated by how a technical system so advanced could fail so horribly, apparently because of the faulty meatware operating it.<\/p>\n\n<!-- excerpt -->\n\n<p>Around the same time I began reading the works of <a href=\"http:\/\/sidneydekker.com\/\">Sidney Dekker<\/a>. I was left in a state of cognitive dissonance, trying to reconcile the mainstream explanation of what happened in the final moments of AF447 (the pilots were poorly trained, inexperienced, and simply incompetent) with the New View that the operators were merely locally rational actors within a complex system, and that \u201croot cause is simply the place you stop looking further\u201d - with that cause far too commonly attributed to humans.<\/p>\n\n<p>I decided to do my own research, which resulted in me producing a talk that has received the strongest reaction of any talk I\u2019ve ever given.<\/p>\n\n<iframe width=\"560\" height=\"315\" src=\"http:\/\/www.youtube.com\/embed\/P8hZOHtrHn0\" frameborder=\"0\" allowfullscreen=\"\"><\/iframe>\n\n<blockquote>\n  <p>On June 1, 2009 Air France 447 crashed into the Atlantic ocean killing all 228 passengers and crew. The 15 minutes leading up to the impact were a terrifying demonstration of the how thick the fog of war is in complex systems.<\/p>\n\n  <p>Mainstream reports of the incident put the blame on the pilots - a common motif in incident reports that conveniently ignore a simple fact: people were just actors within a complex system, doing their best based on the information at hand.<\/p>\n\n  <p>While the systems you build and operate likely don\u2019t control the fate of people\u2019s lives, they share many of the same complexity characteristics. Dev and Ops can learn an abundance from how the feedback loops between these aviation systems are designed and how these systems are operated.<\/p>\n\n  <p>In this talk Lindsay will cover what happened on the flight, why the mainstream explanation doesn\u2019t add up, how design assumptions can impact people\u2019s ability to respond to rapidly developing situations, and how to improve your operational effectiveness when dealing with rapidly developing failure scenarios.<\/p>\n<\/blockquote>\n\n<iframe src=\"http:\/\/www.slideshare.net\/slideshow\/embed_code\/18183459\" width=\"427\" height=\"356\" frameborder=\"0\" marginwidth=\"0\" marginheight=\"0\" scrolling=\"no\" style=\"border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px\" allowfullscreen=\"\" webkitallowfullscreen=\"\" mozallowfullscreen=\"\"> <\/iframe>\n\n<p>The subject matter is heavy, and I while it\u2019s something I\u2019m passionate about, it was an emotionally taxing talk to prepare, and a talk that angers me when presenting.<\/p>\n\n<p>Time to let it sit and rest.<\/p>\n","pubDate":"15 May 2013","link":"https:\/\/fractio.nl\/2013\/05\/15\/escalating-complexity-af447\/","guid":"https:\/\/fractio.nl\/2013\/05\/15\/escalating-complexity-af447\/"},{"title":"Data failures, compartmentalisation challenges, monitoring pipelines","description":"<p>To recap, <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2013\/03\/22\/monitoring-pipelines\/\">pipelines are a useful way of modelling monitoring systems<\/a>.<\/p>\n\n<p>Each compartment of the pipeline manipulates monitoring data before making it available to the next.<\/p>\n\n<p>At a high level, this is how data flows between the compartments:<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8370\/8579331916_e698523190_o.png\" alt=\"basic pipeline\" \/><\/p>\n\n<p>This design gives us a nice separation of concern that enables scalability, fault tolerance, and clear interfaces.<\/p>\n\n<h3 id=\"the-problem\">The problem<\/h3>\n\n<p>What happens when there is no data available for the checks to query?<\/p>\n\n<!-- excerpt -->\n\n<p>In this very concrete case, we can divide the problem into two distinct classes of failure:<\/p>\n\n<ul>\n  <li><strong>Latency when accessing the metric storage layer<\/strong>, manifested as <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2012\/01\/09\/monitoring-sucks-latency-sucks-more\/\">checks timing out<\/a>.<\/li>\n  <li><strong>Latency or failure when pushing metrics into the storage layer<\/strong>, manifested as checks being unable to retrieve fresh data.<\/li>\n<\/ul>\n\n<p>There are two outcomes from this:<\/p>\n\n<ul>\n  <li>We need to provide clearer feedback to the people responding to alerts, to give them more insight into what\u2019s happening within the pipeline<\/li>\n  <li>We need to make the technical system more robust when dealing with either of the above cases<\/li>\n<\/ul>\n\n<h3 id=\"alerting-severity-levels-arent-granular-or-accurate-in-a-modern-monitoring-context\">Alerting severity levels aren\u2019t granular or accurate in a modern monitoring context<\/h3>\n\n<p>There are entire classes of monitoring problems (like the one we\u2019re dealing with here) that map poorly into the existing levels. This is an artefact of an industry wide cargo culting of the alerting levels from Nagios, and these levels may not make sense in a modern monitoring pipeline with distinctly compartmentalised stages.<\/p>\n\n<p>For example, the <a href=\"http:\/\/nagiosplug.sourceforge.net\/developer-guidelines.html#AEN76\">Nagios plugin development guidelines<\/a> state that <code class=\"language-plaintext highlighter-rouge\">UNKNOWN<\/code> from a check can mean:<\/p>\n\n<ul>\n  <li>Invalid command line arguments were supplied to the plugin<\/li>\n  <li>Low-level failures internal to the plugin (such as unable to fork, or open a tcp socket) that prevent it from performing the specified operation.<\/li>\n<\/ul>\n\n<p>\u201cLow-level failures\u201d is extremely broad, and it\u2019s important operationally to provide precise feedback to the people maintaining the monitoring system.<\/p>\n\n<p>Adding an additional level (or levels) with contextual debugging information would help close this feedback loop.<\/p>\n\n<p>In defence of the current practice, there are operational benefits to mapping problems into just 4 levels. For example, there are only ever 4 levels that an engineer needs to be aware of, as opposed to a system where there are 5 or 10 different levels that capture the nuance of a state, but engineers don\u2019t understand what that nuance actually is.<\/p>\n\n<h3 id=\"compartmentalisation-as-the-saviour-and-bane\">Compartmentalisation as the saviour and bane<\/h3>\n\n<p>The core idea driving the pipeline approach is compartmentalisation. We want to split out the different functions of monitoring into separate reliable compartments that have clearly defined interfaces.<\/p>\n\n<p>The motivation for this approach comes from the performance limitations of traditional monitoring systems where all the functions essentially live on a single box that can only be scaled vertically. Eventually you will reach the vertical limit of hardware capacity.<\/p>\n\n<p>This is bad.<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8085\/8579405596_46095fa5cc_o.png\" alt=\"a monolithic monitoring system\" \/><\/p>\n\n<p>Thus the <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2013\/03\/22\/monitoring-pipelines\/\">pipeline approach<\/a>:<\/p>\n\n<blockquote>\n  <p>Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.<\/p>\n<\/blockquote>\n\n<p>This sounds great, except that now we have to deal with the relationships between each compartment both in the normal mode of operation (fetching metrics, querying metrics, sending notifications, etc), but during failure scenarios (one or more compartments being down, incorrect or delayed information passed between compartments, etc).<\/p>\n\n<p>The pipeline attempts to take this into account:<\/p>\n\n<blockquote>\n  <p>Ideally, failures and scalability bottlenecks are compartmentalised.<\/p>\n\n  <p>Where there are cascading failures that can\u2019t be contained, safeguards can be implemented in the surrounding compartments to dampen the effects.<\/p>\n\n  <p>For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.<\/p>\n\n  <p>We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.<\/p>\n<\/blockquote>\n\n<p>While the design is in theory meant to allow this containment, the practicalities of doing this are not straightforward.<\/p>\n\n<p>Some simple questions that need to be asked of each compartment:<\/p>\n\n<ul>\n  <li>How does the compartment deal with a response it hasn\u2019t seen before?<\/li>\n  <li>What is the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Adaptive_capacity\">adaptive capacity<\/a> of each compartment? How robust is each compartment?<\/li>\n  <li>Does a failure in one compartment cascade into another? How far?<\/li>\n<\/ul>\n\n<p>The initial answers won\u2019t be pretty, and the solutions won\u2019t be simple (ideal as that would be) or easily discovered.<\/p>\n\n<p>Additionally, the robustness of each compartments in the pipeline <em>will be different<\/em>, so making each compartent fault tolerant is a hard slog with unique challenges in each compartment.<\/p>\n\n<h3 id=\"how-are-people-solving-this-problem\">How are people solving this problem?<\/h3>\n\n<p>Netflix recently <a href=\"https:\/\/github.com\/Netflix\/Hystrix\/wiki\">open sourced a project called Hystrix<\/a>:<\/p>\n\n<blockquote>\n  <p>Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.<\/p>\n<\/blockquote>\n\n<p>Specifically, Netflix talk about how they make this happen:<\/p>\n\n<blockquote>\n  <h4 id=\"how-does-hystrix-accomplish-this\">How does Hystrix accomplish this?<\/h4>\n\n  <ul>\n    <li>Wrap all calls to external systems (dependencies) in a HystrixCommand object (command pattern) which typically executes within a separate thread.<\/li>\n    <li>Time-out calls that take longer than defined thresholds. A default exists but for most dependencies is custom-set via properties to be just slightly higher than the measured 99.5th percentile performance for each dependency.<\/li>\n    <li>Maintain a small thread-pool (or semaphore) for each dependency and if it becomes full commands will be immediately rejected instead of queued up.<\/li>\n    <li>Measure success, failures (exceptions thrown by client), timeouts, and thread rejections.<\/li>\n    <li>Trip a circuit-breaker automatically or manually to stop all requests to that service for a period of time if error percentage passes a threshold.<\/li>\n    <li>Perform fallback logic when a request fails, is rejected, timed-out or short-circuited.<\/li>\n    <li>Monitor metrics and configuration change in near real-time.<\/li>\n  <\/ul>\n<\/blockquote>\n\n<h3 id=\"potential-solutions\">Potential Solutions<\/h3>\n\n<p>We can apply many of the strategies from Hystrix to the monitoring pipeline:<\/p>\n\n<ul>\n  <li>Wrap all monitoring checks with a timeout that returns an <code class=\"language-plaintext highlighter-rouge\">UNKNOWN<\/code> (assuming you stick with the existing severity levels)<\/li>\n  <li>Add some sort of signalling mechanism to the checks so they fail faster, e.g.\n    <ul>\n      <li>Stick a load balancer like HAProxy or Nginx in front of the data storage compartment<\/li>\n      <li>Cache the state of the data storage compartment that all monitoring checks check before querying the compartment<\/li>\n    <\/ul>\n  <\/li>\n  <li>Detect mass failures, and notify on-call and the monitoring system owners directly to shorten the <a href=\"http:\/\/www.kitchensoap.com\/2010\/11\/07\/mttr-mtbf-for-most-types-of-f\/\">MTTR<\/a>. This is something <a href=\"https:\/\/github.com\/flpjck\/flapjack\">Flapjack<\/a> aims to do <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2013\/03\/15\/rebooting-flapjack\/\">as part of the reboot<\/a>.<\/li>\n<\/ul>\n\n<p>I don\u2019t profess to have all (or even any) of the answers. This is new ground, and I\u2019m very curious to hear how other people are solving this problem.<\/p>\n","pubDate":"25 Mar 2013","link":"https:\/\/fractio.nl\/2013\/03\/25\/data-failures-compartments-pipelines\/","guid":"https:\/\/fractio.nl\/2013\/03\/25\/data-failures-compartments-pipelines\/"},{"title":"Pipelines: a modern approach to modelling monitoring","description":"<p>Over the last few years I have been experimenting with different approaches for scaling  systems that monitor large numbers of heterogenous hosts, specifically in hosting environments.<\/p>\n\n<p>This post outlines a pipeline approach for modelling and manipulating monitoring data.<\/p>\n\n<!-- excerpt -->\n\n<hr \/>\n\n<p>Monitoring can be represented as a pipeline which data flows through, and is eventually turned into a notification for a human.<\/p>\n\n<p>This approach has several benefits:<\/p>\n\n<ul>\n  <li>Failures are compartmentalised<\/li>\n  <li>Compartments can be scaled independently from one another<\/li>\n  <li>Clear interfaces are required between compartments, enabling composability<\/li>\n<\/ul>\n\n<p>Each stage of the pipeline is handled by a different compartment of monitoring infrastructure that analyses and manipulates the data before deciding whether to pass it onto the next compartment.<\/p>\n\n<p>These components are the bare minimum required for a monitoring pipeline:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Data collection infrastructure<\/strong>, is generally a collection of agents on target systems, or standalone tools that extract metrics from opaque systems (preferably via an API).<\/p>\n  <\/li>\n  <li>\n    <p><strong>Data storage infrastructure<\/strong>, provides a place to push collected metrics. These metrics are almost always numerical. These metrics are then queried and fetched for graphing, monitoring checks, and reporting - thus enabling <a href=\"http:\/\/agilesysadmin.net\/pillar-one\">\u201cWe alert on what we draw\u201d<\/a>.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Check execution infrastructure<\/strong>, runs the monitoring checks that are configured for each host, that query the data storage infrastructure. Checks that query textual data often poll the target system directly, which can have <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2012\/01\/09\/monitoring-sucks-latency-sucks-more\/\">effects on latency<\/a>.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Notification infrastructure<\/strong>, processes check results from the check execution infrastructure to send notifications to engineers or stakeholders. Ideally the notification infrastructure can also feed back actions from engineers to acknowledge, escalate, or resolve alerts.<\/p>\n  <\/li>\n<\/ul>\n\n<p>At a high level, this is how data flows between the compartments:<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8370\/8579331916_e698523190_o.png\" alt=\"basic pipeline\" \/><\/p>\n\n<p>When using Nagios, the check + notification infrastructure are generally collapsed into one compartment (with the exception of <a href=\"http:\/\/exchange.nagios.org\/directory\/Addons\/Monitoring-Agents\/NRPE--2D-Nagios-Remote-Plugin-Executor\/details\">NRPE<\/a>).<\/p>\n\n<p>Many monitoring pipelines start out with the data collection + storage infrastructure decoupled from the check infrastructure. Monitoring checks query the same targets that are being graphed, but:<\/p>\n\n<ul>\n  <li>Because the check intervals don\u2019t necessarily match up to the data collection intervals, it can be hard to correlate monitoring alerts to features on the graphs.<\/li>\n  <li>The more systems poll the target system, the more the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Observer_effect_\\(physics\\)\">observer effect<\/a> is amplified.<\/li>\n<\/ul>\n\n<p>There are two other compartments that are becoming increasingly common:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Event processing infrastructure<\/strong>. Sitting between the check execution and notification infrastructure, this compartment processes events generated from the check infrastructure, identifies trends and emergent behaviours, and forwards the alerts to the notification infrastructure. It may also make decisions on who to send alerts to.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Management infrastructure<\/strong>, provides command + control facilities across all the compartments, as well as being the natural place for graphing and dashboards of metrics in the data storage infrastructure to live. If the target audience is non-technical or strongly segmented (e.g. many customers on a shared monitoring infrastructure), it can also provide an abstracted pretty public face to all the compartments.<\/p>\n  <\/li>\n<\/ul>\n\n<p>This is how event processing + management fit into the pipeline:<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8095\/8579345790_663f5d3e09_o.png\" alt=\"event processing + management added to the pipeline\" \/><\/p>\n\n<p>The management infrastructure can likely be broken up into different compartments as well, but for now it serves as a placeholder.<\/p>\n\n<p>Let\u2019s explore the benefits of this pipeline design.<\/p>\n\n<h3 id=\"failures-are-compartmentalised\">Failures are compartmentalised<\/h3>\n\n<p>Ideally, failures and scalability bottlenecks are compartmentalised.<\/p>\n\n<p>Where there are cascading failures that can\u2019t be contained, safeguards can be implemented in the surrounding compartments to dampen the effects<sup><a href=\"#blah\">1<\/a><\/sup>.<\/p>\n\n<p>For example, if the data storage infrastructure stops returning data, this causes the check infrastructure to return false negatives. Or false positives. Or false UNKNOWNs. Bad times.<\/p>\n\n<p>We can contain the effects in the event processing infrastructure by detecting a mass failure and only sending out a small number of targeted notifications, rather than sending out alerts for each individual failing check.<\/p>\n\n<p>This problem is tricky, interesting, and fodder for further blog posts. :-)<\/p>\n\n<h3 id=\"compartments-can-be-scaled-independently\">Compartments can be scaled independently<\/h3>\n\n<p>Monolithic monitoring architectures are a pain to scale. Viewing a monolithic architecture through the prism of the pipeline model, all of the compartments are squeezed onto a single machine. Quite often there isn\u2019t a data collection or storage layer either.<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8085\/8579405596_46095fa5cc_o.png\" alt=\"a monolithic monitoring system\" \/><\/p>\n\n<p>Monolithic architectures often use the same moving parts under the hood, but they tend to be very closely entwined. Each tool has very distinct performance characteristics, but because they all run on a single machine and poorly separated, the only way to improve performance is by throwing expensive hardware at the problem.<\/p>\n\n<p>If you\u2019ve ever worked with a monolithic monitoring system, you will likely be experiencing painful flashbacks right about now.<\/p>\n\n<p>To generalise the workload of the different compartments:<\/p>\n\n<ul>\n  <li>Check execution, notifications, and event processing tends to be very CPU intensive + network latency sensitive<\/li>\n  <li>Data storage is IO intensive + disk space expensive<\/li>\n<\/ul>\n\n<p>Making sure each compartment is humming along nicely is super important when providing a consistent and reliable monitoring service.<\/p>\n\n<p>Splitting the compartments onto separate infrastructure enables us to:<\/p>\n\n<ul>\n  <li>Optimise the performance of each component individually, either through using hardware that\u2019s more appropriate for the workloads (SSDs, multi-CPU physical machines), or tuning the software stack at the kernel and user space level.<\/li>\n  <li>Expose data through well defined APIs, which leads into the next point:<\/li>\n<\/ul>\n\n<h3 id=\"clear-interfaces-are-required-between-compartments\">Clear interfaces are required between compartments<\/h3>\n\n<p>I like to think of this as \u201cthe Duplo approach\u201d - compartments with well defined interfaces you can plug together to compose your pipeline.<\/p>\n\n<p><img src=\"http:\/\/farm3.staticflickr.com\/2518\/3999316430_8df5fdda1f_z.jpg\" alt=\"a Dulpo brick\" \/><\/p>\n\n<p>Clear interfaces abstract the tools used in each compartment of the pipeline, which is essential for chaining tools in a composable way.<\/p>\n\n<p>Clear interfaces help us:<\/p>\n\n<ul>\n  <li>Replace underperforming tools that have reached their scalability limits<\/li>\n  <li>Test new tools in parallel with the old tools by verifying their inputs + outputs<\/li>\n  <li>Better identify input that could be considered erroneous, and react appropriately<\/li>\n<\/ul>\n\n<p>Concepts like <a href=\"http:\/\/en.wikipedia.org\/wiki\/Design_by_contract\">Design by Contract<\/a>, <a href=\"http:\/\/en.wikipedia.org\/wiki\/Service-oriented_architecture\">Service Oriented Architecture<\/a>, or <a href=\"http:\/\/en.wikipedia.org\/wiki\/Defensive_programming\">Defensive Programming<\/a> then have direct applicability to the design of individual components and the pipeline overall.<\/p>\n\n<hr \/>\n\n<p>It\u2019s not all rainbows and unicorns. There are some downsides to the pipeline approach.<\/p>\n\n<h3 id=\"greater-cost\">Greater Cost<\/h3>\n\n<p>There will almost certainly be a bigger initial investment in building a monitoring system with the pipeline approach.<\/p>\n\n<p>You\u2019ll be using more components, thus more servers, thus the cost is greater. While the cost of scaling out may be greater up-front, you limit the need to scale up later on.<\/p>\n\n<p>You can counteract some of these effects by starting small and dividing up compartments over time as part of a piecemeal strategy, but this takes time + persistence.<\/p>\n\n<p>I can tell you from personal project management experience when rolling out of this pipeline design that it\u2019s hard work keeping a model of the complexity in your head and also well documented.<\/p>\n\n<h3 id=\"more-complexity\">More Complexity<\/h3>\n\n<p>The pipeline makes it easier to eliminate scalability bottlenecks at the expense of more moving parts. The more moving parts, the greater the likelihood of failure.<\/p>\n\n<p>Operationally it will be more difficult to troubleshoot when failures occur, and this becomes worse as you increase the safeguards and fault tolerance within your compartments.<\/p>\n\n<p>This is the cost of scalability, and there is no easy fix.<\/p>\n\n<h3 id=\"conclusion\">Conclusion<\/h3>\n\n<p>The pipeline model maps nicely to existing monitoring infrastructures, but also to larger distributed monitoring systems.<\/p>\n\n<p>It provides scalability, fault tolerance, and composability at the cost of a larger upfront investment.<\/p>\n\n<hr \/>\n\n<p><a id=\"blah\">1<\/a>: This is a vast simplification of a very complex topic. Thinking of failure as an energy to be contained by barriers was a popular perspective in accident prevention circles from the 1960\u2019s to the 1980\u2019s, <a href=\"https:\/\/www.msb.se\/Upload\/Kunskapsbank\/Forskningsrapporter\/Slutrapporter\/2009%20Resilience%20Engineering%20New%20directions%20for%20measuring%20and%20maintaining%20safety%20in%20complex%20systems.pdf\">but the concept doesn\u2019t necessarily apply to complex systems<\/a>.<\/p>\n","pubDate":"22 Mar 2013","link":"https:\/\/fractio.nl\/2013\/03\/22\/monitoring-pipelines\/","guid":"https:\/\/fractio.nl\/2013\/03\/22\/monitoring-pipelines\/"},{"title":"Rebooting Flapjack","description":"<p>This is the first time I\u2019ve actually blogged about Flapjack.<\/p>\n\n<!-- excerpt -->\n\n<h3 id=\"the-past\">The past<\/h3>\n\n<p>In 2008 I started talking with <a href=\"https:\/\/twitter.com\/imprecise_matt\">Matt Moor<\/a> about building a \u201cnext generation monitoring system\u201d that would be simple to setup &amp; operate, and provide obvious paths to scale.<\/p>\n\n<p>In 2009 I started hacking on Flapjack while backpacking, and by mid 2009 I had a working prototype running basic monitoring checks.<\/p>\n\n<p>The fundamental idea was simple: decouple the check execution from the alerting and notification, and use message queues to distribute the check execution across lots of machines.<\/p>\n\n<p>It seems simple and obvious now, but at the time nobody was really talking about doing this, so Flapjack gathered a reasonable amount of attention relatively quickly after I started talking about it at conferences.<\/p>\n\n<p>2010 rolled around and I was unable to maintain a good development pace and hold that attention gained by talking at conferences due to some <a href=\"http:\/\/www.flickr.com\/photos\/auxesis\/7104782937\">fairly significant life changes<\/a>. Pretty much all of my open source projects suffered, and in the space of 12 months:<\/p>\n\n<ul>\n  <li><a href=\"http:\/\/cucumber-nagios.org\">cucumber-nagios<\/a> maintainership was handed over<\/li>\n  <li><a href=\"http:\/\/visage-app.com\">Visage<\/a> got a small trickle of bug fixes<\/li>\n  <li><a href=\"http:\/\/flapjack-project.com\">Flapjack<\/a> was wound up and <a href=\"https:\/\/github.com\/flpjck\/flapjack\/commit\/661dbd84d2d94a67b6cea58e5f6e86c82b6b316b\">I considered it a dead project<\/a><\/li>\n<\/ul>\n\n<p>There were plenty of other interesting projects like <a href=\"http:\/\/sensuapp.org\/\">Sensu<\/a> that were achieving similar goals excellently, so while winding up Flapjack was a source of bitter personal disappointment, it was offset by seeing other people doing awesome work in the monitoring space.<\/p>\n\n<h3 id=\"the-present\">The present<\/h3>\n\n<p>Mid <abbr title=\"2012\">last year<\/abbr>, an interesting problem arose at work:<\/p>\n\n<p>In a modern \u201cmonitoring system\u201d, how do you:<\/p>\n\n<ul>\n  <li>\n    <p><strong>Notify a dynamic group of people on a variety of media based on monitoring events?<\/strong> <a href=\"http:\/\/bulletproof.net\">Bulletproof<\/a> has thousands of people that may need to be notified by our monitoring system, depending on what monitoring checks are failing. While the thresholds on each monitoring check are universal, each of these people can have different notification settings based on time of day or week, the type of service affected, or the severity of the failure.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Dampen or roll up common events so on-call isn\u2019t bombarded during outages?<\/strong> When one system deep in the stack fails, it has significant flow-on effects to everything else that depends on it. This generally manifests as thousands (or tens of thousands, in extremely bad cases) of alerts being sent to on-call in a very short period of time (&lt;60 seconds). Obviously this is bad, and we simply want to detect cases like these, and wake up people involved in the incident response process.<\/p>\n  <\/li>\n  <li>\n    <p><strong>Do the above in an API driven way?<\/strong> We need to solve both problems in a way that works in a multitenant environment with strong segregation between customers, and integrates with an existing monitoring &amp; customer self-service stack.<\/p>\n  <\/li>\n<\/ul>\n\n<p>Thus, <a href=\"https:\/\/github.com\/flpjck\/flapjack\">Flapjack was rebooted<\/a> with a significantly altered focus:<\/p>\n\n<ul>\n  <li>Event processing<\/li>\n  <li>Correlation &amp; rollup<\/li>\n  <li>API driven configuration<\/li>\n<\/ul>\n\n<p>We\u2019ve been actively working on the reboot since July last year, and have been sending alerts from Flapjack to customers since January.<\/p>\n\n<p>We\u2019re developing Flapjack as a <a href=\"http:\/\/en.wikipedia.org\/wiki\/MIT_License\">fully Open Source<\/a> <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/USING\">composable platform<\/a> on which you can <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/IMPORTING\">adapt<\/a> and build to your organisation\u2019s needs by hooking it into your existing check execution infrastructure (we ship a Nagios event processor), and self service and provisioning automation tools.<\/p>\n\n<p>Because we care deeply about people integrating Flapjack into their existing environments, we have invested a lot of time and energy into writing quality documentation that covers <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/API\">working with the API<\/a>, <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/DEBUGGING\">debugging production issues<\/a>, and <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/DATA_STRUCTURES\">the data structures<\/a> used behind the scenes. That\u2019s all on top of the <a href=\"https:\/\/github.com\/flpjck\/flapjack\/wiki\/USING\">usage documentation<\/a>, of course.<\/p>\n\n<p>Flapjack is built on Redis, and funnily enough <a href=\"https:\/\/twitter.com\/ripienaar\">R.I. Pienaar<\/a> did a post <a href=\"http:\/\/www.devco.net\/archives\/2013\/01\/06\/solving-monitoring-state-storage-problems-using-redis.php\">earlier this year<\/a> that investigates using Redis to solve the same problem in an extremely similar way. R.I.\u2019s post provides a good primer on some of the thinking behind Flapjack, so I recommend giving it a read.<\/p>\n\n<h3 id=\"the-future\">The future<\/h3>\n\n<p>Fundamentally, Flapjack is trying to plug a notification hole in the monitoring ecosystem that I don\u2019t believe is being adequately addressed by other tools, but the key to doing this is to play nicely with other tools and build a composable pipeline.<\/p>\n\n<p>The above is merely a glimpse of Flapjack that leaves quite a few questions unanswered (e.g. <em>\u201cWhy aren\u2019t you using $x feature of $y check execution engine to do roll-up?\u201d<\/em>, <em>\u201cDo Flapjack and <a href=\"http:\/\/riemann.io\/\">Riemann<\/a> play nicely with one another?\u201d<\/em>), so stay tuned for more:<\/p>\n\n<p><img src=\"http:\/\/24.media.tumblr.com\/tumblr_lx2uc33Q0Z1qb6v7mo1_500.gif\" alt=\"more waffles\" \/><\/p>\n","pubDate":"15 Mar 2013","link":"https:\/\/fractio.nl\/2013\/03\/15\/rebooting-flapjack\/","guid":"https:\/\/fractio.nl\/2013\/03\/15\/rebooting-flapjack\/"},{"title":"Upcoming speaking engagements and travel","description":"<p>My next 2 months is going to be jam packed with conferences and travel!<\/p>\n\n<ul>\n  <li><a href=\"http:\/\/www.devopsdays.org\/events\/2013-newzealand\/\">Devopsdays NZ<\/a>, <strong>March 8 2013<\/strong>. I will be giving a talk that analyses <a href=\"http:\/\/www.devopsdays.org\/events\/2013-newzealand\/proposals\/LessonsCollaborativeMaintenance\/\">AA261 through a DevOps lense<\/a>, looking at the collaborative maintenance and operation of the MD-83 in the crash.<\/li>\n  <li><a href=\"http:\/\/monitorama.com\/\">Monitorama<\/a>, <strong>March 28-29 2013<\/strong>. I\u2019m looking forward to slowing down and listening at Monitorama, which has a tremendous line up of speakers. I\u2019ll be keen to hear what others think of the work <a href=\"http:\/\/bulletproof.net\">we\u2019ve<\/a> been doing <a href=\"http:\/\/github.com\/flpjck\/flapjack\">on Flapjack<\/a> the <a href=\"https:\/\/speakerdeck.com\/auxesis\/zombie-pancakes-rebooting-flapjack-lindsay-holmwood\">last 6 months<\/a>.<\/li>\n  <li><a href=\"http:\/\/mtnwestrubyconf.org\/\">Mountain West Ruby Conf 2013<\/a>, <strong>April 3-5 2013<\/strong>. MWRC has added an extra day of DevOps content to the conference this year, and I\u2019ll be joining an esteemed speaker lineup to talk about what both dev and ops can learn from <a href=\"http:\/\/en.wikipedia.org\/wiki\/Air_France_Flight_447\">AF447<\/a> when responding to rapidly evolving failure scenarios.<\/li>\n  <li>I\u2019ll be staying in the Netherlands for a little under a week between conferences, visiting family and friends. Hopefully I can visit a meetup or two.<\/li>\n  <li><a href=\"http:\/\/www.netways.de\/en\/osdc\/osdc_2013\/overview\/\">Open Source Data Center Conference 2013<\/a>, <strong>April 17-18 2013<\/strong>. This will be my first time in Nurenberg, and I\u2019m really looking forward to saying I have attended <a href=\"http:\/\/en.wikipedia.org\/wiki\/Open_Source_Developers'_Conference\">both<\/a> <a href=\"http:\/\/www.netways.de\/en\/osdc\/osdc_2013\/overview\/\">OSDCs<\/a>. I\u2019ll be talking about <a href=\"http:\/\/github.com\/auxesis\/ript\">Ript<\/a>, a DSL for describing firewall rules, and a tool for incrementally applying them.<\/li>\n  <li><a href=\"http:\/\/www.netways.de\/puppetcamp\">Puppet Camp Nurenberg 2013<\/a>, <strong>April 19 2013<\/strong>. Straight after OSDC I\u2019ll be talking about how we are using <a href=\"http:\/\/bulletproof.net\/\">Puppet at Bulletproof Networks<\/a> in multi-tenant, isolated environments.<\/li>\n<\/ul>\n","pubDate":"05 Mar 2013","link":"https:\/\/fractio.nl\/2013\/03\/05\/upcoming-speaking-and-travel\/","guid":"https:\/\/fractio.nl\/2013\/03\/05\/upcoming-speaking-and-travel\/"},{"title":"How I make interesting technical presentations","description":"<p>Whenever I talk at conferences, I am routinely asked how I go about preparing and making my presentations.<\/p>\n\n<p>There are no hard and fast rules, but these are some things I have learnt:<\/p>\n\n<h3 id=\"start-analog\">Start analog<\/h3>\n\n<p>The most limiting thing you can do when you start putting together a presentation is to reach for slideware. I use a paper notebook to brainstorm my ideas with multicoloured pens, then scan it so I can refer back to it quickly when putting the slides together.<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8098\/8414483097_871baba740.jpg\" alt=\"mindmapping a talk\" \/><\/p>\n\n<!-- excerpt -->\n\n<h3 id=\"dont-create-slides-linearly\">Don\u2019t create slides linearly<\/h3>\n\n<p>I focus on an idea in the brainstorm that surprised me the most when I wrote it down, and use it as a jump-off point for creating slides. I\u2019ve found exploring that initial idea helps set the tone for the rest of the presentation.<\/p>\n\n<h3 id=\"weave-a-story\">Weave a story<\/h3>\n\n<p>Kathy Sierra used to bang on <a href=\"http:\/\/headrush.typepad.com\/creating_passionate_users\/2006\/02\/where_theres_pa.html\">about this heaps<\/a>. We\u2019re wired as a species to find stories interesting, so use this to your advantage.<\/p>\n\n<p>But don\u2019t concoct a story just for the talk - try to relate the content back to your own experiences. Nobody wants to hear about <a href=\"http:\/\/en.wikipedia.org\/wiki\/Alice_and_Bob\">Alice and Bob<\/a>, they want to hear you and your co-workers rise above adversity and the setbacks you had along the way.<\/p>\n\n<p>Chris Fegan\u2019s <a href=\"http:\/\/www.nbnco.com.au\/\">NBNCo<\/a> talk at Puppet Camp Sydney 2013 was a good example of how to weave technical detail into an organisational growth story.<\/p>\n\n<h3 id=\"use-slides-appropriately\">Use slides appropriately<\/h3>\n\n<p>They are a visual aid, and a visual aid alone. People\u2019s attention should be on you - you are the speaker after all! Use lots of supporting visuals, and minimal text. No bullet point lists! Put each point on a separate slide.<\/p>\n\n<p>I use <a href=\"http:\/\/www.flickr.com\/search\/?q=wave&amp;l=cc&amp;ss=0&amp;ct=0&amp;mt=all&amp;w=all&amp;adv=1\">Flickr\u2019s Creative Commons search<\/a> to find relevant images, and favourite them when I want to use them again across multiple presentations. Sometimes they even provide a visual trigger that moves the presentation in a direction I wasn\u2019t expecting.<\/p>\n\n<p>If I post the slides after the presentation, it\u2019s always nice to comment on the picture on Flickr to let the photographer know I appreciate their contributions to Open Culture.<\/p>\n\n<h3 id=\"dont-rely-on-the-slides\">Don\u2019t rely on the slides<\/h3>\n\n<p>Ideally if your laptop died 5 minutes before the talk, you should know your material well enough that you could deliver it by voice alone.<\/p>\n\n<h3 id=\"be-thorough\">Be thorough<\/h3>\n\n<p>Shortcuts are obvious to your audience. I spend at least 20 hours preparing each presentation.<\/p>\n\n<p>A lot of that time is research (I spent 10 hours alone doing research on <a href=\"http:\/\/en.wikipedia.org\/wiki\/AF447\">AF447<\/a> before I created a single slide, and that research was probably too little given the depth of subject matter), and a lot of it is finding images on Flickr. :-)<\/p>\n\n<p>Maybe 20 hours is a lot, but every minute you put into preparation pays off.<\/p>\n\n<h3 id=\"tailor-your-content\">Tailor your content<\/h3>\n\n<p>It\u2019s ok to give the same talk at multiple conferences, but make sure you alter the content so it\u2019s relevant to your audience.<\/p>\n\n<p>I gave my <a href=\"http:\/\/www.slideshare.net\/auxesis\/monitoring-web-application-behaviour-with-cucumbernagios\">cucumber-nagios talk<\/a> tens of times over an 18 month period, but the talk was different every time.<\/p>\n\n<p>If I was at a developer conference, I would talk about how to reuse your existing tests as monitoring checks. If I was at a sysadmin conference, I would talk about testing systems infrastructure. If I was at a DevOps conference, I would talk about encoding &amp; communicating business processes in your monitoring.<\/p>\n\n<h3 id=\"practice-practice-practice\">Practice, practice, practice<\/h3>\n\n<p>Know the timing of your talk. Work out what the average time you should spend on each slide. I generally rehearse each talk at least 3-5 times before I give it the first time, and will revise and rehearse at least 1-2 times on subsequent presentations.<\/p>\n\n<p>Don\u2019t wait until you\u2019ve finished the presentation before you start practicing. I\u2019ll often practice the 20% I\u2019ve put together and discover it feels mechanical, or the ideas don\u2019t flow well into one another. Refactor.<\/p>\n\n<h3 id=\"test-your-equipment\">Test your equipment<\/h3>\n\n<p>Plug your laptop into the projector at least once, preferably twice, before your talk. I carry multiple adapters for every conceivable display type out there, some display cables, a power board, and <a href=\"http:\/\/www.logitech.com\/en-au\/product\/professional-presenter-r800?crid=11\">a clicker<\/a>. Test everything, then test it again.<\/p>\n\n<h3 id=\"mirror-your-display\">Mirror your display<\/h3>\n\n<p>It\u2019s tempting to use your laptop screen for presenter notes and stopwatch widgets. Don\u2019t. Know your material. Use a physical stopwatch. Split displays will break unexpectedly, and you\u2019ll lose your flow. Besides, mirroring is always easier than craning your neck to see what your audience is seeing.<\/p>\n\n<h3 id=\"watch-yourself\">Watch yourself<\/h3>\n\n<p>If you\u2019re lucky to talk at a conference where your talk is recorded, go back and watch your talk. This is vitally important for working out what bits flowed well and what bits were stilted.<\/p>\n\n<p>\u2013<\/p>\n\n<p>The most important thing is to speak at many events as often as possible. You\u2019re only going to get better at presenting if you present. Start working towards that <a href=\"http:\/\/en.wikipedia.org\/wiki\/Outliers_\\(book\\)\">10,000 hours of mastery<\/a>!<\/p>\n","pubDate":"26 Jan 2013","link":"https:\/\/fractio.nl\/2013\/01\/26\/how-i-make-interesting-technical-presentations\/","guid":"https:\/\/fractio.nl\/2013\/01\/26\/how-i-make-interesting-technical-presentations\/"},{"title":"DevOps Down Under 2012 - what happened?","description":"<p>Almost 2 days ago <a href=\"https:\/\/twitter.com\/patrickdebois\">Patrick<\/a> kicked off a discussion about organising another Australian DevOps conference in 2013 amongst a small group of passionate DevOps who are actively involved in the Australian community.<\/p>\n\n<p>While the discussion was trundling on without me, I felt I owed everyone involved an explanation of what happened with this year\u2019s <a href=\"http:\/\/devopsdownunder.org\">unrealised conference<\/a>, and why the conference fell flat.<\/p>\n\n<p>Let\u2019s start at the beginning.<\/p>\n\n<!-- excerpt -->\n\n<p>Having come back from a year of backpacking around Europe and attending the first DevOpsDays conference, I took it upon myself to try and replicate the success by organising the first DevOps Down Under conference in 2010.<\/p>\n\n<p>It was a relatively small affair held downstairs at Atlassian\u2019s Corn Exchange offices in Sydney, and I put the thing together on a shoestring budget in my spare time with some on-the-ground help from Atlassian\u2019s <a href=\"https:\/\/twitter.com\/nickmuldoon\">Nicholas Muldoon<\/a>.<\/p>\n\n<p>The event was successful, with people from all across Australia and New Zealand to attending. At the end of the conference, each attendee was asked to write down one thing they loved, and one thing they hated about the conference.<\/p>\n\n<p><img src=\"http:\/\/farm5.staticflickr.com\/4079\/5448860355_1e98e1c647_z.jpg\" alt=\"Stacks of love and hate\" \/><\/p>\n\n<p>This gave me a great starting point to build another conference on, and in early 2011 I started getting the itch to do another. At the same time, <a href=\"https:\/\/twitter.com\/evanbottcher\">Evan Bottcher<\/a> pinged me about ThoughtWorks lending a hand to organise another DevOps Down Under in Melbourne later in 2011.<\/p>\n\n<p>The most consistent feedback we got from the 2010 conference was that the coffee was \u201ca little bit shit\u201d, so we fixed that by moving the whole conference to Melbourne.<\/p>\n\n<p>After an initial planning meeting, ThoughtWorks kindly lent <a href=\"https:\/\/twitter.com\/chrisbushelloz\">Chris Bushell<\/a> and <a href=\"http:\/\/www.linkedin.com\/pub\/natalie-drucker\/2a\/233\/911\">Natalie Drucker<\/a> to assist with organising.<\/p>\n\n<p>I was just starting a new position at work, and wasn\u2019t able to dedicate nearly as much time to organising as I had in 2010. I provided the initial vision and direction, but without Chris and Natalie\u2019s tireless efforts and persistent pestering of me to get my arse into gear, the conference would have been but a shadow of itself.<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8070\/8206025708_56cf336d68_z.jpg\" alt=\"Attendees at #dodu2011\" \/><\/p>\n\n<p>By the time DevOps Down Under 2011 wrapped up in July, I was tired and wasn\u2019t feeling fired up about putting on another conference just yet. I decided to wait and see how I felt in the new year.<\/p>\n\n<p>Around March this year I started thinking about doing another conference, but the spark wasn\u2019t there like in other years. I decided to press on regardless, motivated by my perceived expectation that people wanted another conference.<\/p>\n\n<p>The vision for DevOps Down Under 2012 was to build a quiet, intimate, and safe atmosphere that was removed from the rat race. To achieve this, the plan was to cap the number of attendees at 140, find a venue outside a major capital city, and source high quality talks.<\/p>\n\n<p><img src=\"http:\/\/farm9.staticflickr.com\/8489\/8206032538_cfc53dfa14_z.jpg\" alt=\"Venue shot for #dodu2012\" \/><\/p>\n\n<p>The venue &amp; budget was in place, and we got a really great collection of talks submitted. I simply failed to execute on anything beyond that.<\/p>\n\n<p>The main reasons why execution failed were:<\/p>\n\n<ul>\n  <li>I had lost the passion for organising the conference, and was motivated by the wrong reasons.<\/li>\n  <li>I had even less time to commit.<\/li>\n  <li>Everyone involved was similarly time poor.<\/li>\n  <li>There was no organisational cadence.<\/li>\n  <li>I didn\u2019t lean enough on other people to help me do the grunt work.<\/li>\n  <li>I didn\u2019t have the time to fix any of these problems.<\/li>\n<\/ul>\n\n<p>With the benefit of hindsight, I simply shouldn\u2019t have tried to put it on.<\/p>\n\n<p>Seeing people putting their hands up to organise a 2013 conference takes a huge mental weight off my shoulders.<\/p>\n\n<p>Through my own actions and inactions, I have felt the responsibility of leading the conference organisation year-on-year has fallen to me. In 2012 that pressure became paralysing, and my eventual coping mechanism was to ignore the conference entirely.<\/p>\n\n<p>As for my future involvement: I am still burnt out, and it would simply be unfair to myself, the organisers, speakers, and attendees to commit to taking an active role in organising a 2013 conference.<\/p>\n\n<p>I have provided the current crop of potential organisers a collection of resources to get them started, and I am extremely confident they will manage to pull off something spectacular.<\/p>\n\n<p>Drawing on my battered experience of organising several conferences, these are the key actionable things I believe you need to make an event like DevOps Down Under happen:<\/p>\n\n<ul>\n  <li><strong>Have at least 3 people who can each dedicate 2+ hours a week to doing the grunt work.<\/strong> Anyone who tells you organising a conference is anything but a hard slog is either lying to you, or doesn\u2019t know what they are talking about.<\/li>\n  <li><strong>Do weekly catchup meetings to keep things on track.<\/strong> Increase the frequency of these closer to the conference date.<\/li>\n  <li><strong>Use a mailing list for asynchronous organisation.<\/strong><\/li>\n  <li><strong>Nominate someone to lead &amp; own the conference vision &amp; organisation.<\/strong><\/li>\n<\/ul>\n\n<p>I hope the above arms you with enough information to avoid falling into the same traps I did.<\/p>\n","pubDate":"22 Nov 2012","link":"https:\/\/fractio.nl\/2012\/11\/22\/devops-down-under-2012-what-happened\/","guid":"https:\/\/fractio.nl\/2012\/11\/22\/devops-down-under-2012-what-happened\/"},{"title":"Ript: quick, reliable, and painless firewalling","description":"<p>Running your own servers? Hate managing firewall rules?<\/p>\n\n<p>For the last year at <a href=\"http:\/\/bulletproof.net\">Bulletproof Networks<\/a> I\u2019ve been working on a little tool called <a href=\"http:\/\/github.com\/bulletproofnetworks\/ript\">Ript<\/a> to make writing firewall rules a joy, and applying them quick, reliable, and painless.<\/p>\n\n<p>Ript is a clean and opinionated <a href=\"http:\/\/en.wikipedia.org\/wiki\/Domain-specific_language\">Domain Specific Language<\/a> for describing firewall rules, and a tool with database migrations-like functionality for applying these rules with zero downtime.<\/p>\n\n<h3 id=\"the-dsl\">The DSL<\/h3>\n\n<p>At Ript\u2019s core is an easy to use Ruby DSL for describing both simple and complex sets of iptables firewall rules. After defining the hosts and networks you care about:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"n\">partition<\/span> <span class=\"s2\">\"joeblogsco\"<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"www.joeblogsco.com\"<\/span><span class=\"p\">,<\/span>      <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"172.19.56.216\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"app-01\"<\/span><span class=\"p\">,<\/span>                  <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"192.168.5.230\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"joeblogsco uat subnet\"<\/span><span class=\"p\">,<\/span>   <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"192.168.5.0\/24\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"joeblogsco stage subnet\"<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"10.60.2.0\/24\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"joeblogsco prod subnet\"<\/span><span class=\"p\">,<\/span>  <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"10.60.3.0\/24\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"bad guy\"<\/span><span class=\"p\">,<\/span>                 <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"172.19.110.247\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"bad guys\"<\/span><span class=\"p\">,<\/span>                <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"10.0.0.0\/8\"<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>\u2026you use Ript\u2019s helpers for accepting, dropping, &amp; rejecting packets, as well as for performing DNAT and SNAT:<\/p>\n\n<!-- excerpt -->\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"n\">partition<\/span> <span class=\"s2\">\"joeblogsco\"<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"www.joeblogsco.com\"<\/span><span class=\"p\">,<\/span>      <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"172.19.56.216\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"app-01\"<\/span><span class=\"p\">,<\/span>                  <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"192.168.5.230\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"joeblogsco uat subnet\"<\/span><span class=\"p\">,<\/span>   <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"192.168.5.0\/24\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"joeblogsco stage subnet\"<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"10.60.2.0\/24\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"joeblogsco prod subnet\"<\/span><span class=\"p\">,<\/span>  <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"10.60.3.0\/24\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"bad guy\"<\/span><span class=\"p\">,<\/span>                 <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"172.19.110.247\"<\/span>\n  <span class=\"n\">label<\/span> <span class=\"s2\">\"bad guys\"<\/span><span class=\"p\">,<\/span>                <span class=\"ss\">:address<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"10.0.0.0\/8\"<\/span>\n\n  <span class=\"n\">rewrite<\/span> <span class=\"s2\">\"public website + ssh access\"<\/span> <span class=\"k\">do<\/span>\n    <span class=\"n\">ports<\/span> <span class=\"mi\">80<\/span><span class=\"p\">,<\/span> <span class=\"mi\">22<\/span>\n    <span class=\"n\">dnat<\/span>  <span class=\"s2\">\"www.joeblogsco.com\"<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"app-01\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rewrite<\/span> <span class=\"s2\">\"private to public\"<\/span> <span class=\"k\">do<\/span>\n    <span class=\"n\">snat<\/span>  <span class=\"p\">[<\/span> <span class=\"s2\">\"joeblogsco uat subnet\"<\/span><span class=\"p\">,<\/span>\n            <span class=\"s2\">\"joeblogsco stage subnet\"<\/span><span class=\"p\">,<\/span>\n            <span class=\"s2\">\"joeblogsco prod subnet\"<\/span>  <span class=\"p\">]<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"www.joeblogsco.com\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">reject<\/span> <span class=\"s2\">\"bad guy\"<\/span> <span class=\"k\">do<\/span>\n    <span class=\"n\">from<\/span> <span class=\"s2\">\"bad guy\"<\/span>\n    <span class=\"n\">to<\/span>   <span class=\"s2\">\"www.joeblogsco.com\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">drop<\/span> <span class=\"s2\">\"bad guys\"<\/span> <span class=\"k\">do<\/span>\n    <span class=\"n\">protocols<\/span> <span class=\"s2\">\"udp\"<\/span>\n    <span class=\"n\">from<\/span>      <span class=\"s2\">\"bad guys\"<\/span>\n    <span class=\"n\">to<\/span>        <span class=\"s2\">\"www.joeblogsco.com\"<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>The DSL provides many <a href=\"https:\/\/github.com\/bulletproofnetworks\/ript#shortcuts\">helpful shortcuts<\/a> for DRYing up your firewall rules, and tries to do as much of the heavy lifting for you as possible.<\/p>\n\n<p>Part of Ript being opinionated is that it doesn\u2019t expose all the underlying features of iptables. This was done for several reasons:<\/p>\n\n<ul>\n  <li>The DSL would become complex, and thus harder to use.<\/li>\n  <li>Not all features within iptables map to Ript\u2019s DSL<\/li>\n  <li>Ript caters for the simple-to-moderately complex use cases that 80% of users have. If you need to use iptables features documented deep within the man pages, Ript is almost certainly not the tool for you.<\/li>\n<\/ul>\n\n<h3 id=\"rule-application\">Rule application<\/h3>\n\n<p>While the DSL is pretty, we didn\u2019t write Ript because of it - we wrote it because we\u2019re working with tens of thousands of iptables rules &amp; making several changes a day to those rules, and the traditional way of applying changes doesn\u2019t cut it at scale.<\/p>\n\n<p>Most tools try to apply firewall rules by flushing all the loaded rules and loading in new ones. This works fine if you only have a few hundred rules, but as soon as you start scaling into thousands of rules, the load time becomes very noticable.<\/p>\n\n<p>The effects of this are fairly simple: the rule load time manifests itself as downtime.<\/p>\n\n<p>Because the ruleset has to be applied serially, rules at the end of the set are held up by rules still being applied at the beginning of the set. From a service provider\u2019s perspective, this means that a rule change for one customer can end up causing downtime for other completely unrelated customers. Not cool.<\/p>\n\n<p><code class=\"language-plaintext highlighter-rouge\">iptables-save<\/code> and <code class=\"language-plaintext highlighter-rouge\">iptables-restore<\/code> help with this, but you still end up writing + applying rules by hand - a tedious task if you\u2019re making lots of firewall changes every day.<\/p>\n\n<p>Ript\u2019s killer feature is incrementally applying rules.<\/p>\n\n<p>Ript generates firewall chains in a very specific way that allows it to apply new rules incrementally, and clean out old rules intelligently. Here\u2019s an example session:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\"><span class=\"c\"># Output all the generated rules by interpreting all files under \/etc\/firewall<\/span>\nript rules generate \/etc\/firewall\n<span class=\"c\"># Output a diff of rules to apply, based on what rules are currently loaded in memory<\/span>\nript rules diff \/etc\/firewall\n<span class=\"c\"># Apply the aforementioned diff<\/span>\nript rules apply \/etc\/firewall\n<span class=\"c\"># Output the currently loaded rule in iptables-restore format<\/span>\nript rules save\n<span class=\"c\"># Output a diff of rules to delete<\/span>\nript clean diff \/etc\/firewall\n<span class=\"c\"># Apply the aforementioned diff<\/span>\nript clean apply \/etc\/firewall<\/code><\/pre><\/figure>\n\n<h3 id=\"getting-started\">Getting started<\/h3>\n\n<p>Ript has been Open Sourced under an MIT license, and is <a href=\"http:\/\/github.com\/bulletproofnetworks\/ript\">available on GitHub<\/a>. To get you going, Ript ships with <a href=\"https:\/\/github.com\/bulletproofnetworks\/ript#the-dsl\">extensive DSL usage documentation<\/a>, and a <a href=\"https:\/\/github.com\/bulletproofnetworks\/ript\/tree\/master\/examples\">boatload of examples<\/a> used by the <a href=\"https:\/\/github.com\/bulletproofnetworks\/ript\/tree\/master\/features\">tests<\/a>.<\/p>\n\n<p>I\u2019ll also be giving a <a href=\"https:\/\/lca2013.linux.org.au\/schedule\/30293\/view_talk?day=None\">talk about Ript at linux.conf.au<\/a> in Canberra in January 2013.<\/p>\n\n<p>Happy Ripting!<\/p>\n","pubDate":"12 Nov 2012","link":"https:\/\/fractio.nl\/2012\/11\/12\/ript-quick-reliable-painless-firewalling\/","guid":"https:\/\/fractio.nl\/2012\/11\/12\/ript-quick-reliable-painless-firewalling\/"},{"title":"Incentivising automated changes","description":"<p>Matthias Marschall wrote a great peice last week on the <a href=\"http:\/\/www.agileweboperations.com\/devops-protocol-no-manual-changes\">pitfalls of making manual changes<\/a> to production systems. <strong>TL,DR; Making manual changes in the heat of the moment will bite you at the most inopportune times<\/strong>.<\/p>\n\n<p>The article finishes with this suggestion:<\/p>\n\n<blockquote>\n  <p>You should have your configuration management tool (like Puppet or Chef) setup so that you can try out possible solutions without having to go in and do it manually.<\/p>\n<\/blockquote>\n\n<p>In my experience, this is the key to solving the problem.<\/p>\n\n<p>Rather than coercing people to follow a \u201cno manual changes\u201d policy, you make the incentives for making changes with automation better than for making changes manually.<\/p>\n\n<p>Specifically:<\/p>\n\n<ul>\n  <li><em>Make it simple.<\/em> Reduce the number of steps to make the change with automation. It should be quicker to find the place in your Chef or Puppet code and deploy than logging into the box, editing a file, and restarting a service.<\/li>\n  <li><em>Make it fast.<\/em> The time from thinking about the change to the change being applied should be shorter with automation than by doing it manually.<\/li>\n  <li><em>Make it safe.<\/em> Provide a rollback mechanism for changes. A safety harness can be as simple as a thin process around \u201cgit revert\u201d + deploy.<\/li>\n<\/ul>\n\n<p>It\u2019s a perfect example of how tools should complement culture.<\/p>\n","pubDate":"29 Oct 2012","link":"https:\/\/fractio.nl\/2012\/10\/29\/incentivising-automated-changes\/","guid":"https:\/\/fractio.nl\/2012\/10\/29\/incentivising-automated-changes\/"},{"title":"Instrumenting your monitoring checks with New Relic","description":"<p><em>This post is part 3 of 3 in a series on monitoring scalability.<\/em><\/p>\n\n<p>In parts 1 and 2 of this series I talked about <a href=\"https:\/\/fractio.nl\/2012\/01\/09\/monitoring-sucks-latency-sucks-more\/\">check latency<\/a> and how you can mitigate its effects by splitting data collection + storage out from alerting, while looking at monitoring systems <a href=\"https:\/\/fractio.nl\/2012\/01\/11\/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks\/\">through the prism<\/a> of an MVC web application.<\/p>\n\n<p>This final post in the series provides a concrete example of how to instrument your monitoring checks so you can identify which exact parts of your checks are inducing latency in your monitoring system.<\/p>\n\n<p>When debugging performance bottlenecks, I tend to use a simple but effective workflow:<\/p>\n\n<ol>\n  <li>observe the system<\/li>\n  <li>analyse the results<\/li>\n  <li>optimise the bottleneck that is having the most impact<\/li>\n  <li>rinse and repeat until the system is performing within the expected performance parameters<\/li>\n<\/ol>\n\n<p>What if we continue to look at monitoring checks as micro MVC web applications? What tools exist to aid this optimisation workflow, and how can we hook instrumentation into our checks?<\/p>\n\n<!-- excerpt -->\n\n<p>The cr\u00e8me de la cr\u00e8me of web app performance monitoring + optimisation tools is <a href=\"http:\/\/newrelic.com\">New Relic<\/a>, boasting an incredibly rich feature set that lets you drill down deep into your application while also providing a high level view of app-wide performance.<\/p>\n\n<p>But is it possible to hook New Relic into applications that aren\u2019t web apps? Let\u2019s give it a go.<\/p>\n\n<p>Here\u2019s an example monitoring check:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"c1\">#!\/usr\/bin\/env ruby<\/span>\n<span class=\"c1\">#<\/span>\n<span class=\"c1\"># Usage: check.rb &lt;time&gt;<\/span>\n\n<span class=\"k\">class<\/span> <span class=\"nc\">Check<\/span>\n  <span class=\"nb\">attr_reader<\/span> <span class=\"ss\">:opts<\/span>\n\n  <span class=\"k\">def<\/span> <span class=\"nf\">initialize<\/span><span class=\"p\">(<\/span><span class=\"n\">opts<\/span><span class=\"o\">=<\/span><span class=\"p\">{})<\/span>\n    <span class=\"vi\">@opts<\/span> <span class=\"o\">=<\/span> <span class=\"n\">opts<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"k\">def<\/span> <span class=\"nf\">model<\/span><span class=\"p\">(<\/span><span class=\"n\">opts<\/span><span class=\"o\">=<\/span><span class=\"p\">{}<\/span>\n    <span class=\"n\">i<\/span> <span class=\"o\">=<\/span> <span class=\"n\">opts<\/span><span class=\"p\">[<\/span><span class=\"ss\">:time<\/span><span class=\"p\">]<\/span>\n    <span class=\"nb\">sleep<\/span><span class=\"p\">(<\/span><span class=\"mi\">1<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">raise<\/span> <span class=\"p\">[<\/span><span class=\"no\">Exception<\/span><span class=\"p\">,<\/span> <span class=\"no\">RuntimeError<\/span><span class=\"p\">,<\/span> <span class=\"no\">StandardError<\/span><span class=\"p\">][<\/span><span class=\"nb\">rand<\/span><span class=\"p\">(<\/span><span class=\"mi\">2<\/span><span class=\"p\">)]<\/span> <span class=\"k\">if<\/span> <span class=\"nb\">rand<\/span><span class=\"p\">(<\/span><span class=\"n\">i<\/span><span class=\"p\">)<\/span> <span class=\"o\">==<\/span> <span class=\"mi\">1<\/span>\n    <span class=\"k\">return<\/span> <span class=\"n\">i<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"k\">def<\/span> <span class=\"nf\">view<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">i<\/span> <span class=\"o\">=<\/span> <span class=\"n\">data<\/span>\n    <span class=\"nb\">sleep<\/span><span class=\"p\">(<\/span><span class=\"nb\">rand<\/span><span class=\"p\">(<\/span><span class=\"n\">i<\/span><span class=\"p\">)<\/span> <span class=\"o\">\/<\/span> <span class=\"mi\">5<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">raise<\/span> <span class=\"p\">[<\/span><span class=\"no\">Exception<\/span><span class=\"p\">,<\/span> <span class=\"no\">RuntimeError<\/span><span class=\"p\">,<\/span> <span class=\"no\">ArgumentError<\/span><span class=\"p\">][<\/span><span class=\"nb\">rand<\/span><span class=\"p\">(<\/span><span class=\"mi\">2<\/span><span class=\"p\">)]<\/span> <span class=\"k\">if<\/span> <span class=\"nb\">rand<\/span><span class=\"p\">(<\/span><span class=\"n\">i<\/span><span class=\"p\">)<\/span> <span class=\"o\">==<\/span> <span class=\"mi\">2<\/span>\n\n    <span class=\"nb\">puts<\/span> <span class=\"s2\">\"OK: we made it!\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"k\">def<\/span> <span class=\"nf\">run<\/span>\n    <span class=\"n\">data<\/span> <span class=\"o\">=<\/span> <span class=\"n\">model<\/span><span class=\"p\">(<\/span><span class=\"vi\">@opts<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">view<\/span><span class=\"p\">(<\/span><span class=\"n\">data<\/span><span class=\"p\">)<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"no\">Check<\/span><span class=\"p\">.<\/span><span class=\"nf\">new<\/span><span class=\"p\">(<\/span><span class=\"ss\">:time<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"no\">ARGV<\/span><span class=\"p\">[<\/span><span class=\"mi\">0<\/span><span class=\"p\">].<\/span><span class=\"nf\">to_i<\/span><span class=\"p\">).<\/span><span class=\"nf\">run<\/span><\/code><\/pre><\/figure>\n\n<p>As you can see, it\u2019s <a href=\"http:\/\/www.urbandictionary.com\/define.php?term=flat%20out%20like%20a%20lizard%20drinking\">flat out like a lizard drinking<\/a> inducing latency by sleeping and spicing things up by randomly throwing exceptions. All things considered, it\u2019s actually a pretty good example of a monitoring check that aims to misbehave.<\/p>\n\n<p>Let\u2019s start instrumenting!<\/p>\n\n<p>First up we need to load some libraries:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"c1\">#!\/usr\/bin\/env ruby<\/span>\n\n<span class=\"nb\">require<\/span> <span class=\"s1\">'rubygems'<\/span>\n<span class=\"nb\">require<\/span> <span class=\"s1\">'newrelic_rpm'<\/span>\n\n<span class=\"k\">class<\/span> <span class=\"nc\">Check<\/span>\n  <span class=\"kp\">include<\/span> <span class=\"no\">NewRelic<\/span><span class=\"o\">::<\/span><span class=\"no\">Agent<\/span><span class=\"o\">::<\/span><span class=\"no\">Instrumentation<\/span><span class=\"o\">::<\/span><span class=\"no\">ControllerInstrumentation<\/span><\/code><\/pre><\/figure>\n\n<p>Reading through the New Relic API documentation\u2026<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"c1\"># When the app environment loads, so does the Agent. However, the<\/span>\n<span class=\"c1\"># Agent will only connect to the service if a web front-end is found. If<\/span>\n<span class=\"c1\"># you want to selectively monitor ruby processes that don't use<\/span>\n<span class=\"c1\"># web plugins, then call this method in your code and the Agent<\/span>\n<span class=\"c1\"># will fire up and start reporting to the service.<\/span><\/code><\/pre><\/figure>\n\n<p>\u2026it looks like we need to manually start up the agent:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"k\">class<\/span> <span class=\"nc\">Check<\/span>\n  <span class=\"c1\"># ...<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"no\">NewRelic<\/span><span class=\"o\">::<\/span><span class=\"no\">Agent<\/span><span class=\"p\">.<\/span><span class=\"nf\">manual_start<\/span><\/code><\/pre><\/figure>\n\n<p>Now we need to tell the New Relic agent what to instrument. The API provides methods to do this at the transaction and method level:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"k\">class<\/span> <span class=\"nc\">Check<\/span>\n  <span class=\"c1\"># ...<\/span>\n\n  <span class=\"n\">add_transaction_tracer<\/span> <span class=\"ss\">:run<\/span><span class=\"p\">,<\/span>   <span class=\"ss\">:name<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s1\">'run'<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:class_name<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s1\">'#{self.class}'<\/span>\n  <span class=\"n\">add_method_tracer<\/span>      <span class=\"ss\">:model<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'Nagios\/#{self.class.name}\/model'<\/span>\n  <span class=\"n\">add_method_tracer<\/span>      <span class=\"ss\">:view<\/span><span class=\"p\">,<\/span>  <span class=\"s1\">'Nagios\/#{self.class.name}\/view'<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>In New Relic parlance, a transaction is an end-to-end process that is comprised of many smaller units of work, and a method is an individual unit of work. In this monitoring check scenario, a transaction is an invocation of the check.<\/p>\n\n<p>When using the New Relic agent with Rails, by default it captures the query parameters passed to the controller action. This helps massively when debugging why a certain transaction takes longer to complete on particular inputs.<\/p>\n\n<p>Wouldn\u2019t it be cool if we could treat the command line arguments to the monitoring check as query parameters to the controller action? That way we could identify which services are running slowly and holding up the check.<\/p>\n\n<p>Turns out this is just another option to <code class=\"language-plaintext highlighter-rouge\">add_transaction_tracer<\/code>:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"n\">add_transaction_tracer<\/span> <span class=\"ss\">:run<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:name<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s1\">'run'<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:class_name<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s1\">'#{self.class}'<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:params<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s1\">'self.opts'<\/span><\/code><\/pre><\/figure>\n\n<p>Provided you store all your options in an instance variable with an <code class=\"language-plaintext highlighter-rouge\">attr_reader<\/code>, you can capture whatever data is passed to the check on execution.<\/p>\n\n<p>One piece of data the New Relic agent captures is an Apdex score for each request. An Apdex score is a measurement of user satisfaction when interacting with an application or service.<\/p>\n\n<p>In this particular scenario, the \u201cuser\u201d is actually a monitoring system, so the score may not be that meaningful. Let\u2019s disable it for now:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"k\">class<\/span> <span class=\"nc\">Check<\/span>\n  <span class=\"c1\"># ...<\/span>\n\n  <span class=\"n\">newrelic_ignore_apdex<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>So far everything has been very smooth - we\u2019ve taken an existing check and added some instrumentation points with New Relic - but we\u2019re about to hit a complication.<\/p>\n\n<p>Internally the New Relic agent spawns a separate thread from which it sends all this instrumented data to the New Relic service. Establishing a connection to the New Relic service actually takes a while (15+ seconds in the worst cases), which doesn\u2019t quite fit the paradigm we\u2019re working in where monitoring checks are returning sub-second results.<\/p>\n\n<p>Essentially this means that we\u2019re collecting all this interesting data with the New Relic agent but it\u2019s never actually sent to the New Relic service.<\/p>\n\n<p>In the PHP world this is a very real problem as PHP processes will exit at the end of each request. In the PHP edition of New Relic there\u2019s quite a cute workaround for exactly this problem - each PHP process sends data to a daemon running in the background that buffers it and sends it to New Relic at a regular interval.<\/p>\n\n<p>Let\u2019s emulate this functionality in Ruby:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"nb\">at_exit<\/span> <span class=\"k\">do<\/span>\n  <span class=\"no\">NewRelic<\/span><span class=\"o\">::<\/span><span class=\"no\">Agent<\/span><span class=\"p\">.<\/span><span class=\"nf\">save_data<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>This will serialise the captured data to <code class=\"language-plaintext highlighter-rouge\">log\/newrelic_agent_store.db<\/code> as a marshalled Ruby object. The last step is to send this data to New Relic at a regular interval:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"c1\">#!\/usr\/bin\/env ruby<\/span>\n<span class=\"c1\">#<\/span>\n<span class=\"c1\"># Usage: collector.rb<\/span>\n<span class=\"c1\">#<\/span>\n\n<span class=\"nb\">require<\/span> <span class=\"s1\">'rubygems'<\/span>\n<span class=\"nb\">require<\/span> <span class=\"s1\">'newrelic_rpm'<\/span>\n\n<span class=\"k\">module<\/span> <span class=\"nn\">NewRelic<\/span>\n  <span class=\"k\">module<\/span> <span class=\"nn\">Agent<\/span>\n    <span class=\"k\">def<\/span> <span class=\"nc\">self<\/span><span class=\"o\">.<\/span><span class=\"nf\">connected?<\/span>\n      <span class=\"n\">agent<\/span><span class=\"p\">.<\/span><span class=\"nf\">connected?<\/span>\n    <span class=\"k\">end<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"vg\">$stdout<\/span><span class=\"p\">.<\/span><span class=\"nf\">sync<\/span> <span class=\"o\">=<\/span> <span class=\"kp\">true<\/span>\n<span class=\"no\">NewRelic<\/span><span class=\"o\">::<\/span><span class=\"no\">Agent<\/span><span class=\"p\">.<\/span><span class=\"nf\">manual_start<\/span>\n\n<span class=\"nb\">print<\/span> <span class=\"s2\">\"Waiting to connect to the NewRelic service\"<\/span>\n<span class=\"k\">until<\/span> <span class=\"no\">NewRelic<\/span><span class=\"o\">::<\/span><span class=\"no\">Agent<\/span><span class=\"p\">.<\/span><span class=\"nf\">connected?<\/span> <span class=\"k\">do<\/span>\n  <span class=\"nb\">print<\/span> <span class=\"s1\">'.'<\/span>\n  <span class=\"nb\">sleep<\/span> <span class=\"mi\">1<\/span>\n<span class=\"k\">end<\/span>\n<span class=\"nb\">puts<\/span>\n\n<span class=\"no\">NewRelic<\/span><span class=\"o\">::<\/span><span class=\"no\">Agent<\/span><span class=\"p\">.<\/span><span class=\"nf\">load_data<\/span>\n<span class=\"no\">NewRelic<\/span><span class=\"o\">::<\/span><span class=\"no\">Agent<\/span><span class=\"p\">.<\/span><span class=\"nf\">shutdown<\/span><span class=\"p\">(<\/span><span class=\"ss\">:force_send<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"kp\">true<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/figure>\n\n<p>This waits for the New Relic agent to establish a connection to the New Relic service, loads the data serialised by the checks, and sends it to New Relic.<\/p>\n\n<p>Just for testing, we can run our pseudo collector like this:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\"><span class=\"k\">while <\/span><span class=\"nb\">true<\/span><span class=\"p\">;<\/span> <span class=\"k\">do <\/span><span class=\"nb\">echo<\/span> <span class=\"s2\">\"Sending\"<\/span> <span class=\"o\">&amp;&amp;<\/span> ruby send.rb <span class=\"o\">&amp;&amp;<\/span> <span class=\"nb\">echo<\/span> <span class=\"s2\">\"Sleeping 30\"<\/span> <span class=\"o\">&amp;&amp;<\/span> <span class=\"nb\">sleep <\/span>30 <span class=\"p\">;<\/span> <span class=\"k\">done<\/span><\/code><\/pre><\/figure>\n\n<p>And invoke the monitoring check like this:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\"><span class=\"k\">while <\/span><span class=\"nb\">true<\/span> <span class=\"p\">;<\/span> <span class=\"k\">do <\/span><span class=\"nv\">RACK_ENV<\/span><span class=\"o\">=<\/span>development bundle <span class=\"nb\">exec <\/span>ruby main.rb 5 <span class=\"p\">;<\/span> <span class=\"k\">done<\/span><\/code><\/pre><\/figure>\n\n<p>Now we\u2019ve got all this set up, we can log into New Relic to view some pretty visualisations of our monitoring check latency:<\/p>\n\n<p><img src=\"http:\/\/farm8.staticflickr.com\/7030\/6682260777_ae93ba89f3_o.jpg\" alt=\"New Relic dashboard screenshot\" \/><\/p>\n\n<p>New Relic automatically identifies which transactions are the slowest, and lets you deep dive to identify where the slowness is:<\/p>\n\n<p><img src=\"http:\/\/farm8.staticflickr.com\/7005\/6682278901_ca151d8508_o.jpg\" alt=\"New Relic transaction deep dive screenshot\" \/><\/p>\n\n<p>If you haven\u2019t got a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Brass_razoo\">brass razoo<\/a> there are plenty of Open Source alternatives to New Relic, but you\u2019ll have to do a bit more grunt work to get them going.<\/p>\n\n<p>This post concludes this series on monitoring scalability! The <strong>TL;DR<\/strong> series summary:<\/p>\n\n<ul>\n  <li>Check latency is the monitoring system killer.<\/li>\n  <li>Even in simple environments check latency slows down your monitoring system and obfuscates incidents.<\/li>\n  <li>To eliminate latency, separate data collection from alerting.<\/li>\n  <li>Make your monitoring checks as non-blocking as possible.<\/li>\n  <li>Whenever debugging monitoring performance problems, think of your monitoring system as an MVC web app.<\/li>\n  <li>Instrument your monitoring checks to identify sources of latency.<\/li>\n<\/ul>\n\n<p>You can find the above code examples <a href=\"https:\/\/gist.github.com\/1598357\">on GitHub<\/a>.<\/p>\n\n<p>If you\u2019ve enjoyed this series of posts, you can find more of my keen insights, witty banter, and Australian colloquialisms <a href=\"http:\/\/twitter.com\/auxesis\">on Twitter<\/a>, or <a href=\"https:\/\/fractio.nl\/feed\">subscribe to my blog<\/a>.<\/p>\n","pubDate":"13 Jan 2012","link":"https:\/\/fractio.nl\/2012\/01\/13\/instrumenting-your-monitoring-checks-with-new-relic\/","guid":"https:\/\/fractio.nl\/2012\/01\/13\/instrumenting-your-monitoring-checks-with-new-relic\/"},{"title":"monitoring system == web app (when diagnosing performance bottlenecks)","description":"<p><em>This post is part 2 of 3 in a series on monitoring scalability.<\/em><\/p>\n\n<p>In <a href=\"https:\/\/fractio.nl\/2012\/01\/09\/monitoring-sucks-latency-sucks-more\/\">part 1<\/a> of this series I talked about check latency, and how it can batter you operationally if it gets out of hand.<\/p>\n\n<p>In this post I\u2019m going to propose an alternative way of looking at monitoring systems that can hopefully shed light on some typical performance bottlenecks.<\/p>\n\n<p>Architecturally, monitoring systems and web applications share many of the same design characteristics:<\/p>\n\n<ul>\n  <li>A check is a request to an action on a controller<\/li>\n  <li>Actions fetch data from a model, and expose a result through a view<\/li>\n<\/ul>\n\n<p><img src=\"http:\/\/farm8.staticflickr.com\/7009\/6672989429_1eece65303_o.jpg\" alt=\"Overview diagram of monitoring system\/web application request lifecycle\" \/><\/p>\n\n<p>If you look at monitoring systems through this prism, many monitoring performance and scalability problems become simpler to understand:<\/p>\n\n<!-- excerpt -->\n\n<ul>\n  <li>Poorly optimised actions can take a variable amount of time to return a response<\/li>\n  <li>You get the best performance out of your monitoring system by optimising actions that are slow, and working towards a consistent throughput across all your monitoring checks<\/li>\n<\/ul>\n\n<p><img src=\"http:\/\/farm8.staticflickr.com\/7016\/6672989701_30356eaf6f_o.jpg\" alt=\"Diagram explaining how latency at one end of the pipeline effects the other\" \/><\/p>\n\n<p>Bearing this in mind, what <a href=\"http:\/\/www.amazon.com\/Scalable-Internet-Architectures-Theo-Schlossnagle\/dp\/067232699X\/ref=sr_1_7?ie=UTF8&amp;qid=1326103530&amp;sr=8-7\">methodologies do we use<\/a> to remove performance bottlenecks from a web application? Can we apply those same techniques to monitoring systems?<\/p>\n\n<p>One very common technique is to precompile data to eliminate computationally expensive operations when serving up a result. The precompilation should almost always be a separate process from the main process serving requests.<\/p>\n\n<p>This has multiple benefits:<\/p>\n\n<ul>\n  <li>You shift the computationally expensive and latency inducing work in a monitoring check to a separate process. This makes acheiving a low and consistent monitoring check response time vastly easier.<\/li>\n  <li>You can throw specialisied hardware at particular parts of the monitoring pipeline. For example, use a SAN with a huge memory cache or SSDs exclusively in your data storage layer to speed up reads + writes, and beefy multicore machines in your alerting layer to increase your check parallelism.<\/li>\n<\/ul>\n\n<p><img src=\"http:\/\/farm8.staticflickr.com\/7027\/6672989965_43fe9411f3_o.jpg\" alt=\"Diagram explaining where to focus optimisation efforts\" \/><\/p>\n\n<p>Separating data collection + storage from thresholding + notifications <em>is the most crucial part<\/em> of ensuring consistent check throughput in your monitoring system<\/p>\n\n<p>In September of 2011 <a href=\"http:\/\/twitter.com\/LordCope\">Stephen Nelson-Smith<\/a> covered why this separation is so important in his article <a href=\"http:\/\/agilesysadmin.net\/pillar-one\">We alert on what we draw<\/a>. The article can be boiled down to \u201cYour graphs and your alerts should be created from the same data source. This simplifies incident response and analysis.\u201d<\/p>\n\n<p>The other advantage that Stephen didn\u2019t cover was the massive throughput boost this gives your monitoring system. It\u2019s tempting to say that the throughput boost is a bigger advantage than the operational gains, however the two are inextricably linked. You have massive operational issues if your monitoring system is \u201crunning late\u201d on executing monitoring checks, but you\u2019ve got <a href=\"http:\/\/en.wikipedia.org\/wiki\/William_Buckley_(convict\\)\">Buckley\u2019s chance<\/a> of effectively responding to incidents if you have no visibility of those incidents.<\/p>\n\n<p>My preference is to collect + store the data with collectd + OpenTSDB, however the DevOps community as a whole seems to be very keen on Ganglia + Graphite. YMMV, do your research and use what\u2019s best for you.<\/p>\n\n<p>The most time consuming part of adopting this separation strategy is reworking your monitoring checks to fetch from these data stores. I\u2019d highly recommend writing a small DSL for doing common things like fetching data and comparing results.<\/p>\n\n<p>No approach is perfect, and separating your data from your alerting introduces a different set of problems.<\/p>\n\n<p>Even by separating the collection from the alerting, your monitoring checks are still essentially going to block when retrieving data from your storage layer. Keeping in mind you will never be able to truly eliminate blocking checks, it is <em>imperative<\/em> you ensure these new checks block as little as possible, otherwise you\u2019ll be subjecting yourself to the same problems.<\/p>\n\n<p>Write your checks with the expectation that your data store <em>will become unreachable.<\/em> The biggest drawback to separation is that when your data store becomes unreachable, all of your checks will fail <em>simultaneously.<\/em><\/p>\n\n<p><img src=\"http:\/\/farm8.staticflickr.com\/7025\/6672990219_2e4c6b7c38_o.jpg\" alt=\"Diagram explaining where things will break\" \/><\/p>\n\n<p>Operationally this can be a complete nightmare. I have seen many a pager and mobile phone melt under a deluge of notifications saying that data for a check could not be read.<\/p>\n\n<p>There are two workarounds for this problem:<\/p>\n\n<ul>\n  <li>Set up a parent check for <em>all<\/em> your monitoring checks that simply reads a value out of the data store, and goes critical if the data store can\u2019t be accessed. If your monitoring system does parenting properly and you have a good check throughput, this should minimise the explosion of alerts.<\/li>\n  <li>Build a manual or automatic notification kill switch into your monitoring system so if the shit does hit the fan and your storage layer disappears, you don\u2019t suffer from information overload and <a href=\"http:\/\/www.popularmechanics.com\/technology\/aviation\/crashes\/what-really-happened-aboard-air-france-447-6611877\">do something fatally stupid<\/a>.<\/li>\n<\/ul>\n\n<p>So how do you ensure your monitoring checks aren\u2019t suffering from check latency?<\/p>\n\n<p>In the <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2012\/01\/13\/instrumenting-your-monitoring-checks-with-new-relic\/\">next post<\/a> in this series, we\u2019ll look at instrumenting your monitoring checks themselves to identify which parts of the checks have bottlenecks.<\/p>\n","pubDate":"11 Jan 2012","link":"https:\/\/fractio.nl\/2012\/01\/11\/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks\/","guid":"https:\/\/fractio.nl\/2012\/01\/11\/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks\/"},{"title":"Monitoring Sucks. Latency Sucks More.","description":"<p><em>This post is part 1 of 3 in a series on monitoring scalability.<\/em><\/p>\n\n<p>The <a href=\"http:\/\/lusislog.blogspot.com\/2011\/06\/why-monitoring-sucks.html\">Monitoring Sucks<\/a> conversation has been an awesome step in the <a href=\"https:\/\/github.com\/monitoringsucks\">right direction<\/a> for defining a common language for describing monitoring concepts and documenting the available tools.<\/p>\n\n<p>The reasons monitoring sucks are many and varied - poor configuration, poor visualisation, poor scalability, poor data retention - there is a lot of well-founded hate for the available tools (some of which I have authored!)<\/p>\n\n<p>I want to take a closer look into a problem I grapple with on a daily basis as part of my job: monitoring scalability.<\/p>\n\n<p>What do I mean by \u201cmonitoring scalability\u201d?<\/p>\n\n<p>For a monitoring system to be considered scalable, I would expect it to execute large volumes of monitoring checks under a variety of conditions (good + bad) with a consistent throughput.<\/p>\n\n<p>Why is monitoring scalability a problem? Are there deeper, subtler problems that underly monitoring system architectures in general?<\/p>\n\n<!-- excerpt -->\n\n<p>Nagios handles 6000+ checks like a champ. I say this with a completely straight face. At <a href=\"http:\/\/bulletproof.net\">Bulletproof<\/a>, we have several large instances of Nagios that have been running for years with thousands of checks.<\/p>\n\n<p>There is one caveat, and it is pretty massive - if your monitoring checks take a variable amount of time to return a result (they have high check latency), you will get reduced throughput, and thus your incident response times becomes unreliable. This leads to a lack of trust in the monitoring system which can kill you operationally if you don\u2019t nip it in the bud.<\/p>\n\n<p>Let\u2019s work through some of the scalability problems by looking at a hypothetical and simplified monitoring system:<\/p>\n\n<p>Imagine you have a very small monitoring system with 150 checks running. The type of check is irrelevant (in Nagios parlance they could be \u201cservice\u201d or \u201chost\u201d checks), however each check is scheduled to be executed every 300 seconds (for the sake of argument, lets just ignore that a 300 second interval is <em>way<\/em> too long).<\/p>\n\n<p>To simplify this hypothetical, let\u2019s posit that all the checks are running serially in a single thread, and each check takes 1 second to execute and return a result.<\/p>\n\n<p>At this point, you\u2019re golden. All checks are executing in 150 seconds, well within the 300 second window.<\/p>\n\n<p>Now double the number of checks to 300.<\/p>\n\n<p>That\u2019s one check executed every second. All the checks execute within the execution window, but things are getting tight, and you don\u2019t have any spare capacity to add more checks.<\/p>\n\n<p>Worst of all: <em>what happens when the check response time goes up to 2 seconds?<\/em> Now you can only execute 50% of your checks within the 300 second window, and your monitoring is 300 seconds \u201cbehind\u201d.<\/p>\n\n<p>Now you\u2019re suffering from <em>check latency<\/em>  - a world of pain filled with plenty of insidious edge cases to cut yourself on.<\/p>\n\n<p>My favourite edge case is when a service failure occurs just after a check has executed and returned an OK result. In the above hypothetical, you would be unaware of the failure for 599 seconds. In a monitoring system suffering heavily from check latency, that period of time could be much much longer. Furthermore, the problem is amplified when you\u2019re using soft\/hard states to eliminate false-positives.<\/p>\n\n<p>The above hypothetical is a tad contrived as pretty much all monitoring systems execute checks in parallel, but it illustrates the scalability challenges even in a simple scenario.<\/p>\n\n<p>Executing checks in parallel certainly helps stave off this type of bottleneck, but as you increase the number of checks and the parallelism of your monitoring system, you start running into operating system limitations such as context switching, memory exhaustion (if you use a language that gobbles up memory), or simply running out of CPU time to execute all the checks.<\/p>\n\n<p>The other enormous gotcha is that when catastrophic failures happen, it\u2019s very common to have monitoring checks that simply timeout because various network resources between your monitoring server and the machine you\u2019re checking are down or misbehaving.<\/p>\n\n<p>The last thing you want in an emergency situation is delayed alerts that may hide the root cause or feed you bad information.<\/p>\n\n<p>So how do you mitigate check latency problems to improve your monitoring scalability?<\/p>\n\n<p>In the <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2012\/01\/11\/monitoring-system-equal-web-app-when-diagnosing-performance-bottlenecks\/\">next post<\/a> in this series, we\u2019ll look at monitoring systems as a type of complex web application, and investigate some performance optimisation techniques you can apply.<\/p>\n","pubDate":"09 Jan 2012","link":"https:\/\/fractio.nl\/2012\/01\/09\/monitoring-sucks-latency-sucks-more\/","guid":"https:\/\/fractio.nl\/2012\/01\/09\/monitoring-sucks-latency-sucks-more\/"},{"title":"Treetop PEG for Puppet resources","description":"<p>Earlier this year at Puppet Camp EU, <a href=\"http:\/\/twitter.com\/sonofhans\">Randall Hansen<\/a> ran an open space session on improving the Puppet user experience.<\/p>\n\n<p>Lots of sharp edges were identified, but one issue that I raised was the annoying need for trailing commas to break up parameters in resource declarations.<\/p>\n\n<p>I chatted about this briefly with <a href=\"http:\/\/twitter.com\/puppetmasterd\">Luke<\/a> and for a laugh I decided to write a <a href=\"http:\/\/treetop.rubyforge.org\/\">Treetop<\/a> Parsing Expression Grammar (PEG) for Puppet resources that supported newlines as the parameter delimeter:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"c1\"># puppet.treetop<\/span>\n<span class=\"n\">grammar<\/span> <span class=\"no\">Puppet<\/span>\n  <span class=\"n\">rule<\/span> <span class=\"n\">resource<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"n\">type<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"nb\">open<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"nb\">name<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"n\">parameters<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"n\">close<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"p\">{<\/span>\n      <span class=\"k\">def<\/span> <span class=\"nf\">resource_type<\/span>\n        <span class=\"n\">type<\/span><span class=\"p\">.<\/span><span class=\"nf\">text_value<\/span>\n      <span class=\"k\">end<\/span>\n\n      <span class=\"k\">def<\/span> <span class=\"nf\">resource_name<\/span>\n        <span class=\"nb\">name<\/span><span class=\"p\">.<\/span><span class=\"nf\">word<\/span><span class=\"p\">.<\/span><span class=\"nf\">text_value<\/span>\n      <span class=\"k\">end<\/span>\n    <span class=\"p\">}<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">type<\/span>\n    <span class=\"n\">word<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"nb\">open<\/span>\n    <span class=\"s2\">\"{\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">close<\/span>\n    <span class=\"s2\">\"}\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"nb\">name<\/span>\n    <span class=\"n\">quotes<\/span> <span class=\"n\">word<\/span> <span class=\"n\">quotes<\/span> <span class=\"s2\">\":\"<\/span>\n    <span class=\"p\">{<\/span>\n      <span class=\"k\">def<\/span> <span class=\"nf\">name<\/span>\n        <span class=\"n\">word<\/span>\n      <span class=\"k\">end<\/span>\n    <span class=\"p\">}<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">word<\/span>\n    <span class=\"p\">[<\/span><span class=\"n\">a<\/span><span class=\"o\">-<\/span><span class=\"n\">zA<\/span><span class=\"o\">-<\/span><span class=\"no\">Z<\/span><span class=\"p\">]<\/span><span class=\"o\">+<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">quotes<\/span>\n   <span class=\"s2\">\"'\"<\/span> <span class=\"o\">\/<\/span> <span class=\"s1\">'\"'<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">parameters<\/span>\n    <span class=\"n\">newline<\/span><span class=\"o\">*<\/span> <span class=\"p\">(<\/span><span class=\"n\">whitespace<\/span> <span class=\"n\">parameter<\/span> <span class=\"n\">comma_or_newline<\/span><span class=\"o\">*<\/span><span class=\"p\">)<\/span><span class=\"o\">*<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">parameter<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"n\">word<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"n\">arrow<\/span>\n    <span class=\"n\">whitespace<\/span>\n    <span class=\"n\">word<\/span>\n    <span class=\"n\">whitespace<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">arrow<\/span>\n    <span class=\"s2\">\"=&gt;\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">comma_or_newline<\/span>\n    <span class=\"n\">comma<\/span> <span class=\"o\">\/<\/span> <span class=\"n\">newline<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">comma<\/span>\n    <span class=\"s2\">\",\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">newline<\/span>\n    <span class=\"s2\">\"<\/span><span class=\"se\">\\n<\/span><span class=\"s2\">\"<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"n\">rule<\/span> <span class=\"n\">whitespace<\/span>\n    <span class=\"s2\">\"<\/span><span class=\"se\">\\s<\/span><span class=\"s2\">\"<\/span><span class=\"o\">*<\/span> <span class=\"sr\">\/ \"\\n\"+\n  end\nend<\/span><\/code><\/pre><\/figure>\n\n<p>It\u2019s throwaway code, but as far as I\u2019m aware it\u2019s relatively idiomatic Treetop.<\/p>\n\n<p>It came in handy earlier this week when explaining PEGs to a <a href=\"http:\/\/twitter.com\/jessereynolds\">new recruit<\/a> into the R&amp;D team at <a href=\"http:\/\/bulletproof.net\">work<\/a>.<\/p>\n\n<p>Said recruit suggested that I publish it, as there aren\u2019t too many examples of Treetop PEGs floating around.<\/p>\n\n<p>To run the PEG over an example snippet:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"c1\">#!\/usr\/bin\/env ruby<\/span>\n\n<span class=\"nb\">require<\/span> <span class=\"s1\">'rubygems'<\/span>\n<span class=\"nb\">require<\/span> <span class=\"s1\">'bundler\/setup'<\/span>\n<span class=\"nb\">require<\/span> <span class=\"s1\">'polyglot'<\/span>\n<span class=\"nb\">require<\/span> <span class=\"s1\">'treetop'<\/span>\n\n<span class=\"no\">Treetop<\/span><span class=\"p\">.<\/span><span class=\"nf\">load<\/span> <span class=\"s2\">\"puppet\"<\/span>\n\n<span class=\"n\">snippet<\/span> <span class=\"o\">=<\/span> <span class=\"o\">&lt;&lt;-<\/span><span class=\"no\">SNIPPET<\/span><span class=\"sh\">\n  package { \"foobar\":\n    ensure =&gt; present, another =&gt; bar, spoons =&gt; doom\n    foo    =&gt; bar\n  }\n<\/span><span class=\"no\">SNIPPET<\/span>\n\n<span class=\"n\">parser<\/span> <span class=\"o\">=<\/span> <span class=\"no\">PuppetParser<\/span><span class=\"p\">.<\/span><span class=\"nf\">new<\/span>\n<span class=\"k\">if<\/span> <span class=\"vi\">@root<\/span> <span class=\"o\">=<\/span> <span class=\"n\">parser<\/span><span class=\"p\">.<\/span><span class=\"nf\">parse<\/span><span class=\"p\">(<\/span><span class=\"n\">snippet<\/span><span class=\"p\">.<\/span><span class=\"nf\">strip<\/span><span class=\"p\">)<\/span>\n  <span class=\"nb\">puts<\/span> <span class=\"s1\">'success'<\/span>\n  <span class=\"nb\">p<\/span> <span class=\"vi\">@root<\/span><span class=\"p\">.<\/span><span class=\"nf\">resource_type<\/span>\n  <span class=\"nb\">p<\/span> <span class=\"vi\">@root<\/span><span class=\"p\">.<\/span><span class=\"nf\">resource_name<\/span>\n<span class=\"k\">else<\/span>\n  <span class=\"nb\">puts<\/span> <span class=\"s1\">'failure'<\/span>\n  <span class=\"nb\">puts<\/span> <span class=\"n\">parser<\/span><span class=\"p\">.<\/span><span class=\"nf\">failure_reason<\/span>\n  <span class=\"nb\">puts<\/span> <span class=\"n\">parser<\/span><span class=\"p\">.<\/span><span class=\"nf\">failure_column<\/span>\n  <span class=\"nb\">puts<\/span> <span class=\"n\">parser<\/span><span class=\"p\">.<\/span><span class=\"nf\">failure_line<\/span>\n  <span class=\"nb\">puts<\/span> <span class=\"n\">parser<\/span><span class=\"p\">.<\/span><span class=\"nf\">failure_index<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>Gemfile for running it and all the above code is in a <a href=\"https:\/\/gist.github.com\/d4fb0770775086145db0\">Gist<\/a>.<\/p>\n","pubDate":"13 Oct 2011","link":"https:\/\/fractio.nl\/2011\/10\/13\/treetop\/","guid":"https:\/\/fractio.nl\/2011\/10\/13\/treetop\/"},{"title":"Standing desk adventures","description":"<p>I\u2019ve been using a standing desk for a bit over four months now, and thus far it\u2019s been quite successful.<\/p>\n\n<p>I decided to test out the idea because I spend a lot of time in front of the screen, and my back has been getting progressively sorer over the last year. Not wanting to transform into the Hunchback of Notre Dame before I turn 30, and aware of the current research suggesting that sitting down for long stretches <a href=\"http:\/\/www.abc.net.au\/news\/2011-03-21\/sit-less-to-lower-heart-disease-risk\/2649106\">increases the risk of heart disease<\/a>, a standing desk seemed like a good alternative.<\/p>\n\n<p>Because I work from home and the office, I actually have two standing desk setups - one for each location.<\/p>\n\n<p>The home setup is incredibly makeshift, with the monitor placed on a box of our things left over from the last move, and the laptop sitting on a discontinued IKEA storage box not too dissimilar from the current <a href=\"http:\/\/www.ikea.com\/au\/en\/catalog\/products\/10184838\/\">Prant<\/a> offering.<\/p>\n\n<p>The reason for the home setup dodginess is twofold:<\/p>\n\n<ul>\n  <li>I wanted to try out the standing desk thing without a large financial commitment<\/li>\n  <li>We\u2019re between houses, and have to use what we\u2019ve got on hand<\/li>\n<\/ul>\n\n<p>Once I discovered that the standing desk was the way I wanted to work for the foreseeable future, I decided to up the ante and buy a real desk for work.<\/p>\n\n<p>There are plenty of purpose built standing desk options that are far beyond my budget, so the search was on for finding a desk for a reasonable price.<\/p>\n\n<p>I stumbled across a Frankenstein IKEA desk <a href=\"http:\/\/lifehacker.com\/5739296\/build-a-diy-wide-adjustable-height-ikea-standing-desk-on-the-cheap\">on Lifehacker<\/a>, but it:<\/p>\n\n<ul>\n  <li>Was too long for the space in the office<\/li>\n  <li>Required a non-trivial amount of construction with tools not on hand<\/li>\n<\/ul>\n\n<p>The same Lifehacker article linked to <a href=\"http:\/\/thingsthatwelearn.com\/#334875\/Standing-Desk-Project\">another blog<\/a> about repurposing an Utby kitchen table as a standing desk:<\/p>\n\n<p><img src=\"http:\/\/farm7.static.flickr.com\/6162\/6206718881_5b11409b7b.jpg\" alt=\"Standing desktop and base\" \/><\/p>\n\n<p>This was the model I settled on.<\/p>\n\n<!-- excerpt -->\n\n<p>The Utby kitchen table is sold as two separate products, the 105cm high <a href=\"http:\/\/www.ikea.com\/aa\/en\/catalog\/products\/20172236\">stainless steel underframe<\/a> (not be confused with the 90cm one), and the 120x60x3.4cm <a href=\"http:\/\/www.ikea.com\/aa\/en\/catalog\/products\/40162222\/#\/40162222\/\">Vika Amon table top<\/a>.<\/p>\n\n<p>At the time the local IKEA store did not have stock of the Vika Amon table top, so based on the advice of a shop assistant I picked up the <a href=\"http:\/\/www.ikea.com\/aa\/en\/catalog\/products\/S49871209\/#\/S49871209\/\">Galant table top<\/a> instead, minus the normal Galant frame.<\/p>\n\n<p>I was assured in-store this would be a reasonable substitute, however as I was finishing off the construction in the office I discovered that the Galant table top I purchased is 2cm thick, as opposed to the Vika Amon which is 3.4cm. This meant that the supplied screws to mount the table top to the underframe would have broken through the surface, so alternative screws are required.<\/p>\n\n<p>Chalk that one up to a lack of research.<\/p>\n\n<p>To date I\u2019m very impressed with the desk, with the bottom rail of the frame providing a suitable place to rest my feet against and store my bag behind.<\/p>\n\n<p><a href=\"http:\/\/www.flickr.com\/photos\/auxesis\/6094908180\/\"><img src=\"http:\/\/farm7.static.flickr.com\/6073\/6094908180_d86925410f.jpg\" alt=\"Standing desk at office\" \/><\/a><\/p>\n\n<p>As for the longer term effects of using a standing desk 12 hours a day 5 days a week: I haven\u2019t been able to find many others sharing their experiences. The sum of what you generally read online is \u201cI just switched over to a standing desk a few hours ago and it\u2019s feeling great!!!\u201d.<\/p>\n\n<p>In my experience, the two key factors are:<\/p>\n\n<ul>\n  <li>Start out with a comfortable, well worn pair of shoes<\/li>\n  <li>Make sure the surface you stand on isn\u2019t too hard (timber floorboards or carpet are best)<\/li>\n<\/ul>\n\n<p>My desk at home has me standing on tiles in Chucks, and boy can I feel it if I\u2019m working long hours. If I use the desk for more than 12 hours at a time, my feet start aching pretty badly.<\/p>\n\n<p>I see this is a good thing though: if my feet are aching, it means I need to stop work for the day.<\/p>\n\n<p>I\u2019ve tried working around this by wearing in different types of shoes (Birkenstock Arizonas and <a href=\"http:\/\/www.cbdcycles.com.au\/clothingdetails.php?id=47\">Shimano SH-MT40s<\/a>), but I generally end up with a pretty <a href=\"http:\/\/twitter.com\/#!\/auxesis\/status\/118224439325368321\">nasty headache within an hour<\/a>.<\/p>\n\n<p>Once we move into our new house I\u2019ll be working on timber floorboards, so I have my fingers crossed that the pain will ease up.<\/p>\n\n<p>I find that I\u2019m shifting my weight between legs every 5-10 minutes, and am much more inclined to bop along to music now that I\u2019m standing up.<\/p>\n\n<p>There\u2019s also an unspoken advantage to having a standing desk in a busy office environment: <em>people will interrupt you for much shorter periods of time if they have nowhere to sit<\/em>.<\/p>\n\n<p>If you\u2019re doing pair programming this can be pretty brutal on your partner if they\u2019re not used to standing up for long stretches, but it has a distinct advantage when you\u2019re trying to shut the world out and keep in the zone.<\/p>\n\n<p>I also find that the by the end of the day I have a mild tingly sensation in my calves, not too dissimilar from the sensation felt when returning from a bike ride with lots of climbs.<\/p>\n\n<p>Since setting up the standing desk in the office, there are two more setups that have showed up on <a href=\"http:\/\/www.ikeahackers.net\/\">IKEA Hackers<\/a>:<\/p>\n\n<ul>\n  <li><a href=\"http:\/\/www.ikeahackers.net\/2011\/09\/sitting-standing-desk-combo.html\">An Expedit-based sitting\/standing desk combo<\/a><\/li>\n  <li><a href=\"http:\/\/www.ikeahackers.net\/2011\/09\/another-expedit-standing-desk-with-cds.html\">Another Expedit-based standing desk with CDs as risers<\/a><\/li>\n<\/ul>\n\n<p>I quite like the idea of the extra storage space gained with the CD-riser design, and may opt for that design when we move.<\/p>\n\n<p>Would I go back to sitting at a desk? Not in the foreseeable future.<\/p>\n\n<p>I find most of my back pain has gone, and I now value sitting a lot more. :-)<\/p>\n\n<p>Would I recommend standing desks for others? If you are working long hours in front of a screen and have trouble finding a comfortable setup to work from, it might be worth a shot.<\/p>\n\n","pubDate":"03 Oct 2011","link":"https:\/\/fractio.nl\/2011\/10\/03\/standing-desk-adventures\/","guid":"https:\/\/fractio.nl\/2011\/10\/03\/standing-desk-adventures\/"},{"title":"Devops Down Under 2011 videos online","description":"<p>The final videos from Devops Down Under 2011 have just been <a href=\"http:\/\/vimeo.com\/devopsdownunder\/videos\">uploaded to Vimeo<\/a>.<\/p>\n\n<p>The list of videos is:<\/p>\n\n<ul>\n  <li><a href=\"http:\/\/vimeo.com\/28291845\">Patrick Debois - Keynote<\/a><\/li>\n  <li><a href=\"http:\/\/vimeo.com\/27931770\">Jason Yip - Kanban for IT Operations<\/a><\/li>\n  <li><a href=\"http:\/\/vimeo.com\/28001362\">Nish Mahanty - Building high-performing teams (to deliver awesome business outcomes<\/a><\/li>\n  <li><a href=\"http:\/\/vimeo.com\/27613690\">Panel Discussion - How do I sell Lean\/Kanban\/Devops to my boss?<\/a><\/li>\n  <li><a href=\"http:\/\/vimeo.com\/27922221\">Robert Postill - Sprinkling DevOps Magic In Other People\u2019s Environments<\/a><\/li>\n  <li><a href=\"http:\/\/vimeo.com\/27643817\">Leni May - Feature flipping and Heroku-style deployment at learnable.com<\/a><\/li>\n  <li><a href=\"http:\/\/vimeo.com\/27709632\">Garrett Honeycutt - How Puppet fits into the larger ecosystem<\/a><\/li>\n  <li><a href=\"http:\/\/vimeo.com\/27640194\">Wade Millican - Downtime, stateful systems, and using clouds to stop rain on your parade<\/a><\/li>\n<\/ul>\n\n<p>Thanks again to all this year\u2019s sponsors, speakers, and attendees for making the conference awesome!<\/p>\n","pubDate":"30 Aug 2011","link":"https:\/\/fractio.nl\/2011\/08\/30\/devops-down-under-videos-online\/","guid":"https:\/\/fractio.nl\/2011\/08\/30\/devops-down-under-videos-online\/"},{"title":"On assholes, ideas, and actions","description":"<p>Almost 2 weeks ago Rusty Russell wrote about <a href=\"http:\/\/rusty.ozlabs.org\/?p=196\">assholes<\/a> in the Open Source community.<\/p>\n\n<p>Some <a href=\"http:\/\/jacobian.org\/writing\/assholes\/\">people<\/a> intepreted Rusty\u2019s post as a tacit toleration of assholes, with <a href=\"http:\/\/mdzlog.alcor.net\/\">Matt Zimmerman<\/a> commenting on the greader post I shared:<\/p>\n\n<blockquote>\nIf we didn't tolerate assholes in our community, we would still be producing great software.\n<\/blockquote>\n\n<p>I don\u2019t believe Rusty\u2019s point is that \u201ctolerance of assholes is a necessary evil\u201d. His point is that people of all walks of life are flawed, and the tech community is far from being a utopian exception.<\/p>\n\n<p>The distinction is between ideas and actions.<\/p>\n\n<p>I have no problem with people holding ideas or beliefs I find offensive or plain wrong. It\u2019s that lack of homogeny that makes life interesting and keeps me thinking. It becomes an issue if said person takes action on those ideas and that action hurts other people.<\/p>\n\n<p>Hence, if you\u2019re being an asshole to me or someone else I won\u2019t tolerate that and will call you out on it. If you believe that the world is flat or that fasting cures cancer I\u2019ll vigorous argue against your position but I won\u2019t wield the ban hammer on the work you do.<\/p>\n","pubDate":"01 Jun 2011","link":"https:\/\/fractio.nl\/2011\/06\/01\/assholes-ideas-actions\/","guid":"https:\/\/fractio.nl\/2011\/06\/01\/assholes-ideas-actions\/"},{"title":"Getting consistent media source access on Boxee with CIFS","description":"<p>Although a bit late to the party, I\u2019ve been using <a href=\"http:\/\/boxee.tv\">Boxee<\/a> at home for watching movies and TV shows for almost a year now. Its indexing engine is pretty accurate, and the killer feature has to be the <a href=\"http:\/\/blog.boxee.tv\/2009\/03\/15\/boxee-iphone-remote-app-available-on-the-app-store\/\">Boxee remote app<\/a> for the iPhone.<\/p>\n\n<p>I\u2019ve tried using <a href=\"http:\/\/xbmc.org\/\">XBMC<\/a> but found it suffers from the common open source problem of exposing all its knobs and dials to make everything infinitely tweakable for the end user (also known as: the KDE user experience). Although the free Boxee release is drifting towards an unmaintained state now Boxee are focusing on the <a href=\"http:\/\/boxee.tv\/box\">Boxee Box<\/a>, it\u2019s simple and stable enough to use that everyone in my family can get their heads around it in under a few minutes.<\/p>\n\n<p>The one bug that\u2019s consistently annoyed me has been disappearing CIFS shares when a configured media source (a file server) exists on a <a href=\"http:\/\/forums.boxee.tv\/showthread.php?t=23064\">different subnet to the Boxee<\/a> itself. This only happened recently when we moved house and put our file server on a separate subnet that\u2019s not DHCP serviced.<\/p>\n\n<p>I scratched my head for a little while on how to solve this. The obvious solution is to shorten the DHCP lease range and put the file server on the same subnet as the Boxee machine, but I\u2019m cautious of what other CIFS bugs lurk beneath the surface, and it feels vastly more complicated than it should be.<\/p>\n\n<p>The simplest solution I could find was to pass off handling CIFS shares to Linux and add a plain old directory as a media source:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\"><span class=\"c\"># \/etc\/fstab: static file system information.<\/span>\n<span class=\"c\">#<\/span>\n<span class=\"c\"># &lt;file system&gt;         &lt;mount point&gt;   &lt;type&gt;  &lt;options&gt;           &lt;dump&gt;  &lt;pass&gt;<\/span>\nproc                    \/proc           proc    nodev,noexec,nosuid 0       0\n\/dev\/sda1               \/               ext4    <span class=\"nv\">errors<\/span><span class=\"o\">=<\/span>remount-ro   0       1\n\/dev\/sda5               none            swap    sw                  0       0\n\/\/jules\/videos          \/media\/videos   cifs    guest               0       0<\/code><\/pre><\/figure>\n\n<p>This bypasses Boxees dodgy CIFS code completely, and has the added benefit of abstracting away the source from Boxee, making it vastly easier to shuffle file shares and disks around without Boxee getting upset.<\/p>\n","pubDate":"22 May 2011","link":"https:\/\/fractio.nl\/2011\/05\/22\/getting-consistent-media-source-access-on-boxee-with-cifs\/","guid":"https:\/\/fractio.nl\/2011\/05\/22\/getting-consistent-media-source-access-on-boxee-with-cifs\/"},{"title":"Testing daemons with Cucumber","description":"<p>Sometimes when writing daemons and doing outside-in testing, you want to fire up the daemon in the background, interact with it, and test the interactions match the behaviour you\u2019re expecting.<\/p>\n\n<p>The biggest challenge is starting and stopping the daemon reliably - you don\u2019t want daemons hanging around between tests, consuming extra resources, and mucking up state.<\/p>\n\n<p>Your test might look something like this:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\"><span class=\"kd\">Feature<\/span><span class=\"p\">:<\/span> Some daemon\n  As an operator\n  I want to run my daemon\n  And have it do things\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Command line tool\n    <span class=\"nf\">When <\/span>I start my daemon with <span class=\"s\">\"kelpie start\"<\/span>\n    <span class=\"nf\">Then <\/span>a daemon called <span class=\"s\">\"kelpie\"<\/span> should be running\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Seeing where my daemon is getting its data from\n    <span class=\"nf\">When <\/span>I start my daemon with <span class=\"s\">\"kelpie start\"<\/span>\n    <span class=\"nf\">Then <\/span>I should see <span class=\"s\">\"Using data from \/.*mysql\"<\/span> on the terminal<\/code><\/pre><\/figure>\n\n<p>I\u2019ve found this problem can be handled quite nicely with with <code class=\"language-plaintext highlighter-rouge\">IO.popen<\/code> and an <code class=\"language-plaintext highlighter-rouge\">at_exit<\/code> callback.<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"no\">When<\/span> <span class=\"sr\">\/^I start my daemon with \"([^\"]*)\"$\/<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">cmd<\/span><span class=\"o\">|<\/span>\n  <span class=\"vi\">@root<\/span> <span class=\"o\">=<\/span> <span class=\"no\">Pathname<\/span><span class=\"p\">.<\/span><span class=\"nf\">new<\/span><span class=\"p\">(<\/span><span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">dirname<\/span><span class=\"p\">(<\/span><span class=\"kp\">__FILE__<\/span><span class=\"p\">)).<\/span><span class=\"nf\">parent<\/span><span class=\"p\">.<\/span><span class=\"nf\">parent<\/span><span class=\"p\">.<\/span><span class=\"nf\">expand_path<\/span>\n  <span class=\"n\">command<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"<\/span><span class=\"si\">#{<\/span><span class=\"vi\">@root<\/span><span class=\"p\">.<\/span><span class=\"nf\">join<\/span><span class=\"p\">(<\/span><span class=\"s1\">'bin'<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s2\">\/<\/span><span class=\"si\">#{<\/span><span class=\"n\">cmd<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span>\n\n  <span class=\"vi\">@pipe<\/span> <span class=\"o\">=<\/span> <span class=\"no\">IO<\/span><span class=\"p\">.<\/span><span class=\"nf\">popen<\/span><span class=\"p\">(<\/span><span class=\"n\">command<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"r\"<\/span><span class=\"p\">)<\/span>\n  <span class=\"nb\">sleep<\/span> <span class=\"mi\">2<\/span> <span class=\"c1\"># so the daemon has a chance to boot<\/span>\n\n  <span class=\"c1\"># clean up the daemon when the tests finish<\/span>\n  <span class=\"nb\">at_exit<\/span> <span class=\"k\">do<\/span>\n    <span class=\"no\">Process<\/span><span class=\"p\">.<\/span><span class=\"nf\">kill<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"KILL\"<\/span><span class=\"p\">,<\/span> <span class=\"vi\">@pipe<\/span><span class=\"p\">.<\/span><span class=\"nf\">pid<\/span><span class=\"p\">)<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"no\">Then<\/span> <span class=\"sr\">\/^a daemon called \"([^\"]*)\" should be running$\/<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">daemon<\/span><span class=\"o\">|<\/span>\n  <span class=\"sb\">`ps -eo cmd |grep ^<\/span><span class=\"si\">#{<\/span><span class=\"n\">daemon<\/span><span class=\"si\">}<\/span><span class=\"sb\">`<\/span><span class=\"p\">.<\/span><span class=\"nf\">size<\/span><span class=\"p\">.<\/span><span class=\"nf\">should<\/span> <span class=\"o\">&gt;<\/span> <span class=\"mi\">0<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>The other way to do this is with the usual backtick method, and poke at <code class=\"language-plaintext highlighter-rouge\">$?<\/code>.<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"n\">command<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"kelpie\"<\/span>\n<span class=\"n\">output<\/span> <span class=\"o\">=<\/span> <span class=\"sb\">`<\/span><span class=\"si\">#{<\/span><span class=\"n\">command<\/span><span class=\"si\">}<\/span><span class=\"sb\">`<\/span>\n<span class=\"n\">pid<\/span> <span class=\"o\">=<\/span> <span class=\"vg\">$?<\/span><span class=\"p\">.<\/span><span class=\"nf\">pid<\/span>\n\n<span class=\"nb\">at_exit<\/span> <span class=\"k\">do<\/span>\n  <span class=\"no\">Process<\/span><span class=\"p\">.<\/span><span class=\"nf\">kill<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"KILL\"<\/span><span class=\"p\">,<\/span> <span class=\"n\">pid<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>The issue here is blocking - if the daemon is doing its job, that command won\u2019t return at all, and you certainly won\u2019t see any output from the command.<\/p>\n\n<p><code class=\"language-plaintext highlighter-rouge\">IO.popen<\/code>\u2019s main advantage here is that it spawns a subprocess to execute the command, which won\u2019t block Ruby.<\/p>\n\n<p>So how do you get at the output of the daemon? Easy - we\u2019ve bound the <code class=\"language-plaintext highlighter-rouge\">IO.open<\/code> instance to <code class=\"language-plaintext highlighter-rouge\">@pipe<\/code>, so we can just interact with that.<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"no\">Then<\/span> <span class=\"sr\">\/^I should see \"([^\"]*)\" on the terminal$\/<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">string<\/span><span class=\"o\">|<\/span>\n  <span class=\"n\">output<\/span> <span class=\"o\">=<\/span> <span class=\"vi\">@pipe<\/span><span class=\"p\">.<\/span><span class=\"nf\">read<\/span><span class=\"p\">(<\/span><span class=\"mi\">250<\/span><span class=\"p\">)<\/span>\n  <span class=\"n\">output<\/span><span class=\"p\">.<\/span><span class=\"nf\">should<\/span> <span class=\"o\">=~<\/span> <span class=\"sr\">\/<\/span><span class=\"si\">#{<\/span><span class=\"n\">string<\/span><span class=\"si\">}<\/span><span class=\"sr\">\/<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>The above code is fairly naive and you\u2019ll have to tweak just how much data you read, otherwise Ruby will block on reading that pipe.<\/p>\n\n<p>The last gotcha is that Ruby buffers output to <code class=\"language-plaintext highlighter-rouge\">STDOUT<\/code> by default, so if the daemon you\u2019re testing is also written in Ruby, you may not see anything on that pipe even though the daemon has executed its <code class=\"language-plaintext highlighter-rouge\">puts<\/code> and <code class=\"language-plaintext highlighter-rouge\">print<\/code> statements.<\/p>\n\n<p>You can disable buffered output by including this statement somewhere in your daemon (I like putting it just after <code class=\"language-plaintext highlighter-rouge\">require<\/code>s):<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"vg\">$stdout<\/span><span class=\"p\">.<\/span><span class=\"nf\">sync<\/span> <span class=\"o\">=<\/span> <span class=\"kp\">true<\/span><\/code><\/pre><\/figure>\n\n","pubDate":"14 Sep 2010","link":"https:\/\/fractio.nl\/2010\/09\/14\/testing-daemons-with-cucumber\/","guid":"https:\/\/fractio.nl\/2010\/09\/14\/testing-daemons-with-cucumber\/"},{"title":"The times, they are a-changin'","description":"<object width=\"720\" height=\"577.5\" style=\"margin: auto;\"><param name=\"movie\" value=\"http:\/\/www.youtube.com\/v\/6ncCyL_g28I?fs=1&amp;hl=en_US\" \/>&lt;\/param&gt;<param name=\"allowFullScreen\" value=\"true\" \/>&lt;\/param&gt;<param name=\"allowscriptaccess\" value=\"always\" \/>&lt;\/param&gt;<embed src=\"http:\/\/www.youtube.com\/v\/6ncCyL_g28I?fs=1&amp;hl=en_US\" type=\"application\/x-shockwave-flash\" allowscriptaccess=\"always\" allowfullscreen=\"true\" width=\"720\" height=\"577.5\" \/>&lt;\/embed&gt;<\/object>\n\n<p>It\u2019s been a stupidly long time since I\u2019ve updated this thing, so here it goes!<\/p>\n\n<p><a href=\"http:\/\/www.flickr.com\/photos\/jula_julz\/4892663872\/\" title=\"30 weeks 5 days by jula julz, on Flickr\"><img src=\"http:\/\/farm5.static.flickr.com\/4080\/4892663872_2ae31cf0e9.jpg\" width=\"334\" height=\"500\" alt=\"30 weeks 5 days\" \/><\/a><\/p>\n\n<p>Julia and I are expecting! The baby\u2019s due on October 19, so not long to go.<\/p>\n\n<p><a href=\"http:\/\/www.flickr.com\/photos\/jula_julz\/4504065200\/\" title=\"New House by jula julz, on Flickr\"><img src=\"http:\/\/farm3.static.flickr.com\/2756\/4504065200_176160bf06.jpg\" width=\"500\" height=\"334\" alt=\"New House\" \/><\/a>\n<a href=\"http:\/\/www.flickr.com\/photos\/jula_julz\/4504076208\/\" title=\"New House by jula julz, on Flickr\"><img src=\"http:\/\/farm3.static.flickr.com\/2785\/4504076208_ecbe5bf64d.jpg\" style=\"margin-top: 16px;\" width=\"500\" height=\"334\" alt=\"New House\" \/><\/a><\/p>\n\n<p>We have a new house. We\u2019re situated between Julia\u2019s parents and mine, which will work out really well when the baby arrives.<\/p>\n\n<p>On the work front, I\u2019ve stopped I\u2019ve manned up and gotten a Real Job. After a brief stint at <a href=\"http:\/\/railsmachine.com\">Rails Machine<\/a> (awesome dudes, you can\u2019t go past them for Ruby on Rails hosting and scalability consulting), I\u2019m now working at <a href=\"http:\/\/bulletproof.net\">Bulletproof Networks<\/a>, being paid to hack on my various open source monitoring projects.<\/p>\n\n<p>In open source land, there\u2019s been a major new release of <a href=\"http:\/\/visage-app.com\/\">Visage<\/a>, <a href=\"http:\/\/cucumber-nagios.org\">cucumber-nagios<\/a> has had several bugfix and minor feature releases, and Visage also has a <a href=\"http:\/\/visage-app.com\/\">brand spanking new<\/a> website. Due to the baby and new job I had to cancel appearances at Agile 2010 and FrOSCon, however <a href=\"http:\/\/agilesysadmin.net\/\">Stephen Nelson-Smith<\/a> stepped up to the plate and gave a great talk on Visage and Reconnoiter (<a href=\"http:\/\/ftp.stw-bonn.de\/froscon\/2010\/hs12\/theora\/hs12_-_2010-08-22_14:00_-_en_-_next_generation_capacity_planning_-_stephen_nelson-smith.ogv\">Theora<\/a>, <a href=\"http:\/\/ftp.stw-bonn.de\/froscon\/2010\/hs12\/h264\/hs12_-_2010-08-22_14:00_-_en_-_next_generation_capacity_planning_-_stephen_nelson-smith.mp4\">h264<\/a>).<\/p>\n\n<p>The baby means we\u2019ll be calling Sydney home for a while, however (fingers crossed) the new year should bode well for travel!<\/p>\n","pubDate":"09 Sep 2010","link":"https:\/\/fractio.nl\/2010\/09\/09\/the-times-they-are-a-changin\/","guid":"https:\/\/fractio.nl\/2010\/09\/09\/the-times-they-are-a-changin\/"},{"title":"Behaviour driven infrastructure through Cucumber","description":"<p><a href=\"http:\/\/blogs.sun.com\/martin\/\">Martin Englund<\/a> posted an open question\nto the <a href=\"http:\/\/groups.google.com\/group\/puppet-users\/browse_thread\/thread\/5c6ccb2ae1cbfd86\">Puppet mailing list<\/a>\na few days ago asking how people are verifying their systems are built\nas expected:<\/p>\n\n<blockquote>\n  <p>When you write code, you always use unit testing &amp; integration testing\nto verify that the application is working as expected, but why don\u2019t\nwe use that when we install a system?<\/p>\n\n  <p>What are you using to verify that your system is correctly configured\nand behaves the way you want?<\/p>\n<\/blockquote>\n\n<p>He linked to a <a href=\"http:\/\/blogs.sun.com\/martin\/entry\/behavior_driven_infrastructure\">blog post<\/a>\ndemonstrating how he was verifying his machines using <a href=\"http:\/\/cukes.info\/\">Cucumber<\/a>.<\/p>\n\n<p>Coincidentally, about a week earlier at <a href=\"http:\/\/devopsdays.org\">Devopsdays<\/a>\nin Gent, I was talking to <a href=\"http:\/\/hazardous.org\/~fkr\/\">Felix Kronlage<\/a>\nand <a href=\"http:\/\/twitter.com\/berndahlers\">Bernd Ahlers<\/a> from <a href=\"http:\/\/www.bytemine.net\/\">bytemine<\/a>\nabout doing similar things through testing SSH and mail delivery\nwith <a href=\"http:\/\/auxesis.github.com\/cucumber-nagios\">cucumber-nagios<\/a>.<\/p>\n\n<p>It\u2019s pretty cool people are thinking about doing BDD\/TDD with\ninfrastructure, and it\u2019s even cooler that the tools are at the point\nwhere doing this is actually possible.<\/p>\n\n<p>When doing software testing, your testing tool is normally separate\nfrom the language and libraries you\u2019re building the software with (but\nalmost always written in the same language). When testing your\ninfrastructure, I think it makes perfect sense to apply this practice.<\/p>\n\n<p>So to practise Behaviour Driven Infrastructure right now, you can use\nCucumber as the testing tool, and Puppet as the programming language.<\/p>\n\n<p>One advantage of practicising BDD within sysadmin world is that the\ntesting tools aren\u2019t closely coupled to the language our systems are\nbuilt with - i.e. if you hate Puppet you can use Cfengine, and if\nCucumber isn\u2019t cutting it use PyUnit.<\/p>\n\n<p>But to something tangible!<\/p>\n\n<!-- excerpt -->\n\n<p>Building on Martin\u2019s excellent examples, i\u2019ve pushed out a new version\nof cucumber-nagios that includes some basic SSH interaction steps, so\nyou can start building behavioural tests for your infrastructure:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\"><span class=\"kd\">Feature<\/span><span class=\"p\">:<\/span> example.org ssh logins\n  As a user of example.org\n  I need to login remotely\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Basic login\n    <span class=\"nf\">Given <\/span>I have no public keys set\n    <span class=\"err\">Then I can ssh to \"example.org\" with the following credentials<\/span><span class=\"p\">:<\/span>\n     <span class=\"p\">|<\/span> <span class=\"nv\">username<\/span> <span class=\"p\">|<\/span> <span class=\"nv\">password<\/span>    <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">lindsay<\/span>  <span class=\"p\">|<\/span> <span class=\"n\">spoonofdoom<\/span> <span class=\"p\">|<\/span>\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Login to multiple hosts\n    <span class=\"nf\">Given <\/span>I have no public keys set\n    <span class=\"err\">Then I can ssh to the following hosts with these credentials<\/span><span class=\"p\">:<\/span>\n     <span class=\"p\">|<\/span> <span class=\"nv\">hostname<\/span>           <span class=\"p\">|<\/span> <span class=\"nv\">username<\/span> <span class=\"p\">|<\/span> <span class=\"nv\">password<\/span>      <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">example.org<\/span>        <span class=\"p\">|<\/span> <span class=\"n\">matthew<\/span>  <span class=\"p\">|<\/span> <span class=\"n\">spladeofpain<\/span>  <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">mail.example.org<\/span>   <span class=\"p\">|<\/span> <span class=\"n\">john<\/span>     <span class=\"p\">|<\/span> <span class=\"n\">forkoffury<\/span>    <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">web04.example.org<\/span>  <span class=\"p\">|<\/span> <span class=\"n\">steve<\/span>    <span class=\"p\">|<\/span> <span class=\"n\">sporkofpork<\/span>   <span class=\"p\">|<\/span>\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Login with a key\n    <span class=\"err\">Given I have the following public keys<\/span><span class=\"p\">:<\/span>\n     <span class=\"p\">|<\/span> <span class=\"nv\">keyfile<\/span>                   <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">\/home\/user\/.ssh\/id_dsa<\/span> <span class=\"p\">|<\/span>\n    <span class=\"err\">Then I can ssh to the following hosts with these credentials<\/span><span class=\"p\">:<\/span>\n     <span class=\"p\">|<\/span> <span class=\"nv\">hostname<\/span>         <span class=\"p\">|<\/span> <span class=\"nv\">username<\/span> <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">example.org<\/span>      <span class=\"p\">|<\/span> <span class=\"n\">matthew<\/span>  <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">mail.example.org<\/span> <span class=\"p\">|<\/span> <span class=\"n\">mark<\/span>     <span class=\"p\">|<\/span>\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Login with an inline key\n    <span class=\"err\">Then I can ssh to the following hosts with these credentials<\/span><span class=\"p\">:<\/span>\n     <span class=\"p\">|<\/span> <span class=\"nv\">hostname<\/span>         <span class=\"p\">|<\/span> <span class=\"nv\">username<\/span> <span class=\"p\">|<\/span> <span class=\"nv\">keyfile<\/span>                   <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">example.org<\/span>      <span class=\"p\">|<\/span> <span class=\"n\">luke<\/span>     <span class=\"p\">|<\/span> <span class=\"n\">\/home\/luke\/.ssh\/id_dsa<\/span> <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">mail.example.org<\/span> <span class=\"p\">|<\/span> <span class=\"n\">john<\/span>     <span class=\"p\">|<\/span> <span class=\"n\">\/home\/john\/.ssh\/id_dsa<\/span> <span class=\"p\">|<\/span><\/code><\/pre><\/figure>\n\n<p>The above example shows there\u2019s lots of ways to test the same thing\n(all depending on what you\u2019re trying to achieve), but there is now\nalso suppport for executing shell commands remotely:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\">  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Checking \/etc\/passwd\n    <span class=\"err\">When I ssh to \"example.org\" with the following credentials<\/span><span class=\"p\">:<\/span>\n     <span class=\"p\">|<\/span> <span class=\"nv\">username<\/span> <span class=\"p\">|<\/span> <span class=\"nv\">password<\/span>      <span class=\"p\">|<\/span> <span class=\"nv\">keyfile<\/span>                 <span class=\"p\">|<\/span>\n     <span class=\"p\">|<\/span> <span class=\"n\">jacob<\/span>    <span class=\"p\">|<\/span> <span class=\"n\">spifeofstrife<\/span> <span class=\"p\">|<\/span> <span class=\"n\">\/home\/jacob\/.ssh\/id_dsa<\/span> <span class=\"p\">|<\/span>\n    <span class=\"nf\">And <\/span>I run <span class=\"s\">\"cat \/etc\/passwd\"<\/span>\n    <span class=\"nf\">Then <\/span>I should see <span class=\"s\">\"jacob\"<\/span> in the output<\/code><\/pre><\/figure>\n\n<p>I don\u2019t expect you would do a <code class=\"language-plaintext highlighter-rouge\">cat \/etc\/passwd<\/code> in a real test,\nhowever the step definition is a good example of how to interact\nwith an established SSH connection:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"no\">When<\/span> <span class=\"sr\">\/^I run \"([^\\\"]*)\"$\/<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">command<\/span><span class=\"o\">|<\/span>\n  <span class=\"vi\">@output<\/span> <span class=\"o\">=<\/span> <span class=\"vi\">@connection<\/span><span class=\"p\">.<\/span><span class=\"nf\">exec!<\/span><span class=\"p\">(<\/span><span class=\"n\">command<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"no\">Then<\/span> <span class=\"sr\">\/^I should see \"([^\\\"]*)\" in the output$\/<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">string<\/span><span class=\"o\">|<\/span>\n  <span class=\"vi\">@output<\/span><span class=\"p\">.<\/span><span class=\"nf\">should<\/span> <span class=\"o\">=~<\/span> <span class=\"sr\">\/<\/span><span class=\"si\">#{<\/span><span class=\"n\">string<\/span><span class=\"si\">}<\/span><span class=\"sr\">\/<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>You\u2019d use this to write specific tests for checking system behaviour,\nsuch as local user logins vs LDAP logins, or the presence of a daemon.<\/p>\n\n<p>So the resulting process may look something like this:<\/p>\n\n<ol>\n  <li>Use cucumber-nagios to write a specification of how you expect your\ninfrastructure to behave.<\/li>\n  <li>Hook your new cucumber-nagios checks into Nagios.<\/li>\n  <li>Start writing your manifests\/cookbooks.<\/li>\n  <li>Run your configuration management tool on the node you\u2019re configuring.<\/li>\n  <li>Iterate until your monitoring system is silent.<\/li>\n<\/ol>\n\n<p>Not only do you have a functional definition of how your machines work\nthat you can use to build your machines, but if your systems deviate\nfrom the expected behaviour at any point in the future, you\u2019ll get an\nalert from your monitoring system.<\/p>\n\n<p>Maintaining both a configuration management system and a set of\nintegration tests might get annoying after a while, but if you ever\ndecide to migrate to another configuration management system or move\nyour machines into the cloud you\u2019d have a set of tests you could apply\nimmediately.<\/p>\n\n<p>This could also be useful for moving existing machines into a\nconfiguration management system. Write a set of integration tests for\nyour unmanaged machines, run your configuration management system\nover the existing machines, see if anything is broken.<\/p>\n\n<p>I\u2019d be interested to hear how this process or similar works for people!<\/p>\n\n","pubDate":"09 Nov 2009","link":"https:\/\/fractio.nl\/2009\/11\/09\/behaviour-driven-infrastructure-through-cucumber\/","guid":"https:\/\/fractio.nl\/2009\/11\/09\/behaviour-driven-infrastructure-through-cucumber\/"},{"title":"Slides from Devopsdays 2009","description":"<p>On cucumber-nagios:<\/p>\n\n<div style=\"width:425px;text-align:left\" id=\"__ss_2444224\">\n\t<object style=\"margin:0px\" width=\"425\" height=\"355\">\n\t\t<param name=\"movie\" value=\"http:\/\/static.slidesharecdn.com\/swf\/ssplayer2.swf?doc=068-joined-091107061958-phpapp01&amp;stripped_title=behaviour-driven-monitoring-with-cucumbernagios-2444224\" \/>\n\t\t<param name=\"allowFullScreen\" value=\"true\" \/>\n\t\t<param name=\"allowScriptAccess\" value=\"always\" \/>\n\t\t<embed src=\"http:\/\/static.slidesharecdn.com\/swf\/ssplayer2.swf?doc=068-joined-091107061958-phpapp01&amp;stripped_title=behaviour-driven-monitoring-with-cucumbernagios-2444224\" type=\"application\/x-shockwave-flash\" allowscriptaccess=\"always\" allowfullscreen=\"true\" width=\"425\" height=\"355\" \/>&lt;\/embed&gt;\n\t<\/object>\n<\/div>\n<p>And Flapjack:<\/p>\n\n<div style=\"width:425px;text-align:left\" id=\"__ss_2433231\">\n\t<object style=\"margin:0px\" width=\"425\" height=\"355\">\n\t\t<param name=\"movie\" value=\"http:\/\/static.slidesharecdn.com\/swf\/ssplayer2.swf?doc=141-joined-091105162106-phpapp01&amp;stripped_title=flapjack-rethinking-monitoring-for-the-cloud\" \/>\n\t\t<param name=\"allowFullScreen\" value=\"true\" \/>\n\t\t<param name=\"allowScriptAccess\" value=\"always\" \/>\n\t\t<embed src=\"http:\/\/static.slidesharecdn.com\/swf\/ssplayer2.swf?doc=141-joined-091105162106-phpapp01&amp;stripped_title=flapjack-rethinking-monitoring-for-the-cloud\" type=\"application\/x-shockwave-flash\" allowscriptaccess=\"always\" allowfullscreen=\"true\" width=\"425\" height=\"355\" \/>&lt;\/embed&gt;\n\t<\/object>\n<\/div>\n\n","pubDate":"07 Nov 2009","link":"https:\/\/fractio.nl\/2009\/11\/07\/slides-from-devopsdays\/","guid":"https:\/\/fractio.nl\/2009\/11\/07\/slides-from-devopsdays\/"},{"title":"Using Cucumber as a scripting language","description":"<p>Yesterday at the excellent <a href=\"http:\/\/devopsdays.org\">Devopsdays<\/a> in Gent,\nBelgium, I proposed an open session to flesh out an idea I had a few weeks\nago - to use <a href=\"http:\/\/cukes.info\/\">Cucumber<\/a> as a general scripting language.<\/p>\n\n<p>Cucumber\u2019s <a href=\"http:\/\/wiki.github.com\/aslakhellesoy\/cucumber\/given-when-then\">Given\/When\/Then<\/a>\nsteps are well suited to procedural tasks like shell script, and you would\nbe writing your \u201cscripts\u201d in straightforward language that non-technical users\nsuch as managers and clients could understand. Also, as writing a scenario\nwithout a Then to close it feels unbalanced, you\u2019d get in the mindset of\ntesting the actions of your \u201cscripts\u201d fairly quickly.<\/p>\n\n<p>With little more than the hypothesis above, a group of us found a room and\nstarted modeling some scenarios. Our focus was on file manipulation, as it\nwas a low hanging fruit and something most scripts do.<\/p>\n\n<p>We came up with this:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\"><span class=\"kd\">Feature<\/span><span class=\"p\">:<\/span> Copy files around\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> <span class=\"nf\">A <\/span>single file\n    <span class=\"nf\">Given <\/span>I am in <span class=\"s\">\"\/tmp\"<\/span>\n    <span class=\"nf\">And <\/span>the file <span class=\"s\">\"spoons\"<\/span> exists\n    <span class=\"nf\">When <\/span>I copy the file <span class=\"s\">\"spoons\"<\/span> to <span class=\"s\">\"forks\"<\/span>\n    <span class=\"nf\">Then <\/span>the file <span class=\"s\">\"forks\"<\/span> should exist\n    <span class=\"nf\">And <\/span>the file <span class=\"s\">\"forks\"<\/span> should be readable\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Multiple files\n    <span class=\"nf\">Given <\/span>I am in <span class=\"s\">\"\/tmp\"<\/span>\n    <span class=\"err\">Given the following table of tasty fruit<\/span><span class=\"p\">:<\/span>\n      <span class=\"p\">|<\/span> <span class=\"nv\">filename<\/span> <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">apples<\/span>   <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">oranges<\/span>  <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">bananas<\/span>  <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">ananas<\/span>   <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">file<\/span> <span class=\"n\">with<\/span> <span class=\"n\">lots<\/span> <span class=\"n\">o<\/span> <span class=\"n\">spaces<\/span> <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">spoons<\/span> <span class=\"n\">of<\/span> <span class=\"n\">:<\/span> <span class=\"n\">doom<\/span> <span class=\"p\">|<\/span>\n    <span class=\"nf\">When <\/span>I create the directory <span class=\"s\">\"\/tmp\/some_other_dir\"<\/span>\n    <span class=\"nf\">When <\/span>I copy the tasty fruit in the table to <span class=\"s\">\"\/tmp\/some_other_dir\"<\/span>\n    <span class=\"nf\">Then <\/span>the tasty fruit in the table should exist in <span class=\"s\">\"\/tmp\/some_other_dir\"<\/span><\/code><\/pre><\/figure>\n\n<p>The first scenario is fairly self explanatory, but the second one is where\nthe interesting stuff starts happening.<\/p>\n\n<p>In the implementation of the \u201cfollowing table\u201d step, we create an instance\nvariable that persists the list of files between steps. This way, we can\nreference the \u201ctasty fruit\u201d throughout our other steps:<\/p>\n\n<!-- excerpt -->\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"no\">Given<\/span> <span class=\"sr\">\/^the following table of (.+):$\/<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"nb\">name<\/span><span class=\"p\">,<\/span> <span class=\"n\">table<\/span><span class=\"o\">|<\/span>\n  <span class=\"vi\">@tables<\/span> <span class=\"o\">=<\/span> <span class=\"p\">{}<\/span>\n  <span class=\"vi\">@tables<\/span><span class=\"p\">[<\/span><span class=\"nb\">name<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"n\">table<\/span><span class=\"p\">.<\/span><span class=\"nf\">hashes<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>We use the <code class=\"language-plaintext highlighter-rouge\">(.+)<\/code> regex to capture the name of the table so we can poke at\nit later on. This design lets you easily use multiple tables throughout your\nsteps that won\u2019t conflict with one another:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\">  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Multiple files from multiple tables\n    <span class=\"err\">Given the following table of tasty fruit<\/span><span class=\"p\">:<\/span>\n      <span class=\"p\">|<\/span> <span class=\"nv\">filename<\/span> <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">apples<\/span>   <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">oranges<\/span>  <span class=\"p\">|<\/span>\n    <span class=\"err\">And the following table of baggy baggage<\/span><span class=\"p\">:<\/span>\n      <span class=\"p\">|<\/span> <span class=\"nv\">filename<\/span> <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">suitcase<\/span> <span class=\"p\">|<\/span>\n      <span class=\"p\">|<\/span> <span class=\"n\">backpack<\/span> <span class=\"p\">|<\/span>\n    <span class=\"nf\">When <\/span>I copy the baggy baggage in the table to <span class=\"s\">\"\/tmp\/some_other_dir\"<\/span>\n    <span class=\"nf\">And <\/span>I copy the tasty fruit in the table to <span class=\"s\">\"\/tmp\/some_other_dir\"<\/span>\n    <span class=\"nf\">Then <\/span>the tasty fruit in the table should exist in <span class=\"s\">\"\/tmp\/some_other_dir\"<\/span>\n    <span class=\"nf\">And <\/span>the baggy baggage in the table should exist in <span class=\"s\">\"\/tmp\/some_other_dir\"<\/span><\/code><\/pre><\/figure>\n\n<p>Other steps can reference data in the table by accepting a name and looking\nit up in the hash of tables:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"no\">Then<\/span> <span class=\"sr\">\/^the (.+) in the table should exist in \"([^\\\"]*)\"$\/<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"nb\">name<\/span><span class=\"p\">,<\/span> <span class=\"n\">destination<\/span><span class=\"o\">|<\/span>\n  <span class=\"vi\">@tables<\/span><span class=\"p\">[<\/span><span class=\"nb\">name<\/span><span class=\"p\">].<\/span><span class=\"nf\">each<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">file<\/span><span class=\"o\">|<\/span>\n    <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">exists?<\/span><span class=\"p\">(<\/span><span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">join<\/span><span class=\"p\">(<\/span><span class=\"n\">destination<\/span><span class=\"p\">,<\/span> <span class=\"n\">file<\/span><span class=\"p\">[<\/span><span class=\"s2\">\"filename\"<\/span><span class=\"p\">])).<\/span><span class=\"nf\">should<\/span> <span class=\"n\">be_true<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>We also looked at handling permission problems:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\">  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> <span class=\"nf\">Do <\/span>things i'm not allowed to\n    <span class=\"nf\">When <\/span>I create the directory <span class=\"s\">\"\/usr\/bin\/wtf\"<\/span><\/code><\/pre><\/figure>\n\n<p>Here the step will raise an <code class=\"language-plaintext highlighter-rouge\">Errno::EACCES<\/code> exception, and as Cucumber uses\na pretty formatter by default, the failed step will appear in red.<\/p>\n\n<p>Finally we tried copying files with a glob. The initial implementation I\nbanged out was very Unix focused (it used <code class=\"language-plaintext highlighter-rouge\">*<\/code>, which is a very explicit\nglobbing syntax), so we scrapped that idea and wrote our intentions in\nplain English:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\">  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Copy based on a pattern\n    <span class=\"nf\">Given <\/span>I am in <span class=\"s\">\"\/tmp\"<\/span>\n    <span class=\"nf\">When <\/span>I create the directory <span class=\"s\">\"\/tmp\/pattern_dir\"<\/span>\n    <span class=\"nf\">And <\/span>I copy files beginning with the letters z,y,x to <span class=\"s\">\"\/tmp\/pattern_dir\"<\/span>\n    <span class=\"nf\">Then <\/span>they should exist there<\/code><\/pre><\/figure>\n\n<p>The implementation is obvious, and is very understandable (and seemingly\npowerful) to someone with no knowledge of globbing.<\/p>\n\n<p>People who have used Cucumber in web development will likely note that the\nabove implementation is an example of <a href=\"http:\/\/wiki.github.com\/aslakhellesoy\/cucumber\/step-organisation\">tightly coupled steps<\/a>,\nwhich is sometimes regarded as an anti-pattern. I\u2019m of the opinion that this\nis a lot more painful in a web development context than in a\nprocedural\/scripting tool one.<\/p>\n\n<p>From my recollection of Euruko earlier this year, when <a href=\"http:\/\/blog.aslakhellesoy.com\/\">Aslak<\/a>\nwas asked whether he considers it an antipattern, he said it can be ok to use\ndepending on the problem you\u2019re trying to solve, so I take that as tacit\npermission that it is ok this context. :-)<\/p>\n\n<p>I posted the results of the session to a <a href=\"http:\/\/gist.github.com\/223110\">Gist<\/a>\nyesterday, and I have also <a href=\"http:\/\/github.com\/auxesis\/cucumber-scripting\">published a repo<\/a>\nwith a <a href=\"http:\/\/github.com\/wycats\/bundler\">bundler<\/a>-ready install process, so\npeople can hack on it more.<\/p>\n\n<p>After the session I remembered that the feature file <a href=\"http:\/\/groups.google.com\/group\/cukes\/browse_thread\/thread\/ebb5f7e5e90dc825#\">doesn\u2019t actually have\nto start with Feature<\/a>,\nso it\u2019s possible to write standalone scenarios one after another.<\/p>\n\n<p>When wrapping up, someone in the room pointed out that our implementation\nactually went one better than being readable by non-technical users - they\ncould probably write the scripts themselves.<\/p>\n\n<p>This is pretty powerful, and coupled with Cucumber\u2019s very cool step generation\nwhen running scenarios with undefined steps, makes it very easy to start\nprototyping a standard library of human readable scripting commands.<\/p>\n\n<p>There was <a href=\"http:\/\/groups.google.com\/group\/cukes\/browse_thread\/thread\/6e268a0238944fd2\/31c9dcfc27a4278e\">chatter<\/a>\non the Cucumber mailing list a few weeks ago about providing alternate\ninterfaces for writing and executing Cucumber features, and it could\nbe cool to see a drag-and-drop interface with a library of common tasks that\ncalls out to Cucumber to execute them. You could even build something quite\nbeautiful with <a href=\"http:\/\/www.macruby.org\/trac\/wiki\/HotCocoa\">HotCocoa<\/a>.<\/p>\n\n<p>Anyhow, if you think anything mentioned above is a cool idea, check out the\ncode and start hacking!<\/p>\n\n","pubDate":"01 Nov 2009","link":"https:\/\/fractio.nl\/2009\/11\/01\/using-cucumber-as-a-scripting-language\/","guid":"https:\/\/fractio.nl\/2009\/11\/01\/using-cucumber-as-a-scripting-language\/"},{"title":"cucumber-nagios 0.5.0","description":"<p>I\u2019ve just released a new version of <code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code>, and this release is \nquite a milestone!<\/p>\n\n<p>Big changes in this release include:<\/p>\n\n<ul>\n  <li>\n    <p>Removal of the ghetto bundler in favour of wycats\/carllerche\u2019s <a href=\"http:\/\/github.com\/wycats\/bundler\/\">bundler<\/a>.<\/p>\n\n    <p>In previous releases, you\u2019d use a <code class=\"language-plaintext highlighter-rouge\">rake<\/code> task to freeze in dependencies. \nThis produced all sorts of weird problems when new versions of the \ndependencies were released, it didn\u2019t handle gems with C extensions that \nwell, and could be <em>very<\/em> slow if you ran it multiple times.<\/p>\n\n    <p>Now that <code class=\"language-plaintext highlighter-rouge\">bundler<\/code> has started maturing, <code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code> has made the \nswitch. It eliminates all the aforementioned issues, and integrates cleanly\nwith RubyGems.<\/p>\n  <\/li>\n  <li>\n    <p>Renaming of the gem to <code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code> from <code class=\"language-plaintext highlighter-rouge\">auxesis-cucumber-nagios<\/code>,\nas GitHub have <a href=\"http:\/\/github.com\/blog\/515-gem-building-is-defunct\">discontinued building gems<\/a>. \nThe gem is now <a href=\"http:\/\/gemcutter.org\/gems\/cucumber-nagios\">published<\/a> on \n<a href=\"http:\/\/gemcutter.org\/\">Gemcutter<\/a>.<\/p>\n  <\/li>\n  <li>\n    <p>The project generator now prints out helpful instructions when you generate\na new project.<\/p>\n  <\/li>\n  <li>\n    <p><code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code> projects have built-in steps for benchmarking response \ntimes. The following example explains it best:<\/p>\n  <\/li>\n<\/ul>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\"><span class=\"kd\">Feature<\/span><span class=\"p\">:<\/span> slashdot.com\n  To keep the geek masses satisfied\n  Slashdot must be responsive\n    \n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Visiting a responsive front page\n    <span class=\"nf\">Given <\/span>I am benchmarking\n    <span class=\"err\">When I go to http<\/span><span class=\"p\">:<\/span><span class=\"err\">\/\/slashdot.org\/<\/span>\n    <span class=\"nf\">Then <\/span>the elapsed time should be less than 5 seconds<\/code><\/pre><\/figure>\n\n<ul>\n  <li>\n    <p>A <code class=\"language-plaintext highlighter-rouge\">--debug<\/code> switch can be passed to <code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code> to print out the \ncommand line built and executed. This can be useful when writing your \nfeatures.<\/p>\n  <\/li>\n  <li>\n    <p>Removal of several unnecessary support files, and cleanups of helpers and \nCucumber\u2019s <a href=\"http:\/\/wiki.github.com\/aslakhellesoy\/cucumber\/a-whole-new-world\">World<\/a>\nobject setup, in line with an updated version of <a href=\"http:\/\/github.com\/brynary\/webrat\">Webrat<\/a>.<\/p>\n  <\/li>\n  <li>\n    <p>Refactoring of the Nagios formatter for Cucumber to use Cucumber 0.4.0\u2019s \nformatter interface. For users, this simply means <code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code> now \nworks with Cucumber 0.4.0 (the latest at time of this release).<\/p>\n  <\/li>\n<\/ul>\n\n<p>Although i\u2019ve done a fair amount of testing, there will invariably be bugs, \nwhich can be reported on <a href=\"http:\/\/github.com\/auxesis\/cucumber-nagios\/issues\">GitHub<\/a>.<\/p>\n","pubDate":"12 Oct 2009","link":"https:\/\/fractio.nl\/2009\/10\/12\/cucumber-nagios-0-point-5\/","guid":"https:\/\/fractio.nl\/2009\/10\/12\/cucumber-nagios-0-point-5\/"},{"title":"Switching to Jekyll","description":"<p>After a quick migration, i\u2019ve switched this blog from WordPress to <a href=\"http:\/\/wiki.github.com\/mojombo\/jekyll\">Jekyll<\/a>.<\/p>\n\n<p>I\u2019ve done this for several reasons:<\/p>\n\n<ul>\n  <li>My blog content is static. Spinning up PHP on every request is overkill.<\/li>\n  <li>I want to write my posts in <a href=\"http:\/\/daringfireball.net\/projects\/markdown\/\">Markdown<\/a>.<\/li>\n  <li>Jekyll has awesome <a href=\"http:\/\/wiki.github.com\/mojombo\/jekyll\/liquid-extensions\">syntax highlighting<\/a> using <a href=\"http:\/\/pygments.org\/\">Pygments<\/a>.<\/li>\n  <li>I can easily migrate from <a href=\"http:\/\/github.com\/mojombo\/jekyll\/tree\/master\/lib\/jekyll\/converters\/\">WordPress to Jekyll<\/a>.<\/li>\n  <li>Latest WordPress releases segfault Apache on Dapper.<\/li>\n<\/ul>\n\n<p>Cool things now i\u2019ve migrated:<\/p>\n\n<ul>\n  <li>I can version control my blog.<\/li>\n  <li>My blog content is flat file, so I just edit the content and push. This also means my blog can be easily distributed and backed up.<\/li>\n  <li>Pulling in Flickr photos, Last.fm listening and tweets no longer blocks the page load. I wrote a <a href=\"http:\/\/holmwood.id.au\/~lindsay\/javascripts\/widgets.js\">cute little<\/a> <a href=\"http:\/\/mootools.net\">MooTools<\/a> class to display the info, and a cron job to fetch it in the background.<\/li>\n  <li>Comments are all preserved, as I switched to <a href=\"http:\/\/disqus.com\/\">Disqus<\/a> several weeks ago. The WordPress =&gt; Disqus import was mind numbingly easy using the Disqus plugin.<\/li>\n<\/ul>\n\n<p>If you want minimalism in your blogging engine and full control over its appearance, Jekyll might be worth checking out.<\/p>\n","pubDate":"30 Sep 2009","link":"https:\/\/fractio.nl\/2009\/09\/30\/switching-to-jekyll\/","guid":"https:\/\/fractio.nl\/2009\/09\/30\/switching-to-jekyll\/"},{"title":"Searching for the perfect presentation toolchain","description":"<p>I\u2019ve spent the last few years trying to find the holy grail of FOSS toolchains for producing and displaying my presentations.<\/p>\n\n<p>I started with OpenOffice.org Impress (which I have sworn never to use again after it ate several presentations), dallied with Clutter\u2019s opt, seriously used KeyJNote, before moving to Mac OS X last year and settling on Apple\u2019s Keynote.<\/p>\n\n<p>While travelling this year without a Mac, i\u2019ve resumed my search for the perfect toolchain, and I think i\u2019ve found a setup that works pretty well.<\/p>\n\n<p>It\u2019s uses Inkscape to build the slides, a text file to order slides, KeyJNote to display them, and <a href=\"http:\/\/rake.rubyforge.org\/\">Rake<\/a> to tie it all together. Oh, and it\u2019s versioned with <a href=\"http:\/\/bazaar-vcs.org\/\">Bazaar<\/a>.<\/p>\n\n<p>Each slide goes on a new line in <code class=\"language-plaintext highlighter-rouge\">order.txt<\/code>:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-text\" data-lang=\"text\">\/home\/auxesis\/Desktop\/devopsdays\/slides\/blank.png\n# http:\/\/www.flickr.com\/photos\/tim_norris\/2600844073\/\n\/home\/auxesis\/Desktop\/devopsdays\/slides\/scalable.png\n# http:\/\/www.flickr.com\/photos\/barbour\/404053639\/\n\/home\/auxesis\/Desktop\/devopsdays\/slides\/distributed.png\n# http:\/\/www.flickr.com\/photos\/numstead\/535460927\/\n\/home\/auxesis\/Desktop\/devopsdays\/slides\/nagios-plugin-format.png<\/code><\/pre><\/figure>\n\n<p>You can easily add comments between slides to keep track of image sources or write down ideas.<\/p>\n\n<p>Then there\u2019s a Rake task for building the KeyJNote command line and setting up displays:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"n\">desc<\/span> <span class=\"s2\">\"setup external displays\"<\/span>\n<span class=\"n\">task<\/span> <span class=\"ss\">:displays<\/span> <span class=\"k\">do<\/span>\n  <span class=\"nb\">system<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"xrandr --output VGA --mode 1024x768\"<\/span><span class=\"p\">)<\/span>\n  <span class=\"nb\">system<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"xrandr --output VGA --same-as LVDS\"<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"n\">desc<\/span> <span class=\"s2\">\"perform presentation\"<\/span>\n<span class=\"n\">task<\/span> <span class=\"ss\">:perform<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"ss\">:displays<\/span> <span class=\"k\">do<\/span>\n  <span class=\"n\">options<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"--transition Crossfade --transtime 250 -c persistent\"<\/span>\n  <span class=\"n\">command<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"keyjnote <\/span><span class=\"si\">#{<\/span><span class=\"n\">options<\/span><span class=\"si\">}<\/span><span class=\"s2\"> @<\/span><span class=\"si\">#{<\/span><span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">dirname<\/span><span class=\"p\">(<\/span><span class=\"kp\">__FILE__<\/span><span class=\"p\">)<\/span><span class=\"si\">}<\/span><span class=\"s2\">\/order.txt\"<\/span>\n  <span class=\"nb\">system<\/span><span class=\"p\">(<\/span><span class=\"n\">command<\/span><span class=\"p\">)<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>Finally, there\u2019s a Rake task for building the PNGs from the SVGs created through Inkscape:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"n\">desc<\/span> <span class=\"s2\">\"build pngs from svgs\"<\/span>\n<span class=\"n\">task<\/span> <span class=\"ss\">:build<\/span> <span class=\"k\">do<\/span>\n  <span class=\"no\">Dir<\/span><span class=\"p\">.<\/span><span class=\"nf\">glob<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"sources\/*.svg\"<\/span><span class=\"p\">).<\/span><span class=\"nf\">each<\/span> <span class=\"k\">do<\/span> <span class=\"o\">|<\/span><span class=\"n\">file<\/span><span class=\"o\">|<\/span>\n    <span class=\"k\">if<\/span> <span class=\"n\">modified?<\/span><span class=\"p\">(<\/span><span class=\"n\">file<\/span><span class=\"p\">)<\/span>\n      <span class=\"n\">basename<\/span> <span class=\"o\">=<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">basename<\/span><span class=\"p\">(<\/span><span class=\"n\">file<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'.svg'<\/span><span class=\"p\">)<\/span>\n      <span class=\"n\">slide<\/span> <span class=\"o\">=<\/span> <span class=\"s2\">\"slides\/<\/span><span class=\"si\">#{<\/span><span class=\"n\">basename<\/span><span class=\"si\">}<\/span><span class=\"s2\">.png\"<\/span>\n      <span class=\"nb\">system<\/span><span class=\"p\">(<\/span><span class=\"s2\">\"inkscape -e <\/span><span class=\"si\">#{<\/span><span class=\"n\">slide<\/span><span class=\"si\">}<\/span><span class=\"s2\"> -f <\/span><span class=\"si\">#{<\/span><span class=\"n\">file<\/span><span class=\"si\">}<\/span><span class=\"s2\">\"<\/span><span class=\"p\">)<\/span>\n    <span class=\"k\">end<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span>\n\n<span class=\"k\">def<\/span> <span class=\"nf\">modified?<\/span><span class=\"p\">(<\/span><span class=\"n\">filename<\/span><span class=\"p\">)<\/span>\n  <span class=\"n\">index_filename<\/span> <span class=\"o\">=<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">join<\/span><span class=\"p\">(<\/span><span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">dirname<\/span><span class=\"p\">(<\/span><span class=\"kp\">__FILE__<\/span><span class=\"p\">),<\/span> <span class=\"s2\">\"cache\"<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"index\"<\/span><span class=\"p\">)<\/span>\n  <span class=\"c1\"># read or initialise index<\/span>\n  <span class=\"k\">if<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">exists?<\/span><span class=\"p\">(<\/span><span class=\"n\">index_filename<\/span><span class=\"p\">)<\/span>\n    <span class=\"vi\">@index<\/span> <span class=\"o\">=<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">open<\/span><span class=\"p\">(<\/span><span class=\"n\">index_filename<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'r'<\/span><span class=\"p\">)<\/span> <span class=\"p\">{<\/span> <span class=\"o\">|<\/span><span class=\"n\">f<\/span><span class=\"o\">|<\/span> <span class=\"no\">Marshal<\/span><span class=\"p\">.<\/span><span class=\"nf\">load<\/span><span class=\"p\">(<\/span><span class=\"n\">f<\/span><span class=\"p\">)<\/span> <span class=\"p\">}<\/span>\n  <span class=\"k\">else<\/span>\n    <span class=\"vi\">@index<\/span> <span class=\"o\">=<\/span> <span class=\"p\">{}<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"c1\"># check if modified<\/span>\n  <span class=\"k\">if<\/span> <span class=\"vi\">@index<\/span><span class=\"p\">[<\/span><span class=\"n\">filename<\/span><span class=\"p\">]<\/span>\n    <span class=\"n\">modified<\/span> <span class=\"o\">=<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">mtime<\/span><span class=\"p\">(<\/span><span class=\"n\">filename<\/span><span class=\"p\">)<\/span> <span class=\"o\">&gt;<\/span> <span class=\"vi\">@index<\/span><span class=\"p\">[<\/span><span class=\"n\">filename<\/span><span class=\"p\">][<\/span><span class=\"ss\">:mtime<\/span><span class=\"p\">]<\/span>\n  <span class=\"k\">else<\/span>\n    <span class=\"n\">modified<\/span> <span class=\"o\">=<\/span> <span class=\"kp\">true<\/span>\n  <span class=\"k\">end<\/span>\n\n  <span class=\"c1\"># update index<\/span>\n  <span class=\"vi\">@index<\/span><span class=\"p\">[<\/span><span class=\"n\">filename<\/span><span class=\"p\">]<\/span> <span class=\"o\">=<\/span> <span class=\"p\">{<\/span><span class=\"ss\">:mtime<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">mtime<\/span><span class=\"p\">(<\/span><span class=\"n\">filename<\/span><span class=\"p\">)}<\/span>\n  <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">open<\/span><span class=\"p\">(<\/span><span class=\"n\">index_filename<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'w'<\/span><span class=\"p\">)<\/span> <span class=\"p\">{<\/span> <span class=\"o\">|<\/span><span class=\"n\">f<\/span><span class=\"o\">|<\/span> <span class=\"no\">Marshal<\/span><span class=\"p\">.<\/span><span class=\"nf\">dump<\/span><span class=\"p\">(<\/span><span class=\"vi\">@index<\/span><span class=\"p\">,<\/span> <span class=\"n\">f<\/span><span class=\"p\">)<\/span> <span class=\"p\">}<\/span>\n\n  <span class=\"k\">return<\/span> <span class=\"n\">modified<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>This keeps an index of SVG mtimes, and only rebuilds a slides if you modify the SVG.<\/p>\n\n<p>Then to view the presentation, it\u2019s a simple:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\"><span class=\"nv\">$ <\/span>rake build <span class=\"p\">;<\/span> rake perform<\/code><\/pre><\/figure>\n\n<p>Now that you\u2019re just dealing with a bunch of files, you can version control the whole presentation with something like bzr (which handles binary content really well). It\u2019s worth setting up an ignore list so all the generated slides don\u2019t get versioned:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-text\" data-lang=\"text\">slides\/*\n*.cache\ncache\/*<\/code><\/pre><\/figure>\n\n","pubDate":"18 Sep 2009","link":"https:\/\/fractio.nl\/2009\/09\/18\/searching-for-the-perfect-presentation-toolchain\/","guid":"https:\/\/fractio.nl\/2009\/09\/18\/searching-for-the-perfect-presentation-toolchain\/"},{"title":"Graphing collectd statistics in the browser with Visage","description":"<p>I\u2019ve been working on a cool little side project the last week called <a href=\"http:\/\/auxesis.github.com\/visage\">Visage<\/a>. It renders graphs of <a href=\"http:\/\/collectd.org\/\">collectd<\/a> statistics in the browser, making the data interactive.<\/p>\n\n<p><a href=\"http:\/\/www.flickr.com\/photos\/auxesis\/3897525979\/\" title=\"Visage in action by auxesis, on Flickr\"><img src=\"http:\/\/farm4.static.flickr.com\/3512\/3897525979_80663619f3_o.png\" width=\"858\" height=\"364\" alt=\"Visage in action\" \/><\/a><\/p>\n\n<p>It\u2019s a lot more interactive than the screenshot suggests, so check out <a href=\"http:\/\/visage.unstated.net\/nadia\/cpu+load\">an instance of Visage<\/a> running.<\/p>\n\n<p>Some background:<\/p>\n\n<p><a href=\"http:\/\/collectd.org\/\">collectd<\/a> is an awesome way to collect statistics from your Unix machines and aggregate the stats in one place (it has a network plugin that makes this a cinch).<\/p>\n\n<p>So you set up collectd, and you\u2019re getting all these great statistics, but you want graphs right? Graphs make the IT manager in all of us smile.<\/p>\n\n<p>To date there have been two options for viewing graphs of collectd\u2019s data: collection.cgi, which comes bundled with collectd on most distros, though sometimes squirreled away weirdly, and the newer collection3.<\/p>\n\n<p>The problem I have with these interfaces is that they are organised like the RRDs that collectd stores. You basically use the interface to navigate the RRDs, not deduce meaning.<\/p>\n\n<p>I want to easily see correlations between multiple hosts during a slashdotting. I want to view related stats for a host on a dashboard page. I want to filter out datasets that aren\u2019t interesting to me.<\/p>\n\n<p>What\u2019s holding the existing graphing interfaces back is the presentation layer (graphs generated from RRDtool, wrapped in a smattering of Perl) being very tightly coupled with the data layer (the RRDs themselves).<\/p>\n\n<p>So I set about exposing the RRDs in a more digestible form - <a href=\"http:\/\/json.org\/\">JSON<\/a>.<\/p>\n\n<p>Once the RRDs are exposed over the web it makes it easy to consume the data and build your own graphing interface as either a thick client, Flash widget, or in the browser. You could also do periodic snapshots and reporting, but I digress.<\/p>\n\n<p>So once I was able to consume this data, I used the <a href=\"http:\/\/raphaeljs.org\">Rapha\u00ebl<\/a> JavaScript library to render the graphs, and turned it into a <a href=\"http:\/\/mootools.net\/\">MooTools<\/a> class for maximum reusability.<\/p>\n\n<p>So there you have it!<\/p>\n\n<p>Right now there are a few rough edges (the axis labels keep me up at night), but it\u2019s functional. If you give it a go, i\u2019d like to know! You can <a href=\"http:\/\/github.com\/auxesis\/visage\/issues\">report any issues<\/a> you find on GitHub.<\/p>\n","pubDate":"08 Sep 2009","link":"https:\/\/fractio.nl\/2009\/09\/08\/graphing-collectd-statistics-in-the-browser-with-visage\/","guid":"https:\/\/fractio.nl\/2009\/09\/08\/graphing-collectd-statistics-in-the-browser-with-visage\/"},{"title":"Upcoming speaking spots","description":"<p>The conference season is starting to warm up:<\/p>\n\n<p>I\u2019ll be speaking at the <a href=\"http:\/\/www.devopsdays.org\/\">devopsdays<\/a> 2009 in Ghent, Belgium on 30th\/31st of October, on <a href=\"http:\/\/auxesis.github.com\/cucumber-nagios\">cucumber-nagios<\/a> and <a href=\"http:\/\/flapjack-project.com\/\">Flapjack<\/a>. <a href=\"http:\/\/www.jedi.be\/blog\/\">Patrick Debois<\/a> is doing an awesome job carving out a <a href=\"http:\/\/en.oreilly.com\/velocity2009\">Velocity<\/a>-like conference in Europe, so if you\u2019re do any sort of operations or sysadmin work it\u2019s definitely worth attending.<\/p>\n\n<p>In January i\u2019ll be speaking at <a href=\"http:\/\/www.lca2010.org.nz\/\">linux.conf.au 2010<\/a> in Wellington, New Zealand on Flapjack. This year\u2019s organisers are putting together a conference so chock-full of awesome your head will be spinning. If you do anything with open source in Australia or New Zealand you can\u2019t afford to miss it.<\/p>\n","pubDate":"03 Sep 2009","link":"https:\/\/fractio.nl\/2009\/09\/03\/upcoming-speaking-spot\/","guid":"https:\/\/fractio.nl\/2009\/09\/03\/upcoming-speaking-spot\/"},{"title":"Streamlining documentation on your project websites","description":"<p>When hacking on small open source projects you end up having to be a jack of all trades - hacker, documentor, community manager. Documentation can be annoying when you just want to hack, but if you want people to your code it\u2019s essential.<\/p>\n\n<p>One problem i\u2019ve experienced is keeping documentation on the project\u2019s website up to date with how the software actually works.<\/p>\n\n<p>Generally you have two sets of documentation - one in your project\u2019s source code repo, and one on your project\u2019s website. The website docs tend to cover meta information about the project, and maybe a quick install guide or demo. Your source code docs will cover the nitty-gritty of setting up your app, and maybe how to hack on the code.<\/p>\n\n<p>More often than not the docs in the project\u2019s source will get updated more regularly than on the website, and after a while the website docs may end up diverging from how the software actually works. This can be especially noticeable in installation docs.<\/p>\n\n<p>I\u2019ve tried attacking this problem on a new project i\u2019m hacking on.<\/p>\n\n<p>Lately i\u2019ve been using <a href=\"http:\/\/nanoc.stoneship.org\/\">nanoc<\/a> to build a few simple sites. Rather than a full blown application server running in the background a la Rails or Django, nanoc simply compiles the templates and layouts on your site and spits out static HTML.<\/p>\n\n<p>nanoc makes it easy to write your own helpers to be used during the compile phase, so i\u2019ve written a simple helper to include documentation from an external source inline with my content:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"c1\">#lib\/external_docs.rb\n<\/span>\n\n<span class=\"k\">def<\/span> <span class=\"nf\">generate_docs_from_source<\/span><span class=\"p\">(<\/span><span class=\"nb\">name<\/span><span class=\"p\">)<\/span>\n  <span class=\"n\">doc_path<\/span> <span class=\"o\">=<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">join<\/span><span class=\"p\">(<\/span><span class=\"vi\">@config<\/span><span class=\"p\">[<\/span><span class=\"ss\">:code_src<\/span><span class=\"p\">],<\/span> <span class=\"s1\">'doc'<\/span><span class=\"p\">)<\/span>\n  <span class=\"n\">filename<\/span> <span class=\"o\">=<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">join<\/span><span class=\"p\">(<\/span><span class=\"n\">doc_path<\/span><span class=\"p\">,<\/span> <span class=\"s2\">\"<\/span><span class=\"si\">#{<\/span><span class=\"nb\">name<\/span><span class=\"p\">.<\/span><span class=\"nf\">upcase<\/span><span class=\"si\">}<\/span><span class=\"s2\">.md\"<\/span><span class=\"p\">)<\/span>\n\n  <span class=\"k\">if<\/span> <span class=\"no\">File<\/span><span class=\"p\">.<\/span><span class=\"nf\">exists?<\/span><span class=\"p\">(<\/span><span class=\"n\">filename<\/span><span class=\"p\">)<\/span>\n    <span class=\"n\">output<\/span> <span class=\"o\">=<\/span> <span class=\"sb\">`rdiscount <\/span><span class=\"si\">#{<\/span><span class=\"n\">filename<\/span><span class=\"si\">}<\/span><span class=\"sb\">`<\/span>\n    <span class=\"no\">Haml<\/span><span class=\"o\">::<\/span><span class=\"no\">Helpers<\/span><span class=\"o\">::<\/span><span class=\"n\">find_and_preserve<\/span><span class=\"p\">(<\/span><span class=\"n\">output<\/span><span class=\"p\">,<\/span> <span class=\"p\">[<\/span><span class=\"s2\">\"pre\"<\/span><span class=\"p\">])<\/span>\n  <span class=\"k\">else<\/span>\n    <span class=\"s2\">\"&amp;lt;span style='color: red'&amp;gt;Error: <\/span><span class=\"si\">#{<\/span><span class=\"nb\">name<\/span><span class=\"p\">.<\/span><span class=\"nf\">upcase<\/span><span class=\"si\">}<\/span><span class=\"s2\">.md doesn't exist in <\/span><span class=\"si\">#{<\/span><span class=\"n\">doc_path<\/span><span class=\"si\">}<\/span><span class=\"s2\">&amp;lt;\/span&amp;gt;\"<\/span>\n  <span class=\"k\">end<\/span>\n<span class=\"k\">end<\/span><\/code><\/pre><\/figure>\n\n<p>This calls out to <a href=\"http:\/\/github.com\/rtomayko\/rdiscount\/tree\/master\">rdiscount<\/a> to generate HTML from a <a href=\"http:\/\/daringfireball.net\/projects\/markdown\/\">Markdown<\/a> document. You can use it easily in your templates:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"o\">%<\/span><span class=\"n\">div<\/span><span class=\"c1\">#installation\n<\/span>\n  <span class=\"o\">%<\/span><span class=\"n\">div<\/span><span class=\"c1\">#install.generated\n<\/span>\n    <span class=\"o\">=<\/span> <span class=\"n\">generate_docs_from_source<\/span><span class=\"p\">(<\/span><span class=\"s1\">'install'<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/figure>\n\n<p>Which will render <code>INSTALL.md<\/code> in your source code <code>docs\/<\/code> directory under <code>div#install<\/code>. This makes it easy to theme the generated docs, courtesy of the <code>.generated<\/code> selector.<\/p>\n\n<p>The <code>nanoc<\/code> command line tool has a simple way to compile a site:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\"><span class=\"nv\">$ <\/span>nanoc compile<\/code><\/pre><\/figure>\n\n<p>Which is useful when you want to call a compile from other scripts. You can probably see where i\u2019m going with this:<\/p>\n\n<p><img src=\"http:\/\/farm3.static.flickr.com\/2585\/3694868813_1c429bc2cd_o.png\" alt=\"Docs build cycle\" title=\"Docs build cycle\" width=\"640\" height=\"480\" class=\"alignnone size-full wp-image-562\" \/><\/p>\n\n<p>On the main project source repo i\u2019ve configured a post-receive hook to POST to docs.flapjack-project.com, which is a <a href=\"http:\/\/github.com\/auxesis\/docs.flapjack-project.com\/tree\/master\">Sinatra app<\/a> that triggers a build of the website. Before it does the build, it checks out the latest copy of the project and website sources.<\/p>\n\n<p>The nice thing about this approach is you can easily interpolate your own static docs with those from the source code:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"o\">%<\/span><span class=\"n\">div<\/span><span class=\"c1\">#developing\n<\/span>\n  <span class=\"o\">%<\/span><span class=\"n\">h2<\/span> <span class=\"no\">Source<\/span> <span class=\"n\">code<\/span>\n  <span class=\"o\">%<\/span><span class=\"nb\">p<\/span>  \n    <span class=\"no\">Flapjack<\/span><span class=\"s1\">'s code is maintained in two repositories on GitHub:\n    %ul \n      %li \n        %a{:href =&gt; \"http:\/\/github.com\/auxesis\/flapjack\/tree\/master\"}&gt; flapjack\n        , the core of the monitoring system.\n      %li \n        %a{:href =&gt; \"http:\/\/github.com\/auxesis\/flapjack-admin\/tree\/master\"}&gt; flapjack-admin  \n        , the admin interface.\n \n  %p  \n    Flapjack is open source, released under the \n    = link_to \"MIT Licence\", \"http:\/\/en.wikipedia.org\/wiki\/MIT_License\"\n    \\.  \n  \n  %div.generated#developing\n    = generate_docs_from_source('<\/span><span class=\"n\">developing<\/span><span class=\"err\">'<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/figure>\n\n<p>The code for those interested:<\/p>\n<ul>\n<li><a href=\"http:\/\/github.com\/auxesis\/docs.flapjack-project.com\/tree\/master\">Sinatra app<\/a>, to trigger website builds\n<\/li>\n<li><a href=\"http:\/\/github.com\/auxesis\/flapjack-project.com\/tree\/master\">nanoc website source<\/a>, to compile the site.<\/li>\n<\/ul>\n\n","pubDate":"07 Jul 2009","link":"https:\/\/fractio.nl\/2009\/07\/07\/streamlining-documentation-on-your-project-websites\/","guid":"https:\/\/fractio.nl\/2009\/07\/07\/streamlining-documentation-on-your-project-websites\/"},{"title":"Sane Ruby on Hardy redux","description":"<p>Last year <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2008\/09\/09\/sane-ruby-on-hardy\/\">I blogged about<\/a> where to get up-to-date Ruby packages for Ubuntu.<\/p>\n\n<p>The PPA I suggested hasn\u2019t been updated in a while, and there are a few gems that require the latest version of RubyGems to work correctly (Rails, i\u2019m looking at you).<\/p>\n\n<p>I suggest you remove the PPA and add this to your <code>\/etc\/apt\/sources.list<\/code>:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-text\" data-lang=\"text\">deb http:\/\/apt.brightbox.net\/ hardy main<\/code><\/pre><\/figure>\n\n<p>Sub <code>hardy<\/code> with <code>intrepid<\/code>\/<code>dapper<\/code> depending on what release you need.<\/p>\n\n<p>Then do an <code>apt-get update &amp;&amp; apt-get remove rubygems &amp;&amp; apt-get install rubygems<\/code>, and you\u2019ll be upgraded to the latest version of RubyGems.<\/p>\n","pubDate":"17 Jun 2009","link":"https:\/\/fractio.nl\/2009\/06\/17\/sane-ruby-on-hardy-redux\/","guid":"https:\/\/fractio.nl\/2009\/06\/17\/sane-ruby-on-hardy-redux\/"},{"title":"cucumber-nagios gets a home","description":"<p>As lots of people are apparently using <a href=\"http:\/\/github.com\/auxesis\/cucumber-nagios\/tree\/master\">cucumber-nagios<\/a> and i\u2019m getting a few bug reports, I thought it\u2019d be worth setting up a proper home for the project.<\/p>\n\n<p>You can find the project\u2019s site <a href=\"http:\/\/auxesis.github.com\/cucumber-nagios\/\">over here<\/a>. Feature requests and bugs can be reported over at the <a href=\"http:\/\/github.com\/auxesis\/cucumber-nagios\/issues\">GitHub issues page<\/a>.<\/p>\n\n<p>I apologise in advance if i\u2019m a tad unresponsive when replying to new issues or emails about it. I\u2019m backpacking around Europe until December so it\u2019s all a matter of finding suitable hacking time.<\/p>\n","pubDate":"08 Jun 2009","link":"https:\/\/fractio.nl\/2009\/06\/08\/cucumber-nagios-gets-a-home\/","guid":"https:\/\/fractio.nl\/2009\/06\/08\/cucumber-nagios-gets-a-home\/"},{"title":"cucumber-nagios: ready for prime time","description":"<p>I pushed out a new release of <code>cucumber-nagios<\/code> last night. Things of note:<\/p>\n\n<ul>\n\t<li><strong>It's now project focused.<\/strong> Use <code>cucumber-nagios-gen<\/code> to generate a project.  The project provides infrastructure for freezing in dependencies, so you can zip up the project directory and migrate it between machines easily. \n<\/li>\n\t<li><strong>It's released as a gem.<\/strong> You can install it with a plain old <code>gem install cucumber-nagios<\/code>. This will install everything you need to set up a project. <\/li>\n<\/ul>\n\n<p>To get up and running with <code>cucumber-nagios<\/code>, this is what you need to do:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\">gem sources <span class=\"nt\">-a<\/span> http:\/\/gems.github.com\ngem <span class=\"nb\">install <\/span>auxesis-cucumber-nagios\ncucumber-nagios-gen project ebay.com.au\n<span class=\"nb\">cd <\/span>ebay.com.au \nrake deps<\/code><\/pre><\/figure>\n\n<p>When you generate a project, it also spits out a <code>.bzrignore<\/code> and <code>.gitignore<\/code>, so there\u2019s no excuse not to be using version control!<\/p>\n\n<p>All the <a href=\"http:\/\/github.com\/auxesis\/cucumber-nagios\/\">previous documentation<\/a> about writing and testing features still applies. I\u2019m planning on adding support for generating features with <code>cucumber-nagios-gen<\/code> in the next release.<\/p>\n\n<p>The code and documentation <a href=\"http:\/\/github.com\/auxesis\/cucumber-nagios\/\">can now be found<\/a> on GitHub. Launchpad was giving me no love.<\/p>\n","pubDate":"05 Mar 2009","link":"https:\/\/fractio.nl\/2009\/03\/05\/cucumber-nagios-ready-for-prime-time\/","guid":"https:\/\/fractio.nl\/2009\/03\/05\/cucumber-nagios-ready-for-prime-time\/"},{"title":"The Hardcopy Books 2009 Launch Survey","description":"<p><a href=\"http:\/\/fagonfoss.com\/blog\/\">James<\/a>, <a href=\"http:\/\/twitter.com\/ycros\">Michael<\/a> and I have been working on a small startup called <a href=\"http:\/\/hardcopybooks.com.au\/\">Hardcopy Books<\/a> for the last 3 months. Here\u2019s the elevator pitch:<\/p>\n\n<blockquote>\nHardcopy Books is an online bookstore exclusively for tech books in Australia.\n\nHow are we different?\n\n<ol>\n    <li>We only stock tech books.<\/li>\n    <li>Your orders will be cheaper than through Amazon.<\/li>\n    <li>We aim to deliver within a week.<\/li>\n<\/ol>\n<\/blockquote>\n\n<p>We started it because ordering tech books in Australia sucks. If you order from Amazon, the books are cheap, but they can take ages to get here, and shipping is very expensive. Most local book stores don\u2019t specialise in tech books, and the ones that do have a very slow turnaround and a horribly tedious ordering process. We aim to change that.<\/p>\n\n<p>Today we\u2019re announcing the <a href=\"http:\/\/hardcopybooks.com.au\/\">Hardcopy Books 2009 Launch Survey<\/a>.<\/p>\n\n<p>We\u2019re asking you, the local tech community, to fill it out and let us know what sort of books you buy. We\u2019re going to be launching with a select group of titles, and the feedback you give us will let us know what books we need to stock up on.<\/p>\n\n<p>We\u2019re also <a href=\"http:\/\/twitter.com\/hardcopybooks\">on Twitter<\/a>, so you can keep up to date with the launch on there.<\/p>\n","pubDate":"26 Feb 2009","link":"https:\/\/fractio.nl\/2009\/02\/26\/the-hardcopy-books-2009-launch-survey\/","guid":"https:\/\/fractio.nl\/2009\/02\/26\/the-hardcopy-books-2009-launch-survey\/"},{"title":"Web app integration testing for sysadmins with cucumber-nagios","description":"<p>Interesting thought experiment:<\/p>\n\n<ul>\n  <li><a href=\"http:\/\/cukes.info\">Cucumber<\/a> is kick arse way of describing the behaviour of a system.<\/li>\n  <li><a href=\"http:\/\/github.com\/brynary\/webrat\/\">Webrat<\/a> makes interacting with websites blindingly easy.<\/li>\n  <li><a href=\"http:\/\/www.nagios.org\">Nagios<\/a> is the industry standard for system\/network\/application monitoring.<\/li>\n<\/ul>\n\n<p>What happens if you combine the three? You get <code>[cucumber-nagios](http:\/\/auxesis.github.com\/cucumber-nagios)<\/code>.<\/p>\n\n<p><code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code> takes the results of a Cucumber run and outputs them in the Nagios plugin format. What does that actually mean?<\/p>\n\n<p>A sysadmin can describe the behaviour of a system that they manage:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-cucumber\" data-lang=\"cucumber\"><span class=\"kd\">Feature<\/span><span class=\"p\">:<\/span> google.com.au\n  It should be up\n  And I should be able to search for things\n\n  <span class=\"kn\">Scenario<\/span><span class=\"p\">:<\/span> Searching for things\n    <span class=\"err\">When I visit \"http<\/span><span class=\"p\">:<\/span><span class=\"err\">\/\/www.google.com\"<\/span>\n    <span class=\"nf\">And <\/span>I fill in <span class=\"s\">\"q\"<\/span> with <span class=\"s\">\"wikipedia\"<\/span>\n    <span class=\"nf\">And <\/span>I press <span class=\"s\">\"Google Search\"<\/span>\n    <span class=\"nf\">Then <\/span>I should see <span class=\"s\">\"www.wikipedia.org\"<\/span><\/code><\/pre><\/figure>\n\n<p>Then they can run the feature through <code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code>:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-bash\" data-lang=\"bash\"><span class=\"nv\">$ <\/span>cucumber-nagios features\/google.com.au\/search.feature\nCritical: 0, Warning: 0, 4 okay | <span class=\"nv\">value<\/span><span class=\"o\">=<\/span>4.000000<span class=\"p\">;;;;<\/span><\/code><\/pre><\/figure>\n\n<p>The curious can check out <a href=\"http:\/\/github.com\/auxesis\/cucumber-nagios\">the code on GitHub<\/a>, and the documentation on the <a href=\"http:\/\/auxesis.github.com\/cucumber-nagios\">project website<\/a>.<\/p>\n\n<p><em>UPDATE<\/em>: There have been a few changes to <code class=\"language-plaintext highlighter-rouge\">cucumber-nagios<\/code> since this post. Check out these\n<a href=\"http:\/\/holmwood.id.au\/~lindsay\/2009\/03\/05\/cucumber-nagios-ready-for-prime-time\/\">two<\/a>\n<a href=\"http:\/\/holmwood.id.au\/~lindsay\/2009\/06\/08\/cucumber-nagios-gets-a-home\/\">posts<\/a>\nfor more info.<\/p>\n","pubDate":"23 Feb 2009","link":"https:\/\/fractio.nl\/2009\/02\/23\/web-app-integration-testing-for-sysadmins-with-cucumber-nagios\/","guid":"https:\/\/fractio.nl\/2009\/02\/23\/web-app-integration-testing-for-sysadmins-with-cucumber-nagios\/"},{"title":"Setting :selected in Merb's select helper","description":"<p>So I don\u2019t waste another 30 minutes of my life:<\/p>\n\n<p>When using the <code>select<\/code> helper in Merb, make sure you call <code>to_s<\/code> on whatever you\u2019re setting <code>:selected<\/code> to, i.e.<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"nb\">select<\/span> <span class=\"ss\">:name<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"s2\">\"order[status]\"<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:collection<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"no\">OrderStatus<\/span><span class=\"p\">.<\/span><span class=\"nf\">all<\/span><span class=\"p\">,<\/span> \n       <span class=\"ss\">:text_method<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"ss\">:description<\/span><span class=\"p\">,<\/span> <span class=\"ss\">:value_method<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"ss\">:id<\/span><span class=\"p\">,<\/span>\n       <span class=\"ss\">:selected<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"vi\">@order<\/span><span class=\"p\">.<\/span><span class=\"nf\">status<\/span><span class=\"p\">.<\/span><span class=\"nf\">to_s<\/span><\/code><\/pre><\/figure>\n\n<p>Otherwise the helper will compare a String (the value) to an Integer (<code>:selected<\/code>), and you\u2019ll never get anything in your select selected!<\/p>\n","pubDate":"16 Feb 2009","link":"https:\/\/fractio.nl\/2009\/02\/16\/setting-selected-in-merbs-select-helper\/","guid":"https:\/\/fractio.nl\/2009\/02\/16\/setting-selected-in-merbs-select-helper\/"},{"title":"Everything old is new again","description":"<p>It\u2019s been interesting watching the flurry of activity in data-storage land over the last few months. <a href=\"http:\/\/couchdb.apache.org\/\">CouchDB<\/a> has been improving in leaps and bounds, and multi dimensioned data stores have been getting a lot more attention in general.<\/p>\n\n<p>I started using Couch on a project a few weeks ago with the <a href=\"http:\/\/datamapper.org\/\">DataMapper<\/a> adapter, specifically for scalability and search reasons. Migrating from my test SQLite database to Couch was a breeze, however it started getting ugly after a short while.<\/p>\n\n<p>The main obstacle was that Datamapper\u2019s Couch adapter integrated pretty clunkily with DataMapper, particularly:<\/p>\n<ul>\n<li>You had to mix in <code>DataMapper::CouchResource<\/code> into your models instead of the bog standard <code>DataMapper::Resource<\/code>.<\/li>\n<li>The standard DataMapper finders didn't work, so you had to wrap all your queries in Couch views. <\/li>\n<\/ul>\n\n<p>As much as this annoyed me, <code>dm-couchrest-adapter<\/code>\u2019s <a href=\"http:\/\/geemus.com\/\">maintainer<\/a> is a smart guy, so I knew there was a good reason behind it being that way even if it wasn\u2019t immediately apparent.<\/p>\n\n<p>That good reason presented itself to me a few days ago when I started playing with <code>dm-ferret-adapter<\/code> to get full text search with Ferret on my models.<\/p>\n\n<p>I was trying to work out how you do a multi-field search, but as with most things in the Ruby world, the documentation was lacking. The author of the <code>dm-sphinx-adapter<\/code> <a href=\"http:\/\/groups.google.com\/group\/datamapper\/browse_thread\/thread\/afbd6bd87e26716e\/148bf9a7e2834197?#148bf9a7e2834197\">fortuitously posted<\/a> on the mailing list about how his adapter handled the problem, so I went digging around inside <code>dm-ferret-adapter<\/code>\u2019s and <code>dm-is-searchable<\/code>\u2019s internals to work out why it wasn\u2019t behaving.<\/p>\n\n<p>The crux of the problem was that DM was being too smart for its own good and tried to match fields listed in the the <code>:conditions<\/code> parameter to actual fields in the database, hence passing a big string in a <code>:conditions<\/code> would explode before it even hit the Ferret adapter.<\/p>\n\n<p>And thus we\u2019ve hit a fundamental problem with DataMapper\u2019s current implementation: at its core, it\u2019s still an ORM for <em>relational<\/em> databases - adapter authors are always going to be fighting an uphill battle when trying to integrate a non-relational data store.<\/p>\n\n<p>So back in Couch land, I ended up switching to the <a href=\"http:\/\/github.com\/jchris\/couchrest\">CouchRest ORM<\/a> to talk to Couch. Explaining why he wrote CouchRest as a standalone ORM,  <a href=\"http:\/\/jchris.mfdz.com\/posts\/122\">Chris Anderson stated<\/a>\u2026<\/p>\n\n<blockquote>  \n(I could have written a DataMapper adapter for CouchDB, but much of DataMapper\u2019s code is based around SQL-like problems that CouchDB just doesn\u2019t have.)\n<\/blockquote>\n\n<p>\u2026 sounds just like the problems I was referring to.<\/p>\n\n<p>Anyhow, why the title of this post?<\/p>\n\n<p>Well last year I briefly hacked on some business banking code for Suncorp, and I was introduced to the wonderful world of <a href=\"http:\/\/en.wikipedia.org\/wiki\/UniVerse\">UniVerse BASIC<\/a>. I\u2019m guessing that almost none of the readers of this blog have ever heard of UniVerse, but some may have heard of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Pick_operating_system\">Pick<\/a>.<\/p>\n\n<p>Pick was a pre-Unix operating system and rapid application development environment initially released in 1965. Pick\u2019s killer feature was the <a href=\"http:\/\/en.wikipedia.org\/wiki\/MultiValue\">MultiValue<\/a> database (think hash table), and was specifically targeted at businesses and business analysts.<\/p>\n\n<p>You\u2019re probably thinking \u201cWoo, a hash table - why should I care about this Lindsay? My language already has Hashes\/Dictionaries\/HashMaps\/filing cabinets\u201d. Well Pick\u2019s hash table implementation was (and still is) pretty kick arse for its time. There\u2019s a query language (that\u2019s suspiciously similar but slightly different to SQL), and it backs onto an incredibly well tested on-disk data store.<\/p>\n\n<p>There\u2019s also no enforced schema, so it was particularly useful in the accounting world where relational databases with rigidly enforced schema aren\u2019t a good fit. Hence, there are <strong>a lot<\/strong> of financial applications out there written in Pick or a Pick derivative.<\/p>\n\n<p>If you\u2019ve ever worked with any of the <a href=\"http:\/\/en.wikipedia.org\/wiki\/EDIFACT\">EDIFACT<\/a> data formats, you\u2019ve indirectly worked with Pick. Those pesky separator\/terminators (<code>'+:?<\/code>) are handled by multivalue databases really well. If you were to represent an EDIFACT segment how UniVerse would process it:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-text\" data-lang=\"text\"># EDIFACT segment\nTVL+240493:1740::2030+JFK+MIA+DL+081+C'<\/code><\/pre><\/figure>\n\n<p>To Python tuples and dictionaries:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-python\" data-lang=\"python\"><span class=\"p\">(<\/span> <span class=\"s\">'TVL'<\/span><span class=\"p\">,<\/span> <span class=\"p\">{<\/span><span class=\"s\">'240493'<\/span><span class=\"p\">:<\/span> <span class=\"p\">(<\/span><span class=\"s\">'1740'<\/span><span class=\"p\">,<\/span> <span class=\"bp\">None<\/span><span class=\"p\">,<\/span> <span class=\"s\">'2030'<\/span><span class=\"p\">)},<\/span> <span class=\"s\">'JFK'<\/span><span class=\"p\">,<\/span> <span class=\"s\">'MIA'<\/span><span class=\"p\">,<\/span> <span class=\"s\">'DL'<\/span><span class=\"p\">,<\/span> <span class=\"s\">'081'<\/span><span class=\"p\">,<\/span> <span class=\"s\">'C'<\/span><span class=\"p\">)<\/span><\/code><\/pre><\/figure>\n\n<p>To Ruby arrays and hashes:<\/p>\n\n<figure class=\"highlight\"><pre><code class=\"language-ruby\" data-lang=\"ruby\"><span class=\"p\">[<\/span> <span class=\"s1\">'TVL'<\/span><span class=\"p\">,<\/span> <span class=\"p\">{<\/span><span class=\"s1\">'240493'<\/span> <span class=\"o\">=&gt;<\/span> <span class=\"p\">[<\/span><span class=\"s1\">'1740'<\/span><span class=\"p\">,<\/span> <span class=\"kp\">nil<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'2030'<\/span><span class=\"p\">]},<\/span>  <span class=\"s1\">'JFK'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'MIA'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'DL'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'081'<\/span><span class=\"p\">,<\/span> <span class=\"s1\">'C'<\/span><span class=\"p\">]<\/span><\/code><\/pre><\/figure>\n\n<p>By now you\u2019re thinking \u201coh god now I have to iterate over a whole bunch of nested data structures\u201d, but fortunately Pick\u2019s BASIC implementation provided syntax that made this pretty straightforward.<\/p>\n\n<p>Anyhow, the MultiValue technology behind Pick was licensed to roughly 3 dozen companies during the 70\u2019s and 80\u2019s, but there\u2019s been a lot of consolidation in the Pick market since then, and the main player is now actually IBM. They purchased two implementations, UniVerse and UniData, in the 90\u2019s, re-branded them as U2 (sorry, no <a href=\"http:\/\/en.wikipedia.org\/wiki\/Bono\">Bono<\/a> here), and have been continually developing them ever since.<\/p>\n\n<p>IBM have written .NET and Java interfaces to U2 data stores, there\u2019s integration with RedBack (a web application development framework), and more recently work has gone into PHP, Python, and Ruby bindings.<\/p>\n\n<p>Pick\u2019s usage in the enterprise was <em>and is still<\/em> phenomenal. Last year at IBM\u2019s <a href=\"http:\/\/www-01.ibm.com\/software\/info\/u2\/university\/index.jsp\">U2 University<\/a> in Sydney the U2 product manager quoted a statistic that the U2 team estimate <em>at least<\/em> 60% of IBM\u2019s clients are directly using either UniVerse or UniData. A large majority of these systems are small back office-type setups that were installed decades ago, next-to-nobody touches, but are mission-critical.<\/p>\n\n<p>So after seeing <a href=\"http:\/\/tokyocabinet.sourceforge.net\/index.html\">Tokyo Cabinet<\/a> <a href=\"http:\/\/www.igvita.com\/2009\/02\/13\/tokyo-cabinet-beyond-key-value-store\/\">do the rounds<\/a> this week in the Ruby sphere, it\u2019s pretty obvious that multi dimensioned data stores are experiencing a bit of resurgence.<\/p>\n\n<p>Google\u2019s success with <a href=\"http:\/\/en.wikipedia.org\/wiki\/BigTable\">BigTable<\/a> has kicked a lot of smart people into gear: CouchDB, <a href=\"http:\/\/hadoop.apache.org\/hbase\/\">HBase<\/a>, and Tokyo Cabinet are shining examples of awesome work being done in the DBMS sphere.<\/p>\n\n<p>What I think is going to make a difference this time:<\/p>\n<ul>\n<li><strong>Implementations are not walled gardens<\/strong>. IBM's U2 products are not open source, and have a significant monetary barrier of entry. It's a problem that the entire Pick marketplace suffers, and why there isn't a lot of young talent in the Pick sphere anymore. \n<\/li>\n\n<li><strong>The multi dimensioned data paradigm maps really well onto existing (and popular!) interchange formats<\/strong>. Take a look at JSON - its take up over the last few years has been impressive to say the least. YAML is another great example. They succeed where rigid data formats don't fit. Also, they're not the new EDIFACT. \n<\/li>\n\n<li><strong>Developers are hitting barriers with RDBMSes.<\/strong> If there's one thing we can learn from the hype-fest that was \"Web 2.0\", it's that scalability is hard. Multi dimensioned databases aren't a magical elixir for the scalability problems of developers around the world, but they <em>do<\/em> prompt people to think of alternate ways of storing their data. \n<\/li>\n<\/ul>\n\n<p>It\u2019ll be interesting to see whether the industry will start taking up multi dimensioned data stores en mass any time soon.<\/p>\n","pubDate":"15 Feb 2009","link":"https:\/\/fractio.nl\/2009\/02\/15\/everything-old-is-new-again\/","guid":"https:\/\/fractio.nl\/2009\/02\/15\/everything-old-is-new-again\/"},{"title":"Following bushfire activity in NSW: @nswbushfires vs @nswrfs","description":"<p>On Sunday I <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2009\/02\/08\/nsw-rural-fire-service-updates-on-twitter\/\">posted about<\/a> a Twitter bot (<a href=\"http:\/\/twitter.com\/nswbushfires\">@nswbushfires<\/a>) I quickly hacked up to post current incident updates to Twitter.<\/p>\n\n<p>People following the <a href=\"http:\/\/search.twitter.com\/search?q=%23bushfires\">bushfires on Twitter<\/a> may have noticed that the <a href=\"http:\/\/www.rfs.nsw.gov.au\/\">Rural Fire Service<\/a> launched an official Twitter bot (<a href=\"http:\/\/twitter.com\/nswrfs\">@nswrfs<\/a>) this morning containing information on major fire updates.<\/p>\n\n<p>I had a brief chat with the Manager of Online Communications from the RFS this afternoon about the datasets available on their site, how data is generated within the RFS, and how that data can best be used.<\/p>\n\n<p>Basically their bot aggregates <a href=\"http:\/\/www.rfs.nsw.gov.au\/dsp_content.cfm?cat_id=684\">major fire updates<\/a>, which contain information on incidents that may directly affect people or property, and how people should respond. The announcements are a digital form of what gets syndicated to news outlets, and are crafted by the RFS communications team. Generally this information is up-to-the-minute.<\/p>\n\n<p>On the other hand, my bot aggregates the <a href=\"http:\/\/www.rfs.nsw.gov.au\/dsp_content.cfm?cat_id=683\">list of current incidents<\/a>, which is an extract of an internal RFS system used by people on the ground to track their handling of fires. The data in the current incidents list can potentially be several hours out of date, as it\u2019s quite often entered into their internal system after the incident has been handled. That said, it provides a state-wide overview of RFS activity and can be useful for tracking non-critical bushfire activity in your area.<\/p>\n\n<p>So for people wanting to follow bushfire activity in NSW, I would highly recommend following <em>both<\/em> bots.<\/p>\n","pubDate":"10 Feb 2009","link":"https:\/\/fractio.nl\/2009\/02\/10\/following-bushfire-activity-in-nsw-nswbushfires-vs-nswrfs\/","guid":"https:\/\/fractio.nl\/2009\/02\/10\/following-bushfire-activity-in-nsw-nswbushfires-vs-nswrfs\/"},{"title":"NSW Rural Fire Service updates on Twitter","description":"<p>I just scraped together a Twitter bot to post updates from the NSW Rural Fire Service\u2019s <a href=\"http:\/\/www.rfs.nsw.gov.au\/dsp_content.cfm?cat_id=683\">Current Incidents<\/a> list.<\/p>\n\n<p>If you\u2019re in NSW, follow <a href=\"http:\/\/twitter.com\/nswbushfires\">@nswbushfires<\/a> on Twitter.<\/p>\n","pubDate":"08 Feb 2009","link":"https:\/\/fractio.nl\/2009\/02\/08\/nsw-rural-fire-service-updates-on-twitter\/","guid":"https:\/\/fractio.nl\/2009\/02\/08\/nsw-rural-fire-service-updates-on-twitter\/"},{"title":"Meet the newest Holmwood","description":"<p>Almost a week ago today, Julia and I tied the knot.<\/p>\n\n<p><img alt=\"Julia and I signing the marriage certificate\" src=\"http:\/\/farm4.static.flickr.com\/3114\/3158068955_3e53cc0275.jpg?v=0\" title=\"Julia and I signing the marriage certificate\" width=\"500\" height=\"334\" \/><\/p>\n\n<p>You can find photos <a href=\"http:\/\/flickr.com\/photos\/tags\/julialindsay2008\">on Flickr<\/a>, and highlights on <a href=\"http:\/\/holmwood.id.au\/~julia\/\">Julia\u2019s brand new blog<\/a>.<\/p>\n\n","pubDate":"04 Jan 2009","link":"https:\/\/fractio.nl\/2009\/01\/04\/meet-the-new-holmwood\/","guid":"https:\/\/fractio.nl\/2009\/01\/04\/meet-the-new-holmwood\/"},{"title":"gotgastro.com launched","description":"<p>Gastro has been relaunched at <a href=\"http:\/\/gotgastro.com\/\">gotgastro.com<\/a>!<\/p>\n\n<p>DNS issues should be all fixed, so go nuts and share the site.<\/p>\n","pubDate":"12 Dec 2008","link":"https:\/\/fractio.nl\/2008\/12\/12\/gotgastro-launched\/","guid":"https:\/\/fractio.nl\/2008\/12\/12\/gotgastro-launched\/"},{"title":"New Gastro features, and tech details","description":"<p>I pushed two new features on <a href=\"http:\/\/gastro.unstated.net\/\">Gastro<\/a> last night: <a href=\"http:\/\/en.wikipedia.org\/wiki\/GeoRSS\">GeoRSS<\/a> support, and a <a href=\"http:\/\/en.wikipedia.org\/wiki\/JSON\">JSON<\/a> export.<\/p>\n\n<p>This should make it a lot easier to take the data i\u2019ve scraped and reuse it. Feel free to grab it and play with it as you will!<\/p>\n\n<p>You should be able to take the GeoRSS feed and pipe it straight into Google Maps (paste the feed url into the search bar), however GMaps doesn\u2019t seem to like fetching the feed.<\/p>\n\n<p>It might be something to do with a reported <a href=\"http:\/\/holmwood.id.au\/~lindsay\/2008\/12\/10\/behold-gastro\/#comment-228038\">DNS issue<\/a>. If anyone has a free second to do a <code>dig<\/code> and see if anything is boned I would be very appreciative.<\/p>\n\n<p>The new release also has a bunch of performance improvements (i\u2019ve basically halved the app load time), and the info window on the map points display the restaurant names.<\/p>\n\n<p>Simon wanted to <a href=\"http:\/\/www.rumble.net\/blog\/index.cgi\/geek\/Geo_mashups2.html\">know the tech details<\/a>, so here they are:<\/p>\n\n<p>The mashup is a collection of small Ruby apps.<\/p>\n\n<p>There\u2019s a scraper, which handles caching copies of the food authority site locally, extracting meaningful data from the cached copies, geocoding against the extracted address data, and writing that data out in an easy-to-parse format (<a href=\"http:\/\/www.yaml.org\/\">YAML<\/a>).<\/p>\n\n<p>The geocoding is done against Google\u2019s geocoding service, using the <a href=\"http:\/\/ym4r.rubyforge.org\/\">ym4r<\/a> library. I serialise the store the lat\/lng data back to the YAML.<\/p>\n\n<p>The website is written in <a href=\"http:\/\/merbivore.com\/\">Merb<\/a> (though I am considering rewriting it in <a href=\"http:\/\/nanoc.stoneship.org\/\">nanoc<\/a>), and uses an in-memory Sqlite3 database that\u2019s created from the YAML data every time the app boots. This means the app takes about 10 seconds to boot, but access to all the data is pretty damn quick once the app is up.<\/p>\n\n<p>I\u2019m caching the JSON\/RSS\/HTML pages just in case it gains momentum and my server gets hammered. I\u2019m using the standard <a href=\"http:\/\/wiki.merbivore.com\/cache\">Merb::Cache<\/a> caching mechanism, which generates new page fragments every 30 minutes.<\/p>\n\n<p>The app is running under 2 Mongrels with apache2 + mod_proxy_balancer and it seems to cope ok. I\u2019ve had 70 unique visits in the last 24 hours so i\u2019m not particularly worried about load.<\/p>\n\n<p>If you want to check out the code, you can grab a copy of my repo with a<\/p>\n\n<blockquote>\n<pre>\n$ bzr branch http:\/\/holmwood.id.au\/~lindsay\/code\/gastro\n<\/pre>\n<\/blockquote>\n\n<p>There are a bunch of rough edges (no documentation, notice urls aren\u2019t serialised yet, ~70 addresses can\u2019t be geocoded, and geocode data can be easily overwritten), but interested folk might like a stickybeek.<\/p>\n\n<p>Spread the word if you like it! New features are pending. :-)<\/p>\n","pubDate":"11 Dec 2008","link":"https:\/\/fractio.nl\/2008\/12\/11\/new-gastro-features-and-tech-details\/","guid":"https:\/\/fractio.nl\/2008\/12\/11\/new-gastro-features-and-tech-details\/"},{"title":"Behold: Gastro!","description":"<p>Dear Simon,\n<a href=\"http:\/\/www.rumble.net\/blog\/index.cgi\/geek\/Geo_mashups.html\">Ask<\/a> and ye <a href=\"http:\/\/gastro.unstated.net\/\">shall receive<\/a>.<\/p>\n\n<p>Regards,\nLindsay<\/p>\n","pubDate":"10 Dec 2008","link":"https:\/\/fractio.nl\/2008\/12\/10\/behold-gastro\/","guid":"https:\/\/fractio.nl\/2008\/12\/10\/behold-gastro\/"},{"title":"Why scripting your server installs is a bad idea (and configuration management is awesome)","description":"<p>I was reading the FiveRuns blog post on <a href=\"http:\/\/blog.fiveruns.com\/2008\/10\/20\/automatic-production-rails\">automating your Rails server configuration<\/a>, and had a peek at the <a href=\"http:\/\/github.com\/mmond\/configuration-automation\/tree\/master%2Fconfigure_ubuntu_eldorado.sh?raw=true\">script they included<\/a> to setup a machine.<\/p>\n\n<p>There are some fundamental problems with it.<\/p>\n\n<p>What happens when the machine can\u2019t reach RubyForge to download the RubyGems tarball? The rest of the script will continue to run and error out horribly, leaving you to wade through a screenful of error output to work out what died.<\/p>\n\n<p>What about when you need to update your production systems with new packages or configuration files? You\u2019ll probably log into all of them and apply the update (or maybe you\u2019ll write a script to do it). But what about new machines that you roll out? Do you update the install script with the new tasks? Do you use the old install method and then apply the new configuration?<\/p>\n\n<p>These problems are vaguely manageable with a small number of machines, but as soon as you grow beyond 3 servers it becomes tedious and painful - deployment is hard.<\/p>\n\n<p>I learnt this the hard way: I started out my sysadmin career building systems like this. The first time I did this I thought it was great. I wrote a ~400 line provisioning script that would set up a PXE booted machine with Samba\/Apache\/OpenLDAP. This worked exceptionally well in my test environment. Production was another story.<\/p>\n\n<p>I had inadvertently optimised my deployment for <em>provisioning<\/em>, not for <em>maintenance<\/em>.<\/p>\n\n<p>When you consider the lifecycle of a system, &lt; 0.001% of it\u2019s life is spent being provisioned. The other 99.999% is the server sitting there serving requests, having new software deployed to it, or having it\u2019s configuration tweaked.<\/p>\n\n<p>Putting all the logic into the provisioning process is setting yourself up for failure. To pull it off, you have to know exactly what the machine is going to be doing <em>for its entire lifecycle<\/em>. Prescience is not a common trait amongst humans.<\/p>\n\n<p>So the project requirements change, and the time comes to add another service to the system. How do you manage this? <code>ssh<\/code>-in-a-for-loop is the obvious response, but then you\u2019d be perpetuating the problem you created for yourself in the first place. And what about new systems? Oh the pain!<\/p>\n\n<p><a href=\"http:\/\/flickr.com\/photos\/40389360@N00\/2428706650\/\"><img src=\"http:\/\/farm4.static.flickr.com\/3101\/2428706650_d1fc862fdc.jpg?v=0\" alt=\"House of Cards by Indenture\" title=\"House of Cards by Indenture\" \/><\/a><\/p>\n\n<p>Unfortunately there\u2019s no easy way out of this mess. You can start from scratch, or push through and waste countless hours of your life maintaining a thousand houses of cards. And you can only start from scratch so many times before it gets equally tedious and fragile.<\/p>\n\n<p><strong>Configuring your systems with magical scripts is not maintainable.<\/strong> Competent sysadmins around the world have a responsibility to stamp out deployment techniques of this ilk.<\/p>\n\n<p>This is <a href=\"http:\/\/www.google.com.au\/search?hl=en&amp;q=capistrano+apt-get+OR+emerge\">how some people use<\/a> <a href=\"http:\/\/www.capify.org\/\">Capistrano<\/a>, and it gives me the willies. Invoking your distro\u2019s package manager from Capistrano is no better than doing it from a shell script. Use the tool for what it\u2019s built for: automating the deployment of your <em>application<\/em>.<\/p>\n\n<p>So what do I recommend? Configuration management is a much better solution to this problem. Instead of maintaining separate procedures for provisioning servers and doing change management, you can merge them into one.<\/p>\n\n<p><a href=\"http:\/\/puppet.reductivelabs.com\/\">Puppet<\/a> is my configuration management system of choice, and I use it extensively across all the infrastructure I manage. I do this for three reasons:<\/p>\n<ol>\n\t<li><em>I forget how machines are configured.<\/em> I work on a lot of different machines (I touch a minimum of 20 different boxes every week), and it's nigh on impossible to remember how each of them are configured.<\/li>\n\t<li><em>Not all operating systems are configured equal.<\/em> The machines are running a mish-mash of Ubuntu, Debian, Red Hat, CentOS, Fedora, Mac OS X, &amp; OpenBSD, and they're not all on the same version. There's no way i'm going to remember each operating system's eccentricities.<\/li>\n\t<li><em>Other people maintain these machines.<\/em> Someone will login to the machine and change something without telling me. This is a good thing (I am not a single point of failure), as long as we're using some sort of configuration management (I know who to blame when something explodes 6 months down the line (probably me :-)).<\/li>\n<\/ol>\n\n<p>Combining Puppet with a <a href=\"http:\/\/bazaar-vcs.org\/\">distributed<\/a> <a href=\"http:\/\/git.or.cz\/\">version<\/a> <a href=\"http:\/\/www.darcs.net\/\">control<\/a> system is a solid one-two punch to your provisioning and maintenance woes.<\/p>\n\n<p><a href=\"http:\/\/flickr.com\/photos\/frankbb\/2955838844\/\"><img src=\"http:\/\/farm4.static.flickr.com\/3215\/2955838844_726f92fccb.jpg?v=0\" alt=\"Fight for your rights by Frank_BB\" title=\"Fight for your rights by Frank_BB\" \/><\/a><\/p>\n\n<p>Hopefully i\u2019ve convinced you to use Puppet. The first question you probably have is <em>\u201chow do I use puppet to manage <code>$app_built_with_my_framework_of_choice<\/code>?\u201d<\/em>.<\/p>\n\n<p>Just like I recommend not using Capistrano to configure your servers, I recommend using Puppet to manage the infrastructure <em>around<\/em> your application. Your Puppet manifests should manage web + database + mail servers, monitoring, system package repositories, system utilities, and the standard library of the language your app is written in.<\/p>\n\n<p>This makes application the side of deployment really simple: for Ruby folk, I recommend bundling <em>all your gems<\/em> with your application. You should expect the ability to plop your app on any machine that has no more than the Ruby standard library and interpreter and have the app run.<\/p>\n\n<p>The end result is configuration that can be applied to a new machine in 10 minutes with Puppet, and a seamless app deployment in less than 5 using Cap or <a href=\"http:\/\/rubyhitsquad.com\/Vlad_the_Deployer.html\">Vlad<\/a>.<\/p>\n\n<p>When you need to scale further down the line, you can simply apply that same configuration to a new node. Making changes across all your machines becomes an order of magnitude easier - update the manifests, push to your Puppetmaster, have the machines update themselves.<\/p>\n\n<p>What could be simpler?<\/p>\n","pubDate":"04 Nov 2008","link":"https:\/\/fractio.nl\/2008\/11\/04\/why-scripting-your-server-install-is-a-bad-idea\/","guid":"https:\/\/fractio.nl\/2008\/11\/04\/why-scripting-your-server-install-is-a-bad-idea\/"},{"title":"Wake from sleep when lid closed on MacBook Pro","description":"<p>I\u2019ve been having a problem with my MacBook where it wakes up when the lid is closed, which leaves me hot and bothered when it\u2019s in my bag.<\/p>\n\n<p>Apparently it\u2019s a know problem on early model MacBooks, caused by the magnet that determines whether the lid is open getting shifted when the lid gets bumped.<\/p>\n\n<p>The easy fix is to run:<\/p>\n<blockquote>\n<pre>\n$ sudo pmset -a lidwake 0\n<\/pre>\n<\/blockquote>\n\n<p>which will disable waking the machine when the lid is opened. Mash the keyboard instead, and you\u2019ll be right as rain.<\/p>\n\n<p>Thanks to these <a href=\"http:\/\/forums.macrumors.com\/showthread.php?t=319123\">forum<\/a> <a href=\"http:\/\/discussions.apple.com\/thread.jspa?threadID=621401&amp;tstart=0\">posts<\/a>.<\/p>\n\n<p>Also, this commands prints a real time log of all power management activity detected by OS X:<\/p>\n<blockquote>\n<pre>\n$ pmset -g pslog\n<\/pre>\n<\/blockquote>\n","pubDate":"25 Oct 2008","link":"https:\/\/fractio.nl\/2008\/10\/25\/wake-from-sleep-when-lid-closed-on-macbook-pro\/","guid":"https:\/\/fractio.nl\/2008\/10\/25\/wake-from-sleep-when-lid-closed-on-macbook-pro\/"},{"title":"Easier fonts with X11 on OSX","description":"<blockquote><pre>\n$ ln -sf ~\/Library\/Fonts\/ .font\n<\/pre><\/blockquote>\n\n<p>can save <a href=\"https:\/\/bugs.launchpad.net\/inkscape\/+bug\/215906\">a lot of hassle<\/a> with fonts in different X11 apps on OS X.<\/p>\n","pubDate":"21 Oct 2008","link":"https:\/\/fractio.nl\/2008\/10\/21\/easier-fonts-with-x11-on-osx\/","guid":"https:\/\/fractio.nl\/2008\/10\/21\/easier-fonts-with-x11-on-osx\/"},{"title":"collectdmon for a crashing collectd","description":"<p>I\u2019ve recently been having problems with collectd crashing without notice on a server aggregating a large amount of stats from ~20 nodes. Initially I set up a shell script to monitor whether it\u2019s up and restart it, but there\u2019s a much more elegant solution in the form of <a href=\"http:\/\/collectd.org\/documentation\/manpages\/collectdmon.1.shtml\">collectdmon<\/a>.<\/p>\n\n<p>It\u2019s design is really simple and quite elegant: <code>collectdmon<\/code> starts and runs <code>collectd<\/code> with the <code>-f<\/code> flag, causing <code>collectd<\/code> to run in the foreground. If <code>collectd<\/code> exits for whatever reason, <code>collectdmon<\/code> will just catch it (because it\u2019s waiting for it to exit), and start it back up. You can also send signals to <code>collectdmon<\/code> to restart or shut down the <code>collectd<\/code> process at any time.<\/p>\n\n<p>The only thing left to do is modify the init script to start <code>collectd<\/code> with <code>collectdmon<\/code>. On Red Hat I did this with the following modification:<\/p>\n\n<blockquote>\n<pre lang=\"diff\">\ndiff -u etc\/rc.d\/init.d\/collectd \/etc\/init.d\/collectd \n--- etc\/rc.d\/init.d\/collectd    2008-10-14 05:15:29.000000000 +1100\n+++ \/etc\/init.d\/collectd        2008-10-10 00:27:17.000000000 +1100\n@@ -25,7 +25,8 @@\n        echo -n $\"Starting $prog: \"\n        if [ -r \"$CONFIG\" ]\n        then\n-               daemon \/usr\/sbin\/collectd -C \"$CONFIG\"\n+               daemon collectdmon -c \/usr\/sbin\/collectd -P \/var\/run\/collectdmon.pid -- -C \"$CONFIG\"\n                RETVAL=$?\n                echo\n                [ $RETVAL -eq 0 ] &amp;&amp; touch \/var\/lock\/subsys\/$prog\n@@ -33,7 +34,7 @@\n }\n stop () {\n        echo -n $\"Stopping $prog: \"\n-       killproc $prog\n+       killproc collectdmon\n        RETVAL=$?\n        echo\n        [ $RETVAL -eq 0 ] &amp;&amp; rm -f \/var\/lock\/subsys\/$prog\n<\/pre>\n<\/blockquote>\n\n","pubDate":"14 Oct 2008","link":"https:\/\/fractio.nl\/2008\/10\/14\/collectdmon-for-a-crashing-collectd\/","guid":"https:\/\/fractio.nl\/2008\/10\/14\/collectdmon-for-a-crashing-collectd\/"},{"title":"Slides from Merbcamp deployment talk...","description":"<p>\u2026can be found <a href=\"http:\/\/www.slideshare.net\/auxesis\/deploying-merb-presentation\/\">on slideshare<\/a>, or just below.<\/p>\n\n<div style=\"width:425px;text-align:left;margin:auto;margin-top: 1em;\" id=\"__ss_652745\">\n\t<object style=\"margin:0px\" width=\"425\" height=\"355\" type=\"application\/x-shockwave-flash\" data=\"http:\/\/static.slideshare.net\/swf\/ssplayer2.swf?doc=deployingmerb-1223834740710831-8&amp;stripped_title=deploying-merb-presentation\">\n\t\t<param name=\"movie\" value=\"http:\/\/static.slideshare.net\/swf\/ssplayer2.swf?doc=deployingmerb-1223834740710831-8&amp;stripped_title=deploying-merb-presentation\" \/>\n\t\t<param name=\"allowFullScreen\" value=\"true\" \/>\n\t\t<param name=\"allowScriptAccess\" value=\"always\" \/>\n\t<\/object>\n<\/div>\n\n<p>The feedback so far has been really positive!<\/p>\n","pubDate":"13 Oct 2008","link":"https:\/\/fractio.nl\/2008\/10\/13\/slides-from-merbcamp-deployment-talk\/","guid":"https:\/\/fractio.nl\/2008\/10\/13\/slides-from-merbcamp-deployment-talk\/"},{"title":"Setting session name in screen","description":"<p>It\u2019s always annoying trying to remember what pid maps to a <a href=\"http:\/\/www.gnu.org\/software\/screen\/\">screen<\/a> session, but it\u2019s possible to actually name the session.<\/p>\n\n<p>When starting a screen session, you can start it with arguments:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>screen -S sessionname\n<\/code><\/pre><\/div><\/div>\n\n<p>Or within a current screen session, you can do:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>C-a :\nsessionname your_session_name\n<\/code><\/pre><\/div><\/div>\n\n<p>Now when you do a <code class=\"language-plaintext highlighter-rouge\">screen -ls<\/code>, your sessions show up with meaningful names:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>There are screens on:\n        12117.irc      (Detached)\n        9905.code      (Multi, detached)\n        14850.projectx (Multi, attached)\n3 Sockets in \/var\/run\/screen\/S-lindsay.\n<\/code><\/pre><\/div><\/div>\n\n<p>And connecting to them is as easy as:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>screen -r projectx\n<\/code><\/pre><\/div><\/div>\n","pubDate":"29 Sep 2008","link":"https:\/\/fractio.nl\/2008\/09\/29\/setting-session-name-in-screen\/","guid":"https:\/\/fractio.nl\/2008\/09\/29\/setting-session-name-in-screen\/"},{"title":"Heading to Merbcamp","description":"<p>I\u2019m heading to <a href=\"http:\/\/merbcamp.com\/\">MerbCamp<\/a> in San Diego in just under two weeks, where i\u2019ll be talking about <a href=\"http:\/\/merbcamp.com\/#schedule\">deploying Merb<\/a>.<\/p>\n\n<p>Still so much to do before I head off, but i\u2019m looking forward to meeting fellow Merbists.<\/p>\n","pubDate":"28 Sep 2008","link":"https:\/\/fractio.nl\/2008\/09\/28\/heading-to-merbcamp\/","guid":"https:\/\/fractio.nl\/2008\/09\/28\/heading-to-merbcamp\/"},{"title":"Sane Ruby on Hardy","description":"<p>Recently, there have been a few happenings in Ruby land that break things a bit on Ubuntu Hardy.<\/p>\n\n<p>Firstly, there were <a href=\"http:\/\/www.rubyinside.com\/june-2008-ruby-security-vulnerabilities-927.html\">two<\/a> <a href=\"http:\/\/www.rubyinside.com\/new-vulnerabilities-discovered-in-ruby-august-2008-1006.html\">rounds<\/a> of security vulnerabilities. This doesn\u2019t break things per se, however there\u2019s been a significant lag time between the Ruby team patching it and Ubuntu releasing updated packages.<\/p>\n\n<p>Secondly, the <a href=\"http:\/\/www.rubygems.org\/\">RubyGems<\/a> team pushed a new backwards incompatible release that changed the metadata format that Gem repositories use. The gems published on <a href=\"http:\/\/gems.rubyforge.org\/\">gems.rubyforge.org<\/a> pretty much stopped working after that, leaving people running RubyGems &lt; 1.2.0 (almost everyone) out in the cold. And by \u201cout in the cold\u201d I mean they couldn\u2019t install gems, and the gem command would consume all available memory.<\/p>\n\n<p>The supplied <code>rubygems-update-1.2.0.gem<\/code> doesn\u2019t work on Ubuntu because of the way RubyGems is packaged, and the plain old gem update doesn\u2019t work due to metadata incompatibilities, making it reasonably impossible to fix unless you install from source or find some magical packages.<\/p>\n\n<p>Fortunately, the <a href=\"https:\/\/launchpad.net\/~ubuntu-ruby\/+archive\">ubuntu-ruby<\/a> team have created these magical packages. There are two repos, <a href=\"https:\/\/launchpad.net\/~ubuntu-ruby\">ubuntu-ruby<\/a> &amp; <a href=\"https:\/\/launchpad.net\/~ubuntu-ruby-backports\/+archive\">ubuntu-ruby-backports<\/a>. Backports is recommended as being more stable, however i\u2019ve been running the \u2018bleeding edge\u2019 repo the last month without any problems.<\/p>\n\n<p>Choose the repo you want, and add it to your \/etc\/apt\/sources.list<\/p>\n\n<blockquote>\n<pre>\ndeb http:\/\/ppa.launchpad.net\/ubuntu-ruby\/ubuntu hardy main\ndeb http:\/\/ppa.launchpad.net\/ubuntu-ruby-backports\/ubuntu hardy main\n<\/pre>\n<\/blockquote>\n\n<p>one <code>apt-get update &amp;&amp; apt-get upgrade<\/code> later and you\u2019ll have updated package goodness.<\/p>\n","pubDate":"09 Sep 2008","link":"https:\/\/fractio.nl\/2008\/09\/09\/sane-ruby-on-hardy\/","guid":"https:\/\/fractio.nl\/2008\/09\/09\/sane-ruby-on-hardy\/"},{"title":"Practical performance monitoring tooling on Linux","description":"<p>While browsing my feeds this morning, I came across this article on HowtoForge called <a href=\"http:\/\/www.howtoforge.com\/extract-values-from-top-and-plot-them\">How To Extract Values From top And Plot Them<\/a>.<\/p>\n\n<p>The first thing that struck me about the article was how they were going about solving their problem:<\/p>\n\n<blockquote>\n  <p><em>\u201cMany researchers who are doing performance evaluation and benchmarking need to capture the values of the CPU and the RAM.\u201d<\/em><\/p>\n<\/blockquote>\n\n<p>Jesus, if that\u2019s your requirement, you\u2019re using the wrong tool. There are plenty of utilities out there that will do exactly what you\u2019re looking for:<\/p>\n\n<ul>\n  <li><a href=\"http:\/\/pagesperso-orange.fr\/sebastien.godard\/man_iostat.html\"><code class=\"language-plaintext highlighter-rouge\">iostat<\/code><\/a> iostat (as the name suggests) reports I\/O related statistics. It can tell you all about CPU, device, partition, and NFS utilisation.<\/li>\n<\/ul>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code># report cpu statistics\n$ iostat -c 5 5\nLinux 2.6.24-12-generic (theodor) \t22\/03\/08\n\navg-cpu:  %user   %nice %system %iowait  %steal   %idle\n         6.73    0.04    2.20    1.32    0.00   89.71\n\navg-cpu:  %user   %nice %system %iowait  %steal   %idle\n         1.06    0.00    1.06    0.00    0.00   97.87\n\navg-cpu:  %user   %nice %system %iowait  %steal   %idle\n         0.97    0.00    1.07    0.00    0.00   97.95\n\navg-cpu:  %user   %nice %system %iowait  %steal   %idle\n         1.07    0.00    1.07    0.00    0.00   97.86\n\navg-cpu:  %user   %nice %system %iowait  %steal   %idle\n         0.88    0.00    0.59    1.37    0.00   97.17\n\n# report device utilisation of \/dev\/sda and all it's partitions.\n# display in megabytes per second.\n# take 5 second sample infinitely.\n$ iostat -p sda -m 5\nLinux 2.6.24-12-generic (theodor)       22\/03\/08\n\navg-cpu:  %user   %nice %system %iowait  %steal   %idle\n         6.73    0.04    2.20    1.32    0.00   89.71\n\nDevice:            tps    MB_read\/s    MB_wrtn\/s    MB_read    MB_wrtn\nsda               5.58         0.02         0.04      12383      30201\nsda1              7.42         0.01         0.03       4054      18951\nsda2              0.24         0.00         0.00        219        439\nsda3              4.39         0.01         0.02       8109      10809\n\navg-cpu:  %user   %nice %system %iowait  %steal   %idle\n         2.52    0.00    1.26    0.00    0.00   96.22\n\nDevice:            tps    MB_read\/s    MB_wrtn\/s    MB_read    MB_wrtn\nsda               0.00         0.00         0.00          0          0\nsda1              0.00         0.00         0.00          0          0\nsda2              0.00         0.00         0.00          0          0\nsda3              0.00         0.00         0.00          0          0\n\n^C\n\n<\/code><\/pre><\/div><\/div>\n\n<ul>\n  <li><a href=\"http:\/\/pagesperso-orange.fr\/sebastien.godard\/man_sar.html\"><code class=\"language-plaintext highlighter-rouge\">sar<\/code><\/a>: sar is a brilliant little tool for getting stats on all manner of system activity. Some example usage:<\/li>\n<\/ul>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code># report CPU utilisation. collect 5 seconds worth of data, 5 times.\n$ sar -u 5 5\nLinux 2.6.24-12-generic (theodor)       22\/03\/08\n\t\n12:57:26        CPU     %user     %nice   %system   %iowait    %steal     %idle\n12:57:31        all      1.07      0.00      1.07      3.12      0.00     94.73\n12:57:36        all      2.42      0.00      1.26      0.00      0.00     96.33\n12:57:41        all      1.54      0.00      1.15      0.00      0.00     97.31\n12:57:46        all      1.55      0.00      1.36      0.00      0.00     97.09\n12:57:51        all      1.84      0.00      1.55      0.00      0.00     96.61\nAverage:        all      1.68      0.00      1.28      0.62      0.00     96.42\n\t\n# report memory and swap utilisation\n$ sar -r 5 5\nLinux 2.6.24-12-generic (theodor)       22\/03\/08\n\t\n12:59:02    kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad\n12:59:07       947744   1086012     53.40    110560    278340   1131764    334868     22.83    163624\n12:59:12       947744   1086012     53.40    110560    278340   1131764    334868     22.83    163656\n12:59:17       947744   1086012     53.40    110560    278340   1131764    334868     22.83    163656\n12:59:22       947704   1086052     53.40    110560    278372   1131764    334868     22.83    163656\n12:59:27       947688   1086068     53.40    110560    278372   1131764    334868     22.83    163656\nAverage:       947725   1086031     53.40    110560    278353   1131764    334868     22.83    163650\n\t\n# report paging statistics\n$ sar -B 5 5\nLinux 2.6.24-12-generic (theodor)       22\/03\/08\n\t\n13:01:50     pgpgin\/s pgpgout\/s   fault\/s  majflt\/s  pgfree\/s pgscank\/s pgscand\/s pgsteal\/s    %vmeff\n13:01:55         0.00      0.80    439.24      0.00   1046.81      0.00      0.00      0.00      0.00\n13:02:00         0.00      1.59    478.88      0.00   1038.84      0.00      0.00      0.00      0.00\n13:02:05         0.00      0.00    439.12      0.00    984.83      0.00      0.00      0.00      0.00\n13:02:10         0.00      0.00    477.25      0.00   1011.58      0.00      0.00      0.00      0.00\n13:02:15         0.00      0.00    443.06      0.00    990.34      0.00      0.00      0.00      0.00\nAverage:         0.00      0.48    455.53      0.00   1014.54      0.00      0.00      0.00      0.00\n<\/code><\/pre><\/div><\/div>\n\n<p>You\u2019ll notice that the <code class=\"language-plaintext highlighter-rouge\">iostat<\/code> and <code class=\"language-plaintext highlighter-rouge\">sar<\/code> outputs are very similar. In fact, they\u2019re almost identical. That\u2019s because they\u2019re from the same project: <a href=\"http:\/\/pagesperso-orange.fr\/sebastien.godard\/\">Sysstat<\/a>.<\/p>\n\n<p>Sysstat is a fantastic collection of utilities for doing performance monitoring. Because they all share a common backend, you\u2019re able to do some pretty interesting things with them.<\/p>\n\n<p>That outputted data we were seeing before? That can be saved and extracted really easily - you can even specify time ranges:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code># collect a 10 second sample of network device stats\n# log them to the file 'sar-network'\n$ sar -n DEV 10 -o sar-network\n\n# ...a whole bunch of output\n\n# retreive network device stats from the sar-network file\n$ sar -n DEV -f sar-network\n\n# ...even more output\n\n# extract the same data, but only between 13:24:00 and 13:25:30\n$ sar -n DEV -s 13:24:00 -e 13:25:30 -f sar-network\n<\/code><\/pre><\/div><\/div>\n\n<p>Newer versions of Sysstat include <a href=\"http:\/\/pagesperso-orange.fr\/sebastien.godard\/man_pidstat.html\">pidstat<\/a>, which reports detailed per-process stats including I\/O, page faults, memory utilisation, context switches, CPU time, and can even report on child processes and threads.<\/p>\n\n<p>You can find all these tool in Redhat and Debian-based distros in the <code class=\"language-plaintext highlighter-rouge\">sysstat<\/code> package.<\/p>\n\n<p>If you\u2019re looking for something a bit more user friendly in its output (but not necessarily as detailed), there\u2019s always <a href=\"http:\/\/dag.wieers.com\/home-made\/dstat\/\">Dstat<\/a>. It essentially collects the same information as all the Sysstat tools, but displays it in a much nicer format (with colourisation, and in cleanly labeled columns).<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code># show cpu, disk, and network stats\n$ dstat -cdn\n----total-cpu-usage---- -dsk\/total- -net\/total-\nusr sys idl wai hiq siq| read  writ| recv  send\n  7   2  90   1   0   0|  18k   44k|   0     0\n  2   1  97   0   0   0|   0     0 |   0     0\n  2   0  98   0   0   0|   0     0 |   0     0\n  0   1  99   0   0   0|   0     0 |   0     0\n  1   1  97   0   0   0|   0     0 |   0     0\n  0   1  98   0   0   0|   0     0 |   0     0\n  0   1  99   0   0   0|   0     0 |   0     0\n  1   0  98   0   0   0|   0     0 |   0     0\n\n^C\n<\/code><\/pre><\/div><\/div>\n\n<p>You can also output the statistics in CSV, which makes it really easy for importing the data into a spreadsheet.<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code># output file lock, tcp socket, and unix socket statistics to stats.csv\n$ dstat --lock --tcp --unix --output stats.csv\n<\/code><\/pre><\/div><\/div>\n\n<p>Dstat, like Sysstat, can be found in most of the major distros.<\/p>\n\n<p>Also worth checking out is <a href=\"http:\/\/collectd.org\/\">collectd<\/a>, the \u201cwould you like graphs with that?\u201d network aware stat collection daemon, which collects the same data as Sysstat and Dstat, but outputs it to <a href=\"http:\/\/oss.oetiker.ch\/rrdtool\/\">RRD<\/a> for quick and easy graphing love. Collectd is my new favourite piece of software, so I\u2019ll probably post a bit more about it.<\/p>\n\n<p>I\u2019ve barely touched the surface of performance monitoring under Linux, but hopefully this will give you starting point other than \u201cmangle top\u2019s manky output\u201d.<\/p>\n","pubDate":"22 Mar 2008","link":"https:\/\/fractio.nl\/2008\/03\/22\/practical-performance-monitoring-tooling-on-linux\/","guid":"https:\/\/fractio.nl\/2008\/03\/22\/practical-performance-monitoring-tooling-on-linux\/"},{"title":"Setting up unixODBC with a remote DB2","description":"<p>My hatred for DB2 grows. I thought it was bad <a href=\"https:\/\/fractio.nl\/2007\/09\/14\/more-fixes-for-the-db2-time-sink\/\">setting up<\/a> the <a href=\"https:\/\/fractio.nl\/2007\/03\/05\/fixing-ibm-db2-for-non-ancient-operating-systems\/\">DB2 server<\/a> - it\u2019s even worse getting the ODBC adapter working. It took me a good 2 days of fiddling to figure this out. Fricken ridiculous.<\/p>\n\n<p>All the <a href=\"http:\/\/www.unixodbc.org\/doc\/db2.html\">ODBC documentation<\/a> on the web focuses on talking to DB2 on a local machine. I needed ODBC to talk to DB2 on a <em>remote<\/em> machine.<\/p>\n\n<p>But first, some quick background before we start. unixODBC has two configuration files, <code class=\"language-plaintext highlighter-rouge\">odbcinst.ini<\/code> for database drivers and <code class=\"language-plaintext highlighter-rouge\">odbc.ini<\/code> for database sources. Drivers are mechanisms for talking to database, and sources are database definitions.<\/p>\n\n<p>Most (all?) unixODBC sources have <em>all<\/em> their config under their own section in <code class=\"language-plaintext highlighter-rouge\">\/etc\/odbc.ini<\/code>. DB2 likes to be different and store its own config in a separate file called <code class=\"language-plaintext highlighter-rouge\">db2cli.ini<\/code>. The file is used by DB2 utilities that use the db2cli. (a bit of background on the <code class=\"language-plaintext highlighter-rouge\">db2cli<\/code> can be found <a href=\"http:\/\/www.uic.rsu.ru\/doc\/db\/DB2CLI\/db2l008.htm#Header_5\">here<\/a>)<\/p>\n\n<p>Anyway, I managed to set this up with DB2 8 under Fedora Core 6 but I assume it works under RHEL 5, and could easily be transposed to other distros.<\/p>\n\n<p>First step - install unixODBC:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>yum install unixODBC\n<\/code><\/pre><\/div><\/div>\n\n<p>Now you\u2019ve got to install a copy of DB2.<\/p>\n\n<p>IBM provide a bunch of different editions of DB2. Don\u2019t fall into the trap of using the \u201cIBM DB2 Driver for ODBC and CLI\u201d - I wasn\u2019t able to get it talking to a remote DB2 server. You want to grab the \u201cDB2 Runtime Client\u201d. Infuriatingly IBM have removed all links to older versions of DB2 on their website, but you can find downloads for DB2 8 <a href=\"http:\/\/www-1.ibm.com\/support\/docview.wss?rs=71&amp;uid=swg21256235\">here<\/a>, and DB2 9 <a href=\"http:\/\/www-1.ibm.com\/support\/docview.wss?rs=71&amp;uid=swg21255390\">here<\/a>.<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>wget ftp:\/\/ftp.software.ibm.com\/ps\/products\/db2\/fixes2\/english-us\/db2linux2632\/client\/runtime\/FP15_MI00189_RTCL.tar\ntar xvf FP15_MI00189_RTCL.tar\nrtcl\/db2_install -p DB2.RTCL\n<\/code><\/pre><\/div><\/div>\n\n<p>Why doesn\u2019t IBM compress their DB2 releases? Gzip\u2019s been around for 15 years - get your act together! I downloaded the archive to a Linode in the states and bzip\u2019d it before downloading it here and saved 70mb.<\/p>\n\n<p>The runtime client doesn\u2019t set up users, so you\u2019ll have to create a DB2 instance and user yourself.<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>mkdir -p \/home\/db2\nadduser -m -d \/home\/db2\/db2inst db2inst\npasswd db2inst\n\/opt\/IBM\/db2\/V8.1\/instance\/db2icrt db2inst\n<\/code><\/pre><\/div><\/div>\n\n<p>Then you need to set up db2inst\u2019s <code class=\"language-plaintext highlighter-rouge\">db2cli.ini<\/code>, which contains all the useful information on how to connect to the remote DB2 instance.<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>vim \/home\/db2\/db2inst\/sqllib\/cfg\/db2cli.ini\n<\/code><\/pre><\/div><\/div>\n\n<p>Here\u2019s an example config (thanks IBM for not providing anything as vaguely useful):<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>[foo]\nDatabase = FOO\nProtocol = TCPIP\nHostname = 192.168.10.77\nServiceName = 50000\n<\/code><\/pre><\/div><\/div>\n\n<p>For some bizarre reason, <code class=\"language-plaintext highlighter-rouge\">ServiceName<\/code> is the port number.<\/p>\n\n<p>So now DB2\u2019s CLI is set up, it\u2019s time to do the ODBC manager. Setup the DB2 driver in\n<code class=\"language-plaintext highlighter-rouge\">\/etc\/odbcinst.ini<\/code>.<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>[DB2]\nDescription             = IBM DB2 Adapter\nDriver                  = \/opt\/IBM\/db2\/V8.1\/lib\/libdb2.so\nFileUsage               = 1\nDontDLClose             = 1\n<\/code><\/pre><\/div><\/div>\n\n<p>And finally, the ODBC source in <code class=\"language-plaintext highlighter-rouge\">\/etc\/odbc.ini<\/code>.<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>[foo]\nDescription = example database connection\nDriver = DB2\n<\/code><\/pre><\/div><\/div>\n\n<p>This is quite unlike normal ODBC source definitions in unixODBC. Because of IBM\u2019s insistence on using <code class=\"language-plaintext highlighter-rouge\">db2cli.ini<\/code>, you have to put all the relevant settings there, essentially turning <code class=\"language-plaintext highlighter-rouge\">odbc.ini<\/code> into a wrapper for <code class=\"language-plaintext highlighter-rouge\">db2cli.ini<\/code>. Apparently the <code class=\"language-plaintext highlighter-rouge\">db2cli<\/code> allows you to put the settings just in <code class=\"language-plaintext highlighter-rouge\">odbc.ini<\/code> and they\u2019ll be passed through when the DB2 driver is called, but I could not get this to work.<\/p>\n\n<p>Also worth noting is that the <code class=\"language-plaintext highlighter-rouge\">db2cli<\/code> doesn\u2019t allow DSN\u2019s (the name of the ODBC source inside the []\u2019s) longer than 8 characters. What the fuck. Are we living in the 80\u2019s?<\/p>\n\n<p>Regardless, to test the setup:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>export DB2INSTANCE=\"db2inst\"\nisql -v foo db2inst password\n<\/code><\/pre><\/div><\/div>\n\n<p>Useful manuals for DB2 can be found <a href=\"http:\/\/www-1.ibm.com\/support\/docview.wss?rs=71&amp;uid=swg27009552\">here<\/a> (the Call Level Interface Guide and Reference Vol.1 &amp; 2 are particularly handy).<\/p>\n\n<p>I hope someone else finds this useful. Hopefully this is my last post about DB2 for quite some time.<\/p>\n","pubDate":"26 Oct 2007","link":"https:\/\/fractio.nl\/2007\/10\/26\/setting-up-unixodbc-with-a-remote-db2\/","guid":"https:\/\/fractio.nl\/2007\/10\/26\/setting-up-unixodbc-with-a-remote-db2\/"},{"title":"More fixes for the DB2 time sink","description":"<p>A while back I posted about <a href=\"\/2007\/03\/05\/fixing-ibm-db2-for-non-ancient-operating-systems\/\">Fixing IBM DB2 for non-ancient operating systems<\/a>. <code class=\"language-plaintext highlighter-rouge\">tail<\/code>\u2019s deprecated <code class=\"language-plaintext highlighter-rouge\">+n<\/code> syntax broke the installer at a critical point, leaving DB2 rather broken.<\/p>\n\n<p>I\u2019ve since found a more elegant fix. Before you run the installer, run set the <code class=\"language-plaintext highlighter-rouge\">_POSIX2_VERSION<\/code> environmental variable:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code># party like it's 1992\n$ export _POSIX2_VERSION=199209\n$ \/tmp\/db2\/db2setup -r \/tmp\/db2\/db2_wse_response_file.rsp\n<\/code><\/pre><\/div><\/div>\n\n<p>This allows <code class=\"language-plaintext highlighter-rouge\">tail<\/code> to use the older syntax, and the install completes successfully.<\/p>\n\n<p>Unfortunately DB2 is broken in a lot of other ways. The supplied <code class=\"language-plaintext highlighter-rouge\">db2_deinstall<\/code> script removes the RPMs, but it does so regardless of whether DB2 is running or not. It also sprays files all over the filesystem. This makes it really difficult to fully and cleanly remove DB2 from your system and not have reinstalls tainted.<\/p>\n\n<p>So here\u2019s a script that should clean up its mess:<\/p>\n\n<div class=\"language-bash highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"c\">#!\/bin\/sh -e<\/span>\n\n<span class=\"nb\">echo<\/span> <span class=\"s2\">\"You really should run db_deinstall first!\"<\/span> \n<span class=\"nb\">echo<\/span> <span class=\"nt\">-n<\/span> <span class=\"s2\">\"Have you done this? [y\/n] \"<\/span> \n<span class=\"nb\">read <\/span><span class=\"k\">continue\n\nif<\/span> <span class=\"o\">[<\/span> <span class=\"s2\">\"<\/span><span class=\"nv\">$continue<\/span><span class=\"s2\">\"<\/span> <span class=\"o\">!=<\/span> <span class=\"s2\">\"y\"<\/span> <span class=\"o\">]<\/span><span class=\"p\">;<\/span> <span class=\"k\">then \n        <\/span><span class=\"nb\">echo<\/span> <span class=\"s2\">\"OK Bye!\"<\/span>\n        <span class=\"nb\">exit <\/span>1\n<span class=\"k\">fi\n\n<\/span>killall <span class=\"nt\">-9<\/span> <span class=\"nt\">-u<\/span> db2inst &amp;\nkillall <span class=\"nt\">-9<\/span> <span class=\"nt\">-u<\/span> db2fenc &amp;\nkillall <span class=\"nt\">-9<\/span> <span class=\"nt\">-u<\/span> dasusr &amp;\n\n<span class=\"nb\">echo<\/span> <span class=\"s2\">\"Sleeping while DB2 is forced to shutdown.\"<\/span>\n<span class=\"nb\">sleep <\/span>4 \n\nuserdel <span class=\"nt\">--force<\/span> <span class=\"nt\">--remove<\/span> db2inst\nuserdel <span class=\"nt\">--force<\/span> <span class=\"nt\">--remove<\/span> db2fenc\nuserdel <span class=\"nt\">--force<\/span> <span class=\"nt\">--remove<\/span> dasusr\n\ngroupdel dasadm\ngroupdel db2grp\ngroupdel db2fgrp\n\n<span class=\"nb\">rm<\/span> <span class=\"nt\">-rf<\/span> \/opt\/IBM\/db2\n<span class=\"nb\">rm<\/span> <span class=\"nt\">-rf<\/span> \/home\/db2\n<span class=\"nb\">rm<\/span> <span class=\"nt\">-rf<\/span> \/var\/db2\n\n<span class=\"nb\">grep<\/span> <span class=\"nt\">-v<\/span> ^db2c \/etc\/services <span class=\"o\">&gt;<\/span> \/etc\/services.new\n<span class=\"nb\">mv<\/span> \/etc\/services.new \/etc\/services\n<\/code><\/pre><\/div><\/div>\n\n<p>This assumes that you\u2019ve set all your DB2 users\u2019 homes to be under <code class=\"language-plaintext highlighter-rouge\">\/home\/db2<\/code> (which is a sane way of limiting DB2\u2019s potential damage anyway).<\/p>\n\n<p>DB2 mangles <code class=\"language-plaintext highlighter-rouge\">\/etc\/services<\/code> and doesn\u2019t clean up after itself, causing subsequent installs to fail hard for no apparent reason. You must clean this up if you want a reinstall to work.<\/p>\n\n<p>My conclusion: administering DB2 can be an enormous time sink. Making it behave is not worth the time. Convince your manager or client to use something else like Postgres.<\/p>\n","pubDate":"14 Sep 2007","link":"https:\/\/fractio.nl\/2007\/09\/14\/more-fixes-for-the-db2-time-sink\/","guid":"https:\/\/fractio.nl\/2007\/09\/14\/more-fixes-for-the-db2-time-sink\/"},{"title":"Fixing IBM DB2 for non-ancient operating systems","description":"<p>You\u2019ve tried installing IBM DB2 V8.1 on the current release of an RPM-based but commercially unsupported operating system. Also, maybe you\u2019ve installed DB2 before on older machines and have a nice response file set up with sane default values.<\/p>\n\n<p>Something has died during the install, but you can\u2019t tell whether it\u2019s causing the following error:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>$ db2cc \nException in thread \"main\" java.lang.UnsatisfiedLinkError: initIDs\n        at java.awt.Component.&lt;clinit&gt;(Component.java:563)\n        at CC.&lt;init&gt;(Unknown Source)\n        at CC.main(Unknown Source)\nDB2JAVIT : RC = 1\n&lt;\/init&gt;&lt;\/clinit&gt;\n<\/code><\/pre><\/div><\/div>\n\n<p>What do you do?<\/p>\n\n<p>Make sure that <code class=\"language-plaintext highlighter-rouge\">libXp<\/code> is installed.<\/p>\n\n<p>&lt;= RHEL 4 &amp; older versions of Fedora Core (at least on Core 3) bundle <code class=\"language-plaintext highlighter-rouge\">libXp<\/code> in <code class=\"language-plaintext highlighter-rouge\">xorg-libs<\/code>, however this package has disappeared in newer versions of Fedora. Fortunately, a <code class=\"language-plaintext highlighter-rouge\">yum provides libXp.so.6<\/code> reveals that it now has its own package, so a <code class=\"language-plaintext highlighter-rouge\">yum install libXp<\/code> will fix your problems.<\/p>\n\n<p>Now <code class=\"language-plaintext highlighter-rouge\">db2cc<\/code> and the other various DB2 admin-y bits should work, but what about the other bits that broke?<\/p>\n\n<p>You\u2019ll need to fix up some syntactical bits in <code class=\"language-plaintext highlighter-rouge\">\/opt\/IBM\/db2\/V8.1\/instance\/db2iutil<\/code> - specifically, the syntax of <code class=\"language-plaintext highlighter-rouge\">tail<\/code> has changed.<\/p>\n\n<p>Open up <code class=\"language-plaintext highlighter-rouge\">db2iutil<\/code> in an editor and type:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>%s\/tail +2\/tail -n +2\/g\n<\/code><\/pre><\/div><\/div>\n\n<p>This will replace all cases of <code class=\"language-plaintext highlighter-rouge\">tail +2<\/code>,  with <code class=\"language-plaintext highlighter-rouge\">tail -n +2<\/code>. Supposedly just <code class=\"language-plaintext highlighter-rouge\">+2<\/code> is no longer valid syntax. (Bizarro! Is this some sort of odd regression?)<\/p>\n\n<p>Now, for the cleanup bits. :-)<\/p>\n\n<p>It\u2019s worth dropping the DB2 instance just in case things have broke in mystical ways known only to IBM.<\/p>\n\n<p>If the instance has already been created, drop it by running:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>\/opt\/IBM\/db2\/V8.1\/instance\/db2idrop db2inst\n<\/code><\/pre><\/div><\/div>\n\n<p>And now recreate the DB2 instance:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>\/opt\/IBM\/db2\/V8.1\/instance\/db2icrt -a SERVER -d -s wse -u db2fenc -p db2c_db2inst db2inst\n<\/code><\/pre><\/div><\/div>\n\n<p>The DB2 admin user probably needs a re-jigging too.<\/p>\n\n<p>Delete the temporary directory created during the installation:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code># rmdir \/home\/db2das\/das\n<\/code><\/pre><\/div><\/div>\n\n<p>Recreate the DB2 administration user in DB2:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>\/opt\/IBM\/db2\/V8.1\/instance\/dascrt -u dasusr -d\n<\/code><\/pre><\/div><\/div>\n\n<p>Of course, modify the paths as needed. Your Unix + DB2 usernames and DB2 instance names are probably different, so in my examples:<\/p>\n\n<ul>\n  <li>DB2 admin user = <code class=\"language-plaintext highlighter-rouge\">dasusr<\/code><\/li>\n  <li>DB2 instance user = <code class=\"language-plaintext highlighter-rouge\">db2inst<\/code><\/li>\n  <li>DB2 instance name = <code class=\"language-plaintext highlighter-rouge\">db2inst<\/code><\/li>\n  <li>Name of server = <code class=\"language-plaintext highlighter-rouge\">SERVER<\/code><\/li>\n<\/ul>\n","pubDate":"05 Mar 2007","link":"https:\/\/fractio.nl\/2007\/03\/05\/fixing-ibm-db2-for-non-ancient-operating-systems\/","guid":"https:\/\/fractio.nl\/2007\/03\/05\/fixing-ibm-db2-for-non-ancient-operating-systems\/"},{"title":"Linux + OOo + Samba = pretty weird file locking behaviour","description":"<p>Interesting file locking problem with Samba and Windows &amp; Linux clients arose at work last week:<\/p>\n\n<ol>\n  <li>1st Linux machine with a CIFS-mounted Samba share has an OpenOffice.org document open on the share.<\/li>\n  <li>2nd Linux machine with the same CIFS-mounted share opens the OOo file. OOo has full write access to the file.<\/li>\n  <li>Windows machine tries to open the OOo file. OOo gets read-only access. (the correct behaviour)<\/li>\n<\/ol>\n\n<p>This is really a Linux OOo bug, but you can circumvent it by enforcing a particular file locking policy with Samba.<\/p>\n\n<p>OOo\u2019s file locking on Windows is handled nicely because the file locking interface it uses is, as far as I can tell, smart enough to work out whether the file is on a remote share, and if so it modifies the locking calls to be more CIFS-ish.<\/p>\n\n<p>From various bug and forum posts, OOo on Linux isn\u2019t afforded same luxury. It will make a \u201cstandard\u201d locking call that doesn\u2019t get translated to a more reliable and CIFS aware call. This standard locking call is used everywhere, and makes the OOo developer\u2019s jobs easier because they don\u2019t have to maintain CIFS aware code, further reinforcing one of my favourite graphics:<\/p>\n\n<p><img src=\"http:\/\/i.imgur.com\/c1wqbvY.jpg\" alt=\"bug and feature \u2013 one and the same\" \/><\/p>\n\n<p>So, it becomes a process of finding the right smb.conf incantations. This is what I found worked:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>[global]\n    ...\n    kernel oplocks = no\n\n[someshare]\n    path = \/path\/to\/share\n\n    oplocks = yes\n    share modes = yes\n    locking = yes\n    strict locking = no\n    blocking locks = no\n\n    ...\n<\/code><\/pre><\/div><\/div>\n\n<p>Replace <code class=\"language-plaintext highlighter-rouge\">...<\/code> with your normal share or global settings.<\/p>\n\n<p>I\u2019ve tested this successfully on Ubuntu Edgy and Dapper, though it\u2019s been tried on a Sarge-like build and not worked correctly.<\/p>\n\n<p>An interesting side effect that I haven\u2019t confirmed: the OOo + CIFS + Linux combination works as intended, but OOo + Windows get a constant read-only status on the file. Looks like the enforced Samba locking behaviour doesn\u2019t bode well with Windows.<\/p>\n\n<p>Further testing on both fronts is required, but people running into this OOo problem in pure Linux environments should find it a handy fix.<\/p>\n","pubDate":"14 Feb 2007","link":"https:\/\/fractio.nl\/2007\/02\/14\/linux-ooo-samba-weird-file-locking-behaviour\/","guid":"https:\/\/fractio.nl\/2007\/02\/14\/linux-ooo-samba-weird-file-locking-behaviour\/"},{"title":"Re: Gimp and Colour Depth","description":"<p><a href=\"https:\/\/benno37.livejournal.com\/10678.html\">Benno<\/a>, the <code class=\"language-plaintext highlighter-rouge\">convert<\/code> command that you used didn\u2019t actually convert your image down to a 16-bit colour RGB image. It just reduced the number of colours in the palette, which can be done with the Gimp!<\/p>\n\n<p>To reduce the colour palette you have to go to \u201cImage -&gt; Mode -&gt; Indexed\u201d, select \u201cGenerate optimum palette\u201d, and set the Maximum number of colours to 256.<\/p>\n\n<p>I found that the \u201cFloyd-Steinberg (reduced colour bleeding)\u201d ditherer worked the best on your image.<\/p>\n\n<p>Once you save it as a bitmap, <code class=\"language-plaintext highlighter-rouge\">identify -verbose $filename<\/code> will give you the following results on the new image.<\/p>\n\n<p>You\u2019ll see that the original, the convert-reduced-palette, and the gimp-reduced-palette images are all 24-bit, although the one done with the Gimp will say that it\u2019s Type is \u201cPalette\u201d, rather than TrueColor.<\/p>\n\n<p>Doesn\u2019t really matter too much though - as long as the image looks fine!<\/p>\n\n<p>Wikipedia have some <a href=\"http:\/\/en.wikipedia.org\/wiki\/Pixel\">really<\/a> <a href=\"http:\/\/en.wikipedia.org\/wiki\/Color_space\">good<\/a> articles on the theory behind digital imaging.<\/p>\n","pubDate":"14 Jul 2005","link":"https:\/\/fractio.nl\/2005\/07\/14\/re-gimp-and-colour-depth\/","guid":"https:\/\/fractio.nl\/2005\/07\/14\/re-gimp-and-colour-depth\/"},{"title":"OpenLDAP config parsing failure","description":"<p>Annoying changes in syntax between versions of OpenLDAP that I didn\u2019t pick up consumed my time at work today.  I was beating my head against the wall trying to work out why the <code class=\"language-plaintext highlighter-rouge\">slapd<\/code> config wasn\u2019t being parsed correctly.<\/p>\n\n<p>Setting the slaptest debug level to -1 reveals all, however:<\/p>\n\n<div class=\"language-plaintext highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code>slapd.conf: line 78: expecting &lt;what&gt; got \"attribute\"\n&lt;access clause&gt; ::= access to &lt;what&gt; [ by &lt;who&gt; &lt;access&gt; [ &lt;control&gt; ] ]+\n&lt;what&gt; ::= * | [dn[.&lt;dnstyle&gt;]=&lt;dn&gt;] [filter=&lt;filter&gt;] [attrs=&lt;attrlist&gt;]\n&lt;attrlist&gt; ::= &lt;attr&gt; [val[.&lt;style&gt;]=&lt;value&gt;] | &lt;attr&gt; , &lt;attrlist&gt;\n&lt;attr&gt; ::= &lt;attrname&gt; | entry | children\n<\/code><\/pre><\/div><\/div>\n\n<p>Under newer versions of OpenLDAP, when defining access controls the <code class=\"language-plaintext highlighter-rouge\">attribute<\/code> definition has been renamed to <code class=\"language-plaintext highlighter-rouge\">attrs<\/code>. This means that your config will mystically stop working and <code class=\"language-plaintext highlighter-rouge\">slapd<\/code> will die in a horrible flaming mess.<\/p>\n","pubDate":"13 Apr 2005","link":"https:\/\/fractio.nl\/2005\/04\/13\/openldap-config-parsing-failure\/","guid":"https:\/\/fractio.nl\/2005\/04\/13\/openldap-config-parsing-failure\/"},{"title":"About","description":"<p>Lindsay Holmwood is a product and engineering leader based in Australia.\nHe currently works at <a href=\"https:\/\/cipherstash.com\/\">CipherStash<\/a> as Chief Product Officer.\nHe served as the Head of Technology at the Australian federal government\u2019s Digital Transformation Agency, where he was responsible for technology strategy, advice, and delivery.\nHe since worked at Envato leading engineering on <a href=\"https:\/\/elements.envato.com\/\">Envato Elements<\/a>, and <a href=\"https:\/\/www.section.io\/\">Section<\/a> as Director of Product.<\/p>\n\n<p>Since bringing DevOps to Australia by running the second ever DevOpsDays conference in the world in 2010, he runs the the longest running DevOps meetup in the world in Sydney.\nHe is a sought-after professional speaker, <a href=\"https:\/\/fractio.nl\/talks\">speaking on<\/a> technology culture, DevOps, digital transformation, and building high performing teams.\nHe also won <a href=\"https:\/\/www.instagram.com\/p\/T1_dn1DqQz\/\">third place<\/a> at the 1996 Sydney Royal Easter Show LEGO building competition.<\/p>\n\n<p>Read <a href=\"https:\/\/fractio.nl\/resume\/\">Lindsay\u2019s resume<\/a>.<\/p>\n\n<h2 id=\"on-the-web\">On the web<\/h2>\n\n<ul class=\"on-the-web\">\n  <li>\n    <a href=\"https:\/\/twitter.com\/auxesis\" target=\"_blank\" title=\"Twitter\">\n      <i class=\"fab fa-twitter-square fa-3x\"><\/i>\n    <\/a>\n  <\/li>\n  <li>\n    <a href=\"https:\/\/github.com\/auxesis\" target=\"_blank\" title=\"GitHub\">\n      <i class=\"fab fa-github fa-3x\"><\/i>\n    <\/a>\n  <\/li>\n  <li>\n    <a href=\"http:\/\/www.linkedin.com\/in\/lindsayholmwood\" target=\"_blank\" title=\"LinkedIn\">\n      <i class=\"fab fa-linkedin fa-3x\"><\/i>\n    <\/a>\n  <\/li>\n  <li>\n    <a href=\"http:\/\/instagram.com\/auxesis\" target=\"_blank\" title=\"Instagram\">\n      <i class=\"fab fa-instagram fa-3x\"><\/i>\n    <\/a>\n  <\/li>\n  <li>\n    <a href=\"http:\/\/flickr.com\/photos\/auxesis\" target=\"_blank\" title=\"Flickr\">\n      <i class=\"fab fa-flickr fa-3x\"><\/i>\n    <\/a>\n  <\/li>\n<\/ul>\n","link":"https:\/\/fractio.nl\/about\/"},{"title":"Management","description":"This is the advice I wish I had available to me when I started career change from individual contributor to leadership.\n\n{% for post in site.categories.management reversed %}\n* <a href=\"{{ site.url }}{{ post.url }}\" title=\"{{ post.title }}\" class=\"post-title\"> {{ post.title }} <\/a>\n{% endfor -%}\n\n<hr \/>\n\n#### You may also be interested in [my talks]({{ site.url }}\/talks) on leadership.\n","link":"https:\/\/fractio.nl\/management\/"},{"title":"Talks","description":"Lindsay is a sought-after professional speaker, speaking on technology culture, DevOps, digital transformation, and building high performing teams.\n\nThese are the highlights.\n\n## Delivering Simpler, Clearer, Faster government services with Cloud Foundry\n\n<iframe width=\"560\" height=\"315\" src=\"https:\/\/www.youtube.com\/embed\/5Ramc0_jG9s\" frameborder=\"0\" allow=\"accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen><\/iframe>\n\nNobody interacts with government because they want to - they interact with government because they have to, and most people come away from online interactions feeling more confused than when they started.\n\nThat's why the Digital Transformation Office was created in 2015 \u2013 to change the way Australian governments deliver services, by relentlessly focusing all delivery activities on user needs, and modernising technical delivery methods.\n\nCloud Foundry is at the core of this technical modernisation, with the DTO providing a Cloud Foundry-based delivery platform to government for building new services. Every team gets a CD pipeline, centralised monitoring and logging, and an app runtime \u2013 driving a culture change through tools.\n\nIn this talk we'll learn about the problems with the traditional approach to digital service delivery in government, what opportunities Cloud Foundry creates for architecting the delivery of user-focused services, and how Cloud Foundry enables the DTO to help government deliver simpler, clearer, faster public services.\n\n<i class=\"fas fa-link\"><\/i>\n<a href=\"https:\/\/speakerdeck.com\/auxesis\/delivering-clearer-simpler-faster-public-services-with-cloud-foundry\" target=\"_blank\">Slides<\/a>.\n\n<small>\n  A longer, broader version of this talk is [also available](https:\/\/www.youtube.com\/watch?v=nE0Imiusbso).\n<\/small>\n\n## Managing remotely, while remotely managing\n\n<script async class=\"speakerdeck-embed\" data-id=\"0384dd0cf787474e986de52ecf542a8e\" data-ratio=\"1.33333333333333\" src=\"\/\/speakerdeck.com\/assets\/embed.js\"><\/script>\n\nDistributed teams unlock access to new talent from diverse backgrounds, and create opportunities for more humane ways of working.\n\nBut it takes a lot of works to reap the benefits. In this talk you will learn strategies for effectively bootstrapping, growing, and adapting distributed teams.\n\n## The DevOps Field Guide to Understanding Cognitive Biases\n\n<iframe width=\"560\" height=\"315\" src=\"\/\/www.youtube.com\/embed\/v6XQsprq-ME\" frameborder=\"0\" allowfullscreen><\/iframe>\n\nAs devops practitioners we focus on improving the culture of collaboration so that others play nicely with us & we play nicely with others - but what if the biggest thing holding us back from change is our own brains?\n\nCognitive biases can deeply affect our behaviours towards others by herding us towards mental shortcuts that are optimised for timeliness over accuracy, at the expense of rationalising irrational behaviour.\n\nYou are probably pushing these biases onto other people every day but don't even know it. Does that idea make you feel unconfortable? You are probably experiencing the Semmelweis reflex kicking your confirmation bias right now.\n\nKnowing is half the battle. This talk will delve into some of the well-known and less well-known biases that may be affecting your ability to work with your peers, and your team's ability to work constructively with other teams.\n\nAttendees will leave the talk with an overview of biases they run into every day, how to hack their brains to use these biases to their advantage, and some tips on how to mitigate the effects of the limitations baked into their wetware.\n\nWe have met the enemy and he is us.\n\n<i class=\"fas fa-link\"><\/i>\n<a href=\"https:\/\/speakerdeck.com\/auxesis\/the-devops-field-guide-to-cognitive-biases\" target=\"_blank\">Slides<\/a>.\n\n## Escalating complexity: DevOps learnings from Air France 447\n\n<iframe width=\"560\" height=\"315\" src=\"\/\/www.youtube.com\/embed\/WqxzpGJbzmI\" frameborder=\"0\" allowfullscreen><\/iframe>\n\nOn June 1, 2009 Air France 447 crashed into the Atlantic ocean killing all 228 passengers and crew. The 15 minutes leading up to the impact were a terrifying demonstration of the how thick the fog of war is in complex systems.\n\nMainstream reports of the incident put the blame on the pilots - a common motif in incident reports that conveniently ignore a simple fact: people were just actors within a complex system, doing their best based on the information at hand.\n\nWhile the systems you build and operate likely don't control the fate of people's lives, they share many of the same complexity characteristics. Dev and Ops can learn an abundance from how the feedback loops between these aviation systems are designed and how these systems are operated.\n\nIn this talk we cover what happened on the flight, why the mainstream explanation doesn't add up, how design assumptions can impact people's ability to respond to rapidly developing situations, and how to improve your operational effectiveness when dealing with rapidly developing failure scenarios.\n\n<i class=\"fas fa-link\"><\/i>\n<a href=\"https:\/\/www.slideshare.net\/auxesis\/escalating-complexity-devops-learnings-from-air-france-447\" target=\"_blank\">Slides<\/a>.\n","link":"https:\/\/fractio.nl\/talks\/"}]}}