Team Development Roadmaps

As engineers we spend so much time focusing on the client’s projects, developing Impact Maps, Story Maps, Stakeholder Maps… all sorts of maps to confirm our approach and provide shared understanding between everyone and a strategy for solving the problem. What often doesn’t get considered is how the members of the team can use the opportunities within a project to improve and learn. Making sure that your team feel challenged and invested whilst providing a safe space helps them to feel valued and be at their natural best. At Infinity Works we strive to be “Best for Client, Best for Colleague”, so with that in mind I had an idea around Team Development Roadmaps.

What’s one of those then? I see it as a personal progression plan that ties into the project and its milestones in a way that allows you to link both personal and project targets together. They’re often seen as separate, but I believe providing opportunities within projects to allow team members to develop and experiment is a great way to keep them motivated. I did this recently with a new team of mine and I’ve outlined my approach here, hopefully you might find it useful.

First off, we did an initial brainstorming session where I had the team spend 10 minutes jotting down all of their goals. We used Miro (my new favourite online collaboration tool) to facilitate the session. We then spent the next 30 minutes walking through each of the goals, allowing the team to talk about what’s important to them.

If possible, have this conversation right at the start of a project, or as the team’s entering the forming phase – it shouldn’t be an afterthought. Having an open and honest conversation tunes everyone in to each other’s goals and aspirations. We actually unearthed some common threads between team members, and opportunities to support each other in achieving those goals which we wouldn’t have known about otherwise.

When we had a high level project roadmap and an understanding of what technologies, features and engagements we’d need with various stakeholders, we then revisited our wall of personal goals. We had another session to line up personal goals with project milestones, to identify opportunities in the project that can contribute towards them:

There will often be goals or activities that for whatever reason won’t be achievable within the project itself, such as long-running themes or external activities like engaging in communities of practise. It’s still worth capturing these as the team can still discuss ways to accommodate these too, for example agreeing on lunch breaks long enough to allow people to attend other meetups, or factoring in time on activities outside of the project into sprint commitments,

Depending on the stage of your project and level of detail known, the output from this session may be as high-level as the diagram above, or it could be as detailed as a Story Map that overlays personal goals on top.

Try to align the two roadmaps as closely as possible to increase the chance of sticking to them. Build Story Maps around your team’s goals, attribute them to milestones and factor them in to your usual team ceremonies. When walking through user stories in planning, identify how they can contribute to individuals’ personal goals and organise the team in the best way to balance both project and personal goals.

Finally, we captured the goals and associated actions in a table. This provided us with a targeted list of actions to focus on, along with owners and a review date so we can regularly review progress and check actions and goals off as they’re completed.

I believe doing this will pay dividends in many ways. By investing in your team this way they’ll feel valued and motivated, and it can help to build camaraderie as they work together to help achieve each other’s goals. I’m keen to hear feedback on this approach, what works well and not so well, and any other suggestions for building teams. Hopefully this gives you food for thought!

To be successful, you’ve got to take the lead

I have been doing a lot of reflection recently. I want to become a great leader – to be a servant to others, to drive myself and those in my charge towards a Just Cause that we all aspire towards. In doing this I’ve been figuring out what to focus on in order to help myself and others achieve greatness in everything we do.

But if you want to be successful in leading others, you first need to be able to lead yourself. What does that even mean?

Here are some ideas I’ve read about recently which really resonate with me, and hopefully may be of use to you too.

Self Leadership

For me, this is about taking charge of your career, getting what you need in order to succeed. It’s about taking ownership and jumping to the challenges that present themselves, no matter how big or small. No one else is going to do the work for you, and you shouldn’t rely on anyone doing all the heavy lifting either.

Yeah sure this may mean taking on the responsibility and accountability, and potentially being exposed to failure if it doesn’t go to plan. Hopefully you work for a company that builds psychological safety and trust into their culture, where taking ownership and stretching yourself should be actively encouraged and not something to be fearful of. If that isn’t the case then look for opportunities to nurture that mindset within your company, maybe start with developing that trust and safety within your team.

It’s important how you ask for that support though – show others you’re enthusiastic about progressing by proactively identifying what you need. Be concise about what you need and ask for it – people will help. You’re more likely to get a positive response this way, as opposed to throwing in the towel with “someone else would be better suited for this work” or “I don’t know what to do”. This could be anything from “I need advice on diagnosing this problem” or “I need support analysing this complex customer requirement” to “I really want to take point on building the new solution, but I need your support with how to design it. Could you help?”.

Finding your Just Cause

Simon Sinek lists this as one of 5 pillars to having an Infinite Mindset. It is a future state or ideal that you are willing to sacrifice yourself for, by means of dedicating your time, energy, career or life towards pursuing. A Just Cause in itself is infinite – a never ending quest for a better future that is so profound and inspiring that you’re constantly energised and passionate about working towards it.

In this self actualisation, money and job security are no longer the motivators – you’re working for a higher cause. Having a Just Cause encourages a “Service Oriented” mindset – being in service to others. From my experience, if you’re able to prioritise your Just Cause and being in service to others, then your personal needs are also rewarded as a result. I do what I can to make an impact and help others, in the hope that the company I’m working for recognises and rewards the right behaviours that contribute to a great collaborative culture.

Motivation and self leadership are significantly easier if you have a Just Cause and your job and company aligns with it. It makes getting out of bed that much easier and turning up to work enjoyable and less like “work”.

I’m still trying to articulate my Just Cause, but before I can do that I need to discover my “why” – my origin story, why I am who I am and do what I do. All I know right now is that I love fixing problems and building solutions, working, collaborating and coaching fellow colleagues and doing what I can to help them develop. I aim to make a positive impact in everything I work on, and if I get to earn a bit of dollar whilst doing that then all the better.

Having a Worthy Rival

I don’t mean an adversary or someone you despise or detest, it could be someone you work with or a close friend. I’m on about that someone who forces us to take stock and push ourselves to do better, who excels in areas we want to develop as well. We all have one of those, whether it’s the tech lead in your team, the confident public speaker talking at events, or the person who just oozes charisma while engaging with stakeholders.

It shouldn’t be someone you’re focused on beating however as that invites a finite mindset, which drives you to focus against them and to do what you can to surpass them. It also shouldn’t be someone you can get the better of to bolster your own confidence. The effort it takes to constantly be No.1 (whatever that means) in a contest which has no end is draining and masks the real opportunity that’s available to you.

Don’t envy them… but rather embrace them. Accept humility and acknowledge that you can learn a thing or two from them. Get them to mentor you or study from afar, learn from their strengths and use them to bolster your weaknesses. Find someone who inspires you, figure out what it is they do so well, and challenge yourself to do better.

Challenge Assumed Constraints

I recently attended a coaching course led by two excellent tutors, Andy and Sean from Erskine Nash, where they introduced me to a book called “Self Leadership and the One Minute Manager”. It describes the idea of challenging assumed constraints, also known as Elephant thinking – becoming so acclimbatised to a constraint that you no longer challenge it, it limits your potential experience.

When was the last time you looked at a job, or a project, or an opportunity to broaden your horizons, and you’ve shied away from it because you fear you’d be lacking the skills or experience needed? Or the last time you worked with a client who “just didn’t get Agile”, or a colleague who didn’t understand your point of view? This is the perfect opportunity to challenge those assumed constraints, to question those limitations that you’ve defined and see what other options you have available to you.

One way of doing this is to imagine having this discussion with a close friend or family member, where they’d ask you “what are you going to do about it?”. There’s just something impactful and thought provoking when it comes from someone you have a close relationship with that causes you to really challenge your self-imposed restraints.

Embracing your Points of Power

As engineers we can be both incredibly proud and stubborn, wanting to prove to everyone and ourselves that we can solve any challenge. Whilst perseverance and determination are great qualities to have, showing vulnerability and humility and asking for support is not a weakness… it is a sign of strength.

In the “One Minute Manager” they refer to this as your “Points of Power” – different sources of power which you can draw from in a situation in order to make things happen. There are 5 different Points of Power which I’ll talk about briefly below.

Task & Knowledge Power

As engineers we’re expected to have task and knowledge power – awareness and understanding of the problems and challenges we face and how to solve them. But if you go into every situation expecting yourself to be able to solve every problem you’ll burn yourself out very quickly. To me, this is a contributor of Imposter Syndrome – that feeling you’re not good enough and sooner or later you’ll be found out. By embracing humility and leaning on the task & knowledge power of others, we can tackle any challenge.

For example, this power may refer to skills and expertise with a particular technology, working in a specific industry or with a client’s business domain. Being able to gauge your task and knowledge power against each of these allows you to understand where your strengths are, and when you need to seek the support of others.

Personal Power

Being technically excellent is a great quality for an engineer to have, but being personable and having people skills is what makes the difference between a great engineer and a great consultant. It’s not always easy building your personal power depending on where your comfort zone is – I personally can be quite introverted in some situations, and sometimes I love nothing more than sticking my earphones in and cracking on.

There are many ways you can develop Personal Power: building relatedness inside and outside your team (ice breakers for example, are a great technique when forming a new team); offering your support whenever possible (even if you’re not an expert in the task at hand); developing your active listening skills. Charisma and personality help, but you will be respected by your peers for just helping out when you can.

Relationship Power

This is an extension of utilising your personal power to build your relationships and connections. In the film “2 Guns”, whenever Denzel Washington’s character needs something, he “knows a guy”. I try to think like this too – I won’t always have all the answers, but I know enough people who have the relevant knowledge and task power to get the job done, and I know these people by building relations.

How can you extend this further? Go to conferences, watch talks, give talks, could be 5 minutes or 50 long. Each of these interactions and connections you make along the way help expand your list of contacts and sources of information you can tap into to get what you need.

Position Power

Ideally the least-preferable power, sometimes just being in the right position can aid you in obtaining the outcome you need. This could be beneficial if for example you are the tech lead, product owner or architect. In the “One Minute Manager”, they refer to this as the Power you hopefully never need to use, but is always good to know it’s there if you need it.

Your levels of Power can vary not only throughout your career, but from situation to situation. For example, you may have a lot of Knowledge power when working with a particular technology or business domain, but you may need to rely more on your Personal and Relationship Power to get what you need when working with unfamiliar systems or new clients. Knowing where you lie on these scales is valuable, and being willing to utilise these other Power points when required can be very advantageous.

Persevere… but be Patient

Personal development doesn’t happen overnight, I’m 12 years into my career and I’m barely figuring this all out myself! If you’ve read this far down then it’s great to see you’re passionate about wanting to become a great self leader, but it’s something that takes a lot of courage, determination, and time.

We’ve grown up in a world where recognition and feedback is instant, and with the likes of smartphones and social media constantly within reach we expect everything to happen with the same immediacy as instant messaging. Simon Sinek refers to this as the “Instant Gratification” model, and the downside to this mindset is that we get disheartened when change doesn’t happen at the same rate of pace.

The journey is long and sometimes it can feel like it’s difficult to be the master of your own destiny, to have the agency to effect change and motivate yourself to constantly improve. Be patient, a lot of this comes with time and experience but if you can keep momentum and focus on continuous improvement, great things will be bestowed upon you – the self leader.

Closing

The above is a very brief insight into some schools of thought I’ve been looking into recently. It’s all too easy for people to tell you to just “take ownership”, and “empower” you without any real support or direction. But hopefully some of the above gives you inspiration on how to take charge of your own career, and get what you need in order to succeed.

References & Further Reading

Videos

Reading

Erskine Nash Associates

Performance Tuning Next.js

TL;DR: Next.js 9.3 introduces getStaticPaths, which allows you to generate a data-driven list of pages to render at build time, potentially allowing you to bypass server-side rendering for some use cases. You can now also use the fallback property to dynamically build pages on request, and serve the generated html instead.

On a recent project we built a website for a client using a combination of Next.js and Contentful headless CMS. The goal of the website was to offer a responsive experience across all devices whilst keeping load times to a minimum and supporting SEO.

I rather like Next.js – it combines the benefits of React with Server Side Rendering (SSR) and static html builds, enabling caching for quick initial page loads and SEO support. Once the cached SSR page has been downloaded, Next.js “hydrates” the page with React and all of the page components, completely seamlessly to the user.

The website is deployed to AWS using CloudFront and Lambda@Edge as our CDN and SSR platform. It works by executing a lambda for Origin Requests and caching the results in CloudFront. Regardless of where the page is rendered (client or server) Next.js runs the same code which in our case queries Contentful for content to display on the page, which is neat as the same code handles both scenarios.

During testing, we noticed that page requests that weren’t cached in CloudFront could take anything up to 10 seconds to render. Although this only affects requests that miss the cache, this wasn’t acceptable to us as it impacts every page that needs to be server-side generated, and the issue would also be replicated for every edge location in CloudFront. This issue only affects the first page load of a visitors session however, as subsequent requests are handled client-side and only the new page content and assets are downloaded.

Whilst investigating the issue we spotted that the majority of processing time was spent in the lambda. We added extra logging to output the elapsed time at various points in the lambda, and then created custom CloudWatch metrics from these to identify where most of the time was incurred.

We identified that the additional overhead was caused by javascript requiring the specific page’s javascript file embedded within the lambda, which is dynamically loaded for the page requested. It’s dynamically loaded to avoid loading all page assets when only rendering a single page, which would add considerable and unnecessary startup time to the lambda.

The lambda we used was based on the Next.js plugin available for the serverless framework, but as we were using Terraform we took the bits we needed from here to make it work https://github.com/serverless-nextjs/serverless-next.js/blob/master/README.md

Due to the overhead from the require statement, we experimented with the resource allocation given to the lambda. It was initially set to 128mb, so we played with various configurations and applied load against the website using JMeter to see if extra resources improved the responsiveness.

We found that by tweaking the memory allocation of the lambda, we could improve the average startup time from ~10 seconds to ~2 seconds. We found that the sweet spot was 368mb, just as the curve begins to flatten out. On the surface, increasing from 128mb to 368mb triples our lambda costs, however these are negligible as the lambda only runs on cache misses with most of our requests served from the CloudFront cache. That said adding extra resources for the sake of milliseconds would be superfluous and more expensive.

This improvement in speed was good enough for us, considering it impacted only a small percentage of visitors. A colleague of mine afterwards however suggested a couple of further refinements that could be made, which would reduce this impact even further. These options would require additional development effort which for us wasn’t possible at the time, but would make the website really responsive for all visitors.

Other strategies for mitigating the cold start issue

Multiple cache behaviours for different paths

By identifying which areas of your website are updated more often than others, you can mitigate the lambda issue by tweaking the cache expiries associated with them in CloudFront. For example, your homepage may change several times a day, whereas your news articles once published might stay fairly static. In this case, you could apply a short cache expiry to the root of your website / and a longer one for /news/*.

Invalidating CloudFront caches proactively

You could proactively invalidate CloudFront caches whenever content on your website changes. CloudFront allows you to specify a path to evict the cache for, so you can be really specific on what you want to invalidate. In our scenario, we could use Contentful webhooks to be notified when a piece of content is updated or removed, and use a lambda to trigger a cache invalidation for that path.

Generating dynamic pages at build time

As of Next.JS 9.3 there is now a getStaticPaths function, which allows you to generate dynamic pages (that use placeholders i.e. /news/[article-uri] at build time. This can significantly reduce the need for SSR depending on your use case.

Initially, you had to generate all of these pages as part of your build, which could be quite inefficient (e.g. rebuilding a website that has thousands of blog articles every time a new blog is published). However, as of Next.JS 9.3 you can now generate static pages on demand as announced here using the fallback key on getStaticPaths

In our project, we could use Contentful WebHooks to trigger website builds, passing through the URI of the new page into the build pipeline to specify what part of the website to rebuild. If you have a page template for /news/* for example, you’d possibly have to trigger a rebuild of all news.

Doing this would negate a lot of the above, as for us we could build a lot of the website upfront, and then new blog articles could be built on demand when visitors accessed them. Next.js’ fallback functionality notifies you when a page is being built for the first time, allowing you to present an intermediary “page loading” screen for the first visitor who triggers the build, giving them visual feedback and keeping them engaged whilst the page builds behind the scenes.

Hopefully this overview gives you some understanding of the potential performance issues faced when using SSR with Next.js, and also the variety of options available to you when tuning your application.

More details of Next.js’ Server Side Rendering and Static Generation capabilities can be found here:

https://nextjs.org/blog/next-9-4

https://nextjs.org/docs/basic-features/data-fetching

https://nextjs.org/docs/basic-features/pages

Booking a Meeting Room with Alexa – Part Two – Coding the Skill

Hey there! In my previous post Booking a Meeting Room with Alexa – Part One, I talk about how to build up the Interaction Model for your Skill using the Alexa Developer Console. Now, I’ll talk about how to write code that can handle the requests.

Setting Up

I chose to use JavaScript to write the skill, as I wanted to try something a little different to Java which is what I normally use. Alexa has an SDK that allows you to develop Skills in a number of languages including Java and Javascript, but also C#, Python, Go and probably many more. I chose Javascript because of its quick load time and conciseness. I’ve written a previous Skill in both Javascript and Java, the former taking < 1 second to execute and the latter taking ~ 2.5 seconds. They both did the same thing, but Java apps can become bloated quickly and unknowingly if you pick certain frameworks, so be weary when choosing your weapon of choice and make sure it’s going to allow you to write quick responding skills. Waiting for Alexa to respond is like waiting for a spinning wheel on a UI, or like your elderly relative to acknowledge they’ve heard you… I’m sure you know what I mean.

To develop in Javascript, I used npm for managing my dependencies, and placed my production code under “src” and test code under “test” (sorry, Java idioms kicking in here!). I used npm init to create my package.json, which includes information about my package (such as name, author, git url etc.) and what dependencies my javascript code has. I later discovered that you can use ask new to create a bootstrapped skill, which you can then use to fill the gaps with your business logic.

Regarding dependencies, there’s a couple of key ones you need for Alexa development: ask-sdk-core and ask-sdk-model. I also used the ssml-builder library, as it provides a nice Builder DSL for crafting your responses. 

Skill Structure

Skills have an entrypoint for receiving a request, and then delegate off to a specific handler that’s capable of servicing it. The skeleton of that entry point looks like this:

const Alexa = require('ask-sdk-core');
var Speech = require('ssml-builder');

let skill;

exports.handler = async function (event, context) {
    if (!skill) {
        skill = Alexa.SkillBuilders.custom()
            .addRequestHandlers(
                <Your Handlers Here>
            )
            .addErrorHandlers(ErrorHandler)
            .create();
    }
    const response = await skill.invoke(event, context);
    return response;
};

So in your top-level handler, you specify one or more RequestHandlers, and one or more ErrorHandlers. Upon calling the create() function you get returned a Skill object, which you can then use to invoke with the received request.

Lazy initialisation of the singleton skill object is because your lambda code can stay active for a period of time after it completes a request, and can handle other requests that may subsequently occur. Initialising this only once speeds up subsequent requests.

Building a RequestHandler

In the middle of the Alexa.SkillBuilders code block, you can see my <Your Handlers Here> placeholder. This is where you pass in RequestHandlers. These allow you to encapsulate the logic for your Skill into manageable chunks. I had a RequestHandler per Intent that my Skill had, but it’s quite flexible. It used something similar to the chain of command pattern, passing your request to each RequestHandler until it finds one that can handle the request. Your RequestHandler has a canHandle function, which returns a boolean stating whether it can handle the request or not:

const HelpIntentHandler = {
    canHandle(handlerInput) {
        return handlerInput.requestEnvelope.request.type === 'IntentRequest'
            && handlerInput.requestEnvelope.request.intent.name === 'AMAZON.HelpIntent';
    },
    handle(handlerInput) {
        const speechText = 'Ask me a question about Infinity Works!';

        return handlerInput.responseBuilder
            .speak(speechText)
            .reprompt(speechText)
            .withSimpleCard('Help', speechText)
            .getResponse();
    }
};

As you can see above, the canHandle function can decide whether or not it can handle the request based on properties in the request. Amazon has a number of built in Intents, such as AMAZON.HelpIntent and AMAZON.CancelIntent that are available to your Skill by default. So it’s best to have RequestHandlers that can do something with these such as providing a list of things that your Skill can do.

Under that, you have your handle function, which takes the request and performs some actions with it. For example that could be adding two numbers spoken by the user, or in my case calling an external API to check availability and book a room. Below is a shortened version of my Room Booker Skill, hopefully to give you a flavour for how this would look:

async handle(handlerInput) {

        let accessToken = handlerInput.requestEnvelope.context.System.user.accessToken;
        const deviceId = handlerInput.requestEnvelope.context.System.device.deviceId;
        let deviceLookupResult = await lookupDeviceToRoom(deviceId);
        if (!deviceLookupResult)
            return handlerInput.responseBuilder.speak("This device doesn't have an associated room, please link it to a room.").getResponse();

        const calendar = google.calendar({version: 'v3', auth: oauth2Client});
        const calendarId = deviceLookupResult.CalendarId.S;
        let event = await listCurrentOrNextEvent(calendar, calendarId, requestedStartDate, requestedEndDate);

        if (roomAlreadyBooked(requestedStartDate, requestedEndDate, event)) {

            //Look for other rooms availability
            const roomsData = await getRooms(ddb);
            const availableRooms = await returnAvailableRooms(roomsData, requestedStartDate, requestedEndDate, calendar);
            return handlerInput.responseBuilder.speak(buildRoomBookedResponse(requestedStartDate, requestedEndDate, event, availableRooms))
                .getResponse();
        }
        
        //If we've got this far, then there's no existing event that'd conflict. Let's book!
        await createNewEvent(calendar, calendarId, requestedStartDate, requestedEndDate);
        let speechOutput = new Speech()
            .say(`Ok, room is booked at`)
            .sayAs({
                word: moment(requestedStartDate).format("H:mm"),
                interpret: "time"
            })
            .say(`for ${requestedDuration.humanize()}`);
        return handlerInput.responseBuilder.speak(speechOutput.ssml(true)).getResponse();
    }

Javascript Gotchas

I’ll be the first to admit that Javascript is not my forte, and this is certainly not what I’d call production quality! But for anyone like me there’s a couple of key things I’d like to mention. To handle data and time processing I used Moment.js, a really nice library IMO for handling datetimes, but also for outputting them in human-readable format, which is really useful when Alexa is going to say it.

Secondly… callbacks are fun… especially when they don’t trigger! I smashed my head against a wall for a while wondering why when I was using the Google SDK that used callbacks, none of them were getting invoked. Took me longer than I’d like to admit to figure out that the lambda was exiting before my callbacks were being invoked. This is due to Javascript running in an event loop, and callbacks being invoked asynchronously. The main block of my code was invoking the 3rd party APIs, passing a callback to execute later on, but was returning way before they had chance to be invoked. As I was returning the text response within these callbacks, no text was being returned for Alexa to say within the main block, so she didn’t give me any clues as to what was going wrong!

To get around this, I firstly tried using Promises, which would allow me to return a Promise to the Alexa SDK instead of a response. The SDK supports this, and means that you can return a promise that’ll eventually resolve, and can finalise the response processing once it does. After a bit of Googling, I found that it’s fairly straightforward to wrap callbacks in promises using something like:

return new Promise(function (resolve, reject) {

        dynamoDb.getItem(params, function (err, data) {
            if (err) reject(err);
            else {
                resolve(data.Item);
            }
        });
    });

Now that I’d translated the callbacks to promises, it allowed me to return something like the following from the Skill, which the SDK would then resolve eventually:

createNewEvent(calendar, requestedStartDate, requestedEndDate).then(result -> return handlerInput.responseBuilder.speak("Room Booked").getResponse();
)

Unfortunately, I couldn’t quite get this to work, and it’s been a couple of months now since I did this I can’t remember what the reason was! But things to be wary of for me are the asynchronous nature of Javascript, and Closures – make sure that objects you’re trying to interact with are in the scope of the Promises you write. Secondly, using Promises ended up resulting in a lot of Promise-chains, which made the code difficult to interpret and follow. Eventually, I ended up using the async/await keywords, which were introduced in ES8. These act as a lightweight wrapper around Promises, but allow you to treat the code as if it were synchronous. This was perfect for my use case, because the process for booking a room is fairly synchronous – you need to know what room you’re in first, then check its availability, then book the room if it’s free. It allowed me to write code like this:

let deviceLookupResult = await lookupDeviceToRoom(deviceId, ddb);
let clashingEvent = await listCurrentOrNextEvent(calendar, calendarId, requestedStartDate, requestedEndDate);
if (!clashingEvent) {
    await createNewEvent(calendar, calendarId, requestedStartDate, requestedEndDate);

    let speechOutput = new Speech()
        .say(`Ok, room is booked at`)
        .sayAs({
            word: moment(requestedStartDate).format("H:mm"),
            interpret: "time"
        })
        .say(`for ${requestedDuration.humanize()}`);
    return handlerInput.responseBuilder.speak(speechOutput.ssml(true)).getResponse();
}

That to me just reads a lot nicer for this particular workflow. Using async/await may not always be appropriate to use, but I’d definitely recommend looking into it.

Speech Synthesis Markup Language (SSML)

The last thing I want to discuss in this post is Speech Synthesis Markup Language (SSML). It’s a syntax defined in XML that allows you to construct phrases that a text-to-speech engine can say. It’s a standard that isn’t just used by Alexa but by many platforms. In the code snippet above, I used a library called ssml-builder which provides a nice DSL for constructing responses. This library then takes your input, and converts it to SSML. The code above actually returns:

<speak>Ok, room is booked at <say-as interpret-as='time'>9:30</say-as> for an hour</speak>

Alexa supports the majority of features defined by the SSML standard, but not all of them. I used https://developer.amazon.com/docs/custom-skills/speech-synthesis-markup-language-ssml-reference.html as a reference of what you can get Alexa to do, and it’s still quite a lot! The main thing I had trouble with was getting SSML to output times in a human-readable way – even using the time hints in the say-as attributes resulted in pretty funky ways to say the time! That’s when moment.js came to the rescue, as it was able to output human-readable forms of the times, so I could avoid using SSML to process them entirely.

If you want to play about with SSML, the Alexa Developer Console provides a sandbox under the “Test” tab, which allows you to write SSML and have Alexa say it. This way you can identify the best way to output what you want Alexa to say, and experiment with tones, speeds, emphasis on certain words etc to make her feel more human:

Wrapping Up

And that’s it for this post, hopefully that gives you an idea of where to start if you’ve not done Alexa or Javascript development before (like me!) In the next post I’ll be touching on how to unit test Skills using Javascript frameworks.

Whilst writing this post, Amazon have been sending me step-by-step guides on Alexa Development which I think would be useful to share too, so if you get chance take a look at these as well. And you don’t even need to be a coder to get started with these! Until next time…

Design your Voice Experience
Identify your Customers
Write your Script
Define your Voice Interaction

Build your Front End, Your Way
Developer Console
Command-Line Interface
Third Party Tools – no Coding Required!

Build the Back-End
Start with AWS Lambda
More Tools – No Back-End Setup Required

Booking a Meeting Room with Alexa – Part One

Hey there! This is part one into my adventures of developing an Alexa skill. I was inspired recently on client site, where I saw they’d installed a shiny new room booking system. Each meeting room had a touch screen setup outside of it, and from it you could see who’d booked the room, and also use it to book the room out if it was available.

It had the right idea, but from talking to people I learnt that it wasn’t the most user-friendly, and that it had cost a pretty penny too! I’d been looking for an excuse to dabble with Alexa and Voice UIs, so I decided to see if I could build something similar with commodity hardware.

“Alexa, book this meeting room out for 48 minutes”

Because I like nothing more than diving in at the deep end, I chose a completely unfamiliar tech stack to me. My comfort zone as of late is Java and ECS, so I used AWS Lambda to host the Skill and Javascript as the development language. I used the Serverless framework to manage deployments. The development of a Lambda Skill is split up into two parts – creating and hosting the voice interface, and then the application code that handles your requests.

In this blog post I’ll be focusing on developing the Invocation Model using the Alexa Development Console. To get started, you can go here and sign in using your Amazon.com account. If you need to create an account you can do that here too.

With Alexa, what you write are Skills – code functions that carry out the action you want to happen. They’re triggered by Invocations – user requests in the form of phrases that Alexa uses to figure out what you’re trying to do. In my case, an Invocation was “Alexa, book out this meeting room for 48 minutes”.

Once you get set up with an account, you’ll end up at a page listing your current skills. Towards the right hand side there’s a button called “Create Skill”, go ahead and click that to be taken to the following page to create your skill:

Amazon gives you a number of template models to choose from, to speed up development and give examples of what you can do with Alexa. You can also “Provision your own” backend resources, directing your Skill either to a http endpoint or an AWS Lambda. Alternatively, you can choose “Alexa-Hosted”, which uses AWS Lambda but integrates the code development into the Alexa Console, so you can do code development alongside in the same UI.

An Alexa Skill can have one or more Intents – actions or requests that your Skills can handle. An Intent can be something like “what’s the weather today”, or “what’s on my agenda today”, or “book me a meeting room” (see where I’m going with this? 😉). Intents can be invoked by one or more Utterances, the phrase you’ll use to request your Intent. You can link one or more Utterances to an Intent, which can be useful to capture all the variations that someone might use to request your Intent.

As part of designing the UX, I found it useful to test how I’d interact with my Skill on an Echo Device, but with the microphone turned off. It was interesting to see how many variations I could come up with to request booking a room, and I noted all of these variations and configured them as Utterances, as you can see below:

Within these Utterances, you can have Slots too – parameter placeholders that allow you to specify variables to the request, making the requests more dynamic. In my case, this was allowing the user to specify the duration of the booking, and optionally providing a start time, but it equally could have been movie actors, days of the week, a telephone number etc. Amazon has various Slot Types, such as animals, dates, countries and so on, which allows Alexa to try to match the user request with a value in that Slot Type. These Slots can be optional as well, so your requests can include one or more parameters. You can do this by configuring multiple Utterances, that use one or more of your Slots.

If you don’t want to use of the preconfigured Slot Types you can create your own list of values to match the parameter against, or use the AMAZON.SearchQuery Spot Type, although I’ve had varying success with its speech recognition.

Not related to my Meeting Room Booker Skill, but something worth mentioning. It doesn’t always quite catch what I say (or interprets it differently to how I intended), making it difficult to do exact matches or lookups. For example I tried building a “Skills Matrix” Skill, where I could name a technology and Alexa would tell me who knows about it. I didn’t realise you could have so many variations on interpreting the words “Node JS”! The only way I could think of getting around it at the time was to have a custom “Technology” Slot Type, and for the more difficult technologies to pronounce, list all the expected variations in there. You can also employ a “Dialog Delegation Strategy”, which allows you to defer all dialog management to your lambda, which allows far more possibilities to interact with your user (e.g. you could use fuzzy logic or ML to figure out what the client meant), but it’s a bit more advanced to get set up.

It’s worth noting at this point, you can have a different Interaction Model per Locale, which makes sense as it allows you to account for things such as language and dialect differences. The key thing to ensure is that when you’re developing and testing your Skill (which I’ll cover in following posts) that you’re always using the same Locale, otherwise you just get a rather unhelpful “I don’t know what you mean”-esque response, or an even less unhelpful but more confusing “Uber can help with that”, which completely threw me off for much longer than I’d like to admit!

Eventually, I had an Interaction Model for the Skill created through the UI. Once you’re past the point of trying it out and want to productionise it, you’ll probably be thinking how to create and modify these Skills programmatically. Thankfully, there’s an Alexa Skills Kit (ASK) SDK, that allows you to do just that.

Here’s a link for installation instructions for the CLI – https://developer.amazon.com/docs/smapi/quick-start-alexa-skills-kit-command-line-interface.html

And here’s a quick start to creating a new Skill using the CLI – https://developer.amazon.com/docs/smapi/quick-start-alexa-skills-kit-command-line-interface.html

You can use the ASK CLI to create, update and delete skills. It’s fairly simple to use, so long as all your json config is correct – the errors it returns don’t give you much insight if you’ve missed a required parameter, or specified an invalid value for example.

As I’d already had a Skill created at this point using the UI, I used the CLI here to pull the metadata generated from the UI, to store in Git. The commands I used in particular were:

ask api list-skills to get the skillId for the newly created Skill
ask api get-skill -s {skillId} to get the Skill metadata as a json payload
ask api get-model -s {skillId} -l {locale} to get the Interaction Model metadata as a json payload

At this point, everything that I did in the UI was now available to me as code, and I was able to check it all in to Git. I found it very useful to do that just as with any code, because once you start tweaking and trying out various things it can be difficult to revert back to a good working state without it should things go wrong. You can use the following commands to update your Skill:

ask api update-skill -s {skillId} -f {skill-file-location} to update the Skill metadata
ask api update-model -s {skillId} -l {locale} -f {model-file-location} to update the Interaction Model

You can also use the ASK CLI to create a Skill from scratch, without ever needing to use the UI. You can use ask new to configure and provision a Skill, and it also creates you a folder structure with the json files I generated from my existing Skill already set up, ready for you to get started.

So that was a rather quick “how to get up and going” creating an Alexa Skill. The next step is linking the Skill to some backend code to handle the requests. I’ll be following this blog up with a how to on that, but in the meantime if you have any questions feel free to give me a shout!

Also, if you’re reading this and thinking “my business could really benefit from an Alexa Skill”, then please drop me a line at ben.foster@infinityworks.com and let’s talk 🙂

Unrewarding Retros? Time to take action!

I love a good retro. It gives you an opportunity to vent, de-stress after a sprint, raise concerns but also for praise and motivation ready for the next sprint. A friendly platform for open, honest and frank discussions.

But what about the venting? The frustration? Those things in your project or amongst your team that cause issues. Those hindrances that we complacently acknowledge as “just the way it is” or “nothing we can do about that for now, but we will do later…”. Most of the time they keep reappearing retro after retro, with little or no attention given to them. So what can we do about them? If all we do is complain for an hour or so every 2 weeks how is that productive?

The step that frequently seems missing to me in a retro is to collate the list of woes that have hindered the sprint, and to do something about them. I’ve been in Retros like this: We collate the obstacles from a Sprint; we make a list of actions; but they’re rarely actioned. They infrequently leave the whiteboard they’re written on, or at best they’re noted down in Confluence and never looked at again. Nothing more than a remnant of good intentions.

Don’t finish the retro until you have a list of actions. A popular method for defining meaningful actions follows this criteria: Specific; Measurable; Attainable; Realistic; and Timely. Having SMART actions provide quantifiable feedback that can motivate your team to engage with them. These criteria can be broken down into the following:

  • Specific and Measurable – An action needs to be tangible, whose progress can be tracked so teams can feel a sense of achievement about working towards correcting issues.
  • Attainable and Realistic – If it isn’t then the team won’t be motivated to achieve it. You want people to be engaged in working towards the action so they don’t get ignored (and you’re no better off than you started)
  • Timely – needs to be something that can be accomplished before the next Retro, so its success can be realised quickly. Also prevents the actions from perpetuating (in the same way we keep stories small and focused)

For the things that go well during a Sprint… well we don’t need to do anything about those… do we? Absolutely not! Things that go well in a sprint should be praised, but we should take action to ensure that those things continue to happen. This isn’t always necessary for every positive, but sometimes you need to keep on top of them… you can’t have too much of a good thing!

But how do you encourage and enable the team to take ownership of these actions? They can be ignored because “someone else will do it” or “didn’t realise it was on me”. People on the team can’t justify complaining about things if they don’t take ownership. Admittedly there are external factors that are outside the control of the team, but this alone warrants its own blog (and even these can be mitigated to some extent).

So this is what I’m proposing – have them visible as sprint goals, maybe on a whiteboard close to the team so they’re constantly reminded of them. Alternatively track them the same way we track everything else in a Sprint – as a story or ticket so they’re no less significant than anything else. Sometimes this may not be possible, given the client or the nature of the actions, so use your judgement to determine how best to track them. Their importance needs to be realised, to a greater extent than what they currently are.

I’m currently trying the whiteboard trick in my team’s Sprints, so I’ll try to feedback on how well this goes. Give it a go yourself and share your experience!

Reactive Kafka + Slow Consumers = Diagnosis Nightmare

Recently I’ve been working with the combination of Reactive Streams (in the form of Akka Streams) and Kafka, as it’s a good fit for some of the systems we’re building at work.

I hope it’ll be beneficial to others to share a particular nuance I discovered whilst working with this combintion, in particular a problem with slow downstream consumers.

To give a brief overview, we were taking messages from a Kafka topic and then sending them as the body of http post requests. This was working fine for the majority of the time, as we only get a message every couple of seconds normally.

However, the issues came when we had to deal with an influx of messages, from an upstream source that every now and then batches about 300 messages and pushes them onto Kafka. What made it even more difficult is that we didn’t know this was a contributing factor at the time…

So what happened? We saw a lot of rebalancing exceptions happen in the consumer, and also because we were using non-committing Kafka consumers, all the messages from offset 0 were constantly being re-read every second or so as a result of the constant rebalancing. Also when you try and use the kafka-consumer-groups script that comes with Kafka, you don’t get a list of partitions and consumers, but a notification that the consumer group either doesn’t exist or is rebalancing.

It turns out, Kafka was constantly redistributing the partitions across the 2 nodes within my affected consumer group. I can’t recall how I eventually figured this out, but the root cause was combining kafka in a reactive stream with a slow downstream consumer (http).

At the time of writing we’re using akka-stream-kafka 0.11-M3, and it has an “interesting” behaviour when working with slow downstream consumers – it stops its scheduled polling when there is no downstream demand, which in turn stops its heartbeating back to Kafka. Because of this, whenever the stream was applying backpressure (because we were waiting on http responses), the backpressure propogated all the way back to the Kafka consumer, which in turn stopped heartbeating.

To replicate this, I created the following Kafka topic:
./kafka-topics.sh --create --zookeeper 192.168.99.102:2181 --topic test_topic --replication-factor 3 --partitions 6

Then I used this code to publish messages onto Kafka, and ran two of these consumers to consume in parallel within the same Kafka consumer group.

What this causes the Kafka broker to do (at least with its default configuration) is to consider that node as slow or unavailable, which triggers a rebalancing of partitions to other nodes (which it deems might be available to pick up the slack). That’s why when I kept reviewing the state of kafka-consumer-groups, you’d eventually see all partitions being consumed by one node, then the other, then getting the rebalancing message. And because both of our nodes were using non-committing consumers, they both kept receiving the full backlog of messages, meanining they both became overwhelmed with messages and applied backpressure, which meant Kafka kept reassigning partitions… it was a vicious cycle!

Using the kafka-consumer-groups script you can see this happening:

benfoster$ ./kafka-consumer-groups.sh --new-consumer --bootstrap-server 192.168.99.102:9000 --describe --group test-consumer
GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
test-consumer, test_topic, 3, unknown, 3, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 4, unknown, 2, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 5, unknown, 3, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 0, unknown, 3, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 1, unknown, 2, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 2, unknown, 3, unknown, consumer-1_/192.168.99.1
benfoster$ ./kafka-consumer-groups.sh --new-consumer --bootstrap-server 192.168.99.102:9000 --describe --group test-consumer
GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
test-consumer, test_topic, 3, unknown, 3, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 4, unknown, 2, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 5, unknown, 3, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 0, unknown, 3, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 1, unknown, 2, unknown, consumer-1_/192.168.99.1
test-consumer, test_topic, 2, unknown, 3, unknown, consumer-1_/192.168.99.1
benfoster$ ./kafka-consumer-groups.sh --new-consumer --bootstrap-server 192.168.99.102:9000 --describe --group test-consumer
GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
test-consumer, test_topic, 0, unknown, 75, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 1, unknown, 74, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 2, unknown, 75, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 3, unknown, 75, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 4, unknown, 75, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 5, unknown, 75, unknown, consumer2_/192.168.99.1
benfoster$ ./kafka-consumer-groups.sh --new-consumer --bootstrap-server 192.168.99.102:9000 --describe --group test-consumer
Consumer group `test-consumer` does not exist or is rebalancing.
benfoster$ ./kafka-consumer-groups.sh --new-consumer --bootstrap-server 192.168.99.102:9000 --describe --group test-consumer
Consumer group `test-consumer` does not exist or is rebalancing.

And within my consumer’s app logs, you can see it constantly rereading the same messages:

2016-07-01 09:37:37,171 [PM-akka.actor.default-dispatcher-22] DEBUG a.kafka.internal.PlainConsumerStage PlainConsumerStage(akka://PM) - Push element ConsumerRecord(topic = test_topic, partition = 0, offset = 0, key = null, value = test2)
2016-07-01 09:42:07,344 [PM-akka.actor.default-dispatcher-14] DEBUG a.kafka.internal.PlainConsumerStage PlainConsumerStage(akka://PM) - Push element ConsumerRecord(topic = test_topic, partition = 0, offset = 0, key = null, value = test2)
2016-07-01 09:38:57,217 [PM-akka.actor.default-dispatcher-22] DEBUG a.kafka.internal.PlainConsumerStage PlainConsumerStage(akka://PM) - Push element ConsumerRecord(topic = test_topic, partition = 1, offset = 3, key = null, value = test24)
2016-07-01 09:43:37,390 [PM-akka.actor.default-dispatcher-20] DEBUG a.kafka.internal.PlainConsumerStage PlainConsumerStage(akka://PM) - Push element ConsumerRecord(topic = test_topic, partition = 1, offset = 3, key = null, value = test24)
etc...

So how did we fix this? Thankfully for us we knew that the number of elements ever to appear in a batch would be small (few hundred elements) so we added an in-memory buffer to the stream, which meant we could buffer all these for the http endpoint to eventually process and Kafka would be unaffected. This was a quick fix and got us what we needed.

As soon as you add a buffer, the two consumers behave, and you get this:

benfoster$ ./kafka-consumer-groups.sh --new-consumer --bootstrap-server 192.168.99.102:9000 --describe --group test-consumer
GROUP, TOPIC, PARTITION, CURRENT OFFSET, LOG END OFFSET, LAG, OWNER
test-consumer, test_topic, 3, unknown, 87, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 4, unknown, 86, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 5, unknown, 86, unknown, consumer2_/192.168.99.1
test-consumer, test_topic, 0, unknown, 87, unknown, consumer1_/192.168.99.1
test-consumer, test_topic, 1, unknown, 86, unknown, consumer1_/192.168.99.1
test-consumer, test_topic, 2, unknown, 86, unknown, consumer1_/192.168.99.1

Is it the right fix? Probably not, if we were dealing with greater volume or velocity we’d have to treat this app as a slow consumer, and possibly ditch the reactive streams abstraction in favour of utilising the lower level Kafka API to ensure we had full control over our partition and heartbeat management. But that constitutes a dedicated article(s) in its own right.

I hope someone else finds this useful, one of the mishaps you can have when you abstract so far away you don’t realise the issues that could occur beneath the woodwork.

I’ve uploaded the source code for the reactive producer (shamelessly ripped from an Activator template, and the ticker publisher code from my friend Neil Dunlop) and consumer I used if you’d like to replicate the scenario, You’ll need a Kafka broker running:
https://github.com/foyst/reactive-kafka-publisher
https://github.com/foyst/reactive-kafka-slow-consumer

Road to Continuous Deployment Part 1 – Performance Testing

The next set of Articles are documenting a presentation I’m working on to demonstrate Continuous Delivery, and how in a world where CD is desired, it’s becoming increasingly important for the appropriate level of testing to happen. I’ll be breaking them down to tackle one topic at a time, finally building up to a full demo on how I feel CD can work.

Firstly, I’m going to start with a problem CI/CD can address – Performance Testing. Not so much on how to performance test or how CI/CD can build systems that don’t require it, but how a continuous delivery pipeline can quickly alert developers to potential problems that could otherwise remain undetected and take ages to debug once in production.

Nothing new or groovy here, but a problem most developers (and definitely sql developers) are familiar with: the N+1 problem. Java developers using Hibernate will be familiar with the @OneToMany annotation, but less so on the @Fetch annotation, and even less so on the implications of changing the FetchMode. I aim to demonstrate how something as simple as changing a Hibernate @OneToMany Fetch strategy can drastically affect the performance of a system

There are several good reasons why you may want to do this:
* This is a reasonable thing to do: Maybe you want to lazy load children, no point eagerly loading all details with a join
* Might not always be a bad thing to do (perhaps most of the time only a few children are accessed, which negates the overall performance impace) but testing should be done to assess the potential impact on performance

Side-bar: The demo project for this article was originally ported from the spring boot data rest example within their samples, however the @Fetch annotation appears to be ignored, which makes it difficult to demonstrate.
This article gives a good direction on what I expected to happen and what the problem likely is: http://www.solidsyntax.be/2013/10/17/fetching-collections-hibernate/
I suspect the spring boot configuration doesn’t use the Criteria API behind the scenes, which means the @Fetch annotation will be ignored.

The application is a simple school class registration system, with the domain modelled around classes and students. One GET resource is available which returns all classes and students as children nodes. Below is a sample of the returned json:

  {
    "id": 3,
    "className": "Oakleaf Grammar School - Class 3",
    "students": [
      {
        "id": 970,
        "firstName": "Marie",
        "lastName": "Robinson"
      },
      {
        "id": 2277,
        "firstName": "Steve",
        "lastName": "Parker"
      },
      {
        "id": 4303,
        "firstName": "Lillian",
        "lastName": "Carpenter"
      },
      {
        "id": 9109,
        "firstName": "Samuel",
        "lastName": "Woods"
      }
    ]
  }

So at this point, I have a simple application whose performance can be altered by simply changing the @Fetch annotation, but how can we test the effect of this?

This is what Gatling is designed for. It is a performance and load testing library written in Scala, which has a really nice DSL that can be used to express scenarios for testing load on systems

This is code required to define a scenario for testing our system:

import io.gatling.core.Predef._
import io.gatling.http.Predef._

class SchoolClassLoadTest extends Simulation {

val httpConf = http
.baseURL("http://localhost:8080") // Here is the root for all relative URLs

val scn = scenario("Scenario Name").repeat(10, "n") {

exec(http("request_1") // Using http protocol
.get("/v1/class") // Get http method with the relative url
.check(status.is(200))) // Check response status code is 200 (OK)
  }
}

Not much is it?

Side-bar: Gatling also comes with is a recorder UI. This allows you to record all interactions with applications over HTTP and save them as scenarios. Gatling achieves this by acting as a proxy, which you can use like any other web proxy. You can route web browser or application traffic via the Gatling Recorder Proxy, and it will record all interactions made through it and save them as a scenario similar to the one above, which can then be tweaked later.

Once you have your scenario defined, you can configure your simulations to meet whatever load and duration parameters you like. For example in our app I’m going to run the test simulating 10 concurrent each making 10 requests:

setUp(scn.inject(atOnceUsers(10)).protocols(httpConf))

This is just a simple example of what you can do with Gatling: you can specify ramp up and down of users to test how an app scales; pause between steps to simulate human interaction; and much more. For more about what you can do with Gatling check out their QuickStart Page

Back to our app. I’m now going to hit this simulation against our app whilst it’s using the FetchMode.JOIN config:
N+1WithJoin

N.B. I’ve warmed up the DB cache before running both the before and after benchmarks by running the simulation once before I’ve recorded the results.

Above is the baseline for our app – you can see the mean response time is 221ms, and the max is 378ms. The 95 percentile is 318ms. Now look what happens when I simply change the @Fetch strategy from JOIN to SELECT:
N+1WithSelect

The average response time has increased from 221ms to 3.4 seconds, with the 95th percentage rocketing up to 8.3 seconds! Something as simple as an annoatation change can have such a dramatic impact on performance. The worrying thing is this is so easy to do: it could be done by someone unfamiliar with Hibernate and database performance tuning; by someone frantically looking to find some performance gain and changing a number of things arbitrarily; or what I consider worse – someone who believes that the benefits of doing this outweigh the performance knock, but fail to gather the evidence to back it up!

Now that we’ve got the tools required to test performance, in my next article I’ll look into how this can be proactively monitored between releases, stay tuned…

Monitoring Solutions for Scakka Applications

Over the past few days I have been looking into monitoring Scala and Akka applications, as a number of new integration projects at work are using this framework. Scakka is designed to allow easy building of concurrent applications, and is a great middle-ground for transitioning from OOP to a functional-based approach.

The choices within this first investigation were between AppDynamics and a combination of Kamon, StatsD, Graphite and Graphana. So what are the two:

AppDynamics (http://www.appdynamics.com): An Application Performance Management tool, capable of providing detailed analytics of a system, not just for the JVM but a variety of technologies, single apps or distributed interactions across a network.

The second choice is actually a combination of open source tools, which provide a subset of AppDynamics’ functionality:
Kamon (http://kamon.io): Modular API for logging metrics for JVM based applications
StatsD (https://github.com/etsy/statsd): Metric collator and aggregator service, which can take metrics from a number of sources (Kamon being one of them)
Graphite (https://github.com/graphite-project/): A highly scalable real-time graphing system
Graphana (http://grafana.org): A dashboard builder and service capable of presenting charts provided by Graphite, Elasticsearch and a number of others in a unified dashboard

Disclaimer: AppDynamics has saved my bacon in the past on a previous project, providing detailed insight to a concurrent system which was suffering from performance issues. Needless to say I’m a little biased to it but still interested to see what else there is out there.

I am aware that there is also New Relic, but given it’s fairly comparable to AppDynamics I dropped it as part of this (mainly because it didn’t provide a broad enough contrast in choice, and it isn’t free)

So I started off with AppDynamics, as that was familiar ground and suspeted it wouldn’t be too much effort given my previous experience with it.

I used the latest version at the time (4.1), and setup the Controller in a VM. This was fairly straightforward as usual (although it takes quite a while) and after this I downloaded a Java Agent and hooked it up to my app using the -javaagent VM argument.

Screen Shot 2015-11-28 at 12.58.10

Sometimes you’re lucky, and AppDynamics will just detect everything going on in your app straight away… This wasn’t one of those times. No metrics appeared for any Business Transactions, and the App overview looked very bare. Digging into the App Servers list however showed that my app had indeed been registered, and when I clicked on the Memory tab I could see that memory usage was being monitored.

Ok, so I’m guessing AppDynamics wasn’t able to detect any entrypoints, as the app establishes it’s own outward connections, so nothing from the outside initiates connections to it. I also couldn’t use any of the automated instrumentation that the Spring framework benefits from either.

Screen Shot 2015-11-28 at 12.58.58

I got around this by adding some custom POJO instrumentation, by monitoring any objects that implemented the Actor class, and listening for invocations for the aroundReceive method. Splitting the transactions by the simple class name gives a breakdown per Actor type. If necessary you could go a level further and split by message type, but I felt this was a bit OTT to start with.

Screen Shot 2015-11-28 at 12.21.31

From here, you get high level metrics about each Actor, such as response times, calls per minute, error rates, each accompanied by historical sparklines. Double Clicking each of the Business Transactions provides very detailed information, from a graphical view of interactions (i.e. http calls made) right down to elapsed time at method level.

AppDynamics’ ability to track errors logged using SLF4J pretty much gives me everything I’d like from a app monitoring perspective, however there are a couple of drawbacks. Firstly there is a free version, but it limits you to storing 24 hours of data, and you can only track 1 JVM instance with it. This makes it very limited if you wanted to monitor interactions acreoss a number of co-ordinated microservices. AD’s default refresh rate of metrics appears to be a minute, although this may be configurable. This might not be realtime enough for some people but hasn’t been a problem for me.

Secondly I tried using the Kamon stack to see what compareable metrics I could get from my app. The benefits from this stack are that as it’s free (as it’s open source) and as it’s modular you can mix and match many of the components within it. For example, you could use a different metrics aggregator (instead of StatsD) or a different dashboaring tool (instead of Graphana), or you can embed it within your own app monitoring framework. As it’s open source it shouldn’t be too difficult to write your own extensions to give you extra functionality.

I used the following example stack (https://github.com/muuki88/docker-grafana-graphite) that all runs within a single Docker app and makes starting up and tearing down a breeze. From what I’ve seen this appears to be the standard platform for monitoring Akka so far.

Setting it up was a bit more involved compared to AppDynamics, which put me off a little as I’d like a monitoring platform to be as unobtrusive as possible. Compared to AD, I had to add a number of Kamon components as dependencies, and also the AspectJ Weaver (as this is how Kamon instruments methods within the Actor system). The AspectJWeaver.jar also needs adding as a Java Agent – there’s 3 ways provided to do this, I used the -javaagent argument option as this allowed me to run it directly from IntelliJ (without having to run sbt externally)

Once I got Kamon working correctly with my app, I configured the monitoring. Again this is more intrusive than AD as it’s done within the app’s config file. However, it provides a number of configuration options, such as filters for including or excluding Actors, Routers and Dispatchers. You can also change the tick interval (to make dashboard updates more frequent) and include System and JVM metrics (such as CPU usage, host memory and jvm heap memory consumption, network utilisation etc.)

Once this was running and posting data to StatsD, I connected to the Graphana app on the webapp port advertised by docker (use “docker ps” to find out what this is) and started playing about with what was available. Firstly I created a row by selecting the button on the left, added a chart, and then started looking through all the metrics that were available. Most of them appear under the stats.timers hierarchy, and are collated under your app name (which is declared within the Kamon configuration in your app’s config file, under the kamon.statsd.simple-metric-key-generator.application element). The Akka-specific metrics live under the akka-actor section, and is constantly updated with your Actors as the system creates new ones.

Screen Shot 2015-11-30 at 12.11.56

I didn’t like how Graphana didn’t know what scales or format to use for any of the metrics, this had to be done manually. I suspect this is because there’s no metadata about the metrics available, which is understandable given it’s a pretty generic framework for aggregating whatever metrics you want. For example, When adding the processing time, it doesn’t provide any suggestions, and leaves you to pick between seconds, milliseconds, nanoseconds etc. Given I didn’t know at which level Kamon was recording these made it difficult to pick the right one!

Kamon doesn’t currently provide much in the way of metrics, although I’m sure that’s set to increase with future releases. Alongide the system metrics (i.e. CPU, memory, heap usage etc.) it provides, Kamon specificically monitors Akka so it can give you informtion on processing time, time spent in mailbox etc., which may be more useful than AD as it only provides generic metrics. Along with the few Akka metrics already available, another one I’d like is messages processed, so I can see if an Actor is getting swamped or processing more than it should.

For now, I think I’ll be sticking with AppDynamics – yes it’s more heavyweight (needs a dedicated VM) and not free, it provides more than enough information for me to make informed decisions about an applications performance or issues. When I need to monitor more than one microservice simultaneously however I might need to look elsewhere.

Then again, it’s quite likely that my lack of knowledge with the Kamon stack has limited my understanding of its potential. If anybody has examples or suggestions on how it can be better refined to monitor these kind of applications please comment below, it would be greatly appreciated!

Using Spotify Docker Plugin on Windows

So recently I have been looking into using Docker to automate parts of my testing and deployment, Using my lightweight CEP component (https://github.com/foyst/smalldata-cep) that uses Siddhi under the covers as a basis to develop my skills in Continuous Integration / Deployment.

Currently using Windows 10 on my personal laptop and never one to give up a challenge, I finally got Docker Toolbox running on Windows with version 1.8.2, with boot2docker and Docker Quickstart Terminal.

My next goal was to configure Docker builds directly to my boot2docker VM using Spotify’s Docker Maven Plugin (https://github.com/spotify/docker-maven-plugin) as a step towards Continuous Integration / Deployment. The idea being whenever I run the deploy phase of my Maven build, a Docker image of my software would be built, ready to run immediately afterwards.

This is the maven configuration I used within my “runner” module’s pom.xml:

<plugins>
    <plugin>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-maven-plugin</artifactId>
    </plugin>
    <plugin>
        <groupId>com.spotify</groupId>
        <artifactId>docker-maven-plugin</artifactId>
        <version>0.3.5</version>
        <configuration>
            <baseImage>nimmis/java:oracle-8-jdk</baseImage>
            <imageName>foyst/smalldata-cep</imageName>
            <exposes>
                <expose>8080</expose>
            </exposes>
            <entryPoint>["java", "-jar", "/app/${project.build.finalName}.jar"]</entryPoint>
            <!-- copy the service's jar file from target into the root directory of the image -->
            <resources>
                <resource>
                    <targetPath>/app/</targetPath>
                    <directory>${project.build.directory}</directory>
                    <include>${project.build.finalName}.jar</include>
                </resource>
            </resources>
        </configuration>
    </plugin>
</plugins>

I was trying to run it within IntelliJ on Windows, and was getting the following error:

Docker - Socket Write Error

Reading the Spotify Docker readme on GitHub states that it uses the DOCKER_HOST environment variable to determine the location of the Docker Daemon, and if this isn’t set it uses localhost:2375. So I tried setting this to the IP of the VM.

20151108 Docker 7

Unsuccessful again with this error:

Docker Error 2

This hadn’t worked either. Next thing to do was ssh onto the boot2docker VM and do a “netstat -apn” command, to see what ports Docker was listening on

Docker Machine Netstat

Docker was actually listening on 2376, not 2375! So after changing this I got back to the Socket write error from previous attempts.

A quick Google for this brought a bug up for the plugin (https://github.com/spotify/docker-maven-plugin/issues/51), mentioning other environment variables that potentially needed setting up, but at this point I didn’t know if they’d fix my specific problem, or even if they did what to set them to.

So next thing I tried to do was go back to Docker basics and build an image of my Spring Boot uberjar using a standard Dockerfile (completely outside of Maven):

FROM nimmis/java:oracle-8-jdk

WORKDIR /app
ADD smalldata-cep-runner-1.0-SNAPSHOT.jar /app/smalldata-cep-runner.jar

EXPOSE 8080
CMD ["java", "-jar", "/app/smalldata-cep-runner.jar"]

Then from within the Docker Quickstart Terminal I was able to run this command within the folder that had my Dockerfile in it: “docker build -t foyst/smalldata-cep .”

This successfully built my image in my boot2docker VM, which I could then run with “docker run –rm -p 8080:8080 foyst/smalldata-cep”

Ok, so using Vanilla Docker build I could successfully create an image, happy days! So from there I moved on to running “mvn docker:build” from PowerShell…

Still an error. Then it occurred to me to try “mvn docker:build” from the Docker Quickstart Terminal:

20151108 Docker 5

Success!! So what’s the difference? Eventually it occurred to me that the Docker Quickstart Terminal must somehow be configured out of the box to communicate with the boot2docker VM. So I started focusing my search on that

Always review the documentation, this came in handy: http://docs.docker.com/engine/installation/windows/#from-your-shell

This allowed my to actually configure the shell of my choice (Powershell), and knowing how that worked I could then apply the same to my IntelliJ configuration.

20151108 Docker 6

Using Docker in Windows is still quite a PITA, but now the client works natively on Windows and you understand what goes on behind the scenes it’s possible to get it working quite nicely.

Next I will be looking into automated builds and deployments using a combination of GitHub, Jenkins and Docker…