Elasticity is Contextual!

Public Cloud became very popular in a very short timeframe, as it was well advertised by all cloud providers, led by Amazon Web Services calling out the concept of "Elasticity".

The term Elasticity means a lot in business and there is an in-depth definition provided here.

Well, what's wrong with it as advertised by Cloud providers?
 
As an individual I can spin up a cloud resource when I wanted it, scale-out/in as well as up/down and I can terminate if I am done with it. 

Even better, I can go serverless and/or I can outsource creation and lifecycle management to cloud provider and be relaxed, how Elastic that is!

That's all looks great, the services offered by all public cloud providers are looking very elastic in nature, we only get to pay what we use so let's move all workloads to the cloud so all our workloads and the services offered by us also become elastic!

Let's put some context to Elasticity, just like any language, the meaning of the same word can be quite different depending on where and how it's used, would Elasticity be any different?

Startup

If you are a founder of a startup and all your services are born in Cloud, you would have designed the services for the cloud and built the services for the cloud, so you have truly utilised the Elasticity nature of the Cloud. As you were so much interested and excited about Cloud services, you have created everything codified and invested a lot of time in developing lots of automation pipelines and they are always triggered by a successful master merge, perfectly scales out and in and replaced with every update.

As time goes on, you become so busy with looking after the business and non-technical side of your startup, so you went to hire a junior Cloud engineer and an intern to look after what you have developed thinking they just need to maintain what you have already done. You have also asked them to add new features to the existing services and you have also asked them to make sure they always use the latest features released by the Cloud providers.

It has been a few months since you asked for new features, you went to check what they are up to only to learn they are still figuring out how to accommodate the new cloud features into the existing service and didn’t even get time to think about your new feature, which is your main business interest.

For you you only had a couple of months before your competitor may launch similar features, so you need to get those features out immediately. So in desperation, you have hired two highly paid Cloud consultants to deliver those new features within a month. You thought you have told them everything and expected them to work with your junior Cloud engineer and the intern.

After a month, you went to check how they are getting on with the new feature, they have impressively demonstrated the new features in their environment. You were impressed and you are curious what the overall experience looks like when you consume the overall service with these new features. So you asked them to demonstrate the overall service consumption.

They were all confused and asked you at the same time, what you mean by overall service experience, you have just asked us to develop and integrate these new features, which we have demonstrated already and nothing else to demonstrate!

You were shocked and speechless, it took a while for you to recover and realise the situation. The consultants released as a separate branch to get the new features out immediately, you haven’t guided the consultants to work on top of your existing codebase or work with your Cloud engineer and intern. Now not only those new features are not integrated with your existing code base, but also you Cloud engineer and intern not across the entire work done by the Cloud consultants.

You have already promised your investors to launch the new features on time before your competitors, now you have features developed in isolation by two consultants who are starting a new engagement next week and both of your cloud staff are clueless on the new feature and how it’s developed and you are forced to make a decision!

Penny dropped! Yes, everything in the cloud follows the Elasticity principle, let’s deploy the new features alone side the existing services, bring them up/down as needed and integrate via RESTful endpoints so it can be loosely coupled with existing services.

It doesn’t sound like a bad decision, looks like perfect loosely coupled microservices. But then you said to your cloud staff, please integrate both by end of next week and get into production by end of the sprint starting next week, otherwise, your competitors will be doing it before you do so.

Your new cloud staff had a panic situation, slowly recovered and integrated everything by end of next week and managed to get into production. It was excellent news for you and your investors, so you are happy with their performance and moved on.

Your promotion of the new feature becomes massively popular there it has generated a massive amount of traffic towards your site, putting your Elastic Architecture to test. But unfortunately, your customers started getting errors and service denials, initially you have suspected some kind of DDoS attack on your site, but it was quickly turned out to be negative.

Your team later on reported based on their investigation, the original service and the new features disproportionally scales and there is no queuing in between the two, so slow scaling on one component causes stress on the other and eventually the entire site stops responding unless all sessions are abruptly closed!

You have taken a moment to pause and thought about Elasticity that’s advertised by Cloud providers, so all of their services should be Elastic in nature, so why we are not?

Your staff corrected yourself, the Cloud providers provide Elasticity for all of their services, what about our services? We know you have given us a tough deadline to get it to production, which leaves us no time to think about Elasticity for our overall services!

You have been shocked to hear but not surprising as you also forgot about the Elasticity of your services is the responsibility of you and your team, not the responsibility of the Cloud provider.

On the other side, you are tasked with explaining the situation and action to remediate it immediately to your stakeholders and customers, who won’t understand the term Elasticity in the context of your services.

Lessons Learnt

So in summary, you and your team have unknowingly made your technology stack as well as your people/processes brittle instead of making it Elastic!

Your services won’t inherit Elasticity principles by simply deploying your services into Cloud.
Provide right context to your staff and consultants before commencing any major work.

Clearly understand your contextual boundary before architecting, designing and building your solution to make sure your solution follows Elasticity principles end to end, but component by component.

Small Business

That wasn’t a very good experience with your startup and unfortunately, that wasn’t successful as well. The main return on the investment you and your investors contributed are the lessons learnt on the overall startup initiative.

You thought of taking a break from your startup initiative and decided to join a small business as a Cloud Engineering Lead for a team of ten Cloud professionals. The experience you gained around understanding your services end to end from your startup initiative helped to gain experience for a Cloud Engineering Lead role.

You have started the role with great excitement to architect and design your services following cloud-native principles and well-architected framework.’

You have been given the requirement to create a campaign application, considering the bursting nature of any campaign services, you have architected the entire solution mostly using Serverless and spot instances.

You have given a full walkthrough to your team and your team understand end to end had the full context to your campaign application and delivered the completed application into production in three months time.

There were several promotions launched via your campaign applications, your application was perfectly architected for the cloud and was able to handle thousands of responses at the same time and queued the responses for the next steps.

Business anticipated the popularity of the campaign and had more business operations staff hired during the campaign period to process all responses and were able to load the service activation requests into the main legacy on-premise system.

The legacy on-premise system has monolithic architecture and to reduce the load on the main database, there was a throttling set up by the system admin who left the company last month and unfortunately not been documented or communicated.

Due to this, even though your campaign application and the business operations team can handle the bursting responses, all get queued by your on-premise legacy system and generated frustration among your customers and even some customers discarded their original responses and went with your competitor as your competitor also had the similar campaign and they were able to handle the load.

You are now tasked with understanding the legacy on-premise application, making it scalable and changing the throttling setting, so all your customer responses are processed on time and especially not to lose any more customers to your competitors.

Your team worked hard including day and night to understand the code, create a manual deployment process and able to deploy the fixes to your legacy on-premise application so the throttling setting can be relaxed. But this whole process, took almost a month and more than half of your customers went with your competitors as they don’t want to wait that long.

On one side you have your team delivered more than what you expected and on top of that, they did more than what they to stabilise the legacy on-premise system, while on the other side not only business is not happy with the outcome, but also the story made public and it has impacted your company’s reputation as well.

Lessons Learnt

Understand your technology landscape end to end that forms your business, not just the service that you offer.
Make sure that the agility expected from the business is factored in before move your workload to cloud.

Though it wasn't a great experience, you felt that you and your team has learnt a lot and subsequently were able to handle a similar situation well in advance.

Large Enterprise

Since leaving the Cloud Engineering Role, you have joined a large enterprise via multiple promotions and finally, you made it to the CTO role where you are in charge of the entire Technology and Operations of your enterprise.

You have been getting constant pressure from business owners, senior leadership and even from the CEO around even increasing technology running costs in terms of hardware, licensing, people and management costs.

You also see the data centre lease coming up for renewal in a couple of years and all of your on-premise hardware support run out at the same time. In terms of your TCO (total cost of ownership), your on-premise data centre, hardware and maintenance cost on both mounts to 40% of your entire technology expenditure per year.

You thought it’s a perfect opportunity to migrate everything to the cloud as once into the cloud, we should be able to scale as needed because cloud services follow Elasticity principles and also make sure you factored in all your learnings from your past experience elsewhere.

You have asked your teams to map out the entire technology portfolio, group the servers and storage into applications and perform assessments to understand the treatment of each of the applications.

You have also asked each of the application teams to make sure they look from end to end when they plan to migrate their applications into Cloud.

These were the two key lessons learnt from your previous engagements respectively from small business and startup, assuming you got all right this time!

You also have appointed a programme manager to oversee everything, who is also expected to appoint delivery leads to focus on each application domain.

Each team has taken a quick start and deployed their non-production instance into Cloud and got it working with local accounts and standalone.

Once each team started to integrate with other applications, all tried to create network connectivity from their cloud network to other applications cloud network, which needs to be done by an already constrained network team.

There are almost a hundred different applications, a cloud network for each of the applications per environment and some of them need to integrate into each other and all of them should be reachable from the office private network in order to use them.

Though there are hundreds of applications, all network connectivity needs to be designed by one central networking team, who work closely with the single enterprise security team and both of these two teams become so busy on requirements gathering and design and couldn’t even connect one cloud network for another three months!

Your programme managed decided to escalate the matter to yourself as the whole programme delayed by six months now without having single network connectivity established with a cloud environment.

At the same time, the security team via the head of IT security escalated another concern which they saw consistent across all of the application workloads. The security team found out that no application team is integrating with the central directory server for integration as there is no network connectivity and also no one has set up federated single sign-on as the identity and access management team is also busy gathering requirements and design with all of the cloud workload teams.

When you hear both the escalations you were speedless for a moment as you come to a realisation that none of them going to be migrated to the cloud on time, which you were planning to complete within two years, so you didn’t have to renew the on-premise contract, which is due in a couple of years time.

Your technical background tells that’s something not right, but you are not quite getting there. But you also realise it’s too low a level for you to get involved, but there are some strategies missing altogether.

To investigate further, you have asked your programme manager to bring a consultant and to produce the report on the critical issues that are delaying the overall progress.

The consultation took another three months and now you have almost spent one year with no workloads functional in the cloud, so you were desperately waiting for the report.

The report highlighted a few critical issues:

All of the workload teams were trying to create point to point connectivity back to office network via on-premise network.
All of them brought specific cross application connectivity requirements, which could be dealt as network filters rather than point to point networking.
There is no consistent strategy for single sign-on setup, which created more work for identity and access management team to force to come up with standard patterns, which wasn’t planned.
Most of the workload teams doesn’t have infrastructure background, which created more confusions to network and IT security teams as they received the requirements from workload teams.

As you go through these findings, again and again, you were still clueless as to what’s the next steps. As you read it again and again, the last finding finally strikes your mind, are we missing a team here altogether?

You also realised that on-premise resources have limits and are not elastic, so we need to create reusable patterns for networking, single sign-on and anything else that require on-premise resources.

All are busy with their own workloads who is going to provide consistent networking from on-premise to cloud and who is going to work with the identity and access management team to have a consistent single sign-on pattern that can be used for most of the workloads?

That’s the right conclusion, but at the wrong time as one year is already over and you have only one more year before the on-premise support contract expires.

Does it actually mean, most of the cloud migration takes more than two years or even more than five years?

Hang on, if I had planned for upfront work on networking, identity and access management, any other standard patterns work upfront, by now I should be able to have workloads active in the cloud?

This could be accurate, but what’s the fundamental principle hidden behind all these?

Lessons Learnt

Elasticity is Contextual, if you introduce resources that are not elastic by nature such as on-premise infrastructure, then you need to work with it and maximise the reusablity otherwise you will be heavily constrained by your on-premise infrastructure no matter whatever Cloud offers you.
Elasticity should be scalable in context of a large enterprise otherwise not only cloud but nothing will provide Elasticity across enterprise.

Executive Summary

Elasticity needs to be considered as a contextual property and needs to be assessed within each context before getting deeper into cloud work.

In fact, there isn't anything you need to do to make cloud Elastic, but you will have to do a lot outside the cloud to make use of Elastic cloud aspects.

Elasticity should be considered end to end for each service.
Elasticity should be there across business processes and technology stacks.
Elasticity should be able scale at enterprise level.

Disclaimer: This article was produced in my own capacity; no association could be assumed with the organisations that I am helping at present or helped in past.

Startup

Lessons Learnt

Small Business

Lessons Learnt

Large Enterprise

Lessons Learnt

Executive Summary

Share this:

Related

Published by Bala

Leave a comment Cancel reply