October 11, 2021

Episode Four

This episode being the first part of a double chapter focuses on the importance of architectural design decisions by iterating through 3 architectural outlines.

Hello there, and welcome to Episode Four of ProScala podcasts. This is the fourth tech episode released on the 11th of October, 2021. We will go into detail with some tech topics targeting both newbies, senior, and business audiences. I’m Csaba Kincses, we’ll start in a moment.

The day has come to talk about a broad topic that can be of huge interest from both tech and business audiences when dealing with a greenfield project. We will build on all the accumulated knowledge from previous episodes as we were talking about more kinds of productivity and technological gains in Functional and Scala, went through patterns and how these can be beneficial to foster agile and learned what OOP features we have making Scala a bit like a hybrid, all these pieces of knowledge will be somewhat funnelled underneath the main topic which is architectural design in the beginning of a project or to prepare for a transition.

Architectural design has the most important effect on how long-lived a piece of software can be, therefore this episode is really important for business audiences like founders at the start of their venture doing their first request to developers about a project to be implemented, and also helps those developers who want to be skilled in reasoning about better software solutions to the business, while letting the business improve in asking the right questions and choose the right developers for doing the first design decisions.

There’s a lot to learn about the importance of technical scalability, why Scala is the scalable language and how it can be a good fit in complex and scalable system designs. We will utilize the show’s pet project trading robot to give an illustration on how complex issues we can meet at the very beginning, which is the point when it is the cheapest to do the right decisions thinking in the long term. Also, we will dive deep into various kinds of architectural designs like monoliths and microservices and some possible intersection or transition between them, while keeping an eye on practicality.

A new format shall be introduced, as this is the first part of a double episode, we will stop halfway when going through the possible architectural schemes for the trading robot, and finish in the next episode.

We will diverge from Scala into general architectural design theory for a short while, for then to come back to Scala to see how these are connected. Namely, we will make a distinction between the simplest monolith and reactive systems, and learn about the evolution of units of compute in means of infrastructural scalability and what comes with this.

Monolithic systems are really common. Having your first encounter with programming, naturally you won’t really think about anything different than a monolith not even knowing that, either if you are on the tech or non-tech side. We write a program, deploy it and that’s all, without an understanding of scalability issues, we won’t really think any further.

To approach making a distinction between a simple monolith and reactive systems given that we know that this distinction has to do something with the nature of how we deploy modifications to the system, I’ll suggest examining how units of compute evolved over time. So we know that the simplest way to deploy is to come up with one single deployment package and most likely the system deployed this way will operate on one single computer.

If you did not need to dive deep into DevOps and need to recall your early memories about how this used to work, you probably remember a process like you needed to think about what load should a server be able to take and make your choice of infrastructure that can surely handle this imagined peak load.

Sounds simple, right? The process to handle infrastructure does not seem to require too much effort, we have a software, we deploy it in a single package to our server of choice with the required capacity, once we need to serve higher load, we choose a stronger machine and redeploy on that and so this process goes on till needing to pick the strongest available machines which in most cases hopefully can serve all the requests our application get. Can this be the right choice? You may argue that hardware is cheap, workforce is expensive, so in most cases you could be right, and if the peak level of required compute does not really diverge from the average utilized compute, you may not even be that upset about infrastructure costs at all.

Still staying close to the simplest monolith and checking out the relation between infrastructure scaling innovation and system architecture, many people from both sides may be familiar with the concept of virtual machines, which offer some advantages compared to the previously described physical server based solutions. This offered as a service is one of the first solutions toward infrastructure outsourcing and an on-demand model for infrastructure.

A virtual machine or virtual private server simply offers an isolated box of capacity, when interacting with it, it looks like an ordinary server as separation between those boxes is abstracted away from us, so launching a virtual private server, you will encounter the process of installing it and it can be separately monitored. As said, it requires an installation, so launching such a box consumes some time.

Back to the process of examining architectures, we’re still at a simple monolith. Unified codebase, single process deployment. What steps does this new tool push us toward? Sticking to that, we may see a varying load in the long run. I mentioned that launching a virtual private server requires some time, so this solution won’t be able to handle rapidly changing loads, but considering that we can still change the throughput of our system many times a day, this can be a great help infrastructure cost-wise. This can also help, if we have different loads in different periods of the day. So, no step ahead, we’re at the point of examining the simplest monolith, at least we saved some money without too much effort on the development side, as well as we still do nothing but creating a single deployment package.

How can virtual machines advance us one step ahead? Let’s see something that is also really common, most of you must have met this, on the development side for sure, on the business side if you have a project management background then also probably. You can use virtual machines to practically separate the different tiers of the application that would otherwise be probably deployed together, like frontend, backend, and the application layer. This could also be achieved with multiple physical servers, but there’s a bigger cost benefit as it is easy to see that these layers may need to serve really different loads, we can set a better utilization of available capacity. Also such a separation can provide safer operation especially at peak loads.

We’re still close to examining the simplest kind of monolith just by introducing a 3 tier separation, though a backend-frontend distinction can force us to have two things to deploy and run some version-wise SQL on the database; this kind of thing by its nature is still considered a monolith.

It could be a subject of debate whether how does hardware evolution and software architecture evolution relate, I introduce this point because I want to make a step toward microservices, which actually mean that we can diverge for multiple reasons from keeping the codebase unified and using a single process deployment, and in this picture there are both software and infrastructure related arguments.

So microservices are a unit of deployment, what comes from this is that the codebase is not unified because we need independently compilable units, and as we do independent deploys of separate parts of the system and these parts could be physically isolated, means that communication has to be solved which is in this case done by default via the network.

We have come to a point of picturing a distributed system whose justification or arguments against it can be approached from both software and infrastructure directions. To make it easier to understand, I will highlight this using some edge cases, but before I want to make clear that there are complex reasons from both the software and infrastructure side, why one would choose to go distributed.

Let’s consider a pure infrastructural approach, we have an idea about the rough boundaries of specific parts of a system and we are sure that each of these parts get a highly varying different load and modeling that we can save a lot on infrastructure as each microservice gets its custom box with different capacities, and even doing different kinds of scaling with these boxes is possible both scaling up and down on demand. Sounds like a nightmare from the simple perspective of developing a monolith, isn’t it?

Basically, what this would mean solely focusing on this infrastructural gain, is that we would need to develop in a completely different way to get a minor cost saving, so it is easy to see, only this gain mentioned won’t be adequate to move us from the convenience of simply scaling by pushing some stronger hardware under our system when needed without needing to focus on the tactics to keep these autonomous units working together.

Now let’s check the flip side on how this picture would look merely focusing on software gains and we’ll see something that would have seemed utopia before the existence of flexibly available on-demand hardware. What I mean is a microservice is a deployment unit, meaning that we split our software into pieces to let us have flexibility. This can be treated as a flexibility in terms of agile and continuous delivery, meaning that we deploy a version faster with a development process having microservices, compared to when we have a monolithic development. It sounds like something that would add more value to be worth the effort to develop in a completely different style. The appearance of virtual machines was the minimum requirement to let this technique be easy to implement, and some technologies are giving us further flexibility to let microservices do their job autonomously being run by container technologies.

As mentioned we have containers and we can treat these as a next step in infrastructure scalability as contrary to a longer deployment time of a virtual machine needing installing, a container is ready in seconds providing a completely isolated space on a piece of hardware to host one of our microservices. Talking about the pure infrastructural approach, we’d probably be willing to combine the scaling possibilities per part if splitting the system, this is where container orchestration comes in to do that for us.

Before we would be able to wrap up the part letting us make a distinction between a simple monolith and a reactive system for then be able to examine the connection of these with Scala, we would need to examine the theory that can connect the edge cases as we checked an infrastructure-wise view and a software-wise view, and that lets us see how these parts of knowledge can coalesce into something that adds further value.

We need to check out the Reactive Manifesto and take into the context of the bits we’ve already mentioned. The Reactive Manifesto promotes responsiveness, resiliency, elasticity, and message-driven communication. Two of these features can be easily interpreted in terms of what we already discussed, these are responsiveness and elasticity. These two picks are the closest to our resource utilization driven approach how we went through the evolutionary steps of infrastructure scaling, as responsiveness simply means we expect the system to reliably respond in a knowable maximum amount of time and elasticity refers to the support of infrastructure scaling depending on the current workload.

The features: resiliency and message driven nature comes from the fact that we expect these systems to be implemented using microservices, and resiliency means the system will stay responsive even if there is a failure, while message driven implementation has a resource effect like due to asynchronous messaging we can save on allocated resources and we can design software whose parts are loosely coupled.

I want to pin that a reactive system is a new breed of software, so even if it was necessary to break down this from an infrastructure evolution point of view because I wanted to give an answer to the question: why didn’t we intend to design such a system from the very beginning, this kind of software gives us further answers and more solutions to problems we did not even dare to imagine solving before, so if you simply consider an infrastructure problem from the monolithic point of view, you could be right that demanding the new skills to implement reactive won’t be worth it, but once you discover that on the long run you won’t be able to satisfy a demand for no downtime and reliably fast response times, you can come to a conclusion that this won’t work with yesterday’s architecture.

Another important point is that with reactive and using the public cloud, you would implement features, you otherwise won’t eve n imagine, as a reactive system is designed for scalability and reliability in sticking to being predictably fast and fault tolerant.

What can make the monolith versus reactive issue more complicated than being a no-brainer is that it brings its own kind of problems like do you have enough developers who understand how such software works as a whole, it’s a problem if the developer team lacks the knowledge about what can go wrong with a contract between microservices or message brokers, and will also need a more complex understanding of DevOps and possible failures due to the distributed nature of the system, as we can have mission critical network errors, this is in which the need for the ability to create a resiliency implementation comes.

You may ask, did the era of infrastructure related architectural decisions end then with the presence of reactive systems. Unfortunately this is not the case yet, we just examined two edge cases yet, and you can have your two cents in calling any of these oversimplified or over complicated considering a specific system design need.

Let’s examine a few things why we won’t start out to immediately go reactive with creating hundreds of microservices. First obstacle could be that we don’t see the clear boundaries of specific functionality at the planning phase of a greenfield project. Another could be that we simply want to use some things bound together for any reason, like we don’t want to risk anything with a network failure. An obvious one could be cited from previous reasoning which we know at the inception of a software that a part of it will run with a steady load, no need for complication with scaling more than replacing the virtual or physical server.

So I don’t say we may be obliged to go completely reactive from the start, but we do need to do everything to avoid a complete monolith to microservices migration. Conducting a monolith to microservices migration may require even more skills than being simply a microservices professional otherwise, thus we can put away our complaints about the new needs that come from developing reactively, like the need of a DevOps professional or a microservices or cloud architect.

And as said, there are new needs in software and instead of rediscovering them each time, you should be prepared on the business side, that you will need an architect or at least a senior who thinks about the software architecture as a whole, taking into consideration all the possibilities from monolith to microservice. This investment pays out if done at the start of a greenfield project, both money-wise and human resource wise, as the bigger mess is created not choosing the right architecture, the harder to find someone, who can unravel this mess by doing a migration to a better solution, and this won’t be seamless in terms of keeping the process of software development smooth, if you are dealing with a system that would change continuously due to new requirement.

Now we can wrap up the previous parts to discover how Scala comes into the picture and what are our next steps in providing a real life example about what the lack of proper architectural design can cost us and what are the advantages of doing all the early decisions with foresight.

Scala offers us at least 3 things, it has it’s built-in actor model for messaging, Akka as a toolset that is capable of letting us build microservices and Lagom, which is a dedicated microservices framework.

Considering the real life example, the show’s pet project is highly suitable to inspect all the possibilities like partially or fully sticking to a monolith and what are the features that would drive us toward reactive or won’t even be possible to be implemented without reactive.

What I will do is I will break down what you might think a trading robot does and what it might do if we get to make it feature packed, and what to consider in architectural planning to have foresight for nearly everything that could affect our project. In the end you should be able to see what a mess a project can become lacking a proper architectural plan and how costly this can be on the development side, not to mention that messy projects may divert the attention of developers from what’s adding direct business value to plumbing and fire extinguishing.

Our journey will lead on a trajectory of 3 stages, all of them having related arguments that can prove them to be valid, for highlighting that there are more than one solutions possible, but we should always take into account what one and another has to offer. For people leading the business side dealing with a greenfield project, one valuable advice could be to learn to ask questions from developer colleagues as initial architectural decisions can form all the lifetime of a piece of software.

As a first step, let’s examine a setup that can do it’s thing being a monolith. We already know a possible implementation from Episode Two, where we dived into how monadic stream processing can work, having trading operations decoupled using some concurrent solution not to distract fast stream processing to have our up to date data.

What would be our first thoughts regarding the simplest expectations from a trading robot? A trading robot is special in a way, that it’s not a convenience feature to have the incoming data processed real time but this is the essence of it. What’s also clear is that the trading robot interacts with the markets and it has to do that continuously, without interruption for more than one reason, first would be that we will have open trades that need handling, so our biggest nightmare would be operational safety issues. Another issue could be something that I would call trading integrity.

We will examine this setup taking into account what solutions it can provide for these two main issues and what would be the limitations. Mind that these two issues will be something that are key to any kind of implementation in any architectural plan. In the end, we will wrap up to train you in a way how we would think doing a decision on a roadmap to implement a trading robot.

To understand the requirement and how our solution can be suitable or not, we need to tackle the term trading integrity. Depending on whether we want to implement a logic that assumes trading decisions are built on each other, especially if we have some allocation and portfolio handling strategy, it can count to have a pre modeled result work the same way in production and we may want to have the ability to replay data and see something fulfilled as was expected due to traceability. When we want to reprocess our logs and want to find a needle in the haystack, in this case a trade expected to happen at a specific point of time, we would highly rely on trading integrity to be enforced and the system being designed to provide that feature.

What could be the edge cases when trading integrity is hurt? Suppose we have an outage, and this spans a time when we would have opened a trade and even closed it. Or the case when we have an open trade, and an outage stops us from closing it. We may introduce solutions to figure out recovering from an outage that we have lost trades, but that can cause latency and as in the case of a trading robot we may want to treat strictly how operational safety and data - in this case trading data - integrity works, even if we have such an implementation, that should be an extra and not something we rely on.

So avoiding distorted logs, we should have something built in and we should see if having something like this fits into a monolith.

Turning to operational safety, to give a better grasp of fitting into a monolith, we should examine the nature of real-time processing and the connection of it to the architectural decisions. In this case, we calculate with short, steady processing times, so once we have an incoming piece of data, we expect it to be processed in for example at most 30 milliseconds. This is quite a short processing time that competes in speed with our possibilities to react to an event when our processing gets clogged, so keeping up with the pace of providing almost real time data, waiting for some solution to scale out won’t necessarily help us, but as you can see, we face expectations talking about a monolith that gave birth to reactive systems, namely elasticity in this case. What drives us towards the monolith here is that we don’t expect rapid spikes nor in the amount of data to be processed or the speed of our algorithm that does processing.

We are in a special case as elasticity is not in play in a short time scale. Considering what real time processing means we want to guarantee the availability of processed data right away, if we want to fetch missed data, that would create a delay in processing of other incoming data, that situation would shout for an elasticity solution, but talking about a trading robot it is likely that we won’t even be willing to wait for the setup of a new container, we would expect to fetch missed data faster, as this is also a thing that can affect trading integrity adversely.

So regarding operational safety, we have discovered while checking how to secure the run of a monolith, that the monolith solution might not be that bad at all, but may require other kinds of practices to make it run as expected to have the needed characteristics. The aforementioned cases of bigger or minor outages most likely require redundancy solutions.

As said earlier, the show’s pet project is a demonstration project that can be a good playground for experimenting, so I would come up with a redundancy solution that gives us the opportunity to tweak the low-level parts of software, as there are cases when this can be necessary, especially when operation is in focus.

When sticking to our desire to guarantee real-time processing, we can examine the possible redundancy solutions and make an assumption on which solution has the biggest chance to be actually able to provide this for us. Suppose we deploy something well tested and that is supposed to be stable considering the application codebase, therefore our biggest enemy could be errors for which it would be hard to prepare in other ways than providing a redundancy solution. These kinds of errors include errors that do not originate from our own codebase, but from an underlying library or the runtime, in this case the Java Virtual Machine as Scala runs on that.

So our biggest pain-point is our real-time feature or more precisely anything that can deter us from real-time processing, meaning what we can do here is to examine the chance of a specific type of outage and the possible lost time due to each type of outage and sort the possible solutions considering the nature of these errors.

As we’re racing with time, our foes are network lags, runtime startup time, state recovery time, stream adjustment time. I have a preferred solution but I’ll break down the whys. What I see is that the Java Virtual Machine can break, and we need to have a backup solution to continue work in case of a JVM outage caused by software error and this solution should pick up processing right away. As said, it’s real time processing so we’re talking about at most a few tens of milliseconds per tick, so if we want to minimize the chance of lost data, a network lag and runtime startup time are significant, and depending on the solution, we would be also happy to avoid a significant state recovery time or stream adjustment time. What I see here is that if we have a twin Java Virtual Machine on the same computer, and we have that JVM already started with a copy of the app running on it in some sleep mode, then we could have a solution that saves us from all these time consumers. Runs on the same computer, so no network lag, started in parallel with the other JVM so no runtime startup time, and if we have a solution to share state between the JVMs instantaneously then no state recovery time. If this solution is some shared memory solution, then we even handled the rare occasion of losing data because a database write being interrupted due to a transaction is not being committed in time. We may only need a stream adjustment when the spare JVM picks up processing but that should merely be a minor hiccup even in these fast processing terms. My point is that for software error, we better have the spare JVM on the same machine as long as this solution offers a speedy recovery. Having a backup JVM on a separate machine only to handle software error outages looks problematic as once there is network lag and thus the twin JVM has to go for sure that the other JVM is not running, will be requiring us to wait for some unsuccessful pings. Sounds like more than a minor hiccup because this could take hundreds of milliseconds at least.

What we are not protected from with our same machine twin JVM solution is a network outage, so considering data and trading integrity as to be a protected valuable asset, we could have another JVM pair running on a machine, that is on another network assuming that it is nearly impossible for the two networks to fail at the same time. In such a setup, the backup JVM twin should only do processing but not trade execution till the point the original JVM pair gets cut from the network. This way, pinging the original JVM would be the source of a hiccup, not a delay in data processing, which can cause a trade to be executed with a delay, I guess we could only save that if we figure out a way to be instantaneously notified right at the arrival of a coming tick that the other JVM pair is out of network.

Conclusion would be that we can go with at least a twin JVM solution, or even double our defense with having another twin at another network location. A side benefit of a twin JVM would be that it can help us with versioning as we can deploy a new version of our software without interrupting processing.

As was said this monolithic setup is just 1 out 3 architectural designs we will be talking about, but now has come the point to check out the pro and against arguments in terms of whether a monolithic or a distributed approach is the better, summarizing the previous details.

First pro argument would be that if we mitigate between this stream processing approach compared to a message driven one that reactive systems would propagate, a stream processing based solution is much more lightweight in terms of network load and can be more suitable to consume real time tick data, and as we do this in a centralized fashion not generating even more messages to communicating information between microservices, we do not bring in further network lag.

Second pro argument is that in this setup we calculate with a relatively fixed processing time and computational load, so seems like a natural solution to not scale out generating messages with network lags and waiting for containers to be initialized which would be intolerable in terms of real time processing, but simply put a stronger server under the software if we know that we have increased computational needs. Just like the old days, but this case seems reasonable.

Third pro point would be that we can easily solve fault tolerance issues with our twin JVM idea and therefore we do not have to deal with managing various microservices instances with some container orchestration that would require additional expertise from our staff, though we would still need to worry about network lags between these instances. With a performant implementation, we could even have a more effective solution for a possible JVM crash than with a reactive system.

Fourth pro point would be that we can simply duplicate the monolith for multiple instruments processing by dealing with some containerized solutions to manage this, in a simple setup that won’t require us to have too complex container orchestration skills and we’re still working with one unit of deployment, we just copy it.

We had 4 pro arguments, now we can deal with the against arguments to have a clear picture about our architectural possibilities.

First against argument is the limited level of extendability this setup provides. Say, if we bring in some secondary functionality like resource hungry but not always needed calculations that are not part of the core functionality, than we bring in a varying resource need which is against why we would choose a monolith, and also contrary to guaranteeing operational safety, as in such a case, we add to the load of the core, risking to cause slower than real time processing. And though a twin JVM solution could also be used for version migrations, keeping rapidly changing secondary functionality in the same deployment unit with the core is contrary to what would be practical, and if multiple developers should commit rapidly to the functionality of the secondary part to redeploy it, that could also endanger operation even if we have the twin JVM to be toggled at a commit.

Second against argument is yet again extendability and a problem that considering how agile would work with microservices assumes that we can separate parts along domain boundaries and this is not enforced if we keep all the features in a monolith.

Before closing this chapter, I’ll introduce a new block as food for thought and as an incentive to boost community activity, more concretely I’ll interview you guys regarding your specific experiences and opinion about the details introduced recently, therefore I prepared 7 questions:

Did you guys have the possibility to be part of making a decision between choosing a monolith and a microservices driven architecture?

If so, and the choice was the monolith, what were the pro monolith arguments and what were the against microservices arguments?

Did you like the outcome of a decision choosing the monolith?

If you have chosen microservices, what were the toughest challenges you needed to deal with as a team and what were the new skills you needed to learn on the go to tackle the newly emerging development issues?

Either if you and your team are dealing with a monolithic or a microservices based system, do you think it would be required to have the DevOps flexibility provided by a microservices like environment to better manage a project in terms of fast delivery and operational safety?

On the business side do you take into account the likely cost of a possible monolith to microservices migration when mitigating between these architectures?

On both sides, did you know about the various ways Scala supports development in terms of the Reactive Manifesto?

If you do have some answers for these questions instantly, then do not hesitate to connect me via LinkedIn, and please also provide me your feedback including questions and ideas, I’ll be happy to utilize them in similar summary blocks.

That’s all for today, we will continue next time with two great architectural designs which will give you a thorough insight and skills on how to decide whether to go completely monolith, completely reactive or in between. Also we will try to summarize what we have learnt in this double episode and give clues both taking into account business and tech benefits why making the right decision at the start of a project is crucial, and how this knowledge can help us to direct a transition.

I’ll be back with the second part of this double episode on the 6th of December, 2021. Till then I expect your comments as this is a really intriguing topic by having more than one solution for a specific task, how would you do it?

Mind that this podcast has a LinkedIn group where you can meet great fellow tech people to discuss and stay tuned about the happenings related to this show.

I was Csaba Kincses, be back to you at the next episode; thanks for listening!