Lessons learned from navigating a large scale R&D project

Reshef Sharvit
7 min readApr 2, 2024

--

In this post I would like to share my experience driving a large scale R&D project, what it took and what helped us get there quicker.
This is also the opportunity to launch goskeleton, a CLI tool I wrote in-house that facilitated the process.

Getting Started

Last July I joined a Cybersecurity startup, that is not a typical startup:
The company started and later on was acquired by a big local corporate, only to be spun off to startup again, some time later.
These frequent changes in direction didn’t make it easier to find and embrace a clear identity.

When I joined, I was told that the main pain point was the difficulty of releasing and delivering new features, mainly due to:
1. Subpar developer experience — CI/CD was very limited, flaky and slow. some key components didn’t have any.
2. Lack of clear architecture — a mix between a monolith and microservices, coupling implementations (databases, frameworks, etc).
3. Lack of monitoring and observability — Was difficult to assess the state of the system at a given time.

Getting started with my new role, the issues quickly unfolded:
Tests and deployments were flaky and took way too long to execute, it was also difficult to assess the status of the system because we barely had any monitoring.
There were also dozens of manual operations to get tasks done.
I was not used to being so unproductive over such a long stretch of time and it started bothering me, a lot.

Tech Stack:
Java (Spring Boot)
AWS
ElasticSearch
EKS
Terraform

A couple of months in, and after trying to catch up and work things out, I came to the conclusion that the hole is just too deep and we’re going to have to re-write and re-design large parts of the application if we have any intention to get back on track, and operate as a functioning R&D unit.

Turning Point

On 7.10, a war was declared on our country. Several engineers were called to reserves, forcing the business to put some initiatives and features on hold.
Instead, my director gave us a green light to start refactoring the systems we thought were required.

I recognized this as a turning point and decided it would be best to start by planting the seeds for a major move and not settle for minor changes, and offer a full SDLC alternative, from code to deployment to infrastructure.

I wrote a detailed plan that describes the steps and phases, focused on:

1. Go — A simple, general purpose, compiled, static language:
We needed a language with a gentle learning curve, especially for those coming from Java background. In the world of lightweight, serverless and cloud-native workloads, we felt Go is superior to Java.

2. Domain Driven Design and Clean Architecture:
The lack of structure and separation between layers of your application resulted in a bloated spaghetti code.
Maintaining the services was a very difficult and risky task. There were dozens of services nobody wanted to touch, and it was understandable — Trying to replace a database for a service was an impossible mission.
I wrote an article on Golang and Clean Architecture that you may find useful.

3. Enforced schemas:
For API — We had no schema/documentation for our APIs.
Understanding and maintaining the request/response structure was difficult. making changes to them was even more.
For Databases — The main database was a NoSQL schema-less database.
The way our data is structured, stored and queried had nothing to do with NoSQL. Queries were inefficient, aggregations were performed on the backend and some other horrific anti patterns that screamed “WHAT THE HELL ARE YOU DOING?? PLEASE STOP IT AND USE SQL”.

OpenAPI specification, from which we generate the server, client and structs.

4. No OPs:
If your API is invoked 5 times a day, why would you need a running instance? (be it container or EC2)
If your cloud vendor saves you valuable maintenance and operation time for you for just a small premium, why think twice?
And last but not least — vendor lock is actually a blessing, especially in an organization that needs to get things done.
Our new infrastructure was focused on serverless, managed services that keep our focus on the business rather than maintaining clusters.
Another benefit from serverless, in this case AWS Lambda, is that you get a decent monitoring and metrics out of the box using CloudWatch, that we connected to OPS Genie.

5. Code Generation:
Code generation shortens your way into writing and releasing highly tested, high quality, predictable code, all while maintaining sky high developer velocity.
I incorporated code generating through our API generation, POJO/Struct generation, deployment, configuration, basically everywhere possible.
A great example is the OpenAPI 3 generator, that generates server, client, data structures and even input validations. so much boilerplate and overall code is handled automagically for us.
For server-to-server communication we use Twirp/gRPC where the server/client, schema, validations and etc are out of the box without the need for 3rd parties.

generated code from OpenAPI 3 specification

6. Infrastructure (and IaC) is your own responsibility
Terraform, Pulumi, CloudFormation, etc. When picking one, and as a former DevOps/SRE veteran, I couldn’t care less.
I chose CloudFormation because it’s a more stand-alone IaC and its learning curve isn’t as steep as its competitors.
We made sure to automate as much as possible through code generation, service discovery and other techniques, so eventually the IaC is part of the deployment and the developer doesn’t even have to change anything.

The main motivation behind the above was to create a simple, do-it-yourself, highly automated, No OPs framework with clear borders.

CloudFormation template that’s generated based on user needs.

goskeleton

Credit: Eli Shalnev the totah.

Getting started with this ambitious initiative wasn’t easy.
Onboarding and buying-in the engineers to the concepts above was easy compared to applying them in practice. In order to shorten the learning curve and time spent writing boilerplate code, I wrote goskeleton, a CLI tool that creates a skeleton for a Golang microservice that implements the above principles and methodologies.

goskeleton does not reinvent the wheel or does anything unordinary, it’s just a tool that helps us reach our goals faster.

With a click of a button I was able to generate a new service, discuss its structure with fellow engineers and then deploy and invoke it.
The quick and robust end-to-end was the killer that got engineers excited about.

Making it happen

I would like to share my experience, 6 months and 15 microservices deep, from several points of perspective:

  1. As an engineer: If you build it, they will come.
    Speaking about something won’t make it happen, and actions speak louder than words.
    Being proactive is key to making a change.
  2. As a technical leader: Make things better.
    The alternative solution I drove had to be substantially better, to justify the time and effort for incorporating and migrating the new technologies.
    I held several meetings to discuss our system’s shortcomings and present alternative solutions to problems. This played a major role in planting the doubt seeds to engineers who were used to certain things, and most of them didn’t know the alternatives and how things can actually be better.
    Some feared the change, others were initially reluctant to adopt anything new. How I handled these fears is explained shortly.
  3. As a friend: Build relationships.
    Our team shrunk dramatically due to reserves and maternity leaves, leaving myself and another 2 engineers.
    On the bright side, communication and coordination became much easier. We were able to build great relationships and eventually friendships, because we were basically working together all day long.
    These 2 engineers (Ori and Efrat) were the first ambassadors of everything I was promoting. They gave me amazing and more importantly, honest feedback that helped me tune my work.
  4. As a colleague: Let colleagues know you’re here for them and available for whatever they need.
    Moving forward, I wanted fellow engineers from other teams to adopt the new software/system infrastructure.In this case, it was mostly about trust.
    My proposal to fellow engineers was simple:
    Use the new solution to dramatically boost your performance, and I will help you on your way there.
    I devoted a substantial amount of my time presenting, convincing and assisting others in their journey, at the expense of my own time and performance.
    It was difficult at first, but overtime, my teammates were able to share the load with me, and fellow engineers from other teams got hooked.

Throughout this journey I learned that relationships and communications are vitally important and sometimes are more important than the technical solutions you are trying to provide.
You could write the best and most robust solution for a problem, but if you can’t get a buy-in from your colleges to use it and more importantly, provide feedback — it won’t matter.

--

--