Case Study

Optimization
for Veeqo

Our DevOps team optimized the infrastructure for Veeqo, an inventory
and shipping e-commerce platform, and made it highly efficient, smooth,
and cost-effective for our partners.

Veeqo is an inventory and shipping platform for e-commerce. It helps businesses to manage sales across multiple channels, ship items via multiple carriers, process refunds, manage B2B orders, forecast inventory—the list of features is almost endless. The platform directly integrates with the world’s most popular retail applications, including e-commerce platforms, marketplaces, shipping carriers, POS systems, and more.

blockquote
Veeqo is a single platform that gives users complete control of their entire inventory. It enables businesses to quickly bulk ship orders from any sales channel, automate repetitive shipping tasks, and track every delivery in one place.

Hundreds of retailers all over the world use Veeqo to power their inventory and shipping founded

FOUNDED

2013

by experienced ecommerce retailer

OVER

1.5bn

inventory updates processed in Veeqo every year

OVER

31m

items picked, packed and shipped through annually

Veeqo high-level overview

The service has multiple subsystems integrated into one platform

Dashboard

The dashboard used to visualize reports

Subsystems

Inventory management, Management, Warehouse, Reporting

Veeqo High Level Overview.

Challenges and solutions

In making Veeqo the reliable and efficient system it is today, our teams faced serious technical challenges. Addressing them required profound expertise, creative solutions, and intensive work. Particularly important were the contributions of our DevOps engineers.

The role of DevOps specialists is to bring order and predictability to the development process: remove bottlenecks, simplify delivery, and automate as many processes as possible. The goal is to turn the development of features into a smoothly running conveyor belt.

When we joined the Veeqo project, our DevOps team started sorting out existing blockages in the development process, prioritizing those that affected Veeqo’s users and the client’s business the most.

Phase 1: Fixing database outages

Challenge

Regular outage

Without any detectable load spikes, the database would often reach maximum processor performance and deny service as the queries piled up in the query queue. To get the system up and running again, workers had to be manually restarted every time.

Effect on users

Degraded user services

Users had limited access to services: they couldn't view contents, place orders, or manage their settings.

“We needed better monitoring and metrics collection to detect the true reason for the outage. We don’t sweep these things under the rug. We go for root causes.”

Andrew Sapozhnikov, Head of DevOps at Mad Devs

Solution

CloudWatch Logo.

Collecting host metrics (CloudWatch and okmeter showed consequences, not causes).

RAM Icon.

Analyzing the true RAM use and identifying 1,200-1,500 worker connections taking up about 40 GB of RAM.

PgBouncers Icon.

Adding PgBouncers to reduce the load on the database.

Speed Icon.

Stabilizing connection pooling by applying ELB and thus distributing traffic more evenly to decrease the number of SPOFs.

As a result, we spent $180 on launching PgBouncers on two c4.large instances behind NLB and saved about 10X the sum in the customer’s monthly costs by: Freeing up about 40 GB of RAM, Postponing the need to upgrade the RDS instance until about 6 months later.

PgBouncers Diagram.

Result

ReadIOPS

halved for better instance performance

CPU

load: reduced due to extra RAM

TPS

increased
by 50%

Phase 2: Continuous integration (CI)

Challenge

Development processes lacked consistency. Particularly, the project had manually configured CI system and pipelines, different development, test, and production environments and runtimes, non-reproducible development environment, inconsistent test builds.

Solution

Dockerization is the most viable solution today to set up continuous delivery. As our application was back then partially hosted on Heroku, we created a unified runtime for developers and CI by dockerizing the main app using a Heroku Stack image and Docker Compose. It made CI reliable and allowed simplifying deployment. Also, it sped up bootstrapping of the development environment.

The developers didn’t trust their CI system. CI wasn’t helping them; it was hindering their work instead. It’s just bad DevOps.

Andrew Sapozhnikov, Head of DevOps at Mad Devs

What we achieved

Reproducible
CI results

Unified
runtime

Standardized development, test, and production environments

CI Panel.

Moving Jenkins

Jenkins was operated manually on a separate EC2 instance. Managing it was thus challenging, especially in emergencies. We moved the Jenkins master to ECS by remaking its provisioning, deployment, and updating (Later, we’ll move it to k8s, but in 2017, ECS was the only good option for us). We reduced build time and cost by allocating a small part of compute resources of Jenkins agents to reserved instances and moving 90% of the load to spot instances.

We further renovated CI by:

  • Creating Jenkinsfiles for all repositories thus making pipelines reproducible
  • Creating backups
  • Running access audit and configuring SSO for more secure access
blockquote
Everything down to the last comma is now written in the form of code. Even if we took Jenkins down completely and had to build it from scratch again, it’d take us no more than 10 minutes.

Maxim Glotov, DevOps Engineer at Mad Devs

Result

We call it a win-win situation between the development team and the customer’s business.

Significantly Reduced Test Time.

Significantly reduced test time

Decreased Cost of CI Maintenance.

Decreased cost of CI maintenance

Phase 3: Elasticsearch

Elasticsearch is crucial in the way users experience the Veeqo platform: the dashboard and the entire interface rely on Elasticsearch. Even if everything else functions flawlessly, delays in the search engine alone cause problems for user experience.

Challenge

Elasticsearch didn’t cope with the load due to its outdated version and non-optimized configurations.

Effect on users

The users were often unable to access search results. Elasticsearch would expose them to delays of up to 30 seconds.

Solution

There are two main ways to improve performance: Increase compute resources, Optimize the use of resources

At any given moment DevOps specialists calculate and evaluate cost factors of different solutions. We applied both types of solutions as we started with enhancing the cluster and later optimized indexing to make searching as convenient for users as possible.

New cluster

We implemented a new Elasticsearch cluster as a self-hosted solution on EC2 instances.

Three reasons:

  • Control over the security of our Elasticsearch
  • Immediate response in case of incidents
  • Independent monitoring and investigation of Elasticsearch performance

Indexing and performance

The indices featured too much unnecessary information and non-optimized mapping.

What we did:

  • Refactored them to make them concise
  • Used the replication and sharding mechanisms to distribute the indices among five nodes

Result

Search Time Went Down to Under 5 Seconds – Now: 250–300 Ms.

~250-300 ms

Search time went down to under 5 seconds (now: ~250-300 ms)

We Received Massive Positive Feedback As Veeqo Users Were Contacting the Customer.

Feedback

We received massive positive feedback as Veeqo users were contacting customer

Elasticsearch Became More Efficient and Reliable Without Costing the Customer More.

Elasticsearch

Elasticsearch became more efficient and reliable without costing the customer more

Now, Elasticsearch is back to three nodes; in fact, two are enough for normal operation, and the third is there to ensure fail-safety.

Phase 4: Cutting the costs

Cutting the costs for the customer is not about making compromises but about achieving efficiency. The mind of a DevOps engineer is like an hourglass: engineers are focusing on either improving performance or cutting the costs. It’s flipping all the time, and it is needed to find balance, which is solutions that are effective and cheap.

Once we improved Veeqo’s performance and user experience, we started working on optimizing infrastructure costs. In total, the infrastructure used to cost the customer about ~$20,000 per month. Within one summer, as we prepared the new solutions and migrated to them, we brought the figure down to ~$13,000.

Importantly, we made the system more secure as we started using RabbitMQ, Memcached, Elasticsearch, and other services within our own network. The monitoring system that we kept perfecting as we went on with cutting the costs confirmed the effectiveness of cost optimization.

Costs can’t be cut overnight. Not without damaging user experience, service uptime, and system survivability, anyway. We initially told Veeqo how we would decrease monthly infrastructure costs by ~7,000 dollars, but it took months to make smooth transitions. Our Veeqo partners trusted our professionalism, and it all paid off. We’re proud of how we optimized the costs along with improving the system’s performance.

Andrew Sapozhnikov, Head of DevOps at Mad Devs

Phase 5. Moving to Kubernetes

Kubernetes and the team were ready for each other. Kubernetes was growing stable, well-integrated with all the major AWS services, and well-equipped for high production loads. The team, in turn, stabilized the development processes and needed better automation. Kubernetes provided a unified platform for app launching and Docker containers orchestration: we wanted to have a unified platform to safeguard Veeqo’s future growth. Having one platform to handle everything meant being independent of cloud providers and being able to launch instances in any clouds. We were going more cloud agnostic: Kubernetes allows moving workloads seamlessly.

Use of resources

There are four main types of resources: CPU, RAM, disc, and network. The way a service uses them depends on its type. Our objective was to arrange the services so that they don’t overlap in terms of the use of resources and—importantly—so that not too many resources are idle.

There’s nothing more expensive than idle resources. At the same time, you don’t want to load the nodes to their maximum capacity because a) it’ll lead to performance degradation and b) you do need some idle resource to handle load spikes. Kubernetes finds balance.

Andrew Sapozhnikov, Head of DevOps at Mad Devs

The balance is found based on thorough continuous calculations. Kubernetes shuffles services from node to node so that resources are not overused and creates optimal distribution that a human admin could never possibly achieve manually.

Kubernetes and containerization

Kubernetes specializes in resource management and orchestration, but it needs a containerization service alongside it. We used Docker to standardize the runtime for application. The two worked perfectly together.

Kubernetes provides an environment where Ops and Devs can speak the same language: the language of YAML configurations and Kubernetes objects. We get system resources on one end, containers on the other, and in the middle, Kubernetes works its magic.

Inventory Changelog.

Phase 6: Infrastructure as Code (IaC)

IaC is a way to manage large structures where manual operation is highly ineffective or simply impossible. To go fully IaC, our teams made fundamental changes in the ways we develop, deliver, and maintain our solutions.

Advantages of IaC

  1. Control over resources: The more people have access to the cloud, the more expensive it is to use it. By introducing IaC, we rule out disorderly alterations in the system and prevent wastefulness, thus improving security and gaining firmer control over resources and costs.
  2. Reproducibility: IaC is the tool of ultimate reproducibility. By going straight to Terraform, one can avoid repetitive manual work, thus achieving: lower operating costs, higher speed, better coordination within the team, (ideally) fully automated infrastructure deployment.
  3. Security: Introduced changes can be tracked, reviewed, and analyzed automatically before delivery. It’s a reliable way to detect and fix vulnerabilities before they can make trouble.
  4. Reliable documentation: Documentation effectively stops being relevant once a new change is introduced into the system without being documented. With many changes going on, doc updates get easily forgotten. But if you have code for documentation, this problem is solved. There’s nothing more relevant for understanding a system than code itself. Ideally, what you have in master is what you have in production.
blockquote
We are proud of our work with Veeqo and eager to share our successes. As our partnership continues, we’re constantly looking for new ways to meet Veeqo’s business needs and make its users’ experience better.

Key results

PostgreSQL

40 GB of RAM Freed Up. ReadIOPS Halved. The Number of TPS Increased by 50%.

40GB of RAM

40 GB of RAM freed up. ReadIOPS halved. Number of TPS increased by 50%

Infrastructure costs

Infrastructure Costs Redused by 35% Without Losses in Performance and Security.

Reduced by 35%

without losses in performance and security

Elasticsearch

Elasticsearch – More Secure Data Access Search Time Decreased Initially to Under 5 Seconds and by Now to ~250–300ms.

~250-300ms

More secure data access Search time decreased initially to under 5 seconds and by now to ~250-300ms

Technology stack

  • Ruby Programming Language Logo.

    Ruby

  • NodeJS Logo

    NodeJS

  • Elasticsearch Logo.

    Elasticsearch

  • PostgreSQL Database Management System Logo.

    PostgreSQL

  • RebbitMQ Message Broker Software Logo.

    RebbitMQ

  • Redis Data Structure Store Logo.

    Redis

  • Memcached Memory Caching System Logo.

    Memcached

  • Cloudwatch Logo.

    Cloudwatch

  • Prometheus Logo.

    Prometheus

  • Grafana Software Application Logo.

    Grafana

  • Sentry Error Tracking System Logo.

    Sentry

  • Heroku Cloud Platform Logo.

    Heroku

  • AWS Cloud Computing Platforms Logo.

    AWS

  • Kubernetes Container Orchestration System Logo.

    Kubernetes

  • Terraform Infrastructure as Code Software Tool Logo.

    Terraform

  • Travis CI Logo.

    Travis CI

  • Jenkins Logo.

    Jenkins

  • Docker Logo.

    Docker

  • Helm Logo.

    Helm

Meet the team

  • Maxim Glotov

    Maxim Glotov

    DevOps engineer

  • Andrew Sapozhnikov

    Andrew Sapozhnikov

    Head of DevOps

  • The resource provided by Mad Devs is excellent, bringing not only their own skill and expertise, but the input of the wider team to the project too. They manage the remote work seamlessly and fit well into the company’s workflow.
    Daniel Vartanov

    Daniel Vartanov

    CTO, Veeqo