Case Study

Optimization
for Veeqo

Our DevOps team optimized the infrastructure for Veeqo, an inventory
and shipping e-commerce platform, and made it highly efficient, smooth,
and cost-effective for our partners.

Veeqo is an inventory and shipping platform for e-commerce. It helps businesses to manage sales across multiple channels, ship items via multiple carriers, process refunds, manage B2B orders, forecast inventory—the list of features is almost endless. The platform directly integrates with the world’s most popular retail applications, including e-commerce platforms, marketplaces, shipping carriers, POS systems, and more.

Veeqo is a single platform that gives users complete control of their entire inventory. It enables businesses to quickly bulk ship orders from any sales channel, automate repetitive shipping tasks, and track every delivery in one place.

Hundreds of retailers all over the world use Veeqo to power their inventory and shipping founded

FOUNDED

2013

by experienced ecommerce retailer

OVER

1.5bn

inventory updates processed in Veeqo every year

OVER

31m

items picked, packed and shipped through annually

Veeqo high-level overview

The service has multiple subsystems integrated into one platform

Dashboard

The dashboard used to visualize
reports

Subsystems

Inventory management, Management, Warehouse, Reporting

Challenges and solutions

In making Veeqo the reliable and efficient system it is today, our teams faced serious technical challenges. Addressing them required profound expertise, creative solutions, and intensive work. Particularly important were the contributions of our DevOps engineers.

The role of DevOps specialists is to bring order and predictability to the development process: remove bottlenecks, simplify delivery, and automate as many processes as possible. The goal is to turn the development of features into a smoothly running conveyor belt.

When we joined the Veeqo project, our DevOps team started sorting out existing blockages in the development process, prioritizing those that affected Veeqo’s users and the client’s business the most.

Phase 1: Fixing database outages

“We needed better monitoring and metrics collection to detect the true reason for the outage. We don’t sweep these things under the rug. We go for root causes.”

Andrew Sapozhnikov, Head of DevOps

Solution

Collecting host metrics (CloudWatch and okmeter showed consequences, not causes).

Analyzing the true RAM use and identifying 1,200-1,500 worker connections taking up about 40 GB of RAM.

Adding PgBouncers to reduce the load on the database.

Stabilizing connection pooling by applying ELB and thus distributing traffic more evenly to decrease the number of SPOFs.

As a result, we spent $180 on launching PgBouncers on two c4.large instances behind NLB and saved about 10X the sum in the customer’s monthly costs by: Freeing up about 40 GB of RAM, Postponing the need to upgrade the RDS instance until about 6 months later.

Result

ReadIOPS

halved for better instance performance

CPU

load: reduced due to extra RAM

TPS

increased
by 50%

Phase 2: Continuous integration (CI)

The developers didn’t trust their CI system. CI wasn’t helping them; it was hindering their work instead. It’s just bad DevOps.

Andrew Sapozhnikov, Head of DevOps

What we achieved

Reproducible
CI results

Unified
runtime

Standardized development, test, and production environments

Moving Jenkins

Jenkins was operated manually on a separate EC2 instance. Managing it was thus challenging, especially in emergencies. We moved the Jenkins master to ECS by remaking its provisioning, deployment, and updating (Later, we’ll move it to k8s, but in 2017, ECS was the only good option for us). We reduced build time and cost by allocating a small part of compute resources of Jenkins agents to reserved instances and moving 90% of the load to spot instances.

We further renovated CI by:

Creating Jenkinsfiles for all repositories thus making pipelines reproducible
Creating backups
Running access audit and configuring SSO for more secure access

Everything down to the last comma is now written in the form of code. Even if we took Jenkins down completely and had to build it from scratch again, it’d take us no more than 10 minutes.

Maxim Glotov, DevOps Engineer

Result

We call it a win-win situation between the development team and the customer’s business.

Significantly reduced test time

Decreased cost of CI maintenance

Phase 3: Elasticsearch

Elasticsearch is crucial in the way users experience the Veeqo platform: the dashboard and the entire interface rely on Elasticsearch. Even if everything else functions flawlessly, delays in the search engine alone cause problems for user experience.

Solution

There are two main ways to improve performance: Increase compute resources, Optimize the use of resources

At any given moment DevOps specialists calculate and evaluate cost factors of different solutions. We applied both types of solutions as we started with enhancing the cluster and later optimized indexing to make searching as convenient for users as possible.

New cluster

We implemented a new Elasticsearch cluster as a self-hosted solution on EC2 instances.

Three reasons:

Control over the security of our Elasticsearch
Immediate response in case of incidents
Independent monitoring and investigation of Elasticsearch performance

Indexing and performance

The indices featured too much unnecessary information and non-optimized mapping.

What we did:

Refactored them to make them concise
Used the replication and sharding mechanisms to distribute the indices among five nodes

Result

~250-300 ms

Search time went down to under 5 seconds (now: ~250-300 ms)

Feedback

We received massive positive feedback as Veeqo users were contacting customer

Elasticsearch

Elasticsearch became more efficient and reliable without costing the customer more

Now, Elasticsearch is back to three nodes; in fact, two are enough for normal operation, and the third is there to ensure fail-safety.

Phase 4: Cutting the costs

Cutting the costs for the customer is not about making compromises but about achieving efficiency. The mind of a DevOps engineer is like an hourglass: engineers are focusing on either improving performance or cutting the costs. It’s flipping all the time, and it is needed to find balance, which is solutions that are effective and cheap.

Once we improved Veeqo’s performance and user experience, we started working on optimizing infrastructure costs. In total, the infrastructure used to cost the customer about ~$20,000 per month. Within one summer, as we prepared the new solutions and migrated to them, we brought the figure down to ~$13,000.

Importantly, we made the system more secure as we started using RabbitMQ, Memcached, Elasticsearch, and other services within our own network. The monitoring system that we kept perfecting as we went on with cutting the costs confirmed the effectiveness of cost optimization.

Costs can’t be cut overnight. Not without damaging user experience, service uptime, and system survivability, anyway. We initially told Veeqo how we would decrease monthly infrastructure costs by ~7,000 dollars, but it took months to make smooth transitions. Our Veeqo partners trusted our professionalism, and it all paid off. We’re proud of how we optimized the costs along with improving the system’s performance.

Andrew Sapozhnikov, Head of DevOps

Phase 5. Moving to Kubernetes

Kubernetes and the team were ready for each other. Kubernetes was growing stable, well-integrated with all the major AWS services, and well-equipped for high production loads. The team, in turn, stabilized the development processes and needed better automation. Kubernetes provided a unified platform for app launching and Docker containers orchestration: we wanted to have a unified platform to safeguard Veeqo’s future growth. Having one platform to handle everything meant being independent of cloud providers and being able to launch instances in any clouds. We were going more cloud agnostic: Kubernetes allows moving workloads seamlessly.

Use of resources

There are four main types of resources: CPU, RAM, disc, and network. The way a service uses them depends on its type. Our objective was to arrange the services so that they don’t overlap in terms of the use of resources and—importantly—so that not too many resources are idle.

There’s nothing more expensive than idle resources. At the same time, you don’t want to load the nodes to their maximum capacity because a) it’ll lead to performance degradation and b) you do need some idle resource to handle load spikes. Kubernetes finds balance.

Andrew Sapozhnikov, Head of DevOps

The balance is found based on thorough continuous calculations. Kubernetes shuffles services from node to node so that resources are not overused and creates optimal distribution that a human admin could never possibly achieve manually.

Kubernetes and containerization

Kubernetes specializes in resource management and orchestration, but it needs a containerization service alongside it. We used Docker to standardize the runtime for application. The two worked perfectly together.

Kubernetes provides an environment where Ops and Devs can speak the same language: the language of YAML configurations and Kubernetes objects. We get system resources on one end, containers on the other, and in the middle, Kubernetes works its magic.

Phase 6: Infrastructure as Code (IaC)

IaC is a way to manage large structures where manual operation is highly ineffective or simply impossible. To go fully IaC, our teams made fundamental changes in the ways we develop, deliver, and maintain our solutions.

Advantages of IaC

Control over resources: The more people have access to the cloud, the more expensive it is to use it. By introducing IaC, we rule out disorderly alterations in the system and prevent wastefulness, thus improving security and gaining firmer control over resources and costs.
Reproducibility: IaC is the tool of ultimate reproducibility. By going straight to Terraform, one can avoid repetitive manual work, thus achieving: lower operating costs, higher speed, better coordination within the team, (ideally) fully automated infrastructure deployment.
Security: Introduced changes can be tracked, reviewed, and analyzed automatically before delivery. It’s a reliable way to detect and fix vulnerabilities before they can make trouble.
Reliable documentation: Documentation effectively stops being relevant once a new change is introduced into the system without being documented. With many changes going on, doc updates get easily forgotten. But if you have code for documentation, this problem is solved. There’s nothing more relevant for understanding a system than code itself. Ideally, what you have in master is what you have in production.

We are proud of our work with Veeqo and eager to share our successes. As our partnership continues, we’re constantly looking for new ways to meet Veeqo’s business needs and make its users’ experience better.

Key results

PostgreSQL

40GB of RAM

40 GB of RAM freed up. ReadIOPS halved. Number of TPS increased by 50%

Infrastructure costs

Reduced by 35%

without losses in performance and security

Elasticsearch

~250-300ms

More secure data access Search time decreased initially to under 5 seconds and by now to ~250-300ms

Technology stack

Ruby
NodeJS
Elasticsearch
PostgreSQL
RebbitMQ
Redis
Memcached
Cloudwatch
Prometheus
Grafana
Sentry
Heroku
AWS
Kubernetes
Terraform
Travis CI
Jenkins
Docker
Helm

Meet the team

Maxim Glotov

DevOps engineer

Andrew Sapozhnikov

CIO

The resource provided by Mad Devs is excellent, bringing not only their own skill and expertise, but the input of the wider team to the project too. They manage the remote work seamlessly and fit well into the company’s workflow.
Daniel Vartanov
CTO, Veeqo

Optimization for Veeqo

Veeqo high-level overview

The dashboard used to visualize reports

Inventory management, Management, Warehouse, Reporting

Challenges and solutions

Phase 1: Fixing database outages

Regular outage

Degraded user services

Solution

Result

ReadIOPS

CPU

TPS

Phase 2: Continuous integration (CI)

Challenge

Solution

What we achieved

Moving Jenkins

Result

Significantly reduced test time

Decreased cost of CI maintenance

Phase 3: Elasticsearch

Challenge

Effect on users

Solution

New cluster

Indexing and performance

Result

~250-300 ms

Feedback

Elasticsearch

Phase 4: Cutting the costs

Phase 5. Moving to Kubernetes

Use of resources

Kubernetes and containerization

Phase 6: Infrastructure as Code (IaC)

Advantages of IaC

Key results

40GB of RAM

Reduced by 35%

~250-300ms

Technology stack

Meet the team

The Evolution of Yourcast.TV

Optimization
for Veeqo

The dashboard used to visualize
reports