Fault-Tolerant Architecture for a Taxi Service

This article is Part 1 of the detailed description of my talk at HighLoad++ 2015.

Today Namba Taxi is the leading passenger transportation company in our region. I will tell how we achieved the fault-tolerant architecture and why we can afford to lose any of our physical servers without the loss in architecture productivity.

What are the existing types of taxi services so far?

The first type is a private taxi or «bombila» or «bordurshik» in our slang. Drivers have their own rules of how to survive in the taxi market and own rates, and they works without intermediaries. The second type of taxi service is a dispatching service. It may or may not be automated and use GPS-trackers and portable radios, or use other available means. The last generation of taxi service is an automated intermediary known as Uber.

What our company is like?

Our company is a service that is keen on automatization. Taking this into account, we try to catch up the Uber, and during our work on the project, we have acquired about 300 000 satisfied clients. The number may seem small, but it is one-third of Bishkek, the capital of our country. We have 600 drivers online, and we process more than 8000 orders per day. Our daily workload looks like this:

There are two rush hours per our graph: in the morning (from 6 to 10 am) and in the evening (from 4 to 8 pm). Our servers process up to 3500 requests per second. We are also able to send responses to drivers within 20 milliseconds and to dispatchers within 2.5 milliseconds.

I will tell you how we managed to build the fault-tolerant architecture. I will also describe the experience of choosing the software for IP telephony and how we managed to send the voice traffic via IP telephony using this technology.

But let me start from the very beginning.

The story began in 2011 when our CEO came to us and told us that he wanted to open a taxi company in Bishkek. We started working, explained where to take servers, and how to install them, and recommended the only existing player in the market of taxi service automatization.

The task was to allow the client to choose whether to call or to send a message to call a taxi. The old system looked the following way:

An operator accepts an order and resends it to a driver,
The driver accepts the order,
The driver goes to the order point and picks up the client,
The driver drives the client to the destination point, and the manager gets the report in the workflow.

Workflow using providers’ software

The key features of our program are the service provider and active call handling module. Operators could call and answer calls. There were SMS notifications, partially automated workflow, and plenty of Chinese navigators used by drivers.

SMS notifications were sent in two cases. The first time is when a driver accepts the order, and the second is when he arrives at the destination place. Unfortunately, our partners were not able to arrange with each other, so we decided to change our provider. The cause was the inability of the service to work for up to 4 hours in a row without fail. We wanted to develop and embrace new markets and segments, and our provider was not ready for the changes.

Once we decided to create our own software, the work started with system requirements. By that time, the taxi company had already been working for a year using provider software. We tried to implement minimum interface changes to make people who used the previous version of the program easily adapt to the new one. We also planned to create a flexible product.

Architecture

Drivers were working with navigators on Windows CE, and we decided to leave them as is. We also planned to add the support of Android devices, real-time updates in dispatching interface, IP telephony. The immediate switch to our software was planned as soon as these modules were ready.

How to reduce GPS data errors on Android

Increase position accuracy and GPS distance calculation for the driver's app on Android devices with Kalman filter and accelerometer.

What were the limitations? The first one - is the extremely high rates on mobile internet. So we had to reduce traffic between driver devices and servers. Servers were simple home stations. We had a small tech team and six months for development.

We decided to implement the web application since it is quickly scaled by people and is not bound by any software or device. All you need for a web application is a browser.

Progressive Web App PWA vs. Native App: Which One to Choose?

Check the article to explore a detailed overview of Progressive Web Apps and Native Apps, and learn the difference between them, their features and benefits.

We decided to implement one core. It either represented the API for drivers and managers and was used for accepting payments or automated the work.
We chose

Django for Core;
Redis for Publish/Subscribe mechanism;
Node.js for tracking real-time events in the dispatcher interface;
Twisted as a socket server for drivers;
Ruby — for work with SMS;
WebRTC — for telephony.

Find a solution for audio/video calls using WebRTC?

When to build your own solution for implementing Audio/Video Calls using WebRTC from scratch, when to use open source alternatives and when to order a custom solution from a third-party vendor.

Why so many technologies in one project?

Firstly, Ruby has Ruby-smpp, which provides excellent work with smpp-protocol to get and send SMS. We need Node.js because of Socket.IO, which allows different types of transport to give support for real-time messaging.

What made us deal with raw choice of telephone stations and create the telephony on their base?

We wanted to build the system using open source software and without any limitations to operating systems or devices. Thanks to this and to the decision to use the web we could reduce the number of workplaces. Staff could work remotely. We no longer needed to spend money on switching equipment and licenses. We also could attract more consumers.

In the upcoming article, I will describe our first working version of the taxi software and what we faced while integrating it to an operating business.

The story continues in Part 2, Part 3.