With the growing amount of data in the world today, there is an ever-increasing need to ensure that products and services function without interruption. User satisfaction can experience significant drops if they encounter problems in a product or service, which pushes businesses to develop strategies that guarantee continuous functionality. As a service scales, this creates challenges for any company that wants to balance profit and costs.
This article will explore the use of cell-based architectures as a means of providing high-performing service that satisfies users while lowering long-term costs.
An overview of cell-based architecture
Before exploring a cell-based architecture, it will be helpful to look at two other design styles: monolithic and microservices. These are often contrasted with the cell-based approach and represent some of the reasons why companies choose to migrate to the latter.
An application with a monolithic architecture is built as a single, unified unit that combines the front-end, back-end, and database interactions into one codebase. This design is simpler to develop, test, and deploy, but engineers find it difficult to scale and modify as the application grows. Likewise, since all components are tightly coupled, failures in one part can impact the entire system. While suitable for small to medium applications, modern companies often transition to microservices or cell-based architectures for better scalability, flexibility, and fault isolation.
data:image/s3,"s3://crabby-images/ca6b2/ca6b268f9bfb3225ec7c70897ae1920234fe8af2" alt="Monolithic architecture."
On the other hand, if an application uses a microservices architecture, then it is built as a collection of small, independent services that each handle a specific function and communicate via APIs or messaging. This approach improves scalability, flexibility, and fault isolation, allowing teams to develop, deploy, and scale services independently. However, it introduces complexity in service coordination, networking, and monitoring. While ideal for large, dynamic applications, there are challenges such as cross-service latency, data consistency, and operational overhead.
data:image/s3,"s3://crabby-images/d6dac/d6dac60e56709bd9db5b9a68f5b54ccaf1cdbf3e" alt="API using architecture."
At a basic level, the architectures encounter the same challenge when failures occur: the entire system will shut down. The monolith structure cannot function, and if one microservice stops working, the others may continue to function but without an essential service. A cell-based architecture solves this issue because it is directly related to the concept of bulkheads in shipbuilding.
When water enters one section of the ship, the bulkheads keep water from entering other sections and prevent the entire ship from sinking or taking on too much water. Likewise, cell-based architecture (CBA) organizes services into self-sufficient, independent units called "cells." Each cell includes everything it needs—services, data storage, compute resources, and networking—so it can function autonomously with minimal dependencies on other cells. In a traditional system consisting of microservices, a fault in one part could result in the failure of the whole system.
This architecture improves fault isolation, scalability, and performance through a reduction in inter-service communication and grouping of related functionalities into manageable units. On the surface, cell-based architecture involves the following elements:
- Cells: The fundamental building block of this architecture that contains everything needed to perform a specific function, including microservices or applications, databases or data stores, computer resources, and internal networking.
- Control plane: Responsible for managing and orchestrating cells across the system to provide service discovery, deployment automation, monitoring and logging, and security enforcement.
- Data plane: Answers routing requests, processing data, and enforcing policies to enable efficient communication within and between cells.
As the bulkhead metaphor illustrates, there are specific uses for this approach to architecture. For example, certain companies can’t afford failures in their services due to the damage it can cause to their reputation and current obligations, as well as finances. Systems that require a low Recovery Time Objective can also benefit from this approach, as can multi-tenant products with strict infrastructure-level isolation at the tenant level.
Regardless of the reasons for implementing it, a cell-based architecture offers the following benefits to businesses:
- Resiliency: Cells isolate failures and reduce the effect of issues, such as bad deployments, client errors, mistakes by an operator, or data corruption.
- Supports scalability: When limited in size, cells help a platform scale as it grows. When necessary, new cells are created to meet rising demand.
- Deployment scoping: Companies can utilize cells to target changes for a small group of users without affecting the broader user population.
- Convenient system testing: Small cells are easier to test, and their performance can be extrapolated to an entire system to estimate its performance capabilities.
- Potential cost savings: In certain situations, cells can help reduce the costs of cross-availability zone traffic if the data plane is sacrificed.
Challenges of implementing a cell-based architecture
There are two concepts that require some exploration in order to understand the motivation for businesses to adopt a cell-based architecture: Cross-availability zone traffic and zone-aware routing.
Cross-AZ (availability zone) traffic
Network data transfer between separate AZs within a cloud provider's region. This approach enhances fault tolerance and redundancy because it distributes workloads across multiple AZs to ensure that if one AZ fails, services can continue running in another. It also improves redundancy by replicating critical data and applications across AZs to prevent single points of failure and increase overall system resilience. However, cross-AZ traffic introduces higher costs, since many providers charge for inter-AZ transfers and increases latency due to additional network hops.
Zone-aware routing
A traffic management strategy that directs requests to resources within the same AZ whenever possible to reduce cross-AZ traffic costs and latency. This approach improves performance and fault isolation by keeping communication local while still allowing fallback to other AZs if needed. Cloud providers and load balancers (such as AWS ALB, Envoy, and Istio) support zone-aware routing to optimize request distribution, ensuring efficient resource utilization and high availability without unnecessary inter-AZ data transfers.
The DoorDash case
While zone-aware routing seems the best answer to reduce cross-traffic costs, a business that scales will soon encounter a lack of resources within a single AZ. DoorDash, an American technology company that operates an online food delivery and logistics platform, faced a similar issue.
DoorDash connects customers with local restaurants, grocery stores, and convenience shops. Since being founded in 2013, the company has expanded across the U.S., Canada, Australia, and beyond to become a leader in the on-demand delivery industry. Through its app and website, DoorDash enables customers to order meals, groceries, and household essentials with real-time tracking. The amount of data involved in its operations is staggering, and both the company and its users rely on uninterrupted service. The initial answer to this was cross-AZ transfers.
With the company's growth, however, also came an increase in its microservices and back-ends, which led to more cross-AZ data transfer costs. While these transfers allowed DoorDash to maintain smooth operations across its vast consumer population, the increase in costs pushed the company to explore other ways to provide the same level of service without the costs.
The answer they found was to implement a cell-based architecture that would be boosted through zone-aware routing. This approach minimizes cross-AZ traffic by guaranteeing each cell operates within a single Availability Zone (AZ) whenever possible. Given that a cell includes its own data storage, processing, and service logic, most requests can be handled locally. The result: a reduction in costly inter-AZ network transfers. Engineers working within a cell-based architecture can scope dependencies within the same AZ to prevent unnecessary cross-zone communication and significantly lower latency and operational costs.
Zone-aware routing complements this approach by directing user requests to a cell within the same AZ as the request origin to further reduce cross-AZ traffic and data transfer costs. If an AZ or a cell within it becomes unavailable, zone-aware routing can then redirect traffic to another AZ as a fallback. Together, cell-based architecture and zone-aware routing helped DoorDash optimize resource utilization and maintain fault tolerance and high availability without excessive cross-zone overhead.
When cell-based architecture is not enough
DoorDash's example and the overall benefits of a cell-based architecture may convince leaders that it's the best configuration for their business; however, there are clear reasons to avoid this approach.
The challenges of implementing a cell-based architecture require effort and a level of skill that not all companies may have or be able to afford at present. Likewise, other investments, such as those involving the proper underlying platform to maintain the desired speed, can increase the initial costs beyond the available resources of smaller businesses, such as startups. Finally, a cell-based architecture depends on up-to-date infrastructure and services, while legacy systems may hinder any migration efforts.
Instead of a simple "yes" or "no" answer to the question of migrating to this architecture, business leaders can consider this a step in a broader modernization strategy across their company. This will ensure a smooth transition to newer systems and approaches, training for current employees, hiring of required skill sets, and, finally, an easier and less resource-intensive transition to cells.
Focus on robust systems
Whether a business is ready to move from microservices to cells or will only be able to begin the initial steps of modernizing its systems, the success of both is connected with robust systems. Customers and users expect quick access and uninterrupted use of products and services, and any scaling goals will rely on healthy infrastructure. Therefore, now is the time to invest in engineers with the right skills and systems that provide the necessary capabilities.