By Dave Andrews, Chief Architect
Today it no longer makes sense to talk about computing at a single “edge.” A modern network consists of multiple layers, each with its own compute capabilities and latency tradeoffs. At Verizon Digital Media Services, we’ve been managing these tradeoffs and interactions between the two innermost layers: very large but sparse public cloud data centers and the content delivery network (CDN). However, the rise of 5G and multi-access edge computing (MEC) creates amazing new capabilities and corresponding levels of additional complexity.
5G/MEC offers developers compute resources at the very edge of the cellular network, enabling applications to run with significantly reduced latency, and with vastly increased network throughput to clients. This allows developers to build applications that can ingest, process, and deliver large amounts of data closer to customers, avoiding today’s longer, slower paths over the internet. However, more places to run code means more potential problems — and more headaches for developers. It’s also unclear how the 5G/MEC layer will interact and cooperate with other existing compute layers. No one wants to build the same application three times, once for each layer, nor perform three deployments, or monitor three systems.
At the EdgeNext Summit in New York this past October, I presented our vision for solving some of these issues with a cohesive edge compute solution. The key is for the solution to hide all of the multi-edge complexity without sacrificing the power afforded us by multi-edge capabilities. Developers need to be able to have their applications move as needed between the 5G/MEC layer, CDN, and public cloud data center based on factors like cost and load.
To do this, we need a simple language to define the entire ecosystem. We need tooling to consume the expression of that language and manage the complexity at each of the three layers. Our EdgeControl specification is a step in the right direction, but it is currently focused exclusively at the CDN layer. In the context of one of two hard things in computer science, this could be called EdgeCast Compute Control, or EC3 for short.
Once such a system is built, what would developers do with it? Here is how our cohesive solution would enable five core development tasks:
- Work in a development environment that’s representative of production. By ensuring a common framework and language support across all three layers, we can enable developers to continually test and run their code in whatever layer is cheapest. Often the inner layers will be cheapest for this purpose.
- Build tests to raise confidence in deployments. For testing, we can extend the concept of Edgecast’s Edge Verify to verify code correctness across layers. The resulting testing framework should have built-in performance analysis to further validate code at each layer, and it should also run the same suite of tests against canary deployments at each layer. Ensuring that code behaves identically at each different layer in the concentric circle is the primary requirement to enable failover and resilience.
- Built-in failover and resilience as use cases demand. This is a key new capability of our cohesive edge compute solution. Currently, failover and resilience are often afterthoughts in application architecture, because they can require tedious configuration of multiple components. This model of achieving resilience does not scale to a more complex (and capable) multi-edge world. Instead, we would enable developers to specify unique failover paths depending on the use case and on which layer the code is running. The three primary building blocks of these paths are:
a. “fail in” — retry at an inner layer of the concentric circles, where more compute resources are available.
b. “fail out” — retry at an outer layer of the system where latency is low, so it can make up for lost time.
c. “fail around” — retry at a neighboring node at the same layer in the system.
We would also enable these failover path lengths and enable paths selectively for different regions, times of day, and other factors. After specifying failover as part of the EC3 specification, detecting failures, following these paths, and reporting on them should all be automated.
For example, if a failure is triggered by excessive load or cost quotas at the MEC layer, we might want it to “fail in” to somewhere with more resources, or that is cheaper, to ensure a function eventually succeeds. Alternatively, in a scenario where latency consistency is paramount (such as an automated control system that is aggregating and analyzing multiple data feeds), and a failure of a function at an inner layer has consumed some available time, we might want it to “fail out” to complete the function at a location closer to the customer.
The EC3 specification should also allow developers to define multiple paths for a function as it runs at each layer, enabling more complex directed-graph structures. Loop detection and automated conflict resolution would be required to make this work.
4. Canary new code to production carefully. Our solution should enable flexible but controlled upgrades to developers’ applications. Ideally, this would build on technology discussed at Velocity a few years ago, and add alerting, and auto-advance and rollback capabilities. These concepts get much tricker in a multi-edge world. For example, canary deployments must be influenced by all currently defined failover paths. If a given piece of code has a simple “fail in” path, new features should be deployed from the interior layers outward, while deprecations should be rolled from the outside in, to ensure that we don’t failover to an older version of an application that is missing functionality.
5. Consume analytics that provides visibility into production behavior. This would build on what EdgeControl can do at the SuperPoP layer to account for complexities such as the same function running at a different layer, having different performance and resource consumption profiles, each of which needs to be visible independently and in aggregate. In real time, our system would answer critical questions such as: What code is running where? Which failovers are happening where? What is user experience and latency like at different points in the network, or for different canary deployments?
This kind of solution isn’t just an option. We can expect developers to demand it. The various edges with their various capabilities will eventually come together to create something that is much greater than the sum of its parts. The sooner we can develop a cohesive system that provides an intuitive interface to developers that enables them to manage the additional complexity of computing across multiple edges, the faster we’ll start seeing the emergence of a whole new generation of world-altering applications.