1. Overview

In this tutorial, we’ll discuss the various ways to implement retry policies in gRPC, a remote procedure call framework developed by Google. gRPC is interoperable in many programming languages but we’ll focus on the Java implementation.

2. Importance of Retry

Applications increasingly rely on a distributed architecture. This approach helps handle heavy workloads through horizontal scaling. It also promotes high availability. However, it also introduces more potential points of failure. Therefore, fault tolerance is crucial when developing applications with multiple microservices.

RPCs can fail temporarily or momentarily because of various factors:

  • Network latency or connection drops in the network
  • Server not responding due to an internal error
  • Busy system resources
  • Busy or unavailable downstream services
  • Other related issues

Retry is a fault-handling mechanism. A retry policy can help automatically reattempt a failed request based on some condition. It can also define how long or how often the client can retry. This simple pattern can help handle transient failures and increase reliability.

3. RPC Failure Stages

Let’s first understand where a remote procedure call (RPC) can fail:

failure stages sequence

The client application initiates the request, which the gRPC client library sends to the server. Once received, the gRPC server library forwards the request to the server application’s logic.

A RPC can fail at various stages:

  1. Before leaving the client
  2. In the server but before reaching the server application logic
  3. In the server application logic

4. Retry Support in gRPC

Since retry is an important recovery mechanism, gRPC automatically retries failed requests in special cases and allows developers to define retry policies for greater control.

4.1. Transparent Retry

We must understand that gRPC can safely reattempt failed requests only in cases where the request hasn’t reached the application server logic. Beyond that, the gRPC cannot guarantee the idempotency of the transactions.  Let’s take a look at the overall transparent retry pathway:

transparent retry pathway

As discussed previously, internal retries can happen safely before leaving the client or in the server but before reaching the server application logic. This retry strategy is referred to as transparent retry. Once the server application successfully processes the request, it returns the response and attempts no further retries.

gRPC can perform a single retry when the RPC reaches the gRPC server library because multiple retries can add load to the network. However, it may retry unlimited times when RPC fails to leave the client.

4.2. Retry Policy

To give developers more control, gRPC supports configuring appropriate retry policies for their applications at the individual service or method level. Once the request crosses Stage 2, it comes under the purview of the configurable retry policy. Service owners or publishers can configure the retry policies of their RPCs with the help of service config, a JSON file.

Service owners, typically distribute the service configuration to the gRPC clients using name resolution services such as DNS. However, in cases where name resolution doesn’t provide a service configuration, service consumers or developers can configure it programmatically.

gRPC supports multiple retry parameters:

Configuration Name

Description

maxAttempts

  • Max number of RPC attempts, including the original request
  • default maximum value is 5

initialBackoff

  • The initial backoff delay between retry attempts

maxBackoff

  • It places an upper limit on exponential backoff growth
  • It’s mandatory and must be greater than zero

backoffMultiplier

  • The backoff will be multiplied by this value after each retry attempt and will increase exponentially when the multiplier is greater than 1
  • It’s mandatory and must be greater than zero

retryableStatusCodes

  • A gRPC call that fails with a matching status will be automatically retried
  • Service owners should be careful while designing methods that can be retried. The methods should be idempotent or retry should be allowed only on error status codes of RPCs that haven’t made any changes in the server

Notably, the gRPC client uses initialBackoff, maxBackoff, and backoffMultiplier parameters to randomize the delay before retrying requests.

Sometimes, the server might send an instruction in the response metadata, not to retry or try the request after some delay. This is known as server pushback.

Now that we’ve discussed both transparent and policy-based retry features of gRPC, let’s summarize how gRPC manages retries overall:

retry logic state diagram

5. Programmatically Apply Retry Policy

Let’s say we have a service that can broadcast messages to the citizens by calling an underlying notification service that sends SMS to cell phones. The government uses this service to make announcements on emergencies. The client application using this service must have a retry strategy to mitigate errors due to transient failures.

Let’s explore further on this.

5.1. High-Level Design

First, let’s look at the interface definition in the broadcast.proto file:

syntax = "proto3";
option java_multiple_files = true;
option java_package = "com.baeldung.grpc.retry";
package retryexample;

message NotificationRequest {
  string message = 1;
  string type = 2;
  int32 messageID = 3;
}

message NotificationResponse {
  string response = 1;
}

service NotificationService {
  rpc notify(NotificationRequest) returns (NotificationResponse){}
}

The broadcast.proto file defines NotificationService with a remote method notify() and two DTOs NotificationRequest and NotificationResponse.

Overall, let’s see the classes used in the client and server sides of the gRPC application:

retry broadcast

Later, we can use the broadcast.proto file for generating the supporting Java source code for implementing the NotificationService. The Maven plugin generates the classes NotificationRequest, NotificationResponse, and NotificationServiceGrpc.

The GrpcBroadcastingServer class on the server side uses the ServerBuilder class to register NotificationServiceImpl to broadcast messages. The client-side class GrpcBroadcastingClient uses the ManagedChannel classes of the gRPC library to manage the channel to perform the RPCs.

The service config file retry-service-config.json outlines the retry policy:

{
     "methodConfig": [
         {
             "name": [
                 {
                      "service": "retryexample.NotificationService",
                      "method": "notify"
                 }
              ],
             "retryPolicy": {
                 "maxAttempts": 5,
                 "initialBackoff": "0.5s",
                 "maxBackoff": "30s",
                 "backoffMultiplier": 2,
                 "retryableStatusCodes": [
                     "UNAVAILABLE"
                 ]
             }
         }
     ]
}

Earlier, we understood the retry policies such as maxAttempts, exponential backoff parameters, and retryableStatusCodes. When the client invokes the remote procedure notify() in NotificationService as defined earlier in the broadcast.proto file, the gRPC framework enforces the retry settings.

5.2. Implement Retry Policy

Let’s take a look at the class GrpcBroadcastingClient:

public class GrpcBroadcastingClient {
    protected static Map<String, ?> getServiceConfig() {
        return new Gson().fromJson(new JsonReader(new InputStreamReader(GrpcBroadcastingClient.class.getClassLoader()
            .getResourceAsStream("retry-service-config.json"), StandardCharsets.UTF_8)), Map.class);
    }

    public static NotificationResponse broadcastMessage() {
        ManagedChannel channel = ManagedChannelBuilder.forAddress("localhost", 8080)
          .usePlaintext()
          .disableServiceConfigLookUp()
          .defaultServiceConfig(getServiceConfig())
          .enableRetry()
          .build();
        return sendNotification(channel);
    }
    
    public static NotificationResponse sendNotification(ManagedChannel channel) {
        NotificationServiceGrpc.NotificationServiceBlockingStub notificationServiceStub = NotificationServiceGrpc
          .newBlockingStub(channel);

        NotificationResponse response = notificationServiceStub.notify(NotificationRequest.newBuilder()
          .setType("Warning")
          .setMessage("Heavy rains expected")
          .setMessageID(generateMessageID())
          .build());
        channel.shutdown();
        return response;
    }
}

The broadcast() method builds the ManagedChannel object with the necessary configurations. Then, we pass it to sendNotification() which further invokes the notify() method on the stub**.**

The methods in the ManagedChannelBuilder class that play a crucial role in setting up the service config consisting of the retry policy are:

  • disableServiceConfigLookup(): Explicitly disables the service config lookup through name resolution
  • enableRetry(): Enables per-method configuration for retry
  • defaultServiceConfig(): Explicitly sets up the service configuration

The method getServiceConfig() reads the service config from the retry-service-config.json file and returns a Map representation of its content. Subsequently, this Map is passed on to the defaultServiceConfig() method in the ManagedChannelBuilder class.

Finally, after creating the ManagedChannel object, we call the notify() method of the notificationServiceStub object of type NotificationServiceGrpc.NotificationServiceBlockingStub to broadcast the message. The policy works for non-blocking stubs as well.

It’s advisable to use a dedicated class for creating ManagedChannel objects. This allows for centralized management, including the configuration of retry policies.

To demonstrate the retry feature, the NotificationServiceImpl class in the server is designed to be randomly out of service. Let’s take a look at the GrpcBroadcastingClient in action:

@Test
void whenMessageBroadCasting_thenSuccessOrThrowsStatusRuntimeException() {
    try {
        NotificationResponse notificationResponse = GrpcBroadcastingClient.sendNotification(managedChannel);
        assertEquals("Message received: Warning - Heavy rains expected", notificationResponse.getResponse());
    } catch (Exception ex) {
        assertTrue(ex instanceof StatusRuntimeException);
    }
}

The method invokes sendNotification() on the GrpcBroadcastingClient class to invoke the server-side remote procedure to broadcast messages. We can examine the logs to verify the retries:

test log

6. Conclusion

In this article, we explored the retry policy feature in the gRPC library. The ability to set up the policy declaratively through a JSON file is a powerful feature. However, we should use it for testing scenarios or when the service config is unavailable during the name resolution.

Retrying failed requests can lead to unpredictable outcomes, hence we should be careful in setting it only for idempotent transactions.

As usual, the code used for this article is available over on GitHub.