Error Handling and Supportability

Background

In the previous post we scientifically drilled into designing useful API Logs. This post will cover error design patterns for APIs and UIs, with a focus on API Responses, Logs, Presentation and Lookup.

Goal: Fast Problem Resolution

An early focus on failures reduces stress and improves productivity. In production it can make the difference between these two outcomes, where the latter causes reputational damage for your company:

  • A problem for a high profile user takes 10 minutes to resolve
  • A problem for a high profile user takes 12 hours to resolve

Goal: Fast Time to Market

Even more important is the hidden cost when software and people do not have good processes for handling failures. Companies are rarely able to measure or quantify these costs:

  • Cryptic failures can block developers or testers
  • Production incidents can waste time for senior engineers
  • Delivery of other critical projects is frequently impacted

Design for Failure Scenarios

To meet these goals, companies must build software that expects failure. It is common these days to read about this approach when using platforms such as Kubernetes, but there is usually insufficient focus on coding.

The following sections of this post will describe some common failure scenarios and desired outcomes, along with design thinking to achieve them. The coding techniques are used in this blog’s code samples.

401 Error Responses

The first error case we will  consider is calling our OAuth secured API without an access token. Our APIs return a secure error object with a code and message. The HTTP 401 status is also included in the response:

{   
  "code": "invalid_token",
  "message": "Missing, invalid or expired access token"
}

The API logs the error, along with some context to include the cause, in case the 401 occurred for an unexpected reason, such as a misconfigured audience. Note that no call stack is logged for 4xx errors:

{
  "id": "39c9350e-b056-38e0-2bed-636a50ead25d",
  "utcTime": "2022-12-22T09:37:51.821Z",
  "apiName": "SampleApi",
  "operationName": "getCompanyTransactions",
  "hostName": "WORK",
  "method": "GET",
  "path": "/investments/companies/4/transactions",
  "resourceId": "4",
  "clientApplicationName": "FinalSPA",
  "statusCode": 401,
  "errorCode": "invalid_token",
  "millisecondsTaken": 4,
  "millisecondsThreshold": 500,
  "correlationId": "4b32057f-e204-db1f-5781-aa054c840e86",
  "sessionId": "832fd8c3-5fc2-e980-2f32-f88ae284f4e1",
  "errorData": {
    "statusCode": 401,
    "clientError": {
      "code": "invalid_token",
      "message": "Missing, invalid or expired access token"
    },
    "context": "JWT verification failed : signature verification failed"
  }
}

4xx Error Responses due to Invalid Input

In real world APIs you will want to validate all input early, to prevent deeper problems such as data corruption. This blog’s app have a couple of very simple input validation cases, just to cover the concept.

The API returns a Not Found for User response if the user tries to access data they are not entitled to, by editing the browser location. In the below screenshot, the user is not authorized to access company 3:

The error response returned from the API has a 404 status code, and the following response body:

{   
  "code": "company_not_found",
  "message": "Company 3 was not found for this user"
}

Similarly if a syntactically invalid ID such as ‘abc‘ is provided, then the API return a 400 error to indicate a malformed request:

{
  "code": "invalid_company_id",
  "message": "The company id must be a positive numeric integer"
}

Using Error Codes in Clients

The Transactions View in our UIs is able to handle these known API error codes gracefully, by redirecting the user back to the home page. More generally, useful error codes put clients in control:

private _isExpectedApiError(error: UIError): boolean {

    if (error.statusCode === 404 && error.errorCode === ErrorCodes.companyNotFound) {
        return true;
    }

    if (error.statusCode === 400 && error.errorCode === ErrorCodes.invalidCompanyId) {
        return true;
    }

    return false;
}

Error Codes per Component

For each of this blog’s UI and API code examples, an ‘Error Codes‘ module lists all of the known error codes. This provides a full list of error codes and can be published to interested parties when required.

export class ErrorCodes {

    public static readonly serverError = 'server_error';

    public static readonly invalidToken = 'invalid_token';

    public static readonly tokenSigningKeysDownloadError = 'jwks_download_failure';

    public static readonly insufficientScope = 'insufficient_scope';

    public static readonly userinfoFailure = 'userinfo_failure';

    public static readonly userInfoInvalidToken = 'invalid_token';

    public static readonly exceptionSimulation = 'exception_simulation';
}

You can start small, where for example any server exception returns a generic code such as server_error. Then enhance your error processing over time, based on specific failures encountered.

Unexpected API 4xx Responses

Most 4xx errors should only occur during initial integration and not during production usage. However, UIs need a default action if a 4xx error ever occurs unexpectedly. This can be simulated in APIs by throwing an error:

throw ErrorFactory.createClientError(
    400,
    'invalid_text_data', 
    'Unsupported characters were encountered');

Client 4xx Error Displays

At this point the client is broken and usually needs to present an error response to the end user. Think through how you will manage this from both a support and usability viewpoint. When online sites present an error such as this, the experience is poor in both areas:

  • Oops – something has gone wrong

This blog’s apps present an Error Summary Link for the view that failed. The end user can either view details or perform a Navigation Action by clicking the home button, to try to recover:

If the red link is clicked, an Error Details View is presented. In some types of app this screen can be hidden away, as a last resort option. Aim to keep the user informed though, and also provide hints for support staff on the underlying cause:

The client should model errors as a first class object. You can then decide which properties to display. Secure properties should also be collected, but may be used differently, such as sending to an error service:

export class UIError extends Error {

    private _area: string;
    private _errorCode: string;
    private _userAction: string;
    private _utcTime: string;
    private _statusCode: number;
    private _instanceId: number;
    private _details: any;
    private _url: string;
}

Custom API 4xx Error Responses

There is not always a one size fits all solution for 4xx errors. Sometimes error responses need to be domain specific and contain more complex payloads, as in this example, which returns a collection of errors, each of which indicates which item failed:

[{   
  "code": "invalid_stock_item_id",
  "message": "The stock item supplied was not found",
  "key": 2
},
{
  "code": "invalid_quantity",
  "message": "The quantity must be a positive integer",
  "key": 5
}]

You should still be able to design a loose abstraction that all client errors can fit into. Custom errors can then translate to the response and log formats in custom ways when required:

export abstract class ClientError extends Error {

    public constructor(statusCode: number, errorCode: string, message: string);
    
    public abstract getStatusCode(): number;

    public abstract getErrorCode(): string;

    public abstract toResponseFormat(): any;

    public abstract toLogFormat(): any;
}

Aim to Fail on First Error

Unless there is a good reason, I avoid returning API Composite Errors and use a Fail on First Error model. A single error item is usually much easier for clients to code against, and also simpler for the API to implement reliably.

This can often work well in cases where a form submits multiple values, since client side validation should deal with simple failures such as missing or malformed input, to ensure that multiple invalid fields submitted to the server is rare.

API Validation Frameworks

Developers are often attracted to vendor solutions that validate fields via declarative annotations. This is fine, but first ensure that the framework fits into your wider plan, and check requirements such as these:

Requirement Description
Controllable Format It must be possible to override default error responses to fit your error model
Reliable Rules must handle nuances such as leading and trailing white space
Extensible It must be possible to create custom validators in code when required
Complex Validation Complex rules must be supported, that operate on multiple input fields, yet use the same response format

API 5xx Error Response Formats

API 500 errors most commonly occur due to either bugs, misconfiguration or temporary infrastructure problems. You will then require extra fields to help with fast problem resolution. This blog’s APIs return an extended error response in this case:

{
	"code": "decryption_error",
	"message": "An unexpected exception occurred in the API",
	"area": "BasicApi",
	"id": 88146,
	"utcTime": "2019-05-06T12:42:30.357Z"
}

The three extra fields provide the Where, Which and When of the failure, to help enable fast error lookup:

Field Meaning
area Where in your API platform to look for the root cause
id Which log entry in that API’s logs contains the error details
utcTime When the error occurred

Note also that the root cause of the error may have been in an Upstream API rather than the API that returned the error to the caller.

Client 5xx Error Displays

For API 5xx errors, the key difference is that this blog’s apps render the extra fields in the API response. This includes an Error ID to represent a particular occurrence of the error:

API 5xx Error Logs

For 5xx errors, this blog’s APIs log the error details in the following format. This includes client and service error details. The service details including a call stack and any other useful information that could explain the cause. Some error fields are denormalized, to support error queries:


  "id": "efa6217b-7be1-f393-773d-6d8aa9a464b3",
  "utcTime": "2023-03-26T11:14:32.154Z",
  "apiName": "SampleApi",
  "operationName": "getCompanyList",
  "hostName": "WORK",
  "method": "GET",
  "path": "/investments/companies",
  "clientApplicationName": "FinalSPA",
  "userId": "a6b404b1-98af-41a2-8e7f-e4061dc0bf86",
  "statusCode": 500,
  "errorCode": "exception_simulation",
  "errorId": 76236,
  "millisecondsTaken": 1,
  "millisecondsThreshold": 500,
  "correlationId": "b95f6800-b236-863a-1f6f-23e7a12cf474",
  "sessionId": "832fd8c3-5fc2-e980-2f32-f88ae284f4e1",
  "errorData": {
    "statusCode": 500,
    "clientError": {
      "code": "exception_simulation",
      "message": "An unexpected exception occurred in the API",
      "id": 76236,
      "area": "SampleApi",
      "utcTime": "2023-03-26T11:14:32.154Z"
    },
    "serviceError": {
      "details": "",
      "stack": [
        "Error: An unexpected exception occurred in the API",
        "at Function.createServerError (/home/gary/dev/oauth.apisample.nodejs/src/plumbing/errors/errorFactory.ts:16:16)",
        "at CustomHeaderMiddleware.processHeaders (/home/gary/dev/oauth.apisample.nodejs/src/plumbing/middleware/customHeaderMiddleware.ts:27:36)",
        "at Layer.handle [as handle_request] (/home/gary/dev/oauth.apisample.nodejs/node_modules/express/lib/router/layer.js:95:5)",
        "at trim_prefix (/home/gary/dev/oauth.apisample.nodejs/node_modules/express/lib/router/index.js:328:13)",
        "at /home/gary/dev/oauth.apisample.nodejs/node_modules/express/lib/router/index.js:286:9",
        "at param (/home/gary/dev/oauth.apisample.nodejs/node_modules/express/lib/router/index.js:365:14)",
        "at param (/home/gary/dev/oauth.apisample.nodejs/node_modules/express/lib/router/index.js:376:14)",
        "at Function.process_params (/home/gary/dev/oauth.apisample.nodejs/node_modules/express/lib/router/index.js:421:3)",
        "at next (/home/gary/dev/oauth.apisample.nodejs/node_modules/express/lib/router/index.js:280:10)",
        "at ClaimsCachingAuthorizer.authorizeRequestAndGetClaims (/home/gary/dev/oauth.apisample.nodejs/src/plumbing/security/baseAuthorizer.ts:62:13)"
      ]
    }
  }
}

Error Communication

In some types of app, end users will communicate problems back to the software producing company, via one of these methods:

  • User sends a screenshot to a help desk email address
  • User phones a help desk person and reads out the red error number

This ‘fairly unique error ID‘ of 88641 is easy for a user to communicate by phone, even if they are not a strong English speaker. The error ID is then used by support staff to quickly look up the error in the log aggregation system, via the following type of query.

GET apilogs*/_search
{ 
  "query":
  {
    "match":
    {
      "errorId": 88641
    }
  }
}

Occasionally, multiple log entries will exist with the same error ID. In this case, other details, such as the time, will enable the most relevant entry to be quickly identified.

Error Chaining

In the below screenshot the root cause of the error is in an upstream API. The Online Sales API must return the most relevant error details to the UI:

  • area = Customers API
  • id = 97264

If the Online Sales API generated a new Error ID, the UI would not point to the root cause. Instead, the error handling in the Online Sales API must log and return error details from the Customers API. There will be two API log entries, both with the same values for these fields:

  • Error ID
  • Correlation ID
  • Session ID

The log aggregation query will return both errors, and the Customers API log entry will contain the root cause.

Error Rehearsal

Typically 500 errors are the most difficult to reproduce or test, but the ability to do so has considerable business value, and ensures that:

  • The API 500 error handling code is behaving as designed
  • The right people have access to production logs
  • People are trained on the incident resolution process

All of this blog’s Final UIs support rehearsing a back end 500 error by long pressing the Reload Data button for a few seconds:

To support error rehearsal, this blog’s APIs use a very small middleware class, which checks for a Custom Header and throws an exception when the header value matches the name of the API:

export class CustomHeaderMiddleware {

    public processHeaders(request: Request, response: Response, next: NextFunction): void {

        const apiToBreak = request.header('x-mycompany-test-exception');
        if (apiToBreak) {
            if (apiToBreak.toLowerCase() === this._apiName.toLowerCase()) {

                throw ErrorFactory.createServerError(
                    BaseErrorCodes.exceptionSimulation,
                    'An unexpected exception occurred in the API');
            }
        }

        next();
    }
}

For demo purposes, this blog’s apps send the custom header to the API after a long press of the Reload Data button. This enables a basic form of chaos testing, to choose an API to break.

Reliability Over Time

Not all errors are so deterministic. Occasionally your software will have deeper bugs, that occur intermittently. The best way to deal with these is to assign them an error code, then measure them.

The Technical Support Analysis post explains how you should be able to produce a report with a breakdown of API errors by type with frequencies. Once you have visibility you can plan actions for any error occurrences that are not expected, such as adding extra details to logs.

clientApplicationName|    apiName    |    operationName     |  statusCode   |     errorCode      |   frequency   
---------------------+---------------+----------------------+---------------+--------------------+---------------
FinalSPA             |SampleApi      |getCompanyTransactions|404            |company_not_found   |6              
FinalSPA             |SampleApi      |getCompanyList        |500            |exception_simulation|2              
FinalSPA             |SampleApi      |getCompanyList        |500            |server_error        |7              
BasicIosApp          |SampleApi      |getCompanyTransactions|500            |file_read_error     |1              
BasicAndroidApp      |SampleApi      |getCompanyTransactions|400            |invalid_company_id  |1              
BasicDesktopApp      |SampleApi      |getUserInfo           |500            |server_error        |2              

Error Coding Design

Error coding can usually be classified into steps, similar to the following. These steps are used by all of this blog’s code samples:

Step Description
Error Translation Catching an error at the earliest point, to include exception specific context in error models
Error Throwing Throwing a different exception after error translation, while keeping the stack trace of the original error
Error Logging Logging the error information in a structured format, that is easy to query later
Error Responses Returning the error details to an API client or displaying them to an end user

Error Translation and Throwing

When there is an exception, both our UI and API collect exception related data into their error models. APIs use both ClientError and ServerError objects, both of which contain fields focused on supportability:

export class ServerErrorImpl extends ServerError {

    private readonly _statusCode: number;
    private readonly _errorCode: string;
    private readonly _instanceId: number;
    private readonly _utcTime: string;
    private _details: any;

    public constructor(errorCode: string, userMessage: string, stack?: string | undefined) {

        super(userMessage);

        this._statusCode = 500;
        this._errorCode = errorCode;
        this._instanceId = Math.floor(Math.random() * (MAX_ERROR_ID - MIN_ERROR_ID + 1) + MIN_ERROR_ID);
        this._utcTime = new Date().toISOString();
        this._details = '';

        if (stack) {
            this.stack = stack;
        }
   }
}

Catching is done both when we want to assign a specific error code, or when we want to capture third party details. These concerns are best managed close to the source exception. Catching is only needed when dealing with infrastructure, and the code base should not contain many catch blocks:

public async validateToken(accessToken: string): Promise<JWTPayload> {

    try {

        const options = {
            algorithms: ['RS256'],
            issuer: this._configuration.issuer,
            audience: this._configuration.audience,
        };
        const result = await jwtVerify(accessToken, this._jwksRetriever.remoteJWKSet, options);

        return result.payload;

    } catch (e: any) {

        if (e.code === 'ERR_JOSE_GENERIC') {
            throw ErrorUtils.fromSigningKeyDownloadError(e, this._configuration.jwksEndpoint);
        }

        let details = 'JWT verification failed';
        if (e.message) {
            details += ` : ${e.message}`;
        }

        throw ErrorFactory.createClient401Error(details);
    }
}

Another example is shown here, where a file read problem includes the library’s call stack, and the rethrown error includes the file path that failed:

public async readData<T>(filePath: string): Promise<T> {

    try {

        const buffer = await fs.readFile(filePath);
        return JSON.parse(buffer.toString()) as T;

    } catch (e: any) {

        const error = ErrorFactory.createServerError(
            ErrorCodes.fileReadError,
            'Problem encountered reading data',
            e.stack);

        if (e instanceof Error) {
            error.setDetails(`File: ${filePath}, ${e.message}`);
        } else {
            error.setDetails(`File: ${filePath}, ${e}`);
        }

        throw error;
    }
}

The API code is able to indicate a 4xx condition by throwing a ClientError, or a 5xx condition by throwing a ServerError. It can assign an error code in both cases. The thrown error also contains a useful stack trace, from the third party library that does the lower level work.

Error Logging and Responses

Results are returned from APIs to clients via an exception middleware class. At this point the operations for toLogFormat and toResponseFormat are called:

export class UnhandledExceptionHandler {

    public handleException(exception: any, request: Request, response: Response, next: NextFunction): void {

        const perRequestContainer = ChildContainerHelper.resolve(request);
        const logEntry = perRequestContainer.get<LogEntryImpl>(BASETYPES.LogEntry);
        const error = ErrorUtils.fromException(exception);

        let clientError;
        if (error instanceof ServerError) {
            
            logEntry.setServerError(error);
            clientError = error.toClientError(this._configuration.apiName);

        } else {

            logEntry.setClientError(error);
            clientError = error;
        }

        const writer = new ResponseWriter();
        writer.writeObjectResponse(response, clientError.getStatusCode(), clientError.toResponseFormat());
    }
}

The app then receives the API error and renders the error to the end user. Both the API and client are therefore consumer focused:

export class ErrorFormatter {

    public getErrorLines(error: UIError): ErrorLine[] {

        const lines: ErrorLine[] = [];

        lines.push(this._createErrorLine('User Action', error.userAction, 'highlightcolor'));

        if (error.message.length > 0) {
            lines.push(this._createErrorLine('Info', error.message));
        }

        if (error.utcTime.length > 0) {
            const displayTime = moment(error.utcTime).format('DD MMM YYYY HH:mm:ss');
            lines.push(this._createErrorLine('UTC Time', displayTime));
        }

        if (error.area.length > 0) {
            lines.push(this._createErrorLine('Area', error.area));
        }

        if (error.errorCode.length > 0) {
            lines.push(this._createErrorLine('Error Code', error.errorCode));
        }

        if (error.instanceId > 0) {
            lines.push(this._createErrorLine('Instance Id', error.instanceId.toString(), 'errorcolor'));
        }

        if (error.statusCode > 0) {
            lines.push(this._createErrorLine('Status Code', error.statusCode.toString()));
        }

        return lines;
    }
}

Portable Design

Technology stacks often provide only basic error handling. You usually need to design error handling based on your company’s requirements. The above examples showed some commonly desired behaviour, which can be coded in any language.

Where Are We?

We have discussed some techniques to enable software companies to deal with errors in both backend and frontend code, in a scalable manner. This behaviour is implemented in all of this blog’s code samples.

Next Steps

  • We will describe the behaviour of this blog’s final Node.js API
  • For a list of all blog posts see the Index Page