Finagle’s new response classifiers improve client’s avoidance of faulty nodes thus increasing your success rate. To get this benefit, you must wire up the application’s rules into your clients and how to do so is explained below.
First, a pop quiz — does Finagle treat an HTTP 500 response as a success or failure? How about a Thrift Exception?
If you answered failures, sadly, you are in for a surprise. Finagle
lacks application level domain knowledge of what kinds of responses
are failures. Without this, Finagle uses a conservative policy and
treats all Returns
as successful
and all Throws
as failures. Unfortunately, both HTTP 500s and Thrift
Exceptions are Returns
, and thus, successful responses.
By having you, the developers, give Finagle this application level
knowledge, it can then accurately track failures in its failure
accrual module which directly helps your client’s success rate.
Finagle’s built-in success rate metrics (e.g. clnt/tweetsvc/success
)
also become accurate and this in turn means you may be able remove
additional success rate metrics you may be wrapping on top of a
Finagle client.
In the future, we are considering wiring this into load balancing which enables us to penalize servers which are returning failures or partial results.
As of release 6.33 you wire up a ResponseClassifier
to your client. For HTTP clients, using
HttpResponseClassifier.ServerErrorsAsFailures
often works great as it
classifies any HTTP 5xx response code as a failure. For
Thrift/ThriftMux clients you may want to use
ThriftResponseClassifier.ThriftExceptionsAsFailures
which classifies
any deserialized Thrift Exception as a failure. For a large set of use
cases these should suffice. Classifiers get wired up to your client in
a straightforward manner, for example:
// Scala
import com.twitter.finagle.ThriftMux
import com.twitter.finagle.builder.ClientBuilder
import com.twitter.finagle.thrift.service.ThriftResponseClassifier
// Discoverable Parameters API
ThriftMux.client
...
.withResponseClassifier(ThriftResponseClassifier.ThriftExceptionsAsFailures)
// ClientBuilder API
ClientBuilder
...
.responseClassifier(ThriftResponseClassifier.ThriftExceptionsAsFailures)
// Java
import com.twitter.finagle.Http;
import com.twitter.finagle.builder.ClientBuilder;
import com.twitter.finagle.http.service.HttpResponseClassifier;
// Discoverable Parameters API
Http.client()
...
.withResponseClassifier(HttpResponseClassifier.ServerErrorsAsFailures());
// ClientBuilder API
ClientBuilder
...
.responseClassifier(HttpResponseClassifier.ServerErrorsAsFailures());
If a classifier is not specified on a client or if a user’s classifier
isn’t defined for a given request/response pair then
ResponseClassifier.Default
is used. This gives us the same behavior
Finagle had prior to classification — responses that are Returns
are
successful and Throws
are failures.
To do this we should understand the few classes used. A
ResponseClassifier
is a PartialFunction
from ReqRep
to
ResponseClass
.
Let’s work our way backwards through those, beginning with
ResponseClass
. This can be either Successful
or Failed
and those
values are self-explanatory. There are three constants which will
cover the vast majority of usage: Success
, NonRetryableFailure
and
RetryableFailure
. While as of today there is no distinction made
between retryable and non-retryable failures, it was a good
opportunity to lay the groundwork for use in the future.
A ReqRep
is a request/response struct with a request of type Any
and a response of type Try[Any]
. While the lack of typing may
initially disturb you, our hope is that it is not an issue in
practice. While all of this functionality is called response
classification, you’ll note that classifiers make judgements on both a
request and response.
Writing a custom PartialFunction
is easy in Scala given its syntactic
sugar. As with many things it is a bit more work from Java but is
doable. Here is an example that counts HTTP 503s as failures (for Java
examples, take a look at HttpResponseClassifierCompilationTest
and ResponseClassifierCompilationTest
):
// Scala
import com.twitter.finagle.http
import com.twitter.finagle.service.{ReqRep, ResponseClass, ResponseClassifier}
import com.twitter.util.Return
val classifier: ResponseClassifier = {
case ReqRep(_, Return(r: http.Response)) if r.statusCode == 503 =>
ResponseClass.NonRetryableFailure
}
Note that this PartialFunction
isn’t total which is ok due to
Finagle always using user defined classifiers in combination with
ResponseClassifier.Default
which will cover all cases.
Thrift and ThriftMux classifiers require a bit more care as the
request and response types are not as obvious. This is because there
is only a single Service
from Array[Byte]
to Array[Byte]
for all the
methods of an IDL’s service. To make this workable, there is support
in Scrooge and Thrift
/ThriftMux.newService
and
Thrift
/ThriftMux.newClient
code to deserialize the responses into the
expected application types so that classifiers can be written in terms
of the Scrooge generated request type, $Service.$Method.Args
, and the
method’s response type. Given an IDL:
exception NotFoundException { 1: string reason }
service SocialGraph {
i32 follow(1: i64 follower, 2: i64 followee) throws (1: NotFoundException ex)
}
One possible classifier would be:
val classifier: ResponseClassifier = {
// #1
case ReqRep(_, Throw(_: NotFoundException)) =>
ResponseClass.NonRetryableFailure
// #2
case ReqRep(_, Return(x: Int)) if x == 0 =>
ResponseClass.NonRetryableFailure
// #3
case ReqRep(SocialGraph.Follow.Args(a, b), _) if a <= 0 =>
ResponseClass.NonRetryableFailure
}
If you examine that classifier you’ll note a few things. First (#1),
the deserialized NotFoundException
can be treated as a failure. Next
(#2), a “successful” response can be examined to enable services using
status codes to classify errors. Lastly (#3), the request can be
introspected to make the decision.
It’s important to understand what the impact will be if you customize response classification for your client. Perhaps most importantly, when responses are classified as failures, this affects how failure accrual sees responses. In the past, you may have had a Thrift service returning nothing but exceptions, but this node would continue getting traffic due to failure accrual’s lack of visibility. While this changes lets you fix this visibility, you should consider what causes those responses. For example, if the service is simply proxying a failure from its downstream service, you may not want to count that as a failure.
There isn’t a strict rule on what is the right thing to do with classification. However, with some minimal thought, many services can improve their success rate both in terms of how it’s reported as well as through avoidance of bad nodes.
We’re really hopeful that this makes a significant difference in how well Finagle works for you but it needs you, the application developers, to make these choices.
If you have any questions on how to use this or feedback on how it’s working, please get in touch through @finagle or the Finaglers mailing list.