De-bugging a logger deadlock in elixir

26 Dec 2020

Note: This issue was fixed in elixir 1.11.1

At work, I got messaged about some test runs on CI randomly failing. I looked at the job and noticed that most of the tests were frozen. Tests were failing serially one after the other due to timeouts.

The job passed on retry though. I managed to reproduce the issue on local after a few attempts.

This seemed to be the common pattern, in the stack trace of the failing tests.

stacktrace:
  (stdlib) gen.erl:169: :gen.do_call/4
  (stdlib) gen_event.erl:239: :gen_event.rpc/2
  (logger) lib/logger.ex:692: Logger.__do_log__/3

Seemed like gen_event do_call called by Logger was hanging. Curious.

I had never seen this issue on production before. The volume of messages logged on test shouldn’t be nearly as big as production.

Also, I vaguely remembered reading about logger being async. Why was it blocking?

Taking a quick peak at logger’s documentation, I found a sync_threshold config.

if the Logger manager has more than :sync_threshold messages in its queue, Logger will change to sync mode, to apply backpressure to the clients.

Our configuration was still at default, which seemed to be 20.

I fired up observer, and kept doing test runs till I came across a failed run.

The message queue length for Logger was going up.

Here’s what the message queue looked like.

Messages switched from notify to sync_notify after ~20 messages. This explained why logger was blocking. Processes calling Logger were waiting for it to respond to these sync_notify messages.

But why was logger not processing any of these messages. Even the initial async notify messages did not seem to have been read.

The logger process status said ‘waiting’, which meant the receive block is stuck waiting for some message.

I started looking at it’s stack trace.

I saw a do_terminate and a report_terminate. Perhaps something crashed. But shouldn’t the client abort in such cases, why was it still waiting?

Scrolling up, the third line looked strange. It was a recursive call to Logger.__do_log__/3. The process was deadlocked trying to call itself.

I looked at gen_event’s code, to find that if a logger backend’s handle_call crashed, gen_event tried to report the crash by logging it. From the stack trace, it seemed to be calling elixir’s logger.

I looked at the logger’s documentation again, and found the handle_otp_reports config. I turned it off by setting handle_otp_reports: false.

On running the test a few more times, the tests seem to have stopeed freezing, and I saw the actual error.

=ERROR REPORT====
** gen_event handler {'Elixir.Logger.Backends.Console',<0.2833.0>} crashed.
** Was installed in 'Elixir.Logger'
** Last event was: {error,<0.64.0> ...

Why was the logger console backend crashing? Looks like it crashes if it can’t serialize metadata. Seems to have been fixed 1.10.x onwards.

Summary

Logger calls seem to deadlock if

If a logger backend has crashed and
If logger is in sync mode and
handle_otp_reports in turned on

In such cases,

Logger calls gen_event,
There’s a crash
gen_event calls logger to report termination.
But logger is in sync mode, deadlock.

Here’s a script which causes the issue. It tries to get Logger in sync mode before gen_event can print the crash.

Seems to cause freezes on 1.9.x, 1.10.x and 1.11

config :logger,
  backends: [
    LoggerBackendCrash
  ],
  handle_otp_reports: true,
  sync_threshold: 10

defmodule LoggerBackendCrash do
  @behaviour :gen_event

  @impl true
  def init(_) do
    {:ok, []}
  end

  @impl true
  def handle_call(_, _) do
    raise "handle_call_error"
    {:ok, []}
  end

  @impl true
  def handle_event(_, _) do
    raise "handle_event_error"
    {:ok, []}
  end

  @impl true
  def handle_info(_, _) do
    {:ok, []}
  end
end

Enum.each(1..20, fn cnt ->
  Logger.info("#{cnt}")
end)