Overview
The site found here is a simple site but sometimes I would find it not running after a few hours or days later. Without any logging it made it impossible to understand why it would crash.
Adding Logging
So I made some changes to the code base to add on logging, and importantly, log those to a file so I can view them once I notice the site is no longer running.
The important parts here are the changes to the config/prod.exs
import Config
config :logger, :default_handler,
config: [
file: ~c"logs.log",
filesync_repeat_interval: 5000,
file_check: 5000,
max_no_bytes: 10_000_000,
max_no_files: 5,
compress_on_rotate: true
],
format: "$date $time $metadata[$level] $message"
This configuration allowed for the logger to pipe its logs to a file instead of the standard out. This configuration is only for running in prod
. So when running locally the logs will still be sent to the terminal.
So with the logs in place I pushed out a new build and waited for the crash to happen again.
Not Enough Logging
Eventually the site did crash again and when I went to check the logs all that I saw was a message from the runtime saying that the application is shutting down with no reason why
Although this wasn't much to go off of it did give me something worthy to search for.
Shortly after beginning my search I learned about two logger configuration flags to use
handle_otp_reports
- This is true by default and is basically allow piping of Erlang and OTP logs, as I understand it.
handle_sasl_reports
- This is not true by default and for it to be enabled handle_otp_reports
must be true. This option allows for crash reports of supervisors.
So with this new knowledge I made those configuration changes and started a new build
Figuring Out the Issue
Eventually the site crashed again and this time the logs had something. I unfortunately do not have the logs anymore but once I read them it was pretty obvious what was going on and I also find a GitHub issue that relates.
In short the issue was that the site's connection to Bluesky was dropping thus the websocket would try to re-connect quickly afterwards, however Bluesky wasn't ready to accept the request. This loop would happen every quickly and eventually Erlang/OTP would kill the supervisor for misbehaving.
The solution to work around this issue was move the websocket connection in its own process and add some connection backoff logic.