The document discusses challenges with observability in serverless applications and proposes solutions. Some key challenges include not having access to the underlying OS, nowhere to install agents/daemons, no background processing, higher concurrency to telemetry systems, and high chance of data loss if batching logs and metrics. The document covers solutions for logging, monitoring, and distributed tracing in serverless environments using services like CloudWatch Logs, API Gateway, and X-Ray.
8. Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
been shot down were not present for the
damage assessment.
The holes in the returning aircraft, then,
represented areas where a bomber could
take damage and still return home safely.
9. Abraham Wald
Wald noted that the study only
considered the aircraft that had survived
their missions—the bombers that had
been shot down were not present for the
damage assessment.
The holes in the returning aircraft, then,
represented areas where a bomber could
take damage and still return home safely.
11. survivor bias in monitoring
Only focus on failure modes that we were able to successfully
identify through investigation and postmortem in the past.
The bullet holes that shot us down and we couldn’t identify stay
invisible, and will continue to shoot us down.
15. In control theory, observability is a measure of how well
internal states of a system can be inferred from
knowledge of its external outputs.
https://en.wikipedia.org/wiki/Observability
26. These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
36. About me
▪ Principal Engineer at DAZN
▪ AWS Serverless Hero
▪ Author of Production-Ready Serverless* by Manning
▪ Blogger**
▪ Speaker
* https://bit.ly/production-ready-serverless
** https://theburningmonk.com
48. user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
handler
handler
handler
handler
handler
handler
handler
49. user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
50. user request
user request
user request
user request
user request
user request
user request
critical paths:
minimise user-facing latency
StatsD
handler
handler
handler
handler
handler
handler
handler
rsyslog
background processing:
batched, asynchronous, low
overhead
NO background processing
except what platform provides
65. •high chance of data loss (if batching)
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry system
new challenges
75. •asynchronous invocations
•nowhere to install agents/daemons
•no background processing
•higher concurrency to telemetry system
•high chance of data loss (if batching)
new challenges
76. These are the four pillars of the Observability Engineering
team’s charter:
• Monitoring
• Alerting/Visualization
• Distributed systems tracing infrastructure
• Log aggregation/analytics
“
” http://bit.ly/2DnjyuW- Observability Engineering at Twitter
150. those extra 10-20ms for
sending custom metrics would
compound when you have
microservices and multiple
APIs are called within one slice
of user event
151. Amazon found every 100ms of latency cost them 1% in sales.
http://bit.ly/2EXPfbA
152. console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs
166. don’t span over async
invocations
good for identifying dependencies of a function,
but not good enough for tracing the entire call
chain as user request/data flows through the
system via async event sources.