SlideShare a Scribd company logo
1 of 161
Download to read offline
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Applying Principles of Chaos
Engineering to Serverless
Yan Cui
Principal Engineer
DAZN
D V C 3 0 5
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
What is chaos engineering?
New challenges with serverless
Applying latency injection to serverless
Applying error injection to serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
After the talk
Slides will be shared on Slideshare
Recording will be posted on YouTube within 48 hours
Find the links on https://theburningmonk.com/reinvent2018
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
What is chaos engineering?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering is the discipline of experimenting on a distributed system
in order to build confidence in the system’s capability
to withstand turbulent conditions in production.
- principlesofchaos.org
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Smallpox
Earliest evidence of disease in third century BC Egyptian mummy
Estimated 400K deaths per year in eighteenth century Europe
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
First vaccine was developed in
1798 by Edward Jenner
https://en.wikipedia.org/wiki/Edward_Jenner
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
WHO certified global eradication
in 1980
https://en.wikipedia.org/wiki/Edward_Jenner
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://en.wikipedia.org/wiki/Vaccine
History of vaccination
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
Vaccination is the most effective method to prevent infectious diseases
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
History of vaccination
Vaccines stimulate the immune system to recognize and destroy the
disease before contracting it for real
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Use controlled experiments to inject failures into our system
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Help us learn about our system’s behavior and uncover unknown failure
modes, before they manifest like wildfire in production
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering
Lets us build confidence in its ability to withstand turbulent conditions
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering is the vaccine to frailties in modern software
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Who am I?
Principal engineer at DAZN
AWS Serverless hero
Author of Production-Ready Serverless* course by Manning.
Blogger**, speaker.
* https://bit.ly/production-ready-serverless
** https://theburningmonk.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About DAZN
Available in seven countries—Austria, Switzerland, Germany,
Japan, Canada, Italy, and USA
Available on 30+ platforms
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
About DAZN
Around 1,000,000 concurrent viewers at peak
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
Too much emphasis is on breaking things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
Easy to conflate the action of injecting failures with the payback
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
The goal is to learn about the system and build confidence
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos engineering has an image problem
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Four steps to start running chaos
experiments yourself
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, working
condition looks like?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
this is not a
steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both control group
& the experiment group
In other words, you should have a reasonable degree of
confidence the system would handle the failure before you
proceed with the experiment
STEP 2.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Explore unknown unknowns away from production
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Experiments that graduate to production should be carefully
considered and planned
You should have reasonable confidence in the system before
running experiments in production
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Treat production with the care it deserves
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
If you knew the system would break and you did it anyway,
then it’s not a chaos experiment!
It’s called being irresponsible.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, server crash, network
error, HD malfunction, more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Netflix’s Simian Army:
https://github.com/Netflix/SimianArmy
Chaos Engineering ebook (O’Reilly): http://oreil.ly/2tZU1Sn
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look for difference
in steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Look for evidence that steady state was impacted by the
injected failure
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Chaos in practice
Address weaknesses before failures happen for real
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Experiments needs to be controlled
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Ensure everyone knows what you are doing
Don’t surprise your teammates
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Run experiments during office hours
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Avoid important dates
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Make the smallest change necessary to prove or disprove hypothesis
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Have a rollback plan
Stop the experiment right away if things start to go wrong
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Containment
Don’t start in production
Can learn a lot by running experiments in staging
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
New challenges with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
chaos monkey kills an
Amazon Elastic Cloud
(Amazon EC2) instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
There are no servers that you can access and kill
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
There is more inherent chaos and complexity in a
serverless architecture.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Smaller units of deployment, but a lot more of them
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
serverful
serverlessServerless challenges
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Every function needs to be correctly configured and secured
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Kinesis
?
SNS
CloudWatch
Events
CloudWatch
LogsIoT
Core
DynamoDB
S3 SES
Serverless challenges
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
A lot of managed, intermediate services
Each with its own set of failure modes
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Unknown failure modes in the infrastructure we don’t control
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless challenges
Often there’s little we can do when an outage occurs in the platform
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Improperly tuned timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing error handling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing fallback
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common weaknesses
Missing regional failover
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Latency injection with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, working
condition looks like?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining steady state
What metrics do you use?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Defining steady state
p95/p99 latencies, error count, backlog size, yield*, harvest**
* percentage of requests completed
** completeness of the returned response
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both control group
& the experiment group
In other words, you should have a reasonable degree of
confidence the system would handle the failure before you
proceed with the experiment
STEP 2.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
API Gateway
Serverless considerations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Serverless considerations
Consider the effect of cold starts
How does it affect your strategy
for handling slow responses
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Strategy should:
1. Give requests the best chance to succeed
2. Do not allow slow response to timeout the caller function
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Finding the right timeout value is tricky
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Too short: requests not given the best chance to succeed
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Too long: risk timing out the calling function
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Even more complicated when you have multiple integration points
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Approach 1: Split invocation time equally
(for example, 3 requests, 6s function timeout = 2s timeout per request)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Approach 2: Every request is given nearly all the invocation time
(for example, 3 requests, 6s function timeout = 5s timeout per request)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
Proposal: set request timeouts dynamically based on
invocation time left
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Request timeouts
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set timeout based on remaining invocation time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set timeout based on remaining invocation time
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Log the timeout with as much context as possible
The API, timeout value, correlation IDs, request object, and more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Record custom metrics
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Use fallbacks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recovery steps
Be mindful when you sacrifice precision for availability
User experience is the king
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, server crash, network
error, HD malfunction, more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate timeout on its HTTP communications
and can degrade gracefully when these requests time out
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Should be applied to third-party services too
DynamoDB, Twillio, Auth0 …
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Be mindful of the blast radius of the experiment
The goal is not to break things
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
http client
public-api-a
http client
public-api-b
internal-api
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
All functions have appropriate timeout on their HTTP
communications to this internal API and can degrade
gracefully when requests are timed out
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Large blast radius, can cause cascade failures unintentionally
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Priming (psychology):
Priming is a technique whereby exposure to one stimulus
influences a response to a subsequent stimulus, without
conscious guidance or intention
It is a technique in psychology used to train a person's
memory both in positive and negative ways
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Use failure injection to program your colleagues into
thinking about failure modes early.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
Make X% of all requests slow
in the dev environment
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
The client app has appropriate timeout on their HTTP
communication with the server and can degrade gracefully
when requests are timed out
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look for difference
in steady state
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
Static weavers (such as PostSharp, AspectJ)
Dynamic proxies
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
https://theburningmonk.com/2015/04/design-for-latency-issues/
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
How to inject latency?
Manually crafted wrapper libraries
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Configured in SSM Parameter Store
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
No injected latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
With injected latency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Factory wrapper function
(think bluebird’s promisifyAll function)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Error injection with serverless
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Common errors
HTTP 5XX
Amazon DynamoDB provisioned throughput exceeded
Throttled AWS Lambda invocations
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate error handling on its HTTP communications
and can degrade gracefully when downstream dependencies fail
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesis:
Function has appropriate error handling on DynamoDB operations and
can degrade gracefully when DynamoDB throughputs are exceeded
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Where to inject errors?
Induce Lambda throttling by temporarily setting reserve concurrency
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Recap
Failures are INEVITABLE
The only way to truly know your system’s
resilience against failures is to test it
through CONTROLLED experiments
The goal of chaos engineering is NOT to
actually break production
CONTAINMENT should be front and
centre of your thinking
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 1. Define “steady state”
What does normal, working
condition looks like?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hypothesize steady state will
continue in both control group
& the experiment group
In other words, you should have a reasonable degree of
confidence the system would handle the failure before you
proceed with the experiment
STEP 2.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 3. Inject realistic failures
For example, server crash, network
error, HD malfunction, more
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
STEP 4. Disprove hypothesis
In other words, look for difference
in steady state
There is more inherent chaos and
complexity in a serverless application
Even without servers, you can still inject
CONTROLLED failures at the application level
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Yan Cui
@theburningmonk
https://theburningmonk.com
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Related breakouts
Wednesday, Nov 28
SRV425-R - Best Practices for Building Multi-Region, Active-Active Serverless Applications
4:00PM – 5:00PM | Venetian, Level 4, Lando 4305
Wednesday, Nov 28
SRV343-R - Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway
4:45PM – 5:45PM | MGM, Level 1, South Concourse 105
Thursday, Nov 29
ARC308 - Chaos Engineering and Scalability at Audible.com
1:00PM – 2:00PM | Aria West, Level 3, Ironwood 5
Please complete the session
survey in the mobile app.
!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

More Related Content

What's hot

LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...
LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...
LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...Amazon Web Services
 
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019 Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019 Ana Medina
 
From Code to a Running Container | AWS Floor28
From Code to a Running Container | AWS Floor28From Code to a Running Container | AWS Floor28
From Code to a Running Container | AWS Floor28Amazon Web Services
 
Using Security To Build
 With Confidence In AWS - Trend Micro
Using Security To Build
 With Confidence In AWS - Trend MicroUsing Security To Build
 With Confidence In AWS - Trend Micro
Using Security To Build
 With Confidence In AWS - Trend MicroAmazon Web Services
 
Using Security To Build With Confidence - Session Sponsored by Trend Micro
Using Security To Build With Confidence - Session Sponsored by Trend MicroUsing Security To Build With Confidence - Session Sponsored by Trend Micro
Using Security To Build With Confidence - Session Sponsored by Trend MicroAmazon Web Services
 
Modernizing on EKS (Keynote)- AWS Container Day 2019 Barcelona
Modernizing on EKS (Keynote)- AWS Container Day 2019 BarcelonaModernizing on EKS (Keynote)- AWS Container Day 2019 Barcelona
Modernizing on EKS (Keynote)- AWS Container Day 2019 BarcelonaAmazon Web Services
 
Jets: A Ruby Serverless Framework
Jets: A Ruby Serverless FrameworkJets: A Ruby Serverless Framework
Jets: A Ruby Serverless FrameworkTung Nguyen
 
NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...
NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...
NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...Amazon Web Services
 
5 Essential Techniques for Building Fault-tolerant Systems
5 Essential Techniques for Building Fault-tolerant Systems5 Essential Techniques for Building Fault-tolerant Systems
5 Essential Techniques for Building Fault-tolerant SystemsAtlassian
 
Device Testing with AWS Device Farm
Device Testing with AWS Device FarmDevice Testing with AWS Device Farm
Device Testing with AWS Device FarmAmazon Web Services
 
Chaos Engineering with Containers - QCon SF 2018
Chaos Engineering with Containers - QCon SF 2018 Chaos Engineering with Containers - QCon SF 2018
Chaos Engineering with Containers - QCon SF 2018 Ana Medina
 
An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018
An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018
An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018Amazon Web Services
 
Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019Akamai Netherlands
 
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...Amazon Web Services Korea
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSBilal Aybar
 
IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018Boaz Ziniman
 
Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019Akamai Netherlands
 
re:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized Servicesre:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized ServicesCalvin French-Owen
 
Top 10 Tips for Securing and Scaling Atlassian Cloud
Top 10 Tips for Securing and Scaling Atlassian CloudTop 10 Tips for Securing and Scaling Atlassian Cloud
Top 10 Tips for Securing and Scaling Atlassian CloudAtlassian
 

What's hot (20)

LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...
LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...
LFS305_Automated Policy Enforcement for Real-Time Operations, Security, and C...
 
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019 Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
Chaos Engineering in a Multi-Cloud World | Escape Conference 2019
 
From Code to a Running Container | AWS Floor28
From Code to a Running Container | AWS Floor28From Code to a Running Container | AWS Floor28
From Code to a Running Container | AWS Floor28
 
Using Security To Build
 With Confidence In AWS - Trend Micro
Using Security To Build
 With Confidence In AWS - Trend MicroUsing Security To Build
 With Confidence In AWS - Trend Micro
Using Security To Build
 With Confidence In AWS - Trend Micro
 
Using Security To Build With Confidence - Session Sponsored by Trend Micro
Using Security To Build With Confidence - Session Sponsored by Trend MicroUsing Security To Build With Confidence - Session Sponsored by Trend Micro
Using Security To Build With Confidence - Session Sponsored by Trend Micro
 
Modernizing on EKS (Keynote)- AWS Container Day 2019 Barcelona
Modernizing on EKS (Keynote)- AWS Container Day 2019 BarcelonaModernizing on EKS (Keynote)- AWS Container Day 2019 Barcelona
Modernizing on EKS (Keynote)- AWS Container Day 2019 Barcelona
 
Jets: A Ruby Serverless Framework
Jets: A Ruby Serverless FrameworkJets: A Ruby Serverless Framework
Jets: A Ruby Serverless Framework
 
NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...
NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...
NEW LAUNCH! AWS DeepLens workshop: Building Computer Vision Applications - MC...
 
Amazon guard duty_lab
Amazon guard duty_labAmazon guard duty_lab
Amazon guard duty_lab
 
5 Essential Techniques for Building Fault-tolerant Systems
5 Essential Techniques for Building Fault-tolerant Systems5 Essential Techniques for Building Fault-tolerant Systems
5 Essential Techniques for Building Fault-tolerant Systems
 
Device Testing with AWS Device Farm
Device Testing with AWS Device FarmDevice Testing with AWS Device Farm
Device Testing with AWS Device Farm
 
Chaos Engineering with Containers - QCon SF 2018
Chaos Engineering with Containers - QCon SF 2018 Chaos Engineering with Containers - QCon SF 2018
Chaos Engineering with Containers - QCon SF 2018
 
An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018
An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018
An Intro to AWS for Developers: AWS Developer Workshop at Web Summit 2018
 
Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019
 
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...
AWS SAM(Serverless Application Model) 을 이용한 백오피스 마이그레이션 (현창훈, HBSmith) :: AWS...
 
Chaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWSChaos engineering & Gameday on AWS
Chaos engineering & Gameday on AWS
 
IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018IoT from Cloud to Edge & Back Again - WebSummit 2018
IoT from Cloud to Edge & Back Again - WebSummit 2018
 
Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019Akamai Tech day Amsterdam 2019
Akamai Tech day Amsterdam 2019
 
re:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized Servicesre:Invent CON320 Tracing and Debugging for Containerized Services
re:Invent CON320 Tracing and Debugging for Containerized Services
 
Top 10 Tips for Securing and Scaling Atlassian Cloud
Top 10 Tips for Securing and Scaling Atlassian CloudTop 10 Tips for Securing and Scaling Atlassian Cloud
Top 10 Tips for Securing and Scaling Atlassian Cloud
 

Similar to Applying Chaos Engineering to Serverless

Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Amazon Web Services
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringAmazon Web Services
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with KubernetesArun Gupta
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedAWS User Group Bengaluru
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Amazon Web Services
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Adrian Hornsby
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudAmazon Web Services
 
Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...
Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...
Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...Amazon Web Services
 
打造新一代的企業 IT - Transforming Enterprise IT
打造新一代的企業 IT - Transforming Enterprise IT打造新一代的企業 IT - Transforming Enterprise IT
打造新一代的企業 IT - Transforming Enterprise ITAmazon Web Services
 
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Amazon Web Services
 
Amazon GuardDuty Threat Detection and Remediation
Amazon GuardDuty Threat Detection and RemediationAmazon GuardDuty Threat Detection and Remediation
Amazon GuardDuty Threat Detection and RemediationAmazon Web Services
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Amazon Web Services
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksAmazon Web Services
 
ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...
ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...
ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...Amazon Web Services
 
Creating resiliency through destruction
Creating resiliency through destructionCreating resiliency through destruction
Creating resiliency through destructionAmazon Web Services
 
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...Amazon Web Services
 
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...Amazon Web Services
 

Similar to Applying Chaos Engineering to Serverless (20)

Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
Breaking Containers: Chaos Engineering for Modern Applications on AWS (CON310...
 
Keynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos EngineeringKeynote - Adrian Hornsby on Chaos Engineering
Keynote - Adrian Hornsby on Chaos Engineering
 
Chaos Engineering with Kubernetes
Chaos Engineering with KubernetesChaos Engineering with Kubernetes
Chaos Engineering with Kubernetes
 
Keynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practicedKeynote - Chaos Engineering: Why breaking things should be practiced
Keynote - Chaos Engineering: Why breaking things should be practiced
 
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
Chaos Engineering and Scalability at Audible.com (ARC308) - AWS re:Invent 2018
 
Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.Chaos Engineering: Why Breaking Things Should Be Practised.
Chaos Engineering: Why Breaking Things Should Be Practised.
 
Resiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the CloudResiliency and Availability Design Patterns for the Cloud
Resiliency and Availability Design Patterns for the Cloud
 
Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...
Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...
Closing Loops and Opening Minds: How to Take Control of Systems, Big and Smal...
 
Chaos Engineering
Chaos EngineeringChaos Engineering
Chaos Engineering
 
打造新一代的企業 IT - Transforming Enterprise IT
打造新一代的企業 IT - Transforming Enterprise IT打造新一代的企業 IT - Transforming Enterprise IT
打造新一代的企業 IT - Transforming Enterprise IT
 
TECHTalks - Boston MA - Tim Harney
TECHTalks - Boston MA - Tim HarneyTECHTalks - Boston MA - Tim Harney
TECHTalks - Boston MA - Tim Harney
 
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
Serverless + Evolutionary Architectures + Safe Deployments = Speed in the Rig...
 
Amazon GuardDuty Threat Detection and Remediation
Amazon GuardDuty Threat Detection and RemediationAmazon GuardDuty Threat Detection and Remediation
Amazon GuardDuty Threat Detection and Remediation
 
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
Releasing Mission-Critical Software at Amazon (DEV209-R1) - AWS re:Invent 2018
 
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech TalksLife of a Code Change to a Tier 1 Service - AWS Online Tech Talks
Life of a Code Change to a Tier 1 Service - AWS Online Tech Talks
 
ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...
ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...
ALB User Authentication: Identity Management at Scale with Netflix (NET204) -...
 
Creating resiliency through destruction
Creating resiliency through destructionCreating resiliency through destruction
Creating resiliency through destruction
 
Industrial Transformation
Industrial TransformationIndustrial Transformation
Industrial Transformation
 
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
[REPEAT 1] Safeguard the Integrity of Your Code for Fast and Secure Deploymen...
 
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
Safeguard the Integrity of Your Code for Fast and Secure Deployments (DEV349-...
 

More from Yan Cui

How to win the game of trade-offs
How to win the game of trade-offsHow to win the game of trade-offs
How to win the game of trade-offsYan Cui
 
How to choose the right messaging service
How to choose the right messaging serviceHow to choose the right messaging service
How to choose the right messaging serviceYan Cui
 
How to choose the right messaging service for your workload
How to choose the right messaging service for your workloadHow to choose the right messaging service for your workload
How to choose the right messaging service for your workloadYan Cui
 
Patterns and practices for building resilient serverless applications.pdf
Patterns and practices for building resilient serverless applications.pdfPatterns and practices for building resilient serverless applications.pdf
Patterns and practices for building resilient serverless applications.pdfYan Cui
 
Lambda and DynamoDB best practices
Lambda and DynamoDB best practicesLambda and DynamoDB best practices
Lambda and DynamoDB best practicesYan Cui
 
Lessons from running AppSync in prod
Lessons from running AppSync in prodLessons from running AppSync in prod
Lessons from running AppSync in prodYan Cui
 
Serverless observability - a hero's perspective
Serverless observability - a hero's perspectiveServerless observability - a hero's perspective
Serverless observability - a hero's perspectiveYan Cui
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functionsYan Cui
 
How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigmYan Cui
 
Why your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSyncWhy your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSyncYan Cui
 
Build social network in 4 weeks
Build social network in 4 weeksBuild social network in 4 weeks
Build social network in 4 weeksYan Cui
 
Patterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applicationsPatterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applicationsYan Cui
 
How to bring chaos engineering to serverless
How to bring chaos engineering to serverlessHow to bring chaos engineering to serverless
How to bring chaos engineering to serverlessYan Cui
 
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsYan Cui
 
Building a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQLBuilding a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQLYan Cui
 
FinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economyFinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economyYan Cui
 
How to improve lambda cold starts
How to improve lambda cold startsHow to improve lambda cold starts
How to improve lambda cold startsYan Cui
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020Yan Cui
 
A chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage awayA chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage awayYan Cui
 
How to debug slow lambda response times
How to debug slow lambda response timesHow to debug slow lambda response times
How to debug slow lambda response timesYan Cui
 

More from Yan Cui (20)

How to win the game of trade-offs
How to win the game of trade-offsHow to win the game of trade-offs
How to win the game of trade-offs
 
How to choose the right messaging service
How to choose the right messaging serviceHow to choose the right messaging service
How to choose the right messaging service
 
How to choose the right messaging service for your workload
How to choose the right messaging service for your workloadHow to choose the right messaging service for your workload
How to choose the right messaging service for your workload
 
Patterns and practices for building resilient serverless applications.pdf
Patterns and practices for building resilient serverless applications.pdfPatterns and practices for building resilient serverless applications.pdf
Patterns and practices for building resilient serverless applications.pdf
 
Lambda and DynamoDB best practices
Lambda and DynamoDB best practicesLambda and DynamoDB best practices
Lambda and DynamoDB best practices
 
Lessons from running AppSync in prod
Lessons from running AppSync in prodLessons from running AppSync in prod
Lessons from running AppSync in prod
 
Serverless observability - a hero's perspective
Serverless observability - a hero's perspectiveServerless observability - a hero's perspective
Serverless observability - a hero's perspective
 
How to ship customer value faster with step functions
How to ship customer value faster with step functionsHow to ship customer value faster with step functions
How to ship customer value faster with step functions
 
How serverless changes the cost paradigm
How serverless changes the cost paradigmHow serverless changes the cost paradigm
How serverless changes the cost paradigm
 
Why your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSyncWhy your next serverless project should use AWS AppSync
Why your next serverless project should use AWS AppSync
 
Build social network in 4 weeks
Build social network in 4 weeksBuild social network in 4 weeks
Build social network in 4 weeks
 
Patterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applicationsPatterns and practices for building resilient serverless applications
Patterns and practices for building resilient serverless applications
 
How to bring chaos engineering to serverless
How to bring chaos engineering to serverlessHow to bring chaos engineering to serverless
How to bring chaos engineering to serverless
 
Migrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 stepsMigrating existing monolith to serverless in 8 steps
Migrating existing monolith to serverless in 8 steps
 
Building a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQLBuilding a social network in under 4 weeks with Serverless and GraphQL
Building a social network in under 4 weeks with Serverless and GraphQL
 
FinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economyFinDev as a business advantage in the post covid19 economy
FinDev as a business advantage in the post covid19 economy
 
How to improve lambda cold starts
How to improve lambda cold startsHow to improve lambda cold starts
How to improve lambda cold starts
 
What can you do with lambda in 2020
What can you do with lambda in 2020What can you do with lambda in 2020
What can you do with lambda in 2020
 
A chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage awayA chaos experiment a day, keeping the outage away
A chaos experiment a day, keeping the outage away
 
How to debug slow lambda response times
How to debug slow lambda response timesHow to debug slow lambda response times
How to debug slow lambda response times
 

Recently uploaded

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 

Recently uploaded (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 

Applying Chaos Engineering to Serverless

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Applying Principles of Chaos Engineering to Serverless Yan Cui Principal Engineer DAZN D V C 3 0 5
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda What is chaos engineering? New challenges with serverless Applying latency injection to serverless Applying error injection to serverless
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. After the talk Slides will be shared on Slideshare Recording will be posted on YouTube within 48 hours Find the links on https://theburningmonk.com/reinvent2018
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. What is chaos engineering?
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - principlesofchaos.org
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Smallpox Earliest evidence of disease in third century BC Egyptian mummy Estimated 400K deaths per year in eighteenth century Europe
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination First vaccine was developed in 1798 by Edward Jenner https://en.wikipedia.org/wiki/Edward_Jenner
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination WHO certified global eradication in 1980 https://en.wikipedia.org/wiki/Edward_Jenner
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://en.wikipedia.org/wiki/Vaccine History of vaccination
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccination is the most effective method to prevent infectious diseases
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. History of vaccination Vaccines stimulate the immune system to recognize and destroy the disease before contracting it for real
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Use controlled experiments to inject failures into our system
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Help us learn about our system’s behavior and uncover unknown failure modes, before they manifest like wildfire in production
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering Lets us build confidence in its ability to withstand turbulent conditions
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering is the vaccine to frailties in modern software
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Who am I? Principal engineer at DAZN AWS Serverless hero Author of Production-Ready Serverless* course by Manning. Blogger**, speaker. * https://bit.ly/production-ready-serverless ** https://theburningmonk.com
  • 18.
  • 19.
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Available in seven countries—Austria, Switzerland, Germany, Japan, Canada, Italy, and USA Available on 30+ platforms
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. About DAZN Around 1,000,000 concurrent viewers at peak
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Too much emphasis is on breaking things
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem Easy to conflate the action of injecting failures with the payback
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is to learn about the system and build confidence
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos engineering has an image problem The goal is not to break things
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Four steps to start running chaos experiments yourself
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. this is not a steady state
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Explore unknown unknowns away from production
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Experiments that graduate to production should be carefully considered and planned You should have reasonable confidence in the system before running experiments in production
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Treat production with the care it deserves The goal is not to break things
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice If you knew the system would break and you did it anyway, then it’s not a chaos experiment! It’s called being irresponsible.
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Netflix’s Simian Army: https://github.com/Netflix/SimianArmy Chaos Engineering ebook (O’Reilly): http://oreil.ly/2tZU1Sn
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Look for evidence that steady state was impacted by the injected failure
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Chaos in practice Address weaknesses before failures happen for real
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Experiments needs to be controlled The goal is not to break things
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Ensure everyone knows what you are doing Don’t surprise your teammates
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Run experiments during office hours
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Avoid important dates
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Make the smallest change necessary to prove or disprove hypothesis
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Have a rollback plan Stop the experiment right away if things start to go wrong
  • 46. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Containment Don’t start in production Can learn a lot by running experiments in staging
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. by Russ Miles @russmiles source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
  • 48. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. New challenges with serverless
  • 49. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. chaos monkey kills an Amazon Elastic Cloud (Amazon EC2) instance latency monkey induces artificial delay in APIs chaos gorilla kills an AWS Availability Zone chaos kong kills an entire AWS region
  • 50. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 51. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges There are no servers that you can access and kill
  • 52. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 53. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 54. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. There is more inherent chaos and complexity in a serverless architecture.
  • 55. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Smaller units of deployment, but a lot more of them
  • 56. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. serverful serverlessServerless challenges
  • 57. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Every function needs to be correctly configured and secured
  • 58. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Kinesis ? SNS CloudWatch Events CloudWatch LogsIoT Core DynamoDB S3 SES Serverless challenges
  • 59. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges A lot of managed, intermediate services Each with its own set of failure modes
  • 60. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Unknown failure modes in the infrastructure we don’t control
  • 61. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless challenges Often there’s little we can do when an outage occurs in the platform
  • 62. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses
  • 63. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Improperly tuned timeouts
  • 64. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing error handling
  • 65. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing fallback
  • 66. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common weaknesses Missing regional failover
  • 67. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Latency injection with serverless
  • 68. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  • 69. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state What metrics do you use?
  • 70. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Defining steady state p95/p99 latencies, error count, backlog size, yield*, harvest** * percentage of requests completed ** completeness of the returned response
  • 71. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  • 72. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. API Gateway Serverless considerations
  • 73. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Serverless considerations Consider the effect of cold starts How does it affect your strategy for handling slow responses
  • 74. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should:
  • 75. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed
  • 76. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Strategy should: 1. Give requests the best chance to succeed 2. Do not allow slow response to timeout the caller function
  • 77. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Finding the right timeout value is tricky
  • 78. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too short: requests not given the best chance to succeed
  • 79. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Too long: risk timing out the calling function
  • 80. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Even more complicated when you have multiple integration points
  • 81. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 1: Split invocation time equally (for example, 3 requests, 6s function timeout = 2s timeout per request)
  • 82. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Approach 2: Every request is given nearly all the invocation time (for example, 3 requests, 6s function timeout = 5s timeout per request)
  • 83. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts Proposal: set request timeouts dynamically based on invocation time left
  • 84. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Request timeouts
  • 85. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  • 86. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set timeout based on remaining invocation time
  • 87. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Log the timeout with as much context as possible The API, timeout value, correlation IDs, request object, and more
  • 88. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Record custom metrics
  • 89. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Use fallbacks
  • 90. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 91. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recovery steps Be mindful when you sacrifice precision for availability User experience is the king
  • 92. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  • 93. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 94. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate timeout on its HTTP communications and can degrade gracefully when these requests time out
  • 95. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 96. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Should be applied to third-party services too DynamoDB, Twillio, Auth0 …
  • 97. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 98. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Be mindful of the blast radius of the experiment The goal is not to break things
  • 99. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. http client public-api-a http client public-api-b internal-api Where to inject latency?
  • 100. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: All functions have appropriate timeout on their HTTP communications to this internal API and can degrade gracefully when requests are timed out
  • 101. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 102. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 103. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Large blast radius, can cause cascade failures unintentionally
  • 104. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 105. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Priming (psychology): Priming is a technique whereby exposure to one stimulus influences a response to a subsequent stimulus, without conscious guidance or intention It is a technique in psychology used to train a person's memory both in positive and negative ways
  • 106. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Use failure injection to program your colleagues into thinking about failure modes early.
  • 107. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency? Make X% of all requests slow in the dev environment
  • 108. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: The client app has appropriate timeout on their HTTP communication with the server and can degrade gracefully when requests are timed out
  • 109. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 110. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 111. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 112. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject latency?
  • 113. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  • 114. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency?
  • 115. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Static weavers (such as PostSharp, AspectJ) Dynamic proxies
  • 116. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. https://theburningmonk.com/2015/04/design-for-latency-issues/
  • 117. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. How to inject latency? Manually crafted wrapper libraries
  • 118. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 119. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 120. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 121. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 122. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 123. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Configured in SSM Parameter Store
  • 124. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 125. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. No injected latency
  • 126. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 127. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. With injected latency
  • 128. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 129. Factory wrapper function (think bluebird’s promisifyAll function)
  • 130. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 131. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 132. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 133. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 134. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 135. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Error injection with serverless
  • 136. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Common errors HTTP 5XX Amazon DynamoDB provisioned throughput exceeded Throttled AWS Lambda invocations
  • 137. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on its HTTP communications and can degrade gracefully when downstream dependencies fail
  • 138. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  • 139. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesis: Function has appropriate error handling on DynamoDB operations and can degrade gracefully when DynamoDB throughputs are exceeded
  • 140. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors?
  • 141. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Where to inject errors? Induce Lambda throttling by temporarily setting reserve concurrency
  • 142. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Recap
  • 144. The only way to truly know your system’s resilience against failures is to test it through CONTROLLED experiments
  • 145.
  • 146. The goal of chaos engineering is NOT to actually break production
  • 147. CONTAINMENT should be front and centre of your thinking
  • 148.
  • 149. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 1. Define “steady state” What does normal, working condition looks like?
  • 150. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Hypothesize steady state will continue in both control group & the experiment group In other words, you should have a reasonable degree of confidence the system would handle the failure before you proceed with the experiment STEP 2.
  • 151. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 3. Inject realistic failures For example, server crash, network error, HD malfunction, more
  • 152. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. STEP 4. Disprove hypothesis In other words, look for difference in steady state
  • 153. There is more inherent chaos and complexity in a serverless application
  • 154. Even without servers, you can still inject CONTROLLED failures at the application level
  • 155. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 156. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 157. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 158. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 159. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Yan Cui @theburningmonk https://theburningmonk.com
  • 160. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Related breakouts Wednesday, Nov 28 SRV425-R - Best Practices for Building Multi-Region, Active-Active Serverless Applications 4:00PM – 5:00PM | Venetian, Level 4, Lando 4305 Wednesday, Nov 28 SRV343-R - Best Practices for Safe Deployments on AWS Lambda and Amazon API Gateway 4:45PM – 5:45PM | MGM, Level 1, South Concourse 105 Thursday, Nov 29 ARC308 - Chaos Engineering and Scalability at Audible.com 1:00PM – 2:00PM | Aria West, Level 3, Ironwood 5
  • 161. Please complete the session survey in the mobile app. ! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.