So, you think you do quality assurance? Part 1: Intro to quality.

Martin Chaov
DraftKings Engineering
16 min readFeb 24, 2023

--

Quality is an inseparable part of our lives, it is everywhere. While the need for quality is everywhere, it takes hard work to achieve a high level of quality. From quality of service in a restaurant, quality of the things that we purchase, to quality of software that we are using (and sometimes creating ourselves).

This two part series covers a variety of topics related to quality assurance.

In Part 1 — Intro to Quality:

  • What is quality assurance?
  • The QA role
  • What is the purpose of testing?
  • What is a bug?
  • A thought experiment about the state of mind with regards to quality
  • How the QA team is related to the SDLC process?

In Part 2: Advanced quality, metrics related to assessing the SDLC such as:

  • Lead time vs Total time
  • Cycle time
  • Mean time to recovery
  • Cost of fixing bugs in prod
  • Time to fix red build
  • Escaped bugs
  • and others

Jump ahead to part 2?

To deliver with quality, one needs a process focused on quality. The metrics listed above provide a good starting point for assessing the performance of the process in place and identifying improvement areas.

What is quality assurance?

quality [noun]

The standard of something when it is compared to other things like it; how good or bad something is; a high standard.

synonym → excellence

assurance [noun]

a statement that something will certainly be true or will certainly happen, particularly when there has been doubt about it

synonyms → guarantee, promise

In other words “quality assurance” is a “guarantee for excellence”. To have a QA role on the team, means commitment to a high standard. In software, there are two separate notions of quality which compliment each other, both are a must to succeed.

  • Functional quality → reflects how accurately the software system follows its design and functional specifications. These are the business requirements and what the system is doing.
  • Non-functional quality → the structural quality of the system and it supports the delivery of the functional requirements. Such attributes are often described as “illities”: scalability, maintainability, supportability, deployability, usability, availability, reliability, securability, etc. These are more related to how the system is doing the what.

Lack of quality in a software system is most often expressed by (but not limited to):

  • faulty output
  • inability to access the resource or function users’ requested
  • slow or unresponsive system
  • loss of data
  • security breaches

In this article the focus is on prevention, thus measuring and testing for the different aspects of quality is out of the scope.

The QA Role

In different companies and businesses, there are variations between who is responsible for quality assurance and how this responsibility is distributed to people in the organization. There are cases where there is a dedicated QA Engineer embedded into the engineering teams. There are also cases where quality is enforced via the software engineering practices and thus is a part of the software engineer’s job description. There are also cases where the QA role is an entirely separate vertical with a separate life cycle that operates in parallel to the R&D department. Whatever the circumstances may be, the goal of this article is to discuss quality assurance and how it shapes the development process. In that context in this article, the QA Engineer/Role should be taken as a responsibility and not as a job title.

Do you trust yourself?

Do you work in software development? Ask yourself these questions:

  • Would you eat food that is prepared the way you develop software?
  • Would you pay for shoes made in the same process?
  • Would you drive a car produced and tested the way you work?
  • Would you board an airplane developed with the practices you are aware of?

If any of the answers are “no” → why the lack of confidence? What is missing so you could trust the product?

If any of the answers are “yes” → can you pinpoint what exactly gives you that confidence? Is it one thing or it is an amalgamation of many?

One of the ways to ensure the output of your process is up to standards is to test it. However, testing itself has a lot more to it than verifying functional requirements.

What is the purpose of testing?

Verifying the correctness of a program has been a field so tough to crack that it has kept some of our great minds, the likes of Alan Turing and Edsger Dijkstra, busy. It is tempting to say that the purpose of testing is to verify if the software works. Testing for correctness is like taking a glass of water from the ocean and concluding there are no whales in the entire ocean. Validating a requirement with a test is the bare minimum and completely insufficient.

The problem comes from the fact that usually functional requirements define what the system’s output is expected to be. While sometimes there are requirements on how to handle business related errors, very often the system could be broken because of the way it is being built. There are no functional requirements for that because it is not part of what the system does but rather how it is doing it.

How many times have you seen a requirement such as: if microservice A goes down and queue B starts filling up, do C? All the system corner/abuse cases that come from implementation details should be identified and tested.

Let’s compare this to the production of a paper coffee cup. Based on the functional requirements we should be able to pour X ml. of hot beverage into it. This tells us nothing about the process of producing the cup itself and how it could affect the final product. Based on the factory and materials that are going to be used for production, there are a magnitude of things that could come up:

  • one vendor could be using glue with unknown ingredients which, when exposed to hot liquids, could cause skin irritation
  • another could be close to a chemical plant, with the possibility of the water source in the area being contaminated
  • maybe the ink used to brand the cups could get dissolved by the moisture of the skin and stain the user

All of the above come from careful risk assessment of the chosen vendor and their practices. Testing based on requirements is not enough because you don’t know the process used to build the product. Same goes for software systems. Assessment of the tooling and architecture should inform the testing strategy.

So, what is the alternative?

The alternative is to test software systems to find all the ways you can break them apart! This includes functional requirements, behavior, system details, scaling, load balancing, deployment, supply chain, security, permutation of business cases, overloading system components, breakpoint testing, and many more. If you don’t find out how your system breaks, your users might do it inadvertently and hackers are most definitely going to do it on purpose!

The system is not a baby and the engineers are not its caring parents, it’s not anybody’s masterpiece. It is a product. The way to survive on the market and in strong competition is to provide impeccable quality.

What is a bug?

A bug is a violation of an assumption[4]. Software bugs usually manifest in the code but they just show where engineers’ assumptions broke down and not how they got to that point. In other words, code that is not behaving according to expectations is observed, and not how such code came to be.

Bugs are rarely just programmers’ mistakes. They are usually symptoms of badly understood requirements, gaps in the process, missing architecture, poorly defined SLAs and KPIs, etc. This is not one person’s mistake. This is a gap in the process, thus understanding the root cause is important to identifying where an adjustment is needed in the workflow for a better result. There are questions that could help narrow the origin of the problem:

  • What do I observe happening? What could the symptoms mean for my process?
  • Why didn’t someone observe this sooner?
  • Why did it break now and not last week?
  • Does anyone know how long the problem has been around?
  • What is the impact of my findings?
  • Could this problem or any of the steps followed here cause more problems, and what could they be?
  • What conditions allow for the problem to occur?
  • What can I do to prevent this problem from happening again?
  • How will the solution look and who will be responsible for it?

The QA role should lead the root cause analysis and fine tune the SDLC process to eliminate it.

A thought experiment…

Try to put yourself in this situation

  1. Imagine you are responsible for the quality in a team on a very important system.
  2. The system is very late and has been delayed multiple times.
  3. Another delay would lead to loss of millions and a PR disaster.
  4. The system was not tested under all possible conditions. There are some known issues that didn’t cause incidents in the past.
  5. You lack confidence in the product but your management pushes you to sign it off for release.

Would you sign it off for release?

This is more or less the chain of events leading to the spectacular explosion in the photo below:

The space shuttle Challenger exploded 73 seconds after lift off. (Credit: Bruce Weaver/AP Photo)

This is what happened on January 28, 1986, when the Space Shuttle Challenger (OV-099) broke apart 73 seconds into its flight, leading to the death of all seven crew members aboard.

What was the problem? It was quality mismanagement and failure of the management of the project to understand the engineers’ concerns. Allen McDonald, who was the director of the Space Shuttle Solid Rocket Motor Project for the engineering contractor Morton Thiokol at the time, was concerned that below-freezing temperatures might impact the integrity of the solid rockets’ O-rings. He did not want to greenlight the project on that day, and did not want to sign it off. What’s more the engineering team was asked by the management to prove that the O-rings will fail in order to cancel the flight. The general manager of Morton Thiokol (not an engineer) signed it off instead of McDonald. [2]

The rest is history. After his testimony, McDonald was effectively demoted from his position at Thiokol. His demotion was reported to the Rogers Commission, which displeased the company’s management. McDonald met with Thiokol’s top executives on May 16, 1986, where executives blamed him and another engineer for causing public relations concerns for the company. McDonald “was treated as a traitor and pariah by NASA and his own company, but, thanks in part to congressional pressure, was allowed to redesign the boosters …” Members of the US Congress introduced a resolution that threatened to prevent Thiokol from acquiring federal contracts unless McDonald’s demotion was reversed. He was promoted to vice president of engineering and charged with redesigning the solid rocket motors. [3]

While this is an extreme example of mismanagement and quality assurance gone wrong, it is easy to see the pattern that led to the disaster. One would think that if their engineers lack confidence in their own creation they should back off and try to understand what’s wrong.

Another way to look at this is based on risk assessment → which one is more expensive:

  • to delay the launch until engineers are confident in the rocket and suffer PR damages
  • to potentially burn through billions in equipment and retrieval operations and maybe sacrifice a crew of astronauts in the process. And still suffer PR damages

What about process?

In a typical SDLC there is always a “Test” phase. However, if the plan is to do all of the testing and “quality assurance” at that point, the system is most certainly headed for an autopsy, not a successful release to the end users. There is nothing wrong with the process diagram shown below. It’s just that such a high level overview hides the details of how it will deliver quality.

SDLC diagram depicting the various phases of a system’s development.

In a healthy process there is prevention built-in. A variety of quality gates are put in place to ensure readiness to go to the next phase, since identifying mistakes becomes more and more expensive the further into the process they are discovered. A healthy process is actively monitored and analyzed and employs zero tolerance towards bugs! All of these are expanded upon in part two of this article.

Quality Assurance and Quality Control

Effective quality assurance is proactive! Its goal is not only to prevent defects from ever occurring for the end users, but also to establish the steps and processes required once a defect is found. Testing is done after the fact and it is part of QC (Quality Control) activities. As such, while it cannot guarantee quality, it is the necessary step of verification of the output to make sure it satisfies the standards put in place. It is an essential part of quality assurance. For testing to be successful, a proper testing strategy should be designed early on and observed throughout the entire SDLC.

Any defect found in the test phase could be a symptom of a problem, in the earlier phases, that led to its creation. Where did it originate though? Is it because something was missed in the design phase? Is there a permutation of requirements that are conflicting? Are the interfaces falling apart with every new requirement? Is some re-design and/or refactoring required?

The guarantee for quality needs to happen in all the steps of the SDLC. Below is a high level example that could be applied.

Planning + Requirements definition:

This is the where the team needs to understand the feasibility of building the system in the first place. They should:

  • Assess if this software could be tested. What, when, and how should be tested. “What” needs to be tested and “how” it could be verified is going to inform how the system should be built and affect the structural requirements of the system. If, for example, it is mission critical for certain types of data to be unavailable in certain jurisdiction, is it enough to just test for that? Is there the need for monitoring to detect if something escaped the filtering, and/or audit log, to make sure detection and tracing of the origin of such errors is possible?
  • Write specification and acceptance criteria. At this stage, these are high level specs for the output of the entire system and not its sub parts.
  • Decide what is in scope. Is the size of the system suitable to the available resources?
  • Establish SLAs and KPIs with architects and business stakeholders.

Design:

In this phase, based on the requirements analysis, there could be some back and forth with the business stakeholders to crystalize what is needed. The output of this phase is the SDD (software design description/document).

  • Solidify test scenarios: use-cases, edge-cases, abuse-cases. These scenarios are going to affect many aspects of system design, including but not limited to: security, estimating possibility of DDoS, lost data recovery.
  • Feedback with the team that is going to build the system. The architects need to digest how the system is going to be used and define the structural requirements to support it. Engineers are required to review and sign off the design before any development can start.
  • Manual and automation test cases should also be defined where applicable. Even if there are things that can’t be automated (at least in the very first version of this system), they should be known and planned in advance.

Development:

Read about code reviews and test reviews in part two.

Test:

While the test phase comes after the development phase, it is important to note that this doesn’t mean the output of the development phase was not tested at all. The development phase could output many different sub-systems, microservices, components, etc. The test phase is the place where all of it is going to be run together and verified for correctness on a production-like environment. Let’s not forget that testing the functional requirements is not enough. There are other things to test which can’t be verified during development:

  • performance and stability of the end to end system including production setups for third-party systems such as message brokers/busses, databases, message queues, and others
  • scaling based on load or other triggers
  • rollout, rollback, hotfix, recovery procedures

Depending on the teams’ composition and spread of responsibilities the test phase could look very different. In some cases, manual testing would be done to verify the correctness of the automation only the first time around. In some organizations there would be a dedicated QA vertical to do the testing, in others, the engineering teams would be responsible for it. Employing integration and end-to-end tests during development could shorten this phase or even make it obsolete in some cases (related to functional requirements), while in others this is where a regulatory authority is going to certify the system.

Results of any tests should be analyzed and any discrepancy between expected and real output should trigger RCA (root cause analysis). Depending on the system and testing strategy, a variety of tests could be performed, such as:

  • Manual tests
  • Automated tests
  • User acceptance tests
  • Performance tests

Rollout:

Rolling out a product to the end users is a process itself. To ensure quality, the team needs to have well-established procedures on how to handle any case of complications. There are variety of strategies to do that depending on if it’s a brand new product or an update to an existing one. Regardless, if you are doing a closed beta for a new product or a services or doing a gradual rollout for updates, there are common things that could ensure one is on the right path. There is no other environment like production, thus tests should be executed on production. Once the system is deployed (even partially) the tests should start running and the team should monitor the system for abnormal behavior. Once the system is verified as safe to release, start piping traffic to it. As soon as the first users hit it, analyze the metrics and validate the systems’ SLAs and KPIs for the real customers.

Maintenance:

Delivering a system is one thing, maintaining it for an extended period of time is completely different. To ensure quality of service the team should:

  • Proactively monitor the system → To be proactive means to control the situation rather than just respond to it after the fact. Proactive monitoring is constantly trying to identify potential issues before they cause outages for the business.
  • Track SLAs, KPIs, and metrics → A lot of the data that could be gathered from a system makes sense only once it is put in context of time and as such, it could be tracked in the form of a trend rather than a fixed number. Correlating trends could reveal insight on how different parts of the system affect each other in real life and either prove or disprove the assumptions made during their implementations.

Examples of a variety of KPIs and metrics are going to be provided in the second part of this mini-series.

Evaluation:

  • Assess the process and it’s output, extract good practices, identify gaps, then fine tune before the next cycle begins.

QA is part of the CAPA activities

CAPA stands for “corrective action, preventive action.” It consists of improvements to an organization’s processes taken to eliminate causes of defects and other undesirable situations. It takes the form of actions, laws, or regulations required by an organization to take, in terms of research and development, documentation, procedures, or systems to rectify and eliminate recurring defects. The corrective and preventive action is designed by a team that includes quality assurance personnel and personnel involved in the actual observation point of the defect. It must be systematically implemented and observed for its ability to eliminate further recurrence of the flaw. [1]

Let’s use the following example of a faulty system to examine the different actions:
A system is required to print the sum of numbers input by a user. The system works as expected with correct input. The system breaks when users input numbers in the wrong format (format based on locale).

Corrective action:

Action taken to eliminate the causes of non-conformities or other undesirable situations, so as to prevent recurrence.

Non-exhaustive list of example corrective actions related to the example system above:

  • Apply defensive coding techniques for identifying faulty input.
  • UI validation to alert users to input numbers in a format suitable for the system.
  • Monitoring, alerting, paging to quickly identify new faulty cases.
  • Process enhancements and fine tuning (or even complete redesign) such that faulty cases could be identified in the requirements gathering and design phases.
  • Introducing new types of tests to cover expected system behavior with known bad input.

Preventive action:

Prevent the occurrence of such non-conformities, generally as a result of risk analysis.

Non-exhaustive list of example preventive actions related to the example system above:

  • Analysis of the localization requirements and documenting locale-specific edge cases including expected system behavior.
  • Regularly reviewing and updating internal documentation, policies, and procedures based on the supported locales.
  • Conducting open code reviews at random principle such that everybody could join. These should include requirements and implementation review. Possible results of this, could be a change of style guides, definition of done, etc.
  • Do retrospectives, even in cases of successful launches. It is easy to focus on the negative, but understanding what made things work is just as important.
  • Testing plan to include cases for a variety of different inputs to force the system out of its expected behavior.

Onward to Part 2: Advanced Quality

The guarantee for quality is built into all stages, but it also doesn’t stop once the system goes live. The SDLC process needs to be re-assessed, even in cases of success, all the way through and fine tuned to reduce waste and improve metrics such as: lead time, cycle time, and reduction of escaped bugs. Quality assurance is embedded into the SDLC itself!

In the next part — metrics and quality gates that could help steer development in the right direction by setting up KPIs to estimate the quality of the SDLC process itself. Questions such as:

  • What are lagging and leading indicators?
  • What metrics are used to evaluate the whole process?
  • How could cyclomatic complexity, code coverage, code churn, defect density, and others be used to fine tune and improve the SDLC?

Read part two here.

You either own the quality or the lack of quality will own you!

Want to learn more about DraftKings’ global Engineering team and culture? Check out our Engineer Spotlights and current openings!

Resources

--

--

Martin Chaov
DraftKings Engineering

15+ years as a software architect, currently Lead Software Architect at DraftKings, specializing in large-scale systems and award-winning iGaming software.