How to leverage AI for chaos testing

Manual testing and various forms of automated testing can cover positive, negative, and edge cases, thus providing a certain level of product quality verification. However, these testing types focus on pre-scripted activities, which have one drawback: they are pre-scripted. If we are only testing what we know we should test, we may overlook a lot of scenarios. It’s like searching for lost keys only in well-lit areas and missing all the dark spaces.

Exploratory testing adds more safety since it allows for variability and a higher diversity of testing scenarios. In exploratory testing, one goes off-script as much as it is possible to go off-script. Working for a certain product for a few years makes everyone somewhat blindfolded by the previous experience. That bias can lead to over reliance on mental sets — if the only tool you have is a hammer, you’re more likely to treat everything as if it’s a nail. 

It’s difficult to test what humans cannot predict

I believe both scripted (manual and automated) and unscripted (exploratory) testing are limited due to various human factors, but it’s also simply impossible to assess quality by merely checking all inputs and outputs for three reasons:

Hardware isn’t fully deterministic. Any computational device is influenced by various factors: from temperature and power fluctuations to aging, clock speed variations, and network delays. This means that hardware might behave differently even when running the same instructions.

Software isn’t fully deterministic: Operating systems add more “layers” of non-determinism: for example, the OS scheduler determines which processes and threads run at any given time, based on factors like priority, resource availability, and system load. This dynamic scheduling introduces variability. In virtual machines, garbage collectors can make software behavior more unpredictable because it automatically manages memory cleanup at unpredictable times, affecting performance and execution timing. Additionally, multithreading and parallel processing introduce variability in execution order and timing, making deterministic behavior difficult to guarantee. Disk, network, and other I/O operations can have variable latency, affecting the timing and order of execution in applications that depend on them.

And lastly, even the best QA engineer can’t predict all users' behaviors.

As studies say:

"When describing distributed systems, functional specifications are incomplete; they do not fully characterize the behavior of the system for all possible inputs. The problem of incomplete specification is exacerbated by the scale and complexity of user behavior and the underlying infrastructure, which make it difficult to fully enumerate all inputs. In practice, the scale of the systems makes it impractical to compose specifications of the constituent services in order to reason about behavior".

Chaos testing theory

Chaos testing was specifically designed to aid with the weaknesses of the aforementioned methods by introducing unexpected events to the system and assessing its behavior. When performing chaos testing, we essentially break the system in various places and in various ways to see how it survives.

From this perspective, we ask questions like "Does everything seem to be working properly?" rather than "Does the implementation match the specification?"

In order to perform chaos testing, we need to take a few steps:

  1. Analyze the graph of all the services and subsystems the product has. Then, come up with ideas (hypotheses) about what we can break and how the system should behave when we break this particular thing.
  2. For each hypothesis we need to plan how we are going to break it, how we assure that the overall system is not corrupted, and how we restore what was broken.
  3. Review and approve each plan with the dev and ops teams.
  4. Execute and observe the system behaviors, then compile a report.

Based on the resulting report, the dev team can decide whether they need to change the system so that it becomes more resilient.

I believe that AI can greatly help us with step one and step two.

Why use AI in chaos testing?

LLMs are really good at analyzing complex graphs and their connections, good LLMs are also trained on such a huge dataset that it should be able to provide more diversity than any human possibly can. 

This means that step one is potentially best done with AI help: we can ask AI to analyze the graph of the services and subsystems, and come up with ideas of what we might want to break and how.

LLMs are also known for their hallucinations: they sometimes come up with information which is not true. In many use-cases this drawback renders AI almost useless, but here the hallucinations are our (imaginary 😄) friends. Any ideas AI brings us, we validate anyway, so AI provides us with a good diversity of options, which we can then filter. 

LLMs lack context. They don’t possess any knowledge of the system we’re trying to perform chaos testing on and that’s a good thing — all the AI does is analyze the graph and come up with ideas, it doesn’t need any context and is therefore not blindfolded by the context as previous knowledge.

With step two, AI can come up with all the potential ways to break a certain graph link (connection between two subsystems): from latency injection to random server outages. As it comes up with more ways to break the links, it provides us with a diversity of options to review. 

Steps three and four are best to be left to human beings. Dev, QA, and ops professionals possess the required product and technical expertise to filter the AI-generated suggestions and pick those which make sense to test.

AI and chaos testing practice

Step one: come up with ideas of what to break and how

Most companies have a designated place to keep track of services (micro and macro) and how they are connected and how they communicate with each other.

At Qase, we use Miro for this:

This scheme is a good input for any LLM. I use ChatGPT4, and here’s my prompt:

Given the graph of services in the picture I uploaded, please come up with all the potential ways to perform chaos testing for each link of the graph. Please consider at least these ways where it makes sense: Latency injection, Random server outages, Resource exhaustion, Simulated DDoS attacks, Database failures, Dependency failures, Disk space exhaustion

ChatGTP4 generates the following output, successfully performing step one:

ChatGPT output:

To perform chaos testing on the architecture depicted in the diagram, we need to identify potential failure points and apply various chaos engineering techniques to test the resilience of each component. Below is a detailed plan for performing chaos testing on each link and component in the graph using latency injection, random server outages, resource exhaustion, simulated DDoS attacks, database failures, dependency failures, and disk space exhaustion where applicable.

1. AWS VPC and Network

  • Latency Injection: Introduce latency in the network to simulate slow network conditions.
  • Random Server Outages: Simulate outages of NAT Gateways and VPN connections.
  • Resource Exhaustion: Exhaust bandwidth limits to test the resilience of the network.

2. Ingress Controllers (Nginx)

  • Latency Injection: Inject latency in the request routing to simulate slow response times.
  • Random Server Outages: Terminate one or more Nginx controllers to see how the system handles the load.
  • Simulated DDoS Attacks: Flood the ingress controllers with high traffic to test their handling capacity.
  • Resource Exhaustion: Exhaust CPU and memory resources of the Nginx controllers.

3. Application Layer (Various Services)

  • Latency Injection: Introduce latency in the communication between services (e.g., app, prop, queue workers).
  • Random Server Outages: Simulate outages of individual services (app, prop, auth, analytics).
  • Resource Exhaustion: Exhaust CPU, memory, and disk space resources of each service.
  • Dependency Failures: Disable dependencies (e.g., disable auth service to test service requests authentication).

4. Database Layer (PostgreSQL, Redis, DocumentDB)

  • Database Failures:
    • PostgreSQL: Simulate database unavailability, slow queries, and data corruption.
    • Redis: Simulate Redis server crash, data loss, and high latency.
    • DocumentDB: Simulate DocumentDB unavailability and slow response times.
  • Resource Exhaustion: Exhaust storage space and connection limits.
  • Disk Space Exhaustion: Fill up the disk space to test database operations under low storage conditions.

5. API Gateway

  • Latency Injection: Introduce latency in API request processing.
  • Random Server Outages: Simulate API Gateway downtime.
  • Simulated DDoS Attacks: Flood the API Gateway with excessive requests.
  • Resource Exhaustion: Exhaust CPU and memory resources of the API Gateway.

6. Amazon S3

  • Latency Injection: Inject latency in read/write operations.
  • Dependency Failures: Simulate unavailability of S3 buckets.
  • Resource Exhaustion: Attempt to reach S3 request rate and storage limits.

7. Queue Workers (SQS, SNS)

  • Latency Injection: Introduce latency in message processing.
  • Random Server Outages: Simulate SQS and SNS service outages.
  • Dependency Failures: Disable access to SQS/SNS and observe system behavior.

8. External Services and Security Layers

  • Simulated DDoS Attacks: Simulate DDoS attacks on AWS Shield and CloudFront.
  • Dependency Failures: Simulate unavailability of external security services like AWS Shield and GuardDuty.
  • Resource Exhaustion: Exhaust limits of external services to test resilience.

Implementation Tools and Methods

  • Chaos Monkey: For random server outages.
  • Latency Injection Tools: tc (Linux), Gremlin, or Chaos Mesh.
  • Resource Exhaustion Tools: stress-ng, Chaos Toolkit.
  • DDoS Simulation Tools: Locust, Vegeta.
  • Database Chaos Tools: Chaos Mesh, Pumba for containerized environments.

Summary

Chaos engineering involves systematically introducing faults into your system to identify weaknesses and improve reliability. For each component and link in the architecture, you can apply one or more chaos engineering techniques to test its resilience. It's crucial to monitor the system's response to these tests to identify failure points and make necessary improvements.

Consider setting up these tests in a controlled environment to prevent actual production outages and ensure you have proper monitoring and alerting in place to capture the impact of these tests.

If the result is not sufficient, it’s easy to ask “please provide more ways” and chatGPT will yield some more.

Step two: come up with a plan for each testing scenario 

This is even simpler: you ask chatGPT to provide you with a detailed plan for each activity:

Below are detailed instructions from ChatGPT for performing chaos testing activities for each component in the architecture. The instructions cover tools and methods to simulate various failure conditions.

ChatGPT output

1. AWS VPC and NetworkLatency Injection

  • Tool: tc (Linux), Chaos Mesh
  • Instructions:
    1. Install tc on a Linux machine.
    2. Use tc to add delay:sudo tc qdisc add dev eth0 root netem delay 100ms
    3. Verify the delay by testing network latency with ping.
    4. Remove delay after testing:sudo tc qdisc del dev eth0 root netem

Random Server Outages

  • Tool: Chaos Monkey
  • Instructions:
    1. Install Chaos Monkey for your environment.
    2. Configure Chaos Monkey to terminate random instances within your VPC.
    3. Monitor system behavior and service recovery.

Route Table Corruption

  • Tool: AWS Console/CLI
  • Instructions:
    1. Log in to AWS Management Console or use AWS CLI.
    2. Modify route tables to introduce incorrect routes:aws ec2 replace-route --route-table-id rtb-xxxxxxxx --destination-cidr-block 0.0.0.0/0 --gateway-id igw-xxxxxxxx
    3. Observe the impact and restore correct routes.

DNS Failures

  • Tool: Modify /etc/hosts or DNS settings
  • Instructions:
    1. Edit the /etc/hosts file to introduce incorrect DNS entries.
    2. Flush DNS cache:sudo systemctl restart nscd
    3. Test application behavior with incorrect DNS resolution.

2. Ingress Controllers (Nginx)Latency Injection

  • Tool: tc (Linux)
  • Instructions:
    1. Install tc on Nginx server.
    2. Introduce delay:sudo tc qdisc add dev eth0 root netem delay 200ms
    3. Monitor the impact on request latency.
    4. Remove the delay after testing:sudo tc qdisc del dev eth0 root netem

Simulated DDoS Attacks

  • Tool: Locust, Vegeta
  • Instructions:
    1. Install Locust:pip install locust
    2. Create a locustfile.py to define load behavior.
    3. Start Locust and generate high traffic:locust -f locustfile.py
    4. Monitor Nginx performance and response.

Configuration Changes

  • Tool: Manual Configuration Changes
  • Instructions:
    1. Modify Nginx configuration file (nginx.conf) with incorrect settings.
    2. Reload Nginx:sudo nginx -s reload
    3. Observe the impact on traffic handling and error logs.

3. Application Layer (Various Services)Resource Exhaustion

  • Tool: stress-ng
  • Instructions:
    1. Install stress-ng on service hosts:sudo apt-get install stress-ng
    2. Run stress test to exhaust CPU and memory:stress-ng --cpu 4 --timeout 60s
      stress-ng --vm 2 --vm-bytes 1G --timeout 60s
    3. Monitor service behavior and system resources.

Dependency Failures

  • Tool: Network Partitioning, Configuration Changes
  • Instructions:
    1. Use iptables to block access to a dependency:sudo iptables -A OUTPUT -p tcp --dport 3306 -j DROP
    2. Test application behavior without the dependency.
    3. Restore access:sudo iptables -D OUTPUT -p tcp --dport 3306 -j DROP

4. Database Layer (PostgreSQL, Redis, DocumentDB)Database Failures

  • Tool: Database Client Tools, AWS Console
  • Instructions:
    1. For PostgreSQL, simulate connection loss:sudo systemctl stop postgresql
    2. Observe application behavior and error handling.
    3. Restart PostgreSQL after testing:sudo systemctl start postgresql

Network Partitioning

  • Tool: tc (Linux), iptables
  • Instructions:
    1. Use tc to introduce network delay:sudo tc qdisc add dev eth0 root netem delay 500ms
    2. Use iptables to block network traffic:sudo iptables -A INPUT -p tcp --dport 5432 -j DROP
    3. Monitor database availability and application response.
    4. Restore network settings:sudo iptables -D INPUT -p tcp --dport 5432 -j DROP
      sudo tc qdisc del dev eth0 root netem

5. API GatewayThrottle Limits Testing

  • Tool: Load Testing Tools (e.g., JMeter)
  • Instructions:
    1. Configure JMeter to send requests exceeding throttle limits.
    2. Run the load test and monitor API Gateway response and throttling behavior.

Version Mismatch

  • Tool: API Deployment Management
  • Instructions:
    1. Deploy a new API version with breaking changes.
    2. Observe client application behavior with the new version.
    3. Roll back to the previous version after testing.

6. Amazon S3Bucket Policy Misconfigurations

  • Tool: AWS Console/CLI
  • Instructions:
    1. Modify S3 bucket policies to restrict access:aws s3api put-bucket-policy --bucket my-bucket --policy file://policy.json
    2. Test application behavior with restricted access.
    3. Restore original bucket policy.

Replication Lag

  • Tool: AWS Console
  • Instructions:
    1. Configure cross-region replication with a delay.
    2. Upload objects to the source bucket and monitor replication lag.
    3. Test application handling of eventual consistency.

7. Queue Workers (SQS, SNS)Message Duplication

  • Tool: Custom Scripts
  • Instructions:
    1. Create a script to send duplicate messages to SQS:import boto3
      sqs = boto3.client('sqs')
      queue_url = 'https://sqs.region.amazonaws.com/123456789012/MyQueue'
      message = 'Test Message'
      for _ in range(2):
      sqs.send_message(QueueUrl=queue_url, MessageBody=message)
    2. Monitor queue worker handling of duplicate messages.

Burst Traffic

  • Tool: Load Testing Tools
  • Instructions:
    1. Configure load testing tools to send burst traffic to SQS.
    2. Monitor SQS and SNS behavior under burst traffic conditions.

8. External Services and Security LayersIAM Role Misconfigurations

  • Tool: AWS Console/CLI
  • Instructions:
    1. Modify IAM role policies to restrict access:aws iam put-role-policy --role-name MyRole --policy-name MyPolicy --policy-document file://policy.json
    2. Test service behavior with restricted IAM policies.
    3. Restore original IAM role policies.

GuardDuty False Positives

  • Tool: Custom Scripts
  • Instructions:
    1. Use scripts to simulate GuardDuty alerts:import boto3
      guardduty = boto3.client('guardduty')
      detector_id = 'detector-id'
      finding_id = 'finding-id'
      guardduty.create_sample_findings(DetectorId=detector_id, FindingTypes=['Recon:EC2/Portscan'])
    2. Monitor incident response and handling of false positives.

By following these instructions, you can systematically introduce faults and observe the behavior of each component in the architecture. This helps to identify potential weaknesses and improve the resilience of the overall system.

Additionally, you’ll want to ask AI to give you instructions on how to revert the system to its original state.

The prompt would be very simple: "how do I revert all the changes and reset everything we did so that the system is in its previous state?"

ChatGPT will provide you with all it knows:

ChatGPT output:

To reset all changes and revert the system to its previous state, you typically need to take the following steps:

For AWS Services

  1. AWS VPC and Network
    • Route Tables: aws ec2 replace-route --route-table-id rtb-xxxxxxxx --destination-cidr-block 0.0.0.0/0 --gateway-id igw-xxxxxxxx
    • Security Groups:
      • Restore original security group rules via AWS Console or CLI.
  2. Ingress Controllers (Nginx)
    • Latency Injection:sudo tc qdisc del dev eth0 root netem
    • Configuration Changes:
      • Restore original nginx.conf and reload Nginx:sudo nginx -s reload
  3. Application Layer (Various Services)
    • Resource Exhaustion:
      • Terminate stress-ng or other stress tests.
    • Dependency Failures:
      • Remove iptables rules blocking dependencies:sudo iptables -D OUTPUT -p tcp --dport 3306 -j DROP
  4. Database Layer (PostgreSQL, Redis, DocumentDB)
    • Database Failures:
      • Restart database services:sudo systemctl start postgresql
    • Network Partitioning:
      • Remove tc and iptables rules:sudo tc qdisc del dev eth0 root netem
        sudo iptables -D INPUT -p tcp --dport 5432 -j DROP
  5. API Gateway
    • Throttle Limits Testing:
      • Stop load testing tools like JMeter.
  6. Amazon S3
    • Bucket Policy Misconfigurations:
      • Restore original bucket policies via AWS Console or CLI.
    • Replication Lag:
      • Verify replication settings and restore if needed.
  7. Queue Workers (SQS, SNS)
    • Message Duplication:
      • Clear duplicate messages from the queue.
    • Burst Traffic:
      • Stop load testing tools generating burst traffic.
  8. External Services and Security Layers
    • IAM Role Misconfigurations:
      • Restore original IAM policies via AWS Console or CLI.
    • GuardDuty False Positives:
      • Resolve or delete sample findings.

General Steps

  1. Undoing Latency Injection and Network Changes:sudo tc qdisc del dev eth0 root netem
    sudo iptables -F
  2. Stopping Stress Tools:
    • Identify and stop any running stress tests like stress-ng:sudo pkill stress-ng
  3. Restore Configuration Files:
    • Ensure all configuration files like nginx.conf are reverted to their original state.
  4. Restart Services:
    • Restart any services that were stopped or affected by chaos tests.
  5. Remove Simulated Failures:
    • Clean up any simulated failures, such as restoring database connections, reverting IAM roles, and stopping load tests.

By following these instructions, you should be able to revert all changes made during chaos testing and restore the system to its previous state. Make sure to validate each step to ensure that the system is fully operational again.

Step three (review the plan) and step four (execute)

Here, we stop using AI and review the plans and ideas (both how to do chaos testing and how to revert the changes) with devs, QA and ops; then decide which steps are feasible to do. Here’s when context knowledge is required! 

Execution should also only be done by human beings — as AI can hallucinate, we can’t possibly rely on its instructions to be 100% accurate.

With this approach to chaos testing, one can save plenty of time and mental effort on generating the diverse set of testing scenarios while at the same time using the pros and cons of AI to the tester’s benefit.


References:

  1. https://www.researchgate.net/publication/249655706_Are_test_cases_needed_Replicated_comparison_between_exploratory_and_test-case-based_software_testing
  2. https://arxiv.org/pdf/1702.05843
  3. http://ithare.com/deterministic-components-for-interactive-distributed-systems-with-transcript/
  4. https://buttondown.email/nelhage/archive/f6e8eddc-b96c-4e66-a648-006f9ebb6678
  5. https://www.intervalzero.com/understanding-hard-real-time-determinism/
  6. https://slikts.github.io/concurrency-glossary/?id=nondeterministic-vs-deterministic