Data retention with the Serverless Framework, DynamoDB, and Go

At Honeybadger we have standard retention periods for data from which our customers can choose. Based on which subscription plan they choose, we’ll store their error data up to 180 days. Some customers, though, need to have a custom retention period. Due to compliance or other reasons, they may want to have enforce a data retention period of 30 days even though they subscribe to a plan that offers a longer retention period. We allow our customers to configure this custom retention period on a per-project basis, and we then delete each error notification based on the schedule that they have set. Since we store customer error data on S3, we need to keep track of every S3 object we create and when it should be expired so that we can delete it at the right time. This blog post describes how we use S3, DynamoDB, Lambda, and the Serverless Framework to accomplish this task.

Keeping track of the S3 objects we create

As our processing pipeline receives and processes error notifications, we store the payload from each notification in an S3 object. We also create objects in a separate S3 bucket that contain a batched list of the resulting S3 keys and the expiration time for each of those keys. These objects are just JSON arrays that look like this:

[
  {
    "key": "pu3We2Ie/ea40b606-1b48-40cb-942f-a046755c7a0f",
    "expire_at": "2017-03-03T23:59:58Z"
  },
  {
    "key": "Ieb0ieVu/fe3c6b8a-76d7-48d8-ab71-ea8a4f3bce08",
    "expire_at": "2017-07-31T23:59:58Z"
  }
]

The S3 bucket that is used to store these lists of keys sends a message to an SNS topic for each PUT request. We have a Lambda function that is subscribed to this topic to process each object as it is created, configured in our serverless.yml as the send-to-dynamo function:

functions:
  send-to-dynamo:
    handler: bin/send-to-dynamo
    events:
      - sns: "arn:aws:sns:us-east-1:*:expirations-${opt:stage, self:provider.stage}"

Tip: If you start your Serverless project with serverless create --template aws-go, you will get a project layout that puts the code for each function in a main.go file with its own subdirectory and a Makefile to help with compiling the code before deploying. Handy!

Ok, back to our configuration… Since this function needs to read the contents of the S3 objects referenced by the SNS notification, we have these permissions defined in serverless.yml:

provider:
  name: aws
  runtime: go1.x
  environment:
    STAGE: "${opt:stage, self:provider.stage}"
  iamRoleStatements:
   - Effect: "Allow"
     Action:
       - "s3:GetObject"
     Resource: "arn:aws:s3:::expirations-bucket-${opt:stage, self:provider.stage}/*"

That STAGE environment variable is used in the Go code to know which DynamoDB table to use, since we create a table per serverless environment that we deploy. Here’s how we define that table and give the Lambda function permission to write to it in our config:

provider:
  # ...
  iamRoleStatements:
    # ...
    - Effect: "Allow"
      Action:
        - "dynamodb:PutItem"
      Resource:
        Fn::GetAtt:
          - expirationsTable
          - Arn

resources:
  Resources:
    expirationsTable:
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: "expirations-${opt:stage, self:provider.stage}"
        AttributeDefinitions:
          - AttributeName: ID
            AttributeType: S
          - AttributeName: ExpireAt
            AttributeType: N
        KeySchema:
          - AttributeName: ID
            KeyType: HASH
          - AttributeName: ExpireAt
            KeyType: RANGE
        StreamSpecification:
          StreamViewType: OLD_IMAGE
        TimeToLiveSpecification:
          AttributeName: ExpireAt
          Enabled: true
        # ...

You’ll notice that the DynamoDB table has been configured with a TTL field called ExpireAt, and that it will emit a stream of events. Since we only care about the TTL events, we use OLD_IMAGE as the StreamViewType, since that will give us the contents of the fields in DynamoDB row when it is deleted.

Since we have a lot of keys arriving with the same expiration time, the Lambda function groups the list of keys by expiration time and creates one record per group in DynamoDB to reduce the write throughput. This results in one DyanmoDB record per expiration time (down to the second) which contains a list of S3 keys that are to be deleted at the expiration time.

The handler function receives the SNS event, looks for S3 records in the event data, then calls the getItems function to load each of those S3 objects and return lists of S3 keys to be deleted, grouped by expiration time. Each of those groups gets inserted as one DynamoDB record.

Using DynamoDB streams to know when to delete

Now that we have the triggers and code in place to store the keys of the objects that are to be deleted, we need to finish the job by watching for TTL events in the DynamoDB stream and delete the S3 objects. The purge-from-s3 function does this. Here’s the configuration from serverless.yml:

provider:
  # ...
  iamRoleStatements:
   - Effect: "Allow"
     Action:
       - "s3:DeleteObject"
     Resource: "arn:aws:s3:::data-bucket-${opt:stage, self:provider.stage}/*"

functions:
  # ...
  purge-from-s3:
    handler: bin/purge-from-s3
    events:
      - stream:
          type: dynamodb
          arn:
            Fn::GetAtt:
              - expirationsTable
              - StreamArn

We have granted the Lambda function the permission to delete objects from the bucket that stores the error payloads, and we have configured that function to be triggered by the events in the stream from our DynamoDB table.

This function is a simpler than the first one. When it receives a REMOVE event from DynamoDB, signifying that a record has been deleted due to the TTL, it iterates over the list of keys stored in the record, deleting each one from the bucket.

Unfortunately, this Lambda function will get invoked for every event in the DynamoDB stream, which includes all the inserts that we did in the send-to-dynamo function. We don’t care about those records, so we just ignore them in this function. I have filed a feature request with Amazon to add the ability to filter the types of stream events that trigger a function, and you’re more than welcome to do the same. :)

I hope you have found this post to be a useful example of how to work with S3 and DynamoDB streams using the Serverless Framework, Lambda, and Go!

Comments