Deploy a self-hosted Braintrust data plane on AWS, GCP, or Azure
AWS
GCP
Azure
Deploy the Braintrust data plane in your AWS account using the Braintrust Terraform module. This is the recommended way to self-host Braintrust on AWS.Braintrust recommends deploying in a dedicated AWS account. AWS enforces account-level Lambda concurrency limits, and since Braintrust’s API runs on Lambda, sharing an account with other workloads can lead to throttling and service disruptions. A dedicated account also aligns with AWS best practices for workload isolation and security.
To test infrastructure provisioning before committing to production-sized resources, use the sandbox example. It uses minimal instance sizes and has deletion protection disabled for easy teardown. It is not suitable for performance or load testing.
In provider.tf, configure your AWS account and region.Supported regions: us-east-1, us-east-2, us-west-2, eu-west-1, ca-central-1, and ap-southeast-2. If you require support for a different region, contact Braintrust.
In terraform.tf, set up your remote backend (typically S3 and DynamoDB).
In main.tf, customize the Braintrust deployment settings.The defaults are suitable for a large production-sized deployment. Adjust them based on your needs, but keep in mind the hardware requirements.
Each deployment must have a unique deployment_name within the same AWS account (max 18 characters). The default is "braintrust", change this if you have multiple deployments. Resource names (IAM roles, RDS instances, S3 buckets) are prefixed with this value and will collide if duplicated.
Brainstore instances require instance types with local NVMe storage for caching (e.g., c8gd, c5d, m5d, i3, i4i families). Generic instance types without local storage (t3, m5, c5) are not supported and will fail at plan time.
Pass the key to Terraform. The recommended approach is to store the license key in AWS Secrets Manager and reference it using a Terraform data source:
data "aws_secretsmanager_secret_version" "brainstore_license" { secret_id = "braintrust/brainstore-license-key"}
Then pass data.aws_secretsmanager_secret_version.brainstore_license.secret_string as the brainstore_license_key value in the module.Alternatively, you can pass the key without storing it in Secrets Manager:
Set TF_VAR_brainstore_license_key=your-key in your environment.
Pass it via command line: terraform apply -var 'brainstore_license_key=your-key'.
Add it to an uncommitted terraform.tfvars or .auto.tfvars file.
Do not commit the license key to your git repository.
The first terraform apply may fail with transient errors such as ASG health check timeouts (while instances are still booting) or Lambda rate limits. Re-running terraform apply resolves these.
This will create all necessary AWS resources including:
Two isolated VPCs:
Main VPC: Hosts Braintrust services (API, database, Redis, Brainstore)
Quarantine VPC: Runs user-defined functions (scorers, tools) in network isolation. This creates ~30 Lambda functions across multiple runtimes. This is required for most production use cases.
Connect your Braintrust organization to your newly deployed data plane.
Changing your live organization’s API URL can disrupt access for existing users. If you are testing, create a new Braintrust organization for your data plane instead of updating your live environment.
If your deployment is accessed through a VPN or is otherwise on a private network (not accessible from the public internet), enable Data plane is on a private network. This enables Chrome’s Local Network Access permission handling, which is required for browser access to private network resources. When enabled, Chrome will prompt users to grant permission for the Braintrust UI to access your self-hosted data plane. See Grant browser permissions for details.
Select Save.
The UI will automatically test the connection to your new data plane. Verify that the ping to each endpoint is successful.
At least 3 private subnets across different availability zones
At least 1 public subnet
Internet and NAT gateways with properly configured route tables
The module manages its own security groups. To also use an existing quarantine VPC, set existing_quarantine_vpc_id and the corresponding existing_quarantine_private_subnet_*_id variables.
These tags will be applied to all resources including Brainstore EC2 instances, volumes, and ENIs. The deployment name variable automatically prefixes resource names and applies a BraintrustDeploymentName tag across all resources.
Use the custom_tags parameter instead of the AWS provider’s default_tags configuration. Due to a Terraform limitation, default_tags are not applied to resources that use launch templates, such as Brainstore instances.
Important for AWS: Avoid using burstable Redis instances (t-family instances like cache.t4g.micro) in production. These instances use CPU credits that can be exhausted during high-load periods, leading to performance throttling.Instead, use non-burstable instances like cache.r7g.large, cache.r6g.medium, or cache.r5.large for predictable performance. Even if these instances seem oversized initially, they provide consistent performance without the risk of CPU credit exhaustion.
The API Handler and AI Proxy Lambda functions default to 10240 MB (the Lambda maximum). You can reduce these to lower costs in environments with tighter memory quotas, though Braintrust recommends keeping the defaults for production workloads.
module "braintrust-data-plane" { source = "github.com/braintrustdata/terraform-aws-braintrust-data-plane" api_handler_memory_limit = 10240 # default, valid range 1–10240 MB ai_proxy_memory_limit = 10240 # default, valid range 1–10240 MB # ... other configuration ...}
The brainstore_wal_footer_version variable controls the WAL footer format written by Brainstore. It defaults to "" (unset) and should not be changed outside of a planned upgrade sequence.
Do not set brainstore_wal_footer_version without following the upgrade guide. Setting it at the same time as a version bump can cause Brainstore nodes still rolling out to fail to read the new WAL format.
When kms_key_arn is configured, all managed S3 buckets (Brainstore, code-bundle, and Lambda responses) enforce blocked_encryption_types = ["NONE"], preventing unencrypted object uploads. This policy is applied automatically as of v4.5.0 — upgrading from an earlier version will include this change in your terraform plan.
As of v4.5.0, the x-bt-use-gateway header is included in the AI Proxy Lambda function URL CORS allowed headers. Browser clients can send this header to control gateway routing without triggering a CORS preflight rejection. No configuration is required.
Deploy the Braintrust data plane in your GCP project using the Braintrust Terraform module and Helm chart. This is the recommended way to self-host Braintrust on GCP.
The Braintrust Terraform module contains all the necessary resources for a self-hosted Braintrust data plane. A dedicated Google Cloud project for your Braintrust deployment is recommended but not required.
In provider.tf, configure your Google Cloud project and region.
In backend.tf, set up your remote backend (typically a GCS bucket).
In main.tf, customize the Braintrust deployment settings.The defaults are suitable for a large production-sized deployment. Adjust them based on your needs, but keep in mind the hardware requirements.
Create a helm-values.yaml file for your deployment. Refer to the Helm chart documentation for configuration options.Deploy the Braintrust Helm chart to your cluster:
The data plane requires a publicly reachable HTTPS endpoint with a valid TLS certificate. The Helm chart deploys the API as a ClusterIP service. You are expected to provide your own ingress solution that terminates TLS and routes traffic to the braintrust-api service on port 8000.Common approaches:
GCP Application Load Balancer with a Google-managed certificate (requires a custom domain)
GKE Gateway API with cert-manager and Let’s Encrypt
Cloud Run NGINX proxy with a VPC Connector for SSL termination (no custom domain required)
Istio/ASM Gateway - the Helm chart includes native VirtualService support (see virtualService in values.yaml)
Any reverse proxy or load balancer that terminates TLS and forwards HTTP to the API service
After configuring your ingress, save the resulting HTTPS URL. You’ll need it to configure your Braintrust organization.
Connect your Braintrust organization to your newly deployed data plane.
Changing your live organization’s API URL can disrupt access for existing users. If you are testing, create a new Braintrust organization for your data plane instead of updating your live environment.
If your deployment is accessed through a VPN or is otherwise on a private network (not accessible from the public internet), enable Data plane is on a private network. This enables Chrome’s Local Network Access permission handling, which is required for browser access to private network resources. When enabled, Chrome will prompt users to grant permission for the Braintrust UI to access your self-hosted data plane. See Grant browser permissions for details.
Select Save.
The UI will automatically test the connection to your new data plane. Verify that the ping to each endpoint is successful.
The Terraform module automatically configures Workload Identity for your GKE cluster and creates two service accounts with the following IAM grants.The module creates two GCS buckets:
Brainstore bucket (<deployment_name>-brainstore-*): Brainstore data storage.
API bucket (<deployment_name>-api-*): Contains two storage paths — code-bundle/ (API layer writes) and brainstore-cache/ (ephemeral Brainstore cache, automatically deleted after 1 day by a GCS lifecycle rule).
The brainstore-cache/ objects are ephemeral and managed automatically. Operators do not need to manage or back up this data.
Choose one of the following authentication methods for the API service:
Native GCS auth (recommended)
Native GCS authentication uses the @google-cloud/storage SDK and Workload Identity. This is the recommended approach for enhanced security as it eliminates the need to manage service account keys.Requirements:
Helm chart version 3.1.0 or later
Workload Identity configured (automatic with Terraform module)
Kubernetes secrets:For native GCS authentication, create secrets without GCS credentials:
Refer to the Terraform outputs for the connection strings. The Brainstore license key can be found at Settings > Data plane. Only organization owners can access this page.
Helm configuration:In your Helm values file, enable native GCS authentication and configure the Google service account:
The Terraform module outputs the service account email as braintrust_service_account. Run terraform output braintrust_service_account to get the full email address.The enableGcsAuth setting defaults to false for backwards compatibility. Contact Braintrust if you want to enable native GCS authentication by default.
S3 compatibility mode (legacy)
S3 compatibility mode uses HMAC keys to access GCS through the S3-compatible API. This is the legacy authentication method.Kubernetes secrets:For S3 compatibility mode, include HMAC credentials:
Refer to the Terraform outputs for the connection strings. The Brainstore license key can be found at Settings > Data plane. Only organization owners can access this page.
Helm configuration:In your Helm values file, ensure enableGcsAuth is not set or set to false. Do not configure googleServiceAccount when using HMAC keys:
api: enableGcsAuth: false # or omit this line
Helm chart v5.0.1+ automatically sets AWS_REQUEST_CHECKSUM_CALCULATION and AWS_RESPONSE_CHECKSUM_VALIDATION to WHEN_REQUIRED when enableGcsAuth is disabled. This ensures AWS SDK compatibility with the GCS S3-compatible endpoint. No additional configuration is required.
The Terraform module outputs the service account email as brainstore_service_account. Run terraform output brainstore_service_account to get the full email address.
If Brainstore needs to access GCS buckets or other GCP resources in another project that are restricted to a specific service account identity, use brainstore_impersonation_targets to grant the Brainstore Kubernetes service account the ability to impersonate one or more Google Cloud service accounts.
This grants roles/iam.serviceAccountTokenCreator on each target service account to the Brainstore Kubernetes service account, enabling Brainstore to generate short-lived tokens and act as those accounts. Values must use the full resource name format projects/{project_id}/serviceAccounts/{service_account_email}, not bare email addresses. The default is [] (no impersonation).
Target service accounts must exist before running terraform apply. You must also have iam.serviceAccounts.setIamPolicy permission on each target service account, or a project-level IAM admin role on the target project.
Deploy the Braintrust data plane in your Azure subscription using the Braintrust Terraform module and Helm chart. This is the recommended way to self-host Braintrust on Azure.
Requirements: Terraform >= 1.10.0 and the azurerm provider ~> 4.0 are required. If you have an existing deployment using azurerm 3.x, run terraform init -upgrade before applying and review the azurerm v4 upgrade guide.
The Braintrust Terraform module contains all the necessary resources for a self-hosted Braintrust data plane. A dedicated Azure subscription for your Braintrust deployment is recommended but not required.
In provider.tf, configure your Azure subscription and tenant details.
In terraform.tf, set up your remote backend (typically Azure Blob Storage).
In main.tf, customize the Braintrust deployment settings.The defaults are suitable for a large production-sized deployment. Adjust them based on your needs, but keep in mind the hardware requirements.The module provisions two AKS node pools:
brainstore pool (aks_brainstore_pool_vm_size): runs Brainstore pods. Must be a VM SKU with local NVMe SSD (e.g. Standard_D32ds_v6). The Azure Container Storage extension is automatically installed to configure RAID0 across the local disks.
services pool (aks_services_pool_vm_size): runs API and other application pods. Does not require local SSD (e.g. Standard_D16s_v6).
Initially set enable_front_door = false in main.tf. You’ll enable this later after configuring the load balancer.
The Terraform module configures PostgreSQL extensions (pg_cron, pg_partman) and sets cron.database_name to the Braintrust database. These are static parameters that require a server restart to take effect. The Terraform provider is configured to not restart the server automatically, since automatic restarts on configuration changes could cause unintended downtime in production.After your first terraform apply, restart the PostgreSQL server before proceeding:
az postgres flexible-server restart \ --resource-group <resource-group-name> \ --name <postgres-database-server-name>
You can find the resource group and server name in your Terraform outputs.
This step is only required on the initial deployment. Subsequent terraform apply runs do not require a restart unless you modify the PostgreSQL extension configuration.
In the Azure Portal, find the private link service named <deployment>-aks-api-pls and manually approve it.
This manual approval step is an Azure platform requirement — Front Door cannot automatically approve private link connections to resources in a different subscription or tenant. The deployment will appear to succeed but Front Door traffic will not flow until the connection is approved.
Front Door deployment takes up to 45 minutes after the Terraform apply completes. Wait for the deployment to finish before proceeding.
Connect your Braintrust organization to your newly deployed data plane.
Changing your live organization’s API URL can disrupt access for existing users. If you are testing, create a new Braintrust organization for your data plane instead of updating your live environment.
If your deployment is accessed through a VPN or is otherwise on a private network (not accessible from the public internet), enable Data plane is on a private network. This enables Chrome’s Local Network Access permission handling, which is required for browser access to private network resources. When enabled, Chrome will prompt users to grant permission for the Braintrust UI to access your self-hosted data plane. See Grant browser permissions for details.
Select Save.
The UI will automatically test the connection to your new data plane. Verify that the ping to each endpoint is successful.
The resource group Terraform resource address changed from azurerm_resource_group.main to azurerm_resource_group.main[0]. A moved block in moved.tf handles this automatically — no manual state manipulation is required. Before applying, run terraform plan and confirm the resource group shows as moved rather than destroyed and recreated. If you see a destroy planned, do not apply without investigating.
Upgrading from Terraform Azure module v0.9.0 to v1.0.0 requires several breaking changes:
Node pool variables renamed: Replace aks_user_pool_vm_size with aks_brainstore_pool_vm_size and aks_services_pool_vm_size, and aks_user_pool_max_count with aks_brainstore_pool_max_count and aks_services_pool_max_count. The existing user node pool will be destroyed and replaced by two new pools during the upgrade — drain and reschedule workloads before applying.
azurerm provider upgrade: Run terraform init -upgrade to update to azurerm ~> 4.0 before applying. Review the azurerm v4 upgrade guide for any state migrations needed (e.g. the enable_rbac_authorization → rbac_authorization_enabled rename on Key Vault resources).
brainstore_license_key now required: The module will fail at plan time if this variable is not set. See the Configure Brainstore license step above.