# Operations & Reference Guide

## not.bot™ Verify: Operations & Reference Guide

This document covers the architecture, configuration, scaling, and failure modes of a running not.bot Verify deployment. It is reference material, not a procedure. For the deploy-from-zero walkthrough, see the Deployment Checklist. For the trust model and privacy properties, see the Architecture & Privacy Guide.

Each component section covers what the component does, the configuration that matters in production, the failure modes you will see, and the operational behavior that surprises new operators. Cross-references to Deployment Checklist phases use the form (DC §X) so you can jump back to the procedure when you need to change something.

---

## Contents

1. [Overview and architecture recap](#1-overview-and-architecture-recap)
2. [OpenBao](#2-openbao)
3. [Admin service](#3-admin-service)
4. [Signature server](#4-signature-server)
5. [Chia node](#5-chia-node)
6. [SDK](#6-sdk)
7. [PostgreSQL and Keycloak (prerequisites)](#7-postgresql-and-keycloak-prerequisites)
8. [Observability and alerting](#8-observability-and-alerting)
9. [Support](#9-support)

---

## 1. Overview and architecture recap

You operate four components, all in your own Kubernetes cluster:

- **OpenBao** holds every DID's private key. Sealed on every restart.
- **Admin service** runs as a singleton, exposed on your internal network. Hosts the management UI, manages the DID pool, reports MAU counts.
- **Signature servers** run behind an internal load balancer, reached by your SDK adapter rather than the public internet. Each one holds one signature DID and serves user verification requests.
- **Chia nodes** follow the Chia blockchain. Used by the admin service for on-chain operations and by signature servers for chain state queries.

You also rely on two services you provide: PostgreSQL (application data) and Keycloak (admin operator login). Neither is part of not.bot Verify; both are prerequisites.

The Architecture & Privacy Guide covers the verification flow in detail. The short version: your application backend uses the SDK to start a signature request, the user opens a universal link to the not.bot app, the not.bot app talks to your signature servers through your SDK adapter, and your backend gets a callback when verification finishes. No traffic in this flow reaches Julia Social.

Julia Social is a runtime dependency only at signature server startup, when the server acquires its honest.bot™ credential. After that, a running deployment operates with no Julia Social connectivity required. See Architecture & Privacy Guide §10 for the full breakdown.

---

## 2. OpenBao

### 2.1 What it does

OpenBao stores the BLS12-381 private key for every DID in your deployment: your business DID and every signature DID in your pool. The admin service signs DID transactions through OpenBao. Each signature server signs every user verification response through OpenBao. No private key leaves the cluster.

The Chiakeys plugin (`vault-plugin-secrets-bls`) extends OpenBao to support Chia's BLS12-381 signature scheme. OpenBao does not include this support out of the box, so the deployment uses a custom OpenBao image with the plugin baked in. The chart references this image by digest and the plugin is registered automatically at startup; see Deployment Checklist Appendix A if you need to rebuild it from source.

### 2.2 The reseal problem

OpenBao seals itself on every pod restart: cluster upgrades, node failures, manual restarts, eviction. A sealed OpenBao cannot sign. A signature server cannot serve a single user request while OpenBao is sealed. Until an operator unseals it, the verification flow is offline.

The unseal key from initialization is what unseals it (DC §1.2). Keep that key in whatever break-glass secret store your team uses for high-blast-radius credentials, and keep it accessible to whoever is on call.

If you cannot tolerate the gap between a restart and a human running `bao operator unseal`, look at OpenBao auto-unseal. The upstream OpenBao docs cover the supported backends (cloud KMS services, HSMs, transit unseal). Auto-unseal moves the trust anchor from "an operator has the key" to "the cluster's KMS principal can fetch it," which is a different threat model. Pick the one that matches your security posture.

### 2.3 1-of-1 key shares

The deployment uses a single key share with a threshold of one (`-key-shares=1 -key-threshold=1`). This is correct for not.bot Verify, including production. The unseal key and root token are operator credentials for an instance that holds only your own DID keys; the trust model does not benefit from Shamir-splitting them across multiple humans.

If your security team requires multi-share unseal as policy, OpenBao supports it. Set `-key-shares=N -key-threshold=M` at init time. You will then need M holders present to unseal after every restart, which makes the reseal problem worse.

### 2.4 Namespacing and policies

The admin service operates inside its own OpenBao namespace, with a scoped token that grants access to two paths inside that namespace: `chiakeys/*` for BLS keys, and `secret/data/*` / `secret/metadata/*` for non-DID secrets such as Chia node CA cert/key pairs (DC §1.3). The token is created with `-orphan` so it survives root revocation. The root token's only purpose is bootstrapping that policy. After Phase 1, the root token is revoked (DC §1.4). If you need administrative access later, generate a new root with the unseal key using `bao operator generate-root`, do the work, and revoke it again.

Nothing in not.bot Verify uses the root token for normal operation. If you ever see the root token in an application configuration, that is a configuration error.

### 2.5 Common failures

**OpenBao sealed.** Logs from the admin service and every signature server show connection failures. Run `bao operator unseal <UNSEAL_KEY>`. Investigate why the pod restarted.

**Sealed during signature server startup.** The signature server cannot complete its startup sequence and exits. The DID it was assigned stays in "assigned" state until the heartbeat timeout releases it. See §4.4.

**Plugin SHA mismatch.** Only applies if you rebuilt the OpenBao image per Deployment Checklist Appendix A. If the Chiakeys plugin SHA-256 in `values.yaml` does not match the binary in your custom image, OpenBao refuses to register the plugin and the chiakeys engine will not enable. Recompute the SHA against the binary in the image (`docker run --rm --entrypoint sha256sum <YOUR_IMAGE> /openbao/plugins/vault-plugin-secrets-bls`), update `values.yaml`, and reinstall the chart. The shipped chart's SHA matches its shipped image; no action is needed when using the default image.

**Unseal key lost.** Unrecoverable. The data in OpenBao is gone. You will need to mint a new business DID, mint new signature DIDs, re-verify your domain, and replace every signature server. Plan accordingly: store the unseal key somewhere you will still have access to it after a personnel change or a disaster.

---

## 3. Admin service

### 3.1 Role

The admin service is the management plane. It does four things:

1. Hosts the operator-facing UI for registering Chia nodes, creating the business DID, minting signature DIDs, and uploading `deployment-config.json` once at first login.
2. Issues DID assignments to signature servers when they start up, and tracks the pool's "available" / "assigned" state in PostgreSQL.
3. Delegates the domain credential from your business DID to each signature server's signature DID at assignment time.
4. Reports your monthly active user count to Julia Social once an hour.

It is a singleton. Run one replica. The DID assignment flow assumes a single coordinator; running two replicas without coordination would race on assignments.

### 3.2 Internal-only exposure

Operators reach the admin service from your internal network. There is no public access path. The chart exposes the service through a Kubernetes `Service` of type `LoadBalancer`, annotated for internal-only placement. Your cloud provider sees the annotation and provisions a private endpoint with a stable hostname.

Per provider:

- **AWS:** `service.beta.kubernetes.io/aws-load-balancer-scheme: "internal"` provisions a private NLB or CLB. For TLS termination at the LB, add `service.beta.kubernetes.io/aws-load-balancer-ssl-cert: "<ACM_CERT_ARN>"` and `service.beta.kubernetes.io/aws-load-balancer-ssl-ports: "https"`. The cert must live in the same region as the LB. The LB type is controlled by `service.beta.kubernetes.io/aws-load-balancer-type` (NLB recommended).
- **GCP / GKE:** `networking.gke.io/load-balancer-type: "Internal"` provisions an internal TCP/UDP load balancer by default. For TLS, GCP's preferred path is a `BackendConfig` plus `ManagedCertificate` resource; an internal HTTP(S) load balancer requires a different setup (Gateway API or an Ingress with ILB scheme). If you are doing TLS termination at the LB, expect to manage one or two extra Kubernetes resources beyond the Service annotation.
- **Azure:** `service.beta.kubernetes.io/azure-load-balancer-internal: "true"` places the LB on the internal AKS network. TLS termination at the Azure LB requires you to upload the cert to a key vault and reference it through a listener configuration; for most teams the simpler path is terminating TLS in nginx-ingress or a similar in-cluster proxy and leaving the LB at L4.

Across providers, the admin service inside the pod serves plain HTTP on port 8080. TLS terminates at the LB. The cert reference goes in Decision **J** of the Pre-flight Checklist.

> Cloud provider TLS termination details change between releases. Use the annotations above as a starting point and verify against your provider's current documentation before going to production. GCP's internal HTTPS LB and Azure's LB cert support have both moved in recent releases.

After the LB exists, point internal DNS for **H** at the LB's stable hostname. The hostname must match `app.baseUrl` in the admin service `values.yaml` and the redirect URIs in Keycloak. Pre-flight Decision H is the single source of truth; Phase 3 (Keycloak) and Phase 5 (admin service) both read from it.

### 3.3 Keycloak integration

The admin service uses two Keycloak URLs, and they have different jobs.

`issuerUri` is what the admin service backend uses to fetch Keycloak's public keys and validate the tokens it receives. This is server-to-server traffic inside the cluster. HTTP on a ClusterIP is fine here because the traffic never leaves the cluster network.

`externalUrl` is what the operator's browser hits during login. The operator types a password and gets a bearer token back. Plaintext is not appropriate. Use HTTPS.

The two URLs can resolve to the same Keycloak instance. They almost always do. The split exists because the URL the backend uses (in-cluster) and the URL the browser uses (external) are usually different addresses for the same service.

The single most common admin service failure is an issuer mismatch. The flow looks like this: the operator clicks login, signs in at Keycloak, returns to the admin service with a token, and the backend rejects the token because its `iss` claim does not match `issuerUri`. The `iss` claim is whatever Keycloak sees as its own hostname when it issues the token, which depends on Keycloak's hostname configuration (`KC_HOSTNAME` or the equivalent). If the operator's browser reaches Keycloak at `https://keycloak.example.com` but Keycloak is configured to advertise itself as `http://keycloak.keycloak.svc.cluster.local:8080`, the token's issuer will be the in-cluster URL and the admin service has to accept that. Set `issuerUri` to whatever Keycloak puts in the `iss` claim, not what you wish it would put there.

If logins fail, decode the JWT (`jwt.io` works offline) and read the `iss` claim. Whatever you find there is what `issuerUri` must equal.

### 3.4 Database schema

The admin service owns the main application schema. It creates most tables on first startup as the admin user (Decision **F**, default `notbot_admin`). The signer user (Decision **G**, default `notbot_signer`) needs read/write access on those tables; the deployment grants this through `ALTER DEFAULT PRIVILEGES` (DC §2).

The current signature server also runs its own migrations at startup, so the signer user temporarily needs `CREATE` on the `public` schema. DC §2 grants that explicitly. If the product later moves all schema creation into the admin service, remove that grant and return the signer user to runtime-only DML.

Both users connect to the same database in the standard deployment. Running them against separate databases means setting up replication, which not.bot Verify does not provide.

If schema migrations on startup fail, the pod logs name the migration. Most migration failures trace back to a permissions problem on the existing schema (someone changed grants out of band) or a database version mismatch (the chart expects PostgreSQL 14+).

### 3.5 Configuration surface

The admin service `values.yaml` is the largest configuration block in the deployment. Most fields are documented inline. Two points worth flagging:

- The chart reads production secrets through Kubernetes Secret references. Create the referenced Secret before install and keep only Secret names and keys in `values.yaml`. Do not check generated credentials or populated local overlays into version control.
- The `juliaServer.host` field defaults to `identity.julia.social`. Beta deployments may be pointed at `identity.juliasocial-dev.com`. Verify with support which one applies to your subscription.

### 3.6 Pending chart changes

The Deployment Checklist reflects the admin service chart after three pending changes the dev team has agreed to:

1. Remove the unused `ingress` block from `values.yaml`. The block exists today but no template consumes it.
2. Add `service.annotations` support to the service template, so the internal-LB annotations described in §3.2 render through.
3. Change the default `service.type` from `NodePort` to `ClusterIP` (or `LoadBalancer`), so a default install does not open port 30880 on every node.

If the admin-service `values.yaml` still has an `ingress` block, or the service defaults to `NodePort`, you have an older chart and your behavior may differ from this guide.

### 3.7 Common failures

**CrashLoopBackOff at first deploy.** Almost always an upstream problem: PostgreSQL unreachable, Keycloak realm or client secret wrong, OpenBao sealed or token invalid, in-cluster DNS not resolving one of the above. The pod logs name the failing dependency.

**Login redirects to a working Keycloak page but returns a 401 or 403 to the admin service.** Issuer mismatch. See §3.3.

**`deployment-config.json` upload fails.** The API key in the file is one-time-use, and any upload error consumes it upstream regardless of what the error message says. Common causes: cluster cannot reach `billingServerUrl` (egress firewall), `customerId` / `organizationName` does not match Julia Social's records, or transient-looking 5xx-shape errors like "Billing service is currently experiencing issues" — the last one is misleading because the failure persists on your side even after the upstream recovers. The fix path is the same in all cases: email `support@julia.social` for a re-issued welcome email containing a fresh API key, then retry §6.3 with the new file. Investigating the root cause of the original error (firewall, name mismatch) before re-issuing is fine — but do not retry the upload with the old key, it will always fail.

**Vault login reports "No online Chia node available."** The admin service could not find a registered Chia node with online status for the Vault/JNI login path. Register a Chia node first, then confirm the node health check marks it online before retrying the domain or signature-DID operation.

**Hourly MAU report quietly stops.** The admin service queues failed reports and retries. If you watch the logs and see retries stacking up, your egress to `billingServerUrl` is blocked. No data is lost; the queue drains once connectivity returns. This does not affect the verification flow.

---

## 4. Signature server

### 4.1 Role

A signature server holds one signature DID for its lifetime, signs user verification responses with that DID's key (through OpenBao), and serves the request flow described in the Architecture & Privacy Guide §5. You run as many as your throughput requires, behind an nginx ingress.

The deployment is stateless from the operator's perspective. Each replica picks up its DID from the pool at startup and releases it at shutdown. Replacing a replica is a normal operation.

### 4.2 The DID pool

The pool is the list of pre-minted signature DIDs in the admin interface. Each DID is in one of two states:

- **Available.** Minted, not held by any running server, ready to be assigned at the next server startup.
- **Assigned.** Held by a running signature server, with the server's identifier and the timestamp of its most recent heartbeat.

A DID stays Assigned until the signature server holding it shuts down cleanly (releasing the DID) or stops sending heartbeats long enough for the admin service to release it automatically.

The pool auto-sizes. The admin service maintains at least two unallocated DIDs at all times. When a signature server starts up and takes a DID, the admin service mints another to keep the unallocated count at two. Each minting is an on-chain transaction (around two minutes), so the buffer of two covers the gap between a startup picking up a DID and the admin service producing a fresh one. You do not size the pool yourself; your replica count and HPA ceiling are the cap.

You mint the first DID once during initial setup (DC §7). The admin service handles every DID after that.

### 4.3 Startup sequence

When a signature server pod starts, it goes through these steps in order. A failure in any step exits the pod, and the logs name the step.

1. **DID assignment.** Contacts the admin service. The admin service picks an Available DID, marks it Assigned, and returns the DID along with a scoped OpenBao token, the signer database credentials, and the connection details (including TLS material) for every registered Chia node.
2. **OpenBao connection.** Connects to OpenBao with the scoped token and confirms the key is reachable.
3. **Database connection.** Connects to PostgreSQL with the signer credentials.
4. **Domain credential delegation.** The admin service delegates your business DID's domain credential to the assigned signature DID. This is what proves to the not.bot app that this server represents your domain.
5. **honest.bot credential acquisition.** Contacts Julia Social. Julia Social issues an honest.bot credential bound to this DID. No two running processes can hold the same credential at once.
6. **Ready.** The server begins serving verification requests on port 8080.

Common failures by step:

- **Step 1.** No Available DIDs. The admin service is failing to mint replacements. Check the admin service logs and the registered Chia node's sync state; minting requires a synced node.
- **Step 2.** OpenBao sealed (unseal it) or unreachable from this namespace (NetworkPolicy issue).
- **Step 3.** PostgreSQL credentials wrong, host unreachable, or the signer user (Decision **G**) lacks the read/write and schema `CREATE` grants from DC §2. The current signer runs migrations at startup; without `CREATE` it fails with `permission denied for schema public`.
- **Step 4.** The admin service is unreachable or your business DID is missing the domain credential. Re-verify your domain (DC §6.5).
- **Step 5.** Egress to Julia Social blocked. Check firewall rules for outbound HTTPS to `identity.julia.social` (or the dev hostname for beta deployments).

### 4.4 Heartbeats and DID release

Each signature server sends a heartbeat to the admin service every minute. The heartbeat updates the DID's `last_heartbeat` timestamp in the admin interface and confirms the server is alive.

On clean shutdown (SIGTERM, then graceful drain), the server releases its DID before exiting. The DID returns to Available immediately, and the admin service does not need to mint a replacement.

On a crash or network partition, the server cannot release its DID. The admin service watches for missed heartbeats and releases the DID after a timeout. Until that timeout fires, the DID is stuck Assigned. The auto-sizing logic continues to maintain two unallocated DIDs in parallel, so a replacement server can start up against a freshly minted DID without waiting for the stuck one.

### 4.5 Scaling

The signature server chart supports a Horizontal Pod Autoscaler. Each new replica goes through the startup sequence in §4.3 the same way the first one did. The ingress routes requests across all running replicas.

Scale up: HPA adds replicas. Each new replica takes one of the two unallocated DIDs, and the admin service mints another to refill. The two-DID buffer absorbs normal scale-up rates. A burst that adds replicas faster than the admin service can mint will see the later replicas wait at startup until a fresh DID lands; the on-chain transaction is around two minutes.

Scale down: HPA terminates replicas. Each terminating replica receives SIGTERM, drains in-flight requests, releases its DID, and exits. The DID returns to Available. Helm rolling updates work the same way: the old pod drains and releases, the new pod picks up an Available DID. The admin service tracks unallocated count across both flows and does not over-mint.

### 4.6 Network requirements

The signature server pod needs egress to:

- **Admin service** (in-cluster). DID assignment and heartbeats.
- **OpenBao** (in-cluster). Every signature request signs through OpenBao, so this connection is hot path, not just startup.
- **PostgreSQL** (in-cluster or external). Reading and writing signature state.
- **Chia nodes** (in-cluster). Mutual TLS using the CA and cert the admin service provided at startup.
- **Julia Social** (external HTTPS). Startup honest.bot credential acquisition, then once an hour for the diagnostic count. After startup, this is the only Julia Social traffic the signature server generates.

If your cluster enforces NetworkPolicies, every path above needs an explicit allow rule. The admin-to-signer namespace boundary trips up new deployments more than any other network restriction.

### 4.7 Bootstrap configuration

The signature server needs a small amount of static configuration to reach the admin service at all. Once that initial call lands, most runtime config (DID, OpenBao token, database credentials, Chia node host/port/TLS material) is supplied by the admin service in its `signatureServerSetup` response and never touches the chart.

What's in the bootstrap:

**Required to start (binary panics or fails to contact admin without these):**

- `API_KEY` — inbound API key the signer requires from clients (SDK adapter, not.bot app). Set from the operator-created Secret referenced by `deployment.apiKeySecret`. Without this set the binary panics on startup with `API_KEY Is Required: NotPresent`.
- `ADMIN_API_KEY` — outbound API key the signer presents to the admin service. Set from the same operator-created Secret referenced by `deployment.apiKeySecret`; it should be the same `businesses.api_key` value used for `API_KEY`.
- `ADMIN_HOSTNAME` — admin service DNS name inside the cluster. Set via `deployment.adminService.hostname` in `values.yaml`. If unset, the binary defaults to `localhost`, which is wrong inside the signer pod.
- `ADMIN_SECURE` — whether the signer uses HTTPS to reach admin. Set via `deployment.adminService.secure`; the standard in-cluster admin service is plain HTTP, so the chart renders `false`.

**Recommended but optional, with safe defaults:**

- `ADMIN_PORT` — admin service port. The chart renders `8080`.
- `CHIA_NETWORK` — `Mainnet`. not.bot Verify operates on Chia mainnet only; do not change this value.
- `RUST_LOG` — log level filter, default `info`.

**Do NOT set** (the binary populates these from the admin response and overriding them either has no effect or actively breaks the deployment):

- `CHIA_FULL_NODE`, `CHIA_FULL_NODE_PORT` — pulled from `chia_nodes[0].host/port` in the admin response.
- `PRIVATE_CA_CRT`, `PRIVATE_CA_KEY` — pulled from `chia_nodes[0]` inline PEM in the admin response; consumed by the Chia RPC client at startup, then unset.
- `CHIA_FULL_NODE_SSL` — when set, the binary switches from inline-PEM-from-admin mode to filesystem-paths mode, looking for `<dir>/full_node/private_full_node.{crt,key}` and `<dir>/ca/private_ca.crt`. If those files don't exist (the chart doesn't provision them), startup fails with `No such file or directory` from the SSL loader.

**Inert in this binary** (referenced elsewhere in the workspace but never read here):

- `SSL_ROOT_CERTS`, `SSL_CERTS`, `SSL_PRIVATE_KEY`.

Optional signer listener values `HOSTNAME` / `PORT` default to `0.0.0.0:8080`; the chart's service already targets 8080, so leave them unset. Optional SMTP alert values named `SIGNATURE_*` are only needed if you configure email alerts.

The signer chart renders non-secret bootstrap config into a ConfigMap and reads both API-key env vars from an operator-created Kubernetes Secret. Database credentials, OpenBao token, DID assignment, and Chia node host/port/TLS material arrive from the admin `signatureServerSetup` response, not from chart env.

### 4.8 Common failures

**Pod stuck in CrashLoopBackOff.** Read the logs. The startup sequence (§4.3) names the failing step.

**Pod runs but the admin interface shows the DID as Available, not Assigned.** The pod is not heartbeating. Either it is wedged after starting (rare; check the logs for repeated errors) or the heartbeat path to the admin service is broken (NetworkPolicy added recently?).

**DID stuck in Assigned with stale heartbeat.** The server holding it died without releasing. Wait for the heartbeat timeout; the admin service will release it. New servers are not blocked while you wait, since the auto-sized buffer keeps unallocated DIDs available.

**429 or 503 from the signature server.** Either OpenBao is sealed (every signature now fails) or you hit a Chia node connectivity problem (chain state queries fail). The pod logs name which.

---

## 5. Chia node

### 5.1 Role

A Chia node follows the Chia blockchain on behalf of your deployment. The admin service uses one to register DIDs on-chain. Each signature server uses one (or more) to verify chain state when serving user requests. Run at least one; three is recommended.

The Chia node is not part of the not.bot Verify Helm bundle. The charts and image build live in a public repository (`github.com/GalactechsLLC/helm-chia-nodes`). DC §4 covers the deployment.

### 5.2 Sync behavior

The chart ships with DB checkpointing on. On first deploy, the node downloads a recent compressed chain snapshot from `torrents.chia.net` and starts from there, which cuts initial sync time dramatically compared to syncing from genesis. The node then walks forward from the checkpoint to the current chain head before it can answer queries about current state.

**Expect 12–24 hours of wall time for first sync** on commodity cloud SSDs. On a t3.xlarge with gp3 SSD (mainnet), we observed:

- Checkpoint tarball download: ~25 minutes (~121 GiB, single mainnet checkpoint, recent vintage).
- Tarball extraction to SQLite: ~50 minutes. Peak disk pressure during this window is roughly the tarball size plus the partial SQLite — size your PVC for at least 500 GiB to avoid running out mid-extraction.
- Walk-forward from checkpoint to chain head: 17+ hours, averaging ~8.5 blocks/second, with periodic stalls from the sync pipeline's back-pressure. Walk-forward dominates total time and scales with how stale the checkpoint is relative to chain head.

Faster disks (NVMe-class IOPS, higher gp3 IOPS provisioning) help the checkpoint download and extraction phases. During our t3.xlarge/gp3 test run, the walk-forward phase averaged ~8.5 blocks/second and showed periodic sync pipeline back-pressure. Higher vCPU counts and faster disks may improve catch-up, but we have not benchmarked enough instance shapes to give a reliable scaling curve. Spinning disks are still not viable for steady-state operation because the chain workload is write-IOPS-heavy at the tail.

You cannot create your business DID, mint signature DIDs, or perform any other on-chain operation until at least one Chia node has fully synced. The admin service runs fine without a synced node, so you can deploy through Phase 5 (admin service) and configure operators while you wait. The on-chain steps (Phase 6 onward) block on sync. Plan to span the deployment across at least two business days.

Subsequent restarts reuse the existing database and come online in a minute or two. The long wait is on first deploy only.

A node showing `Running` in `kubectl get pods` is not the same as a synced node. Check sync status in the admin interface after registering the node (DC §6.2).

### 5.3 The shared private CA

Every Chia node uses mutual TLS on its RPC interface. The node and its caller both present certs signed by a private CA. The operator generates the CA and the cert pair (DC §4.1) and pastes them into each node's values file.

Use one CA for every node in the deployment. The chart supports a different CA per node, but every CA and cert pair has to be uploaded individually when you register the node in the admin interface, and tracking multiple CAs across nodes is overhead with no security benefit. The CA is your trust anchor; sharing it across nodes is the design.

The CA is valid for ten years (DC §4.1 generates with `-days 3650`). Mark your calendar for rotation. Rotation is a planned operation: generate a new CA and cert, update every node's values file, redeploy each node, and re-upload the CA and cert for each registered node in the admin interface.

### 5.4 The YAML indentation gotcha

The CA cert and key go into the values file as YAML block scalars (`|`). Block scalars preserve newlines, but they require consistent indentation under the field name. Every line of the cert and every line of the key must sit at the same indentation level.

```yaml
chia-blockchain:
  chia:
    ca:
      private_ca_crt: |
        -----BEGIN CERTIFICATE-----
        ...
        -----END CERTIFICATE-----
```

If indentation drifts (one of the cert lines is indented one space less than its neighbors, for example), YAML parses the file but the cert content is garbled. The pod cannot parse the cert at runtime, fails readiness, and restarts. From the outside this looks like a CrashLoopBackOff with no obvious cause.

Trust the YAML linter that ships with your editor. If a node restart-loops on first deploy and you cannot see a reason in the logs, paste the cert content into a YAML linter and check.

### 5.5 Redundancy

A single Chia node is a single point of failure for your verification flow. A node falling out of sync, hitting a storage issue, or restarting takes the admin service and signature servers offline as far as on-chain queries are concerned.

Running more than one node gives you the redundancy to absorb a failure; three is recommended. The admin service distributes registered nodes to signature servers at startup; each signature server can fall back across them. Bring up additional nodes by repeating the deploy with a separate values file and Helm release name. Use the same CA in every node.

### 5.6 Storage

The Chia chain database grows continuously. New blocks land every roughly 18 seconds. You will not run out of space tomorrow, but you will eventually if you do not monitor.

Pick a `StorageClass` that supports growth (most cloud-managed PV classes do, with a CSI driver that supports `VolumeExpansion`). Size the PVC generously up front, expand when usage hits 70%, and alert when usage hits 80%. A node whose PVC fills up takes itself out of service until you expand.

SSD is required. The chain workload is IOPS-heavy on writes. Spinning disks fall behind chain head and stay behind.

### 5.7 Preservation and restore

A first sync from the checkpoint takes 12–24 hours of wall time (§5.2). A redeploy that starts from an empty PVC re-incurs that wait. Two operations avoid it: backing up a synced node's DB, and pre-populating a fresh PVC with a known-good DB before the chart starts.

#### Backing up the chain DB

The chain DB lives inside the node's PVC at `${CHIA_ROOT}/db/`. Back up the whole `db/` directory, not just `blockchain_v2_mainnet.sqlite` — the directory also contains a SQLite WAL/shm, height-to-hash index, peers cache, and sub-epoch summaries. Restoring only the `.sqlite` file leaves the node in an inconsistent state on first start.

The backup mechanism is operator's choice; different clusters have very different storage capabilities. Two common shapes:

- **Cluster-native volume snapshot.** If your StorageClass supports CSI VolumeSnapshots, create a VolumeSnapshot resource pointing at the node's PVC. This is the fastest path on most managed-cloud Kubernetes.
- **Filesystem copy via maintenance pod.** Spin up a temporary pod (e.g. `busybox` or `alpine`) that mounts the node's PVC and uploads `${CHIA_ROOT}/db/` to object storage with `tar`, `rclone`, `aws s3 cp`, or whatever your team already uses.

For the most consistent backup, scale the chia-node workload to zero replicas before snapshotting or copying. SQLite tolerates running snapshots, but quiescing first removes edge cases.

The DB grows by 5–10 GiB per month after first sync. Back up often enough that a restore doesn't require a long walk-forward; weekly is a sensible default.

#### Restoring or pre-populating a node DB

The container entrypoint checks the size of `${CHIA_ROOT}/db/blockchain_v2_mainnet.sqlite` at startup. If `du -k` reports the file as smaller than 1 GiB (< 1048576 KiB), the entrypoint downloads the latest checkpoint torrent and extracts it — the slow first-sync path. If the file is at or above that threshold, the entrypoint skips the checkpoint flow and walks forward from whatever state it finds. This gate is what lets a pre-populated PVC short-circuit the 12–24 hour wait.

To restore from a backup, or to seed a new node from a synced one's DB:

1. Pre-create the PVC manually instead of letting `helm install` create an empty one. Match the chart's expected name and size.
2. Mount the PVC from a temporary maintenance pod (`busybox` or `alpine` with a `sleep` command), as the same UID/GID the chia-node container runs as. Wrong ownership leaves files the runtime user cannot read or write.
3. Copy the backup contents into `${CHIA_ROOT}/db/` on the mounted PVC. Preserve file ownership and permissions. Note that some files in the chia DB directory ship mode 0600 (notably `sub-epoch-summaries`) — your copy operation needs to either run as the file owner or use elevated privileges (sudo, root-side `docker cp`, etc.) to read them. A standard non-privileged `cp` will fail silently or copy zero-byte files for these. Verify by checking sizes after the copy. If `sub-epoch-summaries` ends up missing or zero-byte after restore, chia regenerates it from chain state at startup — slower than restoring it, but not fatal.
4. Delete the maintenance pod.
5. `helm install` (or `./upgrade.sh`) the chia-node chart. The entrypoint sees the SQLite > 1 GiB, skips the checkpoint download, and walks forward only the recent blocks — minutes to hours rather than 12–24 hours.

The upstream `helm-chia-nodes` chart does not currently expose either backup or restore as a templated option. If you do this often, consider wrapping the steps above in a script that lives with your IaC.

### 5.8 Mining

Chia nodes do not mine cryptocurrency in this deployment. They follow the chain and serve queries; that is all. If you want to enable mining for unrelated reasons, the upstream Chia charts support it and the GalactechsLLC repo documents the values. not.bot Verify does not benefit from your nodes mining.

### 5.9 Billing

Julia Social does not bill based on the number of Chia nodes you run. Deploy as many as your reliability and load profile call for.

### 5.10 Common failures

**Pod CrashLoopBackOff on first deploy.** Either the CA YAML indentation is wrong (§5.4) or the storage volume failed to mount. The pod logs distinguish.

**Pod runs but admin interface shows it as not synced.** Wait. First sync is slow; the chart's checkpoint mechanism makes it tolerable but not instant.

**Node falls behind chain head after running for weeks.** Storage filled up, IOPS contention with another workload, or network egress to Chia peers degraded. Check disk usage first, then network.

**TLS handshake fails between admin service and node.** The CA or cert uploaded to the admin interface (DC §6.2) does not match what the node was deployed with. Re-upload from the original files used in DC §4.

---

## 6. SDK

### 6.1 Role

The not.bot Verify SDK is a server-side library that runs in your application backend, not in the cluster. It connects your backend to your signature servers and handles the verification flow: starting requests, proxying the cryptographic exchange between the not.bot app and your signature servers, and delivering results through callbacks you define.

Languages: Rust (reference), JavaScript (Express), Python (FastAPI), Java (Spring MVC), Dart (shelf). The API surface is the same across all of them.

Repository: `github.com/julia-social/julia_web_sdk`.

### 6.2 Two integration paths

**Server adapter** (recommended for most deployments). You configure the adapter with your verification requirements and callbacks, and it mounts a set of `/signature/*` routes into your web framework. The adapter handles session tracking, request-to-session mapping, and websocket proxying.

**SignatureClient** (for custom flows). A lower-level HTTP client that calls the signature server endpoints directly. Use it when you need manual control, for automation scripts, or when your framework does not have a matching adapter.

DC §9 covers the adapter-based integration.

### 6.3 Adapter configuration reference

The adapter takes six configuration items:

- `request_claims`: which credentials you want from the user. Defined in `shared/claim_properties.txt` in the SDK repo. Common ones: `Notbot0` (bot detection), `AgeOver18` / `AgeOver21` and the full age bracket range, `FirstName` / `FamilyName` / `Nationality` (PII), `SitePass`. Architecture & Privacy Guide §8 covers what each claim attests to.
- `require_site_pass`: when true, the verification generates a per-user-per-site site pass that lets you recognize returning users without learning their identity.
- `message_generator`: produces the text the user sees in the not.bot app when approving the verification. Make it specific to your site so the user knows what they are approving.
- `on_success`: called when verification completes. The adapter passes the verified response (alias DID, requested claims, site pass if requested, cryptographic presentation) and the session. Store what you need.
- `on_failure`: called when verification fails. Log it, alert on it, handle it.
- `expire_time`: seconds a verification request stays valid. Default 3600.

### 6.4 Routes the adapter exposes

Once mounted, the adapter serves these routes from your backend:

- `GET /signature/notbot`: your frontend calls this to start a verification. Returns a request ID. Build the universal link or QR code from the ID.
- `GET /signature/status`: your frontend polls this to detect when verification finishes.
- `POST /signature/notbot/{request_id}`: the not.bot app calls this during the cryptographic exchange. The adapter proxies to your signature server.
- `POST /signature/verify/{request_id}`: the not.bot app submits the signed presentation here. The adapter proxies to your signature server, then fires `on_success` or `on_failure`.
- `WS /signature/honestbot` and `WS /calculate_site_pass`: WebSocket routes used during the verification protocol. Proxied to your signature servers.

Your backend code calls the first two. The not.bot app drives the rest.

### 6.5 The WebSocket proxy loop trap

Configure `SIGNATURE_HOSTNAME` to point at your signature server load balancer (Decision **I**), not at the host running the SDK adapter. If both point at the same host, the WebSocket proxy will loop back to itself on `/signature/honestbot` and the verification will hang.

This is the most common SDK misconfiguration. The fix is one environment variable. The symptom is mysterious: the universal link opens the not.bot app, the user approves, and nothing happens. The browser shows a stalled WebSocket. Check `SIGNATURE_HOSTNAME` first.

### 6.6 Session mapping and sticky sessions

The adapter maps each request ID to the session that started it. When verification completes, the adapter looks up the originating session and passes it to `on_success`, even though the verification response arrives on a different HTTP connection (it must, since the not.bot app is a separate client).

The map lives in memory in the adapter process. If you run multiple instances of your backend behind a load balancer, the verification response can land on a different backend instance than the one that started the request. The instance that received it has no map entry and cannot route to the right session.

Two fixes work:

- **Sticky sessions** at your load balancer. The user's frontend session and the not.bot app's verification response both route to the same backend instance. Most load balancers support this with a session cookie or source-IP affinity.
- **Shared session store.** Replace the adapter's in-memory map with a Redis or similar store. The SDK does not provide this out of the box; you have to wire it up by overriding the adapter's session callbacks.

For most deployments, sticky sessions are the simpler fix.

### 6.7 Common failures

**Verification hangs after the user approves in the not.bot app.** WebSocket loop trap. See §6.5.

**`on_success` fires on a different user's session.** Sticky sessions are off and the request landed on the wrong instance. See §6.6.

**`SIGNATURE_API_KEY` rejected.** The key was generated for a different deployment, or the value was rotated and the running adapter has a stale value. Get a fresh key from the admin interface and restart the adapter.

**Universal link does not open the not.bot app.** The user does not have the app installed (the link should redirect them to the appropriate app store; if it does not, your link construction is wrong) or the link's deeplink association is broken (rare; usually a manifest or App Store linkage issue, not an SDK problem).

---

## 7. PostgreSQL and Keycloak (prerequisites)

You provide both. Neither is part of the not.bot Verify Helm bundle. Versions: PostgreSQL 14+, Keycloak 22+.

PostgreSQL hosts the admin service's application data: registered Chia nodes, the signature DID pool, MAU counters, and signature server state. The schema is created by the admin service on first startup as the admin user (Decision **F**, default `notbot_admin`). The signer credentials (Decision **G**, default `notbot_signer`) need read/write on tables created by the admin user.

The schema lives in one database. Splitting it across two databases means setting up replication; the deployment does not provide that.

Keycloak handles operator login to the admin service. End users on the not.bot app never touch Keycloak. Only your deployment operators do. The admin-service realm has one client (`admin-service`) with confidential authentication and a redirect URI matching `app.baseUrl`.

The most useful operational property of both is that they are standard. Any monitoring, backup, or HA pattern your team already runs for either applies here.

For PostgreSQL backup: nightly dumps are the minimum, point-in-time recovery (WAL archiving) is better. The admin service can recreate the DID pool by re-querying the chain through a registered node, but everything else (operator state, Chia node registrations, MAU counters) lives only in the database.

For Keycloak: standard. Lose the realm and you lose admin login until you recreate it. The realm config is in version control if you exported it; otherwise back up Keycloak's own data.

---

## 8. Observability and alerting

### 8.1 Probes

The Helm charts ship with liveness and readiness probes on both the admin service and the signature server. Kubernetes uses these to restart unhealthy pods and hold traffic from pods that are not ready. Defaults are in the charts; do not change them unless you have a reason.

### 8.2 Logging

Both components log to stdout. `kubectl logs` works out of the box. Both use structured output, so an aggregator (ELK, Loki, Datadog) parses without custom config.

The admin service log level is controlled by `springProfile`. `prod` logs at INFO. For deeper troubleshooting, set `springProfile: dev` and redeploy; switch back when you have what you need (the dev profile produces high-volume output and includes debug data you do not want in long-term storage).

The signature server logs at INFO by default. The startup sequence (§4.3) logs each step, so a failed startup names the failing step without needing higher verbosity.

### 8.3 Metrics

The signature server exposes `/metrics` on port 8080 with Prometheus scrape annotations. If you run Prometheus in your cluster, it picks up the endpoint with no extra setup. The annotations have no effect without a collector.

The admin service does not expose a Prometheus endpoint in the current release. If you need admin service metrics, scrape Kubernetes-level metrics (pod uptime, restart count, resource usage) and pair them with the admin interface's own DID pool view.

### 8.4 Conditions to alert on

Five failure conditions take the verification flow down or degrade it. Set alerts for each.

**OpenBao sealed.** Every signature request requires OpenBao. A sealed OpenBao means zero signatures. The admin service and signature server logs both show connection failures. OpenBao reseals on every pod restart (§2.2). Alert on the bao seal status, or on a sustained pattern of OpenBao connection errors in the signature server logs.

**Chia node out of sync.** A node behind chain head cannot verify current state. The admin interface shows sync status per node. Alert on any registered node leaving "synced" state. If your only synced node goes down, the admin service and signature servers lose chain visibility. §5.5 covers running multiple nodes.

**Signature server heartbeat loss.** A stale heartbeat in the admin interface means the server holding that DID has stopped reporting. The admin service will release the DID after the timeout, but until it does, that DID is stuck. If the server crashed, the replacement blocks at startup waiting for an Available DID. Alert on heartbeat staleness exceeding the heartbeat interval by more than a small margin.

**Signature DID pool not refilling.** The admin service maintains two unallocated DIDs and mints replacements as servers consume them. If the unallocated count drops to zero and stays there, the admin service is failing to mint, usually because no registered Chia node is synced or OpenBao is sealed. New signature servers will block at startup until the pool refills. Alert when the unallocated count stays at zero for more than a few minutes.

**Julia Social connectivity loss.** Signature servers contact Julia Social once at startup (honest.bot credential) and once an hour (diagnostic count). Connectivity loss does not affect running servers or active verifications. It prevents new signature servers from starting and pauses diagnostic and MAU reporting. Alert on sustained egress failures, especially during scale-up windows.

The Architecture & Privacy Guide §10 covers the operational consequences of each failure mode in more detail.

---

## 9. Support

Email `support@julia.social` for deployment issues, configuration questions, or credential reissuance.

When opening a ticket, include:
- Your `customerId` and `organizationName` from `deployment-config.json`.
- The component involved (admin service, signature server, OpenBao, Chia node, SDK).
- Pod logs from the relevant component covering at least the last error.
- The phase of the Deployment Checklist where the issue first showed up, if you are still in initial setup.

Do not paste secrets, scoped tokens, or your unseal key into a support ticket. Support will never ask for them.
