Character AI API (v1)

Download OpenAPI specification:

1. Overview

1.1 Purpose

This API enables clients to turn a single "reference" image of a subject into a full video sequence in which that subject appears consistently—regardless of changes in pose, background, or action. By uploading one image, consumers can programmatically request new frames or entire videos that preserve the subject's visual identity, driven by arbitrary control signals (e.g. poses, text prompts, or latent codes).

Key goals:

Simplicity: One reference image → many generated frames.
Flexibility: Support diverse control modalities (pose, text, custom latents).
Consistency: Enforce identity preservation and temporal smoothness.
Scalability: Offer both single-frame and batch-oriented video endpoints, with async/webhook support.

1.2 Core Concepts

Concept	Description
Reference Image	The single input image (`I_ref`) that defines the subject's appearance (size H×W×3).
Identity Feature Vector	A learned embedding (`f_identity ∈ ℝᵈ`) encoding the subject's appearance, disentangled from pose or scene.
Control Signals	Per-frame vectors (`z_t`) that drive motion, pose, or style. Can come from:
	* Pose Keypoints (2D or 3D joint locations)
	* Text Prompts (via CLIP or other text encoder)
	* Custom Latent Codes (sampled or learned priors)
Temporal Coherence	Mechanisms (optical-flow losses, temporal discriminators, attention across frames) ensuring smooth transitions between generated frames.

1.3 High-Level Workflow

Upload & Process Reference Image
- Client uploads I_ref.
- API returns a reference_image_id.
- Segmentation and feature-extraction run to produce a mask M_ref and embedding f_identity.
Generate Control Sequence
- Client requests control vectors via pose-, text-, or latent-based endpoints.
- API returns one or more z_t vectors describing desired per-frame variation.
Frame / Video Generation
- Single-frame: supply one z_t → get I_t.
- Multi-frame: supply [z_1…z_T] → get a packaged video (MP4/GIF).
Monitor & Retrieve
- Synchronous responses for small jobs.
- Asynchronous jobs with job IDs, polling, or webhooks for larger video renders.

2. Getting Started

2.1 Base URL

All endpoints are hosted under the versioned base URL:

https://api.aicharacter.studio/v1

Note: All requests must use HTTPS.

2.2 Authentication & Authorization

Header	Value	Description
`Authorization`	`Bearer <API_KEY>`	Your API key as a Bearer token.
`Content-Type`	`application/json`	For all JSON request bodies.
`Accept`	`application/json`	For all JSON responses.

API Key Management

Generate and manage keys in your dashboard at aicharacter.studio.
Keys are tied to scopes; each key bears a set of permissions:
- reference:read – upload and fetch reference-image data
- control:write – generate control vectors (pose/text/latent)
- generate:write – produce frames or videos

Sample Request

POST /v1/reference-image HTTP/1.1
Host: api.aicharacter.studio
Authorization: Bearer sk_live_abc123…
Content-Type: multipart/form-data; boundary=----XYZ

------XYZ
Content-Disposition: form-data; name="image"; filename="subject.jpg"
Content-Type: image/jpeg

<…binary image data…>
------XYZ--

2.3 Rate Limits

Limit Type	Value
Standard	100 requests per minute
Burst	50 requests per 10 seconds

When you approach or exceed your limit, the API will respond with HTTP 429 and include:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1618881600

Retry-After: seconds until you may retry.
X-RateLimit-Reset: UNIX timestamp when the window resets.

2.4 Error Handling

All errors return a JSON payload in the following format:

{
  "error": {
    "code": "string",
    "message": "string",
    "details": { … }
  }
}

HTTP Status	Code	Meaning
400	`BadRequest`	Invalid request syntax or parameters
401	`Unauthorized`	Missing or invalid API key
403	`Forbidden`	Insufficient scope or access denied
404	`NotFound`	Resource (e.g. reference_image_id) not found
422	`UnprocessableEntity`	Semantic validation failed (e.g. bad keypoint data)
429	`RateLimitExceeded`	Rate limit exceeded
500	`InternalError`	Server-side error — please retry later

Example: Invalid JSON

HTTP/1.1 400 Bad Request
Content-Type: application/json

{
  "error": {
    "code": "BadRequest",
    "message": "Request body could not be parsed as JSON.",
    "details": {
      "line": 4,
      "column": 15
    }
  }
}

Reference Image Processing

This section covers endpoints for uploading a reference image, retrieving its metadata, obtaining a segmentation mask, and extracting the identity feature vector.

Upload Reference Image

Upload a single image that defines the subject's appearance.

Authorizations:

ApiKeyAuth

Request Body schema: multipart/form-data
required

image required	string <binary> JPEG/PNG binary data of the subject.
name	string Optional user label or filename.

Responses

Response samples

201
400
401
413

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"width": 1024,
"height": 768,
"uploaded_at": "2025-05-21T08:42:13Z"
}

Get Reference-Image Metadata

Retrieve basic metadata for a previously uploaded reference image.

Authorizations:

ApiKeyAuth

path Parameters

reference_image_id

required

string

Example: ref_12345abcd

ID returned on upload.

Responses

Response samples

200
404

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"width": 1024,
"height": 768,
"status": "processed",
"uploaded_at": "2025-05-21T08:42:13Z"
}

Retrieve Segmentation Mask

Download the binary mask isolating the subject from background. Clients can use If-None-Match with a previously obtained ETag to perform conditional requests.

Authorizations:

ApiKeyAuth

path Parameters

reference_image_id

required

string

Example: ref_12345abcd

ID returned on upload.

query Parameters

format

string

Default: "png"

Enum: "png" "base64"

png (default) or base64

Responses

Response samples

200
404

Content type

image/png

No sample

Extract Identity Features

Return the subject's embedding vector, disentangled from pose and context.

Authorizations:

ApiKeyAuth

path Parameters

reference_image_id

required

string

Example: ref_12345abcd

ID returned on upload.

query Parameters

model

string

Default: "cnn"

Enum: "cnn" "vit"

cnn (default) or vit to select backbone type.

Responses

Response samples

200
404
422

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"model": "cnn",
"feature_dim": 512,
"f_identity": [0.023,
-1.14,
0.987,
0.001
]
}

Conditional Control Signals

Endpoints in this section generate per-frame control vectors (zₜ) that drive pose, text, or custom latent–based variation. Clients supply the reference_image_id and modality‐specific inputs; the API returns one or more zₜ vectors for downstream frame/video generation.

Generate Pose Control Signal

Encode a set of 2D or 3D keypoints into a control vector zₜ for a single frame.

Authorizations:

ApiKeyAuth

Request Body schema: application/json
required

reference_image_id required	string ID of the uploaded reference image.
required	Array of objects (Keypoint) List of joint coordinates.
format	string Default: "json" Enum: "json" "base64" Format of the keypoints data if it were to be encoded differently (primarily for future use or alternative inputs, response is always JSON).

Responses

Request samples

Payload

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"keypoints": [{"x": 0.5,
"y": 0.1
},
{"x": 0.52,
"y": 0.25
}
]
}

Response samples

200
400
404
422

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"control_type": "pose",
"frame_index": 0,
"z_t": [0.023,
-1.14,
0.987,
0.001
]
}

Generate Text Control Signal

Turn a natural-language prompt into a sequence of control vectors [z₁…zₜ] via a text encoder + transformer.

Authorizations:

ApiKeyAuth

Request Body schema: application/json
required

reference_image_id required	string ID of the uploaded reference image.
prompt required	string Text string describing desired action/scene.
steps required	integer <int32> >= 1 Number of frames `T` to generate.
model	string Default: "clip" Enum: "clip" "custom-nlp" Text encoder to use.

Responses

Request samples

Payload

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"prompt": "A dancing robot in a city street",
"steps": 16
}

Response samples

200
400
404
422

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"control_type": "text",
"steps": 16,
"z_sequence": [[0.12,
-0.33,
0.77,
0.002
],
[0.1,
-0.3,
0.8,
0.003
]
]
}

Generate Custom Latent Codes

Generate a sequence of random or learned latent vectors [z₁…zₜ] suitable for scene or style variation.

Authorizations:

ApiKeyAuth

query Parameters

length required	integer <int32> >= 1 Number of vectors to generate (`T`).
seed	integer <int64> Deterministic RNG seed.
distribution	string Default: "gaussian" Enum: "gaussian" "uniform" `gaussian` (default) or `uniform`.

Responses

Response samples

200
400
422

Content type

application/json

{"control_type": "latent",
"length": 8,
"z_sequence": [[0.05,
-1.2,
0.32,
0.004
],
[0.02,
-1.15,
0.29,
0.005
]
]
}

Frame Generation

Endpoints in this section consume one or more control vectors (zₜ) along with the reference_image_id to render either a single frame or a full video.

Generate a Single Frame

Render a single frame Iₜ given a control vector zₜ. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:

ApiKeyAuth

query Parameters

custom_cache_duration_seconds

integer

Optional. Custom cache duration for the generated asset in seconds.

header Parameters

X-GPU-Type	string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for processing.
X-Instance-Count	integer [ 1 .. 4 ] Number of GPU workers (1-4) for asynchronous tasks (if applicable, though this endpoint is primarily sync).

Request Body schema: application/json
required

reference_image_id required	string ID of the uploaded reference image.
control_type required	string Enum: "pose" "text" "latent" Type of control signal provided.
z_t required	Array of numbers <float> [ items <float > ] Control vector for this frame (length must match model dim).
resolution	string^\d+x\d+$ Default: "720x720" Output size, e.g. "1280x720".
model	string Enum: "gan" "diffusion" "transformer" Generator mode.
output_format	string Default: "url" Enum: "url" "base64" Desired output format for the image.
identity_strength	number or null <float> Weight for identity preservation. Higher values enforce stricter subject matching.
temporal_strength	number or null <float> Weight for temporal coherence. Higher values promote smoother transitions.
perceptual_weight	number or null <float> Weight for perceptual loss (e.g., LPIPS).
adv_weight	number or null <float> Weight for adversarial loss from GAN discriminators.
pose_smoothness	number or null <float> Weight for pose regularization to penalize jittery movements.
url_ttl	integer or null <int32> Optional. Time-to-live in seconds for the `image_url` if generated.

Responses

Request samples

Payload

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"control_type": "pose",
"z_t": [0.12,
-0.33,
0.77,
0.001
],
"resolution": "1280x720",
"model": "gan"
}

Response samples

200
400
404
422

Content type

application/json

Example

Output format URL

{"reference_image_id": "ref_12345abcd",
"frame_index": 0,
"resolution": "1280x720",
"image_url": "https://cdn.aicharacter.studio/frames/ref_12345abcd/frame0.png"
}

Generate a Multi-Frame Video

Produce an entire video (T frames) in one request. Supports both synchronous (small T) and asynchronous (large T) workflows. Render T frames as an MP4 or GIF, given a sequence of control vectors [z₁…z_T]. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:

ApiKeyAuth

query Parameters

custom_cache_duration_seconds

integer

Optional. Custom cache duration for the generated video in seconds.

header Parameters

X-GPU-Type	string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for processing.
X-Instance-Count	integer [ 1 .. 4 ] Number of GPU workers (1-4) for asynchronous tasks.

Request Body schema: application/json
required

reference_image_id required	string ID of the uploaded reference image.
control_type required	string Enum: "pose" "text" "latent" Type of control signal sequence provided.
z_sequence required	Array of numbers[ items <float >[ items <float > ] ] List of `T` control vectors.
format	string Default: "mp4" Enum: "mp4" "gif" Desired output video format.
resolution	string^\d+x\d+$ Default: "720x720" Video size, e.g. "1280x720".
fps	integer <int32> Default: 30 Frames per second.
model	string Default: "gan" Enum: "gan" "diffusion" "transformer" Generator mode to use for video generation.
async	boolean Default: false `false` (default) for sync; `true` to enqueue and receive a job ID.
callback_url	string or null <url> If `async=true`, POST job completion payload here.
identity_strength	number or null <float> Weight for identity preservation. Higher values enforce stricter subject matching.
temporal_strength	number or null <float> Weight for temporal coherence. Higher values promote smoother transitions.
perceptual_weight	number or null <float> Weight for perceptual loss (e.g., LPIPS).
adv_weight	number or null <float> Weight for adversarial loss from GAN discriminators.
pose_smoothness	number or null <float> Weight for pose regularization to penalize jittery movements.
url_ttl	integer or null <int32> Optional. Time-to-live in seconds for the `video_url` if generated.

Responses

Request samples

Payload

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"control_type": "text",
"z_sequence": [[0.05,
-1.2,
0.32,
0.001
],
[0.02,
-1.15,
0.29,
0.002
]
],
"format": "mp4",
"resolution": "1280x720",
"fps": 24,
"model": "gan"
}

Response samples

201
202
400
404
422

Content type

application/json

{"reference_image_id": "ref_12345abcd",
"frames": 16,
"format": "mp4",
"resolution": "1280x720",
"fps": 24,
"video_url": "https://cdn.aicharacter.studio/videos/ref_12345abcd/run.mp4",
"duration_sec": 0.67,
"mask_url": "https://cdn.aicharacter.studio/masks/ref_12345abcd/run_mask.png"
}

Advanced Features

This section describes two higher-level capabilities: 3D/generalization via NeRF and batch-style multi-reference processing.

8.1 3D Pose & View Generalization

Integration with NeRF (Neural Radiance Fields) for enhanced 3D pose and view generalization. Specific endpoint for direct NeRF interaction might be part of specialized workflows or future extensions. Current capabilities are implicitly leveraged in models supporting 3D awareness.

8.2 Batch Processing

Functionality for processing multiple requests efficiently.

Batch Video Generation (Asynchronous or Synchronous)

Process an array of independent video-generation tasks. Each task can target a different reference_image_id and control sequence. Supports both synchronous and asynchronous processing. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:

ApiKeyAuth

query Parameters

custom_cache_duration_seconds

integer

Optional. Custom cache duration for all generated assets in this batch in seconds.

header Parameters

X-GPU-Type	string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for processing the batch.
X-Instance-Count	integer [ 1 .. 4 ] Number of GPU workers (1-4) for asynchronous batch tasks.

Request Body schema: application/json
required

required	Array of objects (BatchVideoTask) non-empty A list of individual video generation tasks.
async	boolean Default: false `false` (default) for sync; `true` to enqueue and receive job IDs for each task.
callback_url	string or null <url> If `async=true`, POST job completion payloads here (may be per job or global).

Responses

Request samples

Payload

Content type

application/json

Example

Asynchronous Batch Video Generation

{"tasks": [{"reference_image_id": "ref_12345abcd",
"control_type": "text",
"z_sequence": [[0.1,
0.2
],
[0.3,
0.4
]
],
"format": "mp4",
"resolution": "640x480"
},
{"reference_image_id": "ref_98765wxyz",
"control_type": "pose",
"z_sequence": [[0.5,
0.6
],
[0.7,
0.8
]
],
"fps": 24
}
],
"async": true,
"callback_url": "https://myapp.com/hooks/video-complete"
}

Response samples

200
202
400
401
422

Content type

application/json

{"results": [{"reference_image_id": "ref_12345abcd",
"video_url": "https://cdn.aicharacter.studio/videos/ref_12345abcd/task1.mp4",
"mask_url": "https://cdn.aicharacter.studio/masks/ref_12345abcd/task1_mask.png"
},
{"reference_image_id": "ref_98765wxyz",
"video_url": "https://cdn.aicharacter.studio/videos/ref_98765wxyz/task2.mp4",
"mask_url": "https://cdn.aicharacter.studio/masks/ref_98765wxyz/task2_mask.png"
}
]
}

Generate NeRF Model or Video

Build a subject-specific NeRF model and render either the raw NeRF or a video of arbitrary camera trajectories. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:

ApiKeyAuth

query Parameters

custom_cache_duration_seconds

integer

Optional. Custom cache duration for the generated NeRF model/video in seconds.

header Parameters

X-GPU-Type	string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for NeRF processing.
X-Instance-Count	integer [ 1 .. 4 ] Number of GPU workers (1-4) for NeRF processing (primarily for async NeRF tasks if supported).

Request Body schema: application/json
required

reference_image_id required	string ID of the uploaded reference image.
required	Array of objects (CameraPose) non-empty List of camera poses for rendering or model fitting.
resolution	string^\d+x\d+$ Default: "320x240" Render size, e.g. "640x480".
output_format	string Default: "video" Enum: "nerf_model" "video" Desired output format.
fps	integer or null <int32> Default: 24 If output_format is video, frames per second.
url_ttl	integer or null <int32> Optional. Time-to-live in seconds for the `video_url` if `output_format` is `video`.

Responses

Request samples

Payload

Content type

application/json

Example

videoOutput

{"reference_image_id": "ref_12345abcd",
"camera_trajectory": [{"yaw": 0,
"pitch": 10,
"roll": 0,
"distance": 2
},
{"yaw": 45,
"pitch": 15,
"roll": 0,
"distance": 2
},
{"yaw": 90,
"pitch": 10,
"roll": 0,
"distance": 2
}
],
"resolution": "640x480",
"output_format": "video",
"fps": 30
}

Response samples

200
400
401
404
422

Content type

application/json

Example

NeRF Model Output

{"reference_image_id": "ref_12345abcd",
"output_format": "nerf_model",
"model_url": "https://cdn.aicharacter.studio/nerf/ref_12345abcd/model.onnx"
}

Performance & Scaling

9. Performance & Scaling

To meet the needs of diverse workloads—from single-frame tests to large-scale video batches—our API offers both synchronous and asynchronous modes, fine-grained job management, and options for GPU sizing, caching, and CDN delivery.

9.1 Asynchronous Jobs & Webhooks

async Flag Add "async": true in your /generate/video or /v1/batch/video requests to enqueue work without blocking your connection.

Job Creation Response

HTTP/1.1 202 Accepted
Location: /v1/jobs/{job_id}
{
  "job_id": "job_abc123",
  "status_url": "https://api.aicharacter.studio/v1/jobs/job_abc123"
}

Callback/Webhook Supply "callback_url": "https://your.app/webhook" to have job results POSTed when complete:

{
  "job_id": "job_abc123",
  "status": "completed",
  "result": { /* identical to sync response */ }
}

9.2 Progress Polling

Endpoint
```
GET /v1/jobs/{job_id}
```

Response Fields

Field	Type	Description
`status`	`string`	`pending`, `processing`, `completed`, `failed`
`progress`	`number`	Percentage 0–100 of completion
`result`	`object`	Populated when `status=completed`
`error`	`object`	Populated when `status=failed`

Example

GET /v1/jobs/job_abc123 HTTP/1.1
Authorization: Bearer sk_live_…

{
  "job_id": "job_abc123",
  "status": "processing",
  "progress": 47.5
}

9.3 GPU & Instance Sizing

By default, jobs run on a shared GPU cluster. For predictable performance:

X-GPU-Type Header (optional):
- v100 (default)
- a100
- t4
X-Instance-Count Header (async only):
- Number of GPU workers (1–4)

Note: Additional charges apply for dedicated hardware.

9.4 Caching & Deduplication

ETags & Conditional Requests
- Frame/video endpoints return an ETag header.
- Clients may GET with If-None-Match to reuse cached results when control inputs and reference image are unchanged. (Note: For POST generation endpoints, this implies re-POSTing with same parameters might yield cached result URL if underlying generation matches a cached one).
Cache TTL
- Default: 24 hours
- Can be extended via custom_cache_duration_seconds query parameter on generation calls.

9.5 CDN Delivery & URL Expiry

CDN-Backed URLs
- All image_url and video_url links point to a global CDN for low-latency delivery.
URL Expiry
- Default expiration: 7 days
- Extendable via the optional url_ttl (in seconds) parameter on generation requests.

9.6 Best Practices for High Throughput

Batch vs. Single Requests
- Aggregate many small videos in /v1/batch/video to reduce overhead.
Connection Management
- Use async + webhooks rather than long polling for large jobs.
Retry & Backoff
- On HTTP 429, respect the Retry-After header.
- Implement exponential backoff for 5xx errors.
Parallelization
- Limit concurrent requests to your rate limit.
- Use multiple API keys (scoped per project) if needed.

Get Job Status

Polls for the status of an asynchronous job (e.g., video generation, batch processing).

Authorizations:

ApiKeyAuth

path Parameters

job_id

required

string

Example: job_98765abcd

The unique identifier of the asynchronous job.

Responses

Response samples

200
401
404

Content type

application/json

Example

inProgress

{"job_id": "job_98765abcd",
"status": "in_progress",
"progress": 65,
"message": "Video encoding at 65%",
"created_at": "2023-10-26T10:00:00Z",
"updated_at": "2023-10-26T10:05:00Z"
}

Conceptual - Model Architectures

6. Model Architectures & Modes

You can control the underlying generator by passing the model parameter in your /generate/frame or /generate/video requests. Available modes trade off speed, stochasticity, and visual fidelity.

6.1 GAN-Based Mode

Identifier: model=gan
Overview: Uses a StyleGAN-like backbone extended for video. Leverages Adaptive Instance Normalization (AdaIN) to inject the identity vector into each layer and adds per-frame noise to introduce fine stochastic detail.
Key Components:
- Generator: 2D convolutional "synthesis" network with style modulation.
- Noise Inputs: Per–layer Gaussian noise tensors (n\u209c \u223c N(0,1)) to diversify textures.
- AdaIN:
```
AdaIN(x, f_identity) = \u03c3(f_identity) * ((x - \u03bc(x)) / \u03c3(x)) + \u03bc(f_identity)
```
- Discriminators:
  - Frame Discriminator: 2D CNN to judge individual frames.
  - Temporal Discriminator: 3D CNN to judge frame sequences.

API Usage Example:

{
  "model": "gan",
  "reference_image_id": "ref_12345abcd",
  "control_type": "pose",
  "z_t": [...],
  "resolution": "1280x720"
}

Pros & Cons:

Pros Cons

Fast inference (1–2 sec/frame at 720p) May exhibit flicker without strong temporal loss tuning

Rich, detailed textures via noise inputs Can require heavier GPU for high-res

Pros	Cons
Fast inference (1–2 sec/frame at 720p)	May exhibit flicker without strong temporal loss tuning
Rich, detailed textures via noise inputs	Can require heavier GPU for high-res

6.2 Diffusion-Based Mode

Identifier: model=diffusion
Overview: Iteratively denoises a Gaussian noise map into the target frame over S steps. Conditioned on f_identity and z\u209c via cross-attention in a 2D/3D U-Net.
Key Components:
- U-Net Backbone:
  - Down/up‐sampling streams with skip connections.
  - 2D convs for spatial, 3D convs or temporal-blocks for short-range temporal modeling.
- Cross-Attention:
  - Queries from noisy frame x\u209c\u02e2; Keys/Values from [f_identity, z\u209c].
- Denoising Loss:
```
L_diff = E_{\u03b5, t, s} \u2016 \u03b5 - \u03b5_\u03b8(x\u209c\u02e2, f_identity, z\u209c) \u2016\u00b2
```

API Usage Example:

{
  "model": "diffusion",
  "reference_image_id": "ref_12345abcd",
  "control_type": "text",
  "z_sequence": [...],
  "format": "mp4"
}

Pros & Cons:

Pros Cons

Highly stable subject consistency Slower (10–50 sec/frame at 720p)

Better handling of complex scenes & lighting May require tuning of step count S

Pros	Cons
Highly stable subject consistency	Slower (10–50 sec/frame at 720p)
Better handling of complex scenes & lighting	May require tuning of step count S

6.3 Transformer-Based Mode

Identifier: model=transformer
Overview: Treats video as a sequence of spatio-temporal tokens. A Vision Transformer encoder–decoder uses self-attention to model long-range dependencies, with cross-attention to integrate f_identity.
Key Components:
- Patch Extraction: Splits each frame into P×P pixel patches, linearly projected to token vectors.
- Spatio-Temporal Blocks:
  - Self-attention across both time (frames) and space (patches).
  - Cross-attention layers attending to [f_identity, z\u209c].
- Output Head: Reconstructs patches into image grid.

API Usage Example:

{
  "model": "transformer",
  "reference_image_id": "ref_12345abcd",
  "control_type": "latent",
  "z_sequence": [...],
  "fps": 24
}

Pros & Cons:

Pros Cons

Excellent temporal coherence over long clips Highest memory usage (TPU/GPU required)

Unified handling of spatial & temporal context Slower than GAN but faster than diffusion

Pros	Cons
Excellent temporal coherence over long clips	Highest memory usage (TPU/GPU required)
Unified handling of spatial & temporal context	Slower than GAN but faster than diffusion

Conceptual - Consistency & Loss Functions

7. Consistency & Loss Functions

To guarantee that generated frames faithfully preserve the subject's appearance and transition smoothly, the system employs multiple specialized losses. Clients do not directly interact with these losses but can tune high-level consistency parameters via optional query fields in generation calls (e.g. identity_strength, temporal_strength).

7.1 Identity Preservation

Ensures each frame's subject matches the reference image in appearance.

Feature-Space Loss Compare deep features of masked regions between I_ref and generated frame I_t using a pretrained VGG network:

$$ L_{\text{identity}} = \sum_{l} \big\lVert \phi_{l}(I_{\text{ref}}\odot M_{\text{ref}}) - \phi_{l}(I_{t}\odot M_{t}) \big\rVert_{2}^{2} $$
- $\phi_{l}$: activations at layer $l$.
- $M_{t}$: predicted mask for frame $t$.
Face Recognition Loss (for facial subjects) Use a lightweight face-recognition embedding $e(\cdot)$ and enforce cosine similarity:

$$ L_{\text{face}} = 1 - \cos\big(e(I_{\text{ref}}),,e(I_{t})\big) $$

7.2 Temporal Coherence

Encourages smooth transitions and discourages flicker or jumpiness.

Optical-Flow Warping Loss Compute forward flow $F_{t\to t+1}$ via a flow network, warp $I_{t}$ and compare to $I_{t+1}$:

$$ L_{\text{flow}} = \big\lVert \text{Warp}(I_{t},,F_{t\to t+1}) - I_{t+1}\big\rVert_{1} $$
Temporal Discriminator Loss A 3D-CNN discriminator $D_{\text{temp}}$ distinguishes real vs. generated clips:

$$ L_{\text{temp}} = \mathbb{E}\big[\log D_{\text{temp}}({I_{t}}{\text{real}})\big] + \mathbb{E}\big[\log(1 - D{\text{temp}}({I_{t}}_{\text{gen}}))\big] $$
Sliding-Window Attention Transformer-style self-attention over neighboring frames:

$$ \hat h_{t} = \mathrm{Attention}\big(Q_{t},,K_{t-k:t+k},,V_{t-k:t+k}\big) $$

– implicitly enforces consistency over a window of size $2k+1$.

7.3 Domain-Specific or Perceptual Losses

Additional optional losses depending on subject type or client needs.

Loss Name	Purpose	Endpoint Parameter
Perceptual Loss	Enforce color/style consistency via LPIPS	`perceptual_weight`
GAN Adversarial	Sharpness/detail via frame & temporal GANs	`adv_weight`
Pose Regularization	Penalize implausible keypoint jumps	`pose_smoothness`

Clients can adjust these via optional fields in generation requests:

{
  "reference_image_id": "...",
  "model": "diffusion",
  "control_type": "text",
  "z_sequence": [...],
  "identity_strength": 0.7,
  "temporal_strength": 0.8,
  "adv_weight": 0.5
}

Conceptual - Examples & SDKs

10. Examples & SDKs

Below are quickstart snippets in cURL, Python, and JavaScript. For full reference, see our GitHub repos and PyPI/npm packages.

10.1 cURL Examples

10.1.1 Upload Reference Image

curl -X POST "https://api.aicharacter.studio/v1/reference-image" \\
  -H "Authorization: Bearer $API_KEY" \\
  -F "image=@/path/to/subject.jpg"

10.1.2 Generate Pose Control Vector

curl -X POST "https://api.aicharacter.studio/v1/control/pose" \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{ 
        "reference_image_id": "ref_12345abcd",
        "keypoints": [
          { "x": 0.5, "y": 0.1 },
          { "x": 0.5, "y": 0.25 }
        ]
      }'

10.1.3 Render a Single Frame

curl -X POST "https://api.aicharacter.studio/v1/generate/frame" \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{ 
        "reference_image_id": "ref_12345abcd",
        "control_type": "pose",
        "z_t": [0.12, -0.33, 0.77],
        "resolution": "1280x720"
      }'

10.1.4 Start an Async Video Job

curl -X POST "https://api.aicharacter.studio/v1/generate/video" \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{ 
        "reference_image_id": "ref_12345abcd",
        "control_type": "text",
        "z_sequence": [[…], […]],
        "async": true,
        "callback_url": "https://your.app/hooks/complete"
      }'

10.2 Python SDK

from aicharacter import Client

client = Client(api_key="sk_live_abc123")

# 1. Upload image
ref = client.upload_reference_image("/path/to/subject.jpg")

# 2. Generate text control
z_seq = client.control.text(
    reference_image_id=ref.id,
    prompt="A cat playing piano",
    steps=8
)

# 3. Generate video synchronously
video = client.generate.video(
    reference_image_id=ref.id,
    control_type="text",
    z_sequence=z_seq,
    format="mp4",
    resolution="720x720",
    fps=24
)

print("Download at:", video.video_url)

10.3 JavaScript (Node/Vue) SDK

import { AiCharacterClient } from "@aicharacter/studio";

const client = new AiCharacterClient({ apiKey: "sk_live_abc123" });

async function run() {
  // 1. Upload
  const { reference_image_id } = await client.referenceImage.upload(
    "/path/to/subject.jpg"
  );

  // 2. Pose control
  const { z_t } = await client.control.pose({
    reference_image_id,
    keypoints: [{ x: 0.5, y: 0.1 }, { x: 0.5, y: 0.25 }]
  });

  // 3. Render frame
  const frame = await client.generate.frame({
    reference_image_id,
    control_type: "pose",
    z_t,
    resolution: "1280x720"
  });

  console.log("Frame URL:", frame.image_url);
}

run();

Conceptual - FAQs & Troubleshooting

11. FAQs & Troubleshooting

Q1: My generated frames flicker—how can I improve temporal stability?

Solution: Increase the temporal_strength parameter in your generation call (range 0.0–1.0, default 0.8).
Tip: Use model=diffusion or model=transformer, which natively enforce stronger temporal consistency than gan.

Q2: I keep getting HTTP 429—what's the best retry strategy?

Honor the Retry-After header in the 429 response.
Implement exponential backoff (e.g., initial 1 s, multiply by 2 up to a max of 32 s).
For large jobs, switch to async=true + webhook to avoid holding client connections open.

Q3: My mask endpoint returns 404 even though the upload succeeded.

Confirm that the reference image's status (from GET /v1/reference-image/{id}) is "processed".
If status is "pending", wait a few seconds and retry.
If it remains pending > 60 s, contact support.

Q4: Which control type is best for highly dynamic motions?

Pose: precise but requires keypoint data per frame.
Text: easy to specify broad actions ("running," "jumping") and auto‐generates sequences.
Latent: use for abstract or style-driven motion; less semantically grounded.

Q5: How do I mask out the background in generated frames?

Generated frames come with optional masks if you append ?include_mask=true to your frame/video URL.
The mask URL will be in your JSON response under mask_url.

Q6: I need higher resolution than 1280×720—what are my options?

Supported resolutions: up to 1920x1080 on shared GPUs, and up to 4K (3840×2160) on dedicated A100 instances.
To request >1080p, include header X-GPU-Type: a100.

Q7: My asynchronous job failed with "ResourceExhausted."

This indicates GPU memory limits were exceeded (e.g., too many high-res frames).
Reduce resolution or switch to async=true with a higher-tier GPU via X-GPU-Type.
If persistent, split into smaller batches.

Q8: Can I reuse existing `z_sequence` across multiple videos?

Yes. For reproducibility, store both reference_image_id and the exact z_sequence.
Ensure you also specify the same seed when using latent control for deterministic output.

Q9: What image/video formats are supported?

Frames: PNG or JPEG (via output_format).
Videos: MP4 (H.264) or animated GIF.

Q10: Who do I contact for support or to request new features?

Email: support@aicharacter.studio

Q11: How do I get access to use the API?

Send an email to dev@aicharacter.studio to request credentials and onboarding details.
Or DM us on X (formerly Twitter) at https://x.com/aidotstudio.

Character AI API (v1)

1. Overview

1.1 Purpose

1.2 Core Concepts

1.3 High-Level Workflow

2. Getting Started

2.1 Base URL

2.2 Authentication & Authorization

API Key Management

Sample Request

2.3 Rate Limits

2.4 Error Handling

Example: Invalid JSON

Reference Image Processing

Upload Reference Image

Authorizations:

Request Body schema: multipart/form-datarequired

Responses

Response samples

Get Reference-Image Metadata

Authorizations:

path Parameters

Responses

Response samples

Retrieve Segmentation Mask

Authorizations:

path Parameters

query Parameters

Responses

Response samples

Extract Identity Features

Authorizations:

path Parameters

query Parameters

Responses

Response samples

Conditional Control Signals

Generate Pose Control Signal

Authorizations:

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

Generate Text Control Signal

Authorizations:

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

Generate Custom Latent Codes

Authorizations:

query Parameters

Responses

Response samples

Frame Generation

Generate a Single Frame

Authorizations:

query Parameters

header Parameters

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

Generate a Multi-Frame Video

Authorizations:

query Parameters

header Parameters

Request Body schema: application/jsonrequired

Responses

Request samples

Response samples

Advanced Features

8.1 3D Pose & View Generalization

8.2 Batch Processing

Batch Video Generation (Asynchronous or Synchronous)

Authorizations:

query Parameters

header Parameters

Request Body schema: application/jsonrequired

Responses

Request Body schema: multipart/form-data
required

Request Body schema: application/json
required

Request Body schema: application/json
required

Request Body schema: application/json
required

Request Body schema: application/json
required

Request Body schema: application/json
required

Request Body schema: application/json
required

Q8: Can I reuse existing `z_sequence` across multiple videos?