Character AI API (v1)

Download OpenAPI specification:

1. Overview

1.1 Purpose

This API enables clients to turn a single "reference" image of a subject into a full video sequence in which that subject appears consistently—regardless of changes in pose, background, or action. By uploading one image, consumers can programmatically request new frames or entire videos that preserve the subject's visual identity, driven by arbitrary control signals (e.g. poses, text prompts, or latent codes).

Key goals:

  • Simplicity: One reference image → many generated frames.
  • Flexibility: Support diverse control modalities (pose, text, custom latents).
  • Consistency: Enforce identity preservation and temporal smoothness.
  • Scalability: Offer both single-frame and batch-oriented video endpoints, with async/webhook support.

1.2 Core Concepts

Concept Description
Reference Image The single input image (I_ref) that defines the subject's appearance (size H×W×3).
Identity Feature Vector A learned embedding (f_identity ∈ ℝᵈ) encoding the subject's appearance, disentangled from pose or scene.
Control Signals Per-frame vectors (z_t) that drive motion, pose, or style. Can come from:
* Pose Keypoints (2D or 3D joint locations)
* Text Prompts (via CLIP or other text encoder)
* Custom Latent Codes (sampled or learned priors)
Temporal Coherence Mechanisms (optical-flow losses, temporal discriminators, attention across frames) ensuring smooth transitions between generated frames.

1.3 High-Level Workflow

  1. Upload & Process Reference Image

    • Client uploads I_ref.
    • API returns a reference_image_id.
    • Segmentation and feature-extraction run to produce a mask M_ref and embedding f_identity.
  2. Generate Control Sequence

    • Client requests control vectors via pose-, text-, or latent-based endpoints.
    • API returns one or more z_t vectors describing desired per-frame variation.
  3. Frame / Video Generation

    • Single-frame: supply one z_t → get I_t.
    • Multi-frame: supply [z_1…z_T] → get a packaged video (MP4/GIF).
  4. Monitor & Retrieve

    • Synchronous responses for small jobs.
    • Asynchronous jobs with job IDs, polling, or webhooks for larger video renders.

2. Getting Started

2.1 Base URL

All endpoints are hosted under the versioned base URL:

https://api.aicharacter.studio/v1

Note: All requests must use HTTPS.


2.2 Authentication & Authorization

Header Value Description
Authorization Bearer <API_KEY> Your API key as a Bearer token.
Content-Type application/json For all JSON request bodies.
Accept application/json For all JSON responses.

API Key Management

  • Generate and manage keys in your dashboard at aicharacter.studio.

  • Keys are tied to scopes; each key bears a set of permissions:

    • reference:read – upload and fetch reference-image data
    • control:write – generate control vectors (pose/text/latent)
    • generate:write – produce frames or videos

Sample Request

POST /v1/reference-image HTTP/1.1
Host: api.aicharacter.studio
Authorization: Bearer sk_live_abc123…
Content-Type: multipart/form-data; boundary=----XYZ

------XYZ
Content-Disposition: form-data; name="image"; filename="subject.jpg"
Content-Type: image/jpeg

<…binary image data…>
------XYZ--

2.3 Rate Limits

Limit Type Value
Standard 100 requests per minute
Burst 50 requests per 10 seconds

When you approach or exceed your limit, the API will respond with HTTP 429 and include:

HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1618881600
  • Retry-After: seconds until you may retry.
  • X-RateLimit-Reset: UNIX timestamp when the window resets.

2.4 Error Handling

All errors return a JSON payload in the following format:

{
  "error": {
    "code": "string",
    "message": "string",
    "details": {}
  }
}
HTTP Status Code Meaning
400 BadRequest Invalid request syntax or parameters
401 Unauthorized Missing or invalid API key
403 Forbidden Insufficient scope or access denied
404 NotFound Resource (e.g. reference_image_id) not found
422 UnprocessableEntity Semantic validation failed (e.g. bad keypoint data)
429 RateLimitExceeded Rate limit exceeded
500 InternalError Server-side error — please retry later

Example: Invalid JSON

HTTP/1.1 400 Bad Request
Content-Type: application/json

{
  "error": {
    "code": "BadRequest",
    "message": "Request body could not be parsed as JSON.",
    "details": {
      "line": 4,
      "column": 15
    }
  }
}

Reference Image Processing

This section covers endpoints for uploading a reference image, retrieving its metadata, obtaining a segmentation mask, and extracting the identity feature vector.

Upload Reference Image

Upload a single image that defines the subject's appearance.

Authorizations:
ApiKeyAuth
Request Body schema: multipart/form-data
required
image
required
string <binary>

JPEG/PNG binary data of the subject.

name
string

Optional user label or filename.

Responses

Response samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "width": 1024,
  • "height": 768,
  • "uploaded_at": "2025-05-21T08:42:13Z"
}

Get Reference-Image Metadata

Retrieve basic metadata for a previously uploaded reference image.

Authorizations:
ApiKeyAuth
path Parameters
reference_image_id
required
string
Example: ref_12345abcd

ID returned on upload.

Responses

Response samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "width": 1024,
  • "height": 768,
  • "status": "processed",
  • "uploaded_at": "2025-05-21T08:42:13Z"
}

Retrieve Segmentation Mask

Download the binary mask isolating the subject from background. Clients can use If-None-Match with a previously obtained ETag to perform conditional requests.

Authorizations:
ApiKeyAuth
path Parameters
reference_image_id
required
string
Example: ref_12345abcd

ID returned on upload.

query Parameters
format
string
Default: "png"
Enum: "png" "base64"

png (default) or base64

Responses

Response samples

Content type
No sample

Extract Identity Features

Return the subject's embedding vector, disentangled from pose and context.

Authorizations:
ApiKeyAuth
path Parameters
reference_image_id
required
string
Example: ref_12345abcd

ID returned on upload.

query Parameters
model
string
Default: "cnn"
Enum: "cnn" "vit"

cnn (default) or vit to select backbone type.

Responses

Response samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "model": "cnn",
  • "feature_dim": 512,
  • "f_identity": [
    ]
}

Conditional Control Signals

Endpoints in this section generate per-frame control vectors (zₜ) that drive pose, text, or custom latent–based variation. Clients supply the reference_image_id and modality‐specific inputs; the API returns one or more zₜ vectors for downstream frame/video generation.

Generate Pose Control Signal

Encode a set of 2D or 3D keypoints into a control vector zₜ for a single frame.

Authorizations:
ApiKeyAuth
Request Body schema: application/json
required
reference_image_id
required
string

ID of the uploaded reference image.

required
Array of objects (Keypoint)

List of joint coordinates.

format
string
Default: "json"
Enum: "json" "base64"

Format of the keypoints data if it were to be encoded differently (primarily for future use or alternative inputs, response is always JSON).

Responses

Request samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "keypoints": [
    ]
}

Response samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "control_type": "pose",
  • "frame_index": 0,
  • "z_t": [
    ]
}

Generate Text Control Signal

Turn a natural-language prompt into a sequence of control vectors [z₁…zₜ] via a text encoder + transformer.

Authorizations:
ApiKeyAuth
Request Body schema: application/json
required
reference_image_id
required
string

ID of the uploaded reference image.

prompt
required
string

Text string describing desired action/scene.

steps
required
integer <int32> >= 1

Number of frames T to generate.

model
string
Default: "clip"
Enum: "clip" "custom-nlp"

Text encoder to use.

Responses

Request samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "prompt": "A dancing robot in a city street",
  • "steps": 16
}

Response samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "control_type": "text",
  • "steps": 16,
  • "z_sequence": [
    ]
}

Generate Custom Latent Codes

Generate a sequence of random or learned latent vectors [z₁…zₜ] suitable for scene or style variation.

Authorizations:
ApiKeyAuth
query Parameters
length
required
integer <int32> >= 1

Number of vectors to generate (T).

seed
integer <int64>

Deterministic RNG seed.

distribution
string
Default: "gaussian"
Enum: "gaussian" "uniform"

gaussian (default) or uniform.

Responses

Response samples

Content type
application/json
{
  • "control_type": "latent",
  • "length": 8,
  • "z_sequence": [
    ]
}

Frame Generation

Endpoints in this section consume one or more control vectors (zₜ) along with the reference_image_id to render either a single frame or a full video.

Generate a Single Frame

Render a single frame Iₜ given a control vector zₜ. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:
ApiKeyAuth
query Parameters
custom_cache_duration_seconds
integer

Optional. Custom cache duration for the generated asset in seconds.

header Parameters
X-GPU-Type
string
Default: v100
Enum: "v100" "a100" "t4"

Specify the type of GPU for processing.

X-Instance-Count
integer [ 1 .. 4 ]

Number of GPU workers (1-4) for asynchronous tasks (if applicable, though this endpoint is primarily sync).

Request Body schema: application/json
required
reference_image_id
required
string

ID of the uploaded reference image.

control_type
required
string
Enum: "pose" "text" "latent"

Type of control signal provided.

z_t
required
Array of numbers <float> [ items <float > ]

Control vector for this frame (length must match model dim).

resolution
string^\d+x\d+$
Default: "720x720"

Output size, e.g. "1280x720".

model
string
Enum: "gan" "diffusion" "transformer"

Generator mode.

output_format
string
Default: "url"
Enum: "url" "base64"

Desired output format for the image.

identity_strength
number or null <float>

Weight for identity preservation. Higher values enforce stricter subject matching.

temporal_strength
number or null <float>

Weight for temporal coherence. Higher values promote smoother transitions.

perceptual_weight
number or null <float>

Weight for perceptual loss (e.g., LPIPS).

adv_weight
number or null <float>

Weight for adversarial loss from GAN discriminators.

pose_smoothness
number or null <float>

Weight for pose regularization to penalize jittery movements.

url_ttl
integer or null <int32>

Optional. Time-to-live in seconds for the image_url if generated.

Responses

Request samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "control_type": "pose",
  • "z_t": [
    ],
  • "resolution": "1280x720",
  • "model": "gan"
}

Response samples

Content type
application/json
Example
{}

Generate a Multi-Frame Video

Produce an entire video (T frames) in one request. Supports both synchronous (small T) and asynchronous (large T) workflows. Render T frames as an MP4 or GIF, given a sequence of control vectors [z₁…z_T]. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:
ApiKeyAuth
query Parameters
custom_cache_duration_seconds
integer

Optional. Custom cache duration for the generated video in seconds.

header Parameters
X-GPU-Type
string
Default: v100
Enum: "v100" "a100" "t4"

Specify the type of GPU for processing.

X-Instance-Count
integer [ 1 .. 4 ]

Number of GPU workers (1-4) for asynchronous tasks.

Request Body schema: application/json
required
reference_image_id
required
string

ID of the uploaded reference image.

control_type
required
string
Enum: "pose" "text" "latent"

Type of control signal sequence provided.

z_sequence
required
Array of numbers[ items <float >[ items <float > ] ]

List of T control vectors.

format
string
Default: "mp4"
Enum: "mp4" "gif"

Desired output video format.

resolution
string^\d+x\d+$
Default: "720x720"

Video size, e.g. "1280x720".

fps
integer <int32>
Default: 30

Frames per second.

model
string
Default: "gan"
Enum: "gan" "diffusion" "transformer"

Generator mode to use for video generation.

async
boolean
Default: false

false (default) for sync; true to enqueue and receive a job ID.

callback_url
string or null <url>

If async=true, POST job completion payload here.

identity_strength
number or null <float>

Weight for identity preservation. Higher values enforce stricter subject matching.

temporal_strength
number or null <float>

Weight for temporal coherence. Higher values promote smoother transitions.

perceptual_weight
number or null <float>

Weight for perceptual loss (e.g., LPIPS).

adv_weight
number or null <float>

Weight for adversarial loss from GAN discriminators.

pose_smoothness
number or null <float>

Weight for pose regularization to penalize jittery movements.

url_ttl
integer or null <int32>

Optional. Time-to-live in seconds for the video_url if generated.

Responses

Request samples

Content type
application/json
{
  • "reference_image_id": "ref_12345abcd",
  • "control_type": "text",
  • "z_sequence": [
    ],
  • "format": "mp4",
  • "resolution": "1280x720",
  • "fps": 24,
  • "model": "gan"
}

Response samples

Content type
application/json
{}

Advanced Features

This section describes two higher-level capabilities: 3D/generalization via NeRF and batch-style multi-reference processing.

8.1 3D Pose & View Generalization

Integration with NeRF (Neural Radiance Fields) for enhanced 3D pose and view generalization. Specific endpoint for direct NeRF interaction might be part of specialized workflows or future extensions. Current capabilities are implicitly leveraged in models supporting 3D awareness.

8.2 Batch Processing

Functionality for processing multiple requests efficiently.

Batch Video Generation (Asynchronous or Synchronous)

Process an array of independent video-generation tasks. Each task can target a different reference_image_id and control sequence. Supports both synchronous and asynchronous processing. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:
ApiKeyAuth
query Parameters
custom_cache_duration_seconds
integer

Optional. Custom cache duration for all generated assets in this batch in seconds.

header Parameters
X-GPU-Type
string
Default: v100
Enum: "v100" "a100" "t4"

Specify the type of GPU for processing the batch.

X-Instance-Count
integer [ 1 .. 4 ]

Number of GPU workers (1-4) for asynchronous batch tasks.

Request Body schema: application/json
required
required
Array of objects (BatchVideoTask) non-empty

A list of individual video generation tasks.

async
boolean
Default: false

false (default) for sync; true to enqueue and receive job IDs for each task.

callback_url
string or null <url>

If async=true, POST job completion payloads here (may be per job or global).

Responses

Request samples

Content type
application/json
Example
{
  • "tasks": [
    ],
  • "async": true,
}

Response samples

Content type
application/json

Generate NeRF Model or Video

Build a subject-specific NeRF model and render either the raw NeRF or a video of arbitrary camera trajectories. Supports conditional requests via If-None-Match if attempting to regenerate with identical parameters.

Authorizations:
ApiKeyAuth
query Parameters
custom_cache_duration_seconds
integer

Optional. Custom cache duration for the generated NeRF model/video in seconds.

header Parameters
X-GPU-Type
string
Default: v100
Enum: "v100" "a100" "t4"

Specify the type of GPU for NeRF processing.

X-Instance-Count
integer [ 1 .. 4 ]

Number of GPU workers (1-4) for NeRF processing (primarily for async NeRF tasks if supported).

Request Body schema: application/json
required
reference_image_id
required
string

ID of the uploaded reference image.

required
Array of objects (CameraPose) non-empty

List of camera poses for rendering or model fitting.

resolution
string^\d+x\d+$
Default: "320x240"

Render size, e.g. "640x480".

output_format
string
Default: "video"
Enum: "nerf_model" "video"

Desired output format.

fps
integer or null <int32>
Default: 24

If output_format is video, frames per second.

url_ttl
integer or null <int32>

Optional. Time-to-live in seconds for the video_url if output_format is video.

Responses

Request samples

Content type
application/json
Example
{
  • "reference_image_id": "ref_12345abcd",
  • "camera_trajectory": [
    ],
  • "resolution": "640x480",
  • "output_format": "video",
  • "fps": 30
}

Response samples

Content type
application/json
Example
{}

Performance & Scaling

9. Performance & Scaling

To meet the needs of diverse workloads—from single-frame tests to large-scale video batches—our API offers both synchronous and asynchronous modes, fine-grained job management, and options for GPU sizing, caching, and CDN delivery.


9.1 Asynchronous Jobs & Webhooks

  • async Flag Add "async": true in your /generate/video or /v1/batch/video requests to enqueue work without blocking your connection.

  • Job Creation Response

    HTTP/1.1 202 Accepted
    Location: /v1/jobs/{job_id}
    {
      "job_id": "job_abc123",
      "status_url": "https://api.aicharacter.studio/v1/jobs/job_abc123"
    }
    
  • Callback/Webhook Supply "callback_url": "https://your.app/webhook" to have job results POSTed when complete:

    {
      "job_id": "job_abc123",
      "status": "completed",
      "result": { /* identical to sync response */ }
    }
    

9.2 Progress Polling

  • Endpoint

    GET /v1/jobs/{job_id}
    
  • Response Fields

    Field Type Description
    status string pending, processing, completed, failed
    progress number Percentage 0–100 of completion
    result object Populated when status=completed
    error object Populated when status=failed
  • Example

    GET /v1/jobs/job_abc123 HTTP/1.1
    Authorization: Bearer sk_live_…
    
    {
      "job_id": "job_abc123",
      "status": "processing",
      "progress": 47.5
    }
    

9.3 GPU & Instance Sizing

By default, jobs run on a shared GPU cluster. For predictable performance:

  • X-GPU-Type Header (optional):

    • v100 (default)
    • a100
    • t4
  • X-Instance-Count Header (async only):

    • Number of GPU workers (1–4)

Note: Additional charges apply for dedicated hardware.


9.4 Caching & Deduplication

  • ETags & Conditional Requests

    • Frame/video endpoints return an ETag header.
    • Clients may GET with If-None-Match to reuse cached results when control inputs and reference image are unchanged. (Note: For POST generation endpoints, this implies re-POSTing with same parameters might yield cached result URL if underlying generation matches a cached one).
  • Cache TTL

    • Default: 24 hours
    • Can be extended via custom_cache_duration_seconds query parameter on generation calls.

9.5 CDN Delivery & URL Expiry

  • CDN-Backed URLs

    • All image_url and video_url links point to a global CDN for low-latency delivery.
  • URL Expiry

    • Default expiration: 7 days
    • Extendable via the optional url_ttl (in seconds) parameter on generation requests.

9.6 Best Practices for High Throughput

  1. Batch vs. Single Requests

    • Aggregate many small videos in /v1/batch/video to reduce overhead.
  2. Connection Management

    • Use async + webhooks rather than long polling for large jobs.
  3. Retry & Backoff

    • On HTTP 429, respect the Retry-After header.
    • Implement exponential backoff for 5xx errors.
  4. Parallelization

    • Limit concurrent requests to your rate limit.
    • Use multiple API keys (scoped per project) if needed.

Get Job Status

Polls for the status of an asynchronous job (e.g., video generation, batch processing).

Authorizations:
ApiKeyAuth
path Parameters
job_id
required
string
Example: job_98765abcd

The unique identifier of the asynchronous job.

Responses

Response samples

Content type
application/json
Example
{
  • "job_id": "job_98765abcd",
  • "status": "in_progress",
  • "progress": 65,
  • "message": "Video encoding at 65%",
  • "created_at": "2023-10-26T10:00:00Z",
  • "updated_at": "2023-10-26T10:05:00Z"
}

Conceptual - Model Architectures

6. Model Architectures & Modes

You can control the underlying generator by passing the model parameter in your /generate/frame or /generate/video requests. Available modes trade off speed, stochasticity, and visual fidelity.


6.1 GAN-Based Mode

  • Identifier: model=gan

  • Overview: Uses a StyleGAN-like backbone extended for video. Leverages Adaptive Instance Normalization (AdaIN) to inject the identity vector into each layer and adds per-frame noise to introduce fine stochastic detail.

  • Key Components:

    • Generator: 2D convolutional "synthesis" network with style modulation.

    • Noise Inputs: Per–layer Gaussian noise tensors (n\u209c \u223c N(0,1)) to diversify textures.

    • AdaIN:

      AdaIN(x, f_identity) = \u03c3(f_identity) * ((x - \u03bc(x)) / \u03c3(x)) + \u03bc(f_identity)
      
    • Discriminators:

      • Frame Discriminator: 2D CNN to judge individual frames.
      • Temporal Discriminator: 3D CNN to judge frame sequences.
  • API Usage Example:

    {
      "model": "gan",
      "reference_image_id": "ref_12345abcd",
      "control_type": "pose",
      "z_t": [...],
      "resolution": "1280x720"
    }
    
  • Pros & Cons:

    Pros Cons
    Fast inference (1–2 sec/frame at 720p) May exhibit flicker without strong temporal loss tuning
    Rich, detailed textures via noise inputs Can require heavier GPU for high-res

6.2 Diffusion-Based Mode

  • Identifier: model=diffusion

  • Overview: Iteratively denoises a Gaussian noise map into the target frame over S steps. Conditioned on f_identity and z\u209c via cross-attention in a 2D/3D U-Net.

  • Key Components:

    • U-Net Backbone:

      • Down/up‐sampling streams with skip connections.
      • 2D convs for spatial, 3D convs or temporal-blocks for short-range temporal modeling.
    • Cross-Attention:

      • Queries from noisy frame x\u209c\u02e2; Keys/Values from [f_identity, z\u209c].
    • Denoising Loss:

      L_diff = E_{\u03b5, t, s} \u2016 \u03b5 - \u03b5_\u03b8(x\u209c\u02e2, f_identity, z\u209c) \u2016\u00b2
      
  • API Usage Example:

    {
      "model": "diffusion",
      "reference_image_id": "ref_12345abcd",
      "control_type": "text",
      "z_sequence": [...],
      "format": "mp4"
    }
    
  • Pros & Cons:

    Pros Cons
    Highly stable subject consistency Slower (10–50 sec/frame at 720p)
    Better handling of complex scenes & lighting May require tuning of step count S

6.3 Transformer-Based Mode

  • Identifier: model=transformer

  • Overview: Treats video as a sequence of spatio-temporal tokens. A Vision Transformer encoder–decoder uses self-attention to model long-range dependencies, with cross-attention to integrate f_identity.

  • Key Components:

    • Patch Extraction: Splits each frame into P×P pixel patches, linearly projected to token vectors.

    • Spatio-Temporal Blocks:

      • Self-attention across both time (frames) and space (patches).
      • Cross-attention layers attending to [f_identity, z\u209c].
    • Output Head: Reconstructs patches into image grid.

  • API Usage Example:

    {
      "model": "transformer",
      "reference_image_id": "ref_12345abcd",
      "control_type": "latent",
      "z_sequence": [...],
      "fps": 24
    }
    
  • Pros & Cons:

    Pros Cons
    Excellent temporal coherence over long clips Highest memory usage (TPU/GPU required)
    Unified handling of spatial & temporal context Slower than GAN but faster than diffusion

Conceptual - Consistency & Loss Functions

7. Consistency & Loss Functions

To guarantee that generated frames faithfully preserve the subject's appearance and transition smoothly, the system employs multiple specialized losses. Clients do not directly interact with these losses but can tune high-level consistency parameters via optional query fields in generation calls (e.g. identity_strength, temporal_strength).


7.1 Identity Preservation

Ensures each frame's subject matches the reference image in appearance.

  • Feature-Space Loss Compare deep features of masked regions between I_ref and generated frame I_t using a pretrained VGG network:

    $$ L_{\text{identity}} = \sum_{l} \big\lVert \phi_{l}(I_{\text{ref}}\odot M_{\text{ref}}) - \phi_{l}(I_{t}\odot M_{t}) \big\rVert_{2}^{2} $$

    • $\phi_{l}$: activations at layer $l$.
    • $M_{t}$: predicted mask for frame $t$.
  • Face Recognition Loss (for facial subjects) Use a lightweight face-recognition embedding $e(\cdot)$ and enforce cosine similarity:

    $$ L_{\text{face}} = 1 - \cos\big(e(I_{\text{ref}}),,e(I_{t})\big) $$


7.2 Temporal Coherence

Encourages smooth transitions and discourages flicker or jumpiness.

  • Optical-Flow Warping Loss Compute forward flow $F_{t\to t+1}$ via a flow network, warp $I_{t}$ and compare to $I_{t+1}$:

    $$ L_{\text{flow}} = \big\lVert \text{Warp}(I_{t},,F_{t\to t+1}) - I_{t+1}\big\rVert_{1} $$

  • Temporal Discriminator Loss A 3D-CNN discriminator $D_{\text{temp}}$ distinguishes real vs. generated clips:

    $$ L_{\text{temp}} = \mathbb{E}\big[\log D_{\text{temp}}({I_{t}}{\text{real}})\big] + \mathbb{E}\big[\log(1 - D{\text{temp}}({I_{t}}_{\text{gen}}))\big] $$

  • Sliding-Window Attention Transformer-style self-attention over neighboring frames:

    $$ \hat h_{t} = \mathrm{Attention}\big(Q_{t},,K_{t-k:t+k},,V_{t-k:t+k}\big) $$

    – implicitly enforces consistency over a window of size $2k+1$.


7.3 Domain-Specific or Perceptual Losses

Additional optional losses depending on subject type or client needs.

Loss Name Purpose Endpoint Parameter
Perceptual Loss Enforce color/style consistency via LPIPS perceptual_weight
GAN Adversarial Sharpness/detail via frame & temporal GANs adv_weight
Pose Regularization Penalize implausible keypoint jumps pose_smoothness

Clients can adjust these via optional fields in generation requests:

{
  "reference_image_id": "...",
  "model": "diffusion",
  "control_type": "text",
  "z_sequence": [...],
  "identity_strength": 0.7,
  "temporal_strength": 0.8,
  "adv_weight": 0.5
}

Conceptual - Examples & SDKs

10. Examples & SDKs

Below are quickstart snippets in cURL, Python, and JavaScript. For full reference, see our GitHub repos and PyPI/npm packages.


10.1 cURL Examples

10.1.1 Upload Reference Image

curl -X POST "https://api.aicharacter.studio/v1/reference-image" \\
  -H "Authorization: Bearer $API_KEY" \\
  -F "image=@/path/to/subject.jpg"

10.1.2 Generate Pose Control Vector

curl -X POST "https://api.aicharacter.studio/v1/control/pose" \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{ 
        "reference_image_id": "ref_12345abcd",
        "keypoints": [
          { "x": 0.5, "y": 0.1 },
          { "x": 0.5, "y": 0.25 }
        ]
      }'

10.1.3 Render a Single Frame

curl -X POST "https://api.aicharacter.studio/v1/generate/frame" \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{ 
        "reference_image_id": "ref_12345abcd",
        "control_type": "pose",
        "z_t": [0.12, -0.33, 0.77],
        "resolution": "1280x720"
      }'

10.1.4 Start an Async Video Job

curl -X POST "https://api.aicharacter.studio/v1/generate/video" \\
  -H "Authorization: Bearer $API_KEY" \\
  -H "Content-Type: application/json" \\
  -d '{ 
        "reference_image_id": "ref_12345abcd",
        "control_type": "text",
        "z_sequence": [[…], […]],
        "async": true,
        "callback_url": "https://your.app/hooks/complete"
      }'

10.2 Python SDK

from aicharacter import Client

client = Client(api_key="sk_live_abc123")

# 1. Upload image
ref = client.upload_reference_image("/path/to/subject.jpg")

# 2. Generate text control
z_seq = client.control.text(
    reference_image_id=ref.id,
    prompt="A cat playing piano",
    steps=8
)

# 3. Generate video synchronously
video = client.generate.video(
    reference_image_id=ref.id,
    control_type="text",
    z_sequence=z_seq,
    format="mp4",
    resolution="720x720",
    fps=24
)

print("Download at:", video.video_url)

10.3 JavaScript (Node/Vue) SDK

import { AiCharacterClient } from "@aicharacter/studio";

const client = new AiCharacterClient({ apiKey: "sk_live_abc123" });

async function run() {
  // 1. Upload
  const { reference_image_id } = await client.referenceImage.upload(
    "/path/to/subject.jpg"
  );

  // 2. Pose control
  const { z_t } = await client.control.pose({
    reference_image_id,
    keypoints: [{ x: 0.5, y: 0.1 }, { x: 0.5, y: 0.25 }]
  });

  // 3. Render frame
  const frame = await client.generate.frame({
    reference_image_id,
    control_type: "pose",
    z_t,
    resolution: "1280x720"
  });

  console.log("Frame URL:", frame.image_url);
}

run();

Conceptual - FAQs & Troubleshooting

11. FAQs & Troubleshooting

Q1: My generated frames flicker—how can I improve temporal stability?

  • Solution: Increase the temporal_strength parameter in your generation call (range 0.0–1.0, default 0.8).
  • Tip: Use model=diffusion or model=transformer, which natively enforce stronger temporal consistency than gan.

Q2: I keep getting HTTP 429—what's the best retry strategy?

  • Honor the Retry-After header in the 429 response.
  • Implement exponential backoff (e.g., initial 1 s, multiply by 2 up to a max of 32 s).
  • For large jobs, switch to async=true + webhook to avoid holding client connections open.

Q3: My mask endpoint returns 404 even though the upload succeeded.

  • Confirm that the reference image's status (from GET /v1/reference-image/{id}) is "processed".
  • If status is "pending", wait a few seconds and retry.
  • If it remains pending > 60 s, contact support.

Q4: Which control type is best for highly dynamic motions?

  • Pose: precise but requires keypoint data per frame.
  • Text: easy to specify broad actions ("running," "jumping") and auto‐generates sequences.
  • Latent: use for abstract or style-driven motion; less semantically grounded.

Q5: How do I mask out the background in generated frames?

  • Generated frames come with optional masks if you append ?include_mask=true to your frame/video URL.
  • The mask URL will be in your JSON response under mask_url.

Q6: I need higher resolution than 1280×720—what are my options?

  • Supported resolutions: up to 1920x1080 on shared GPUs, and up to 4K (3840×2160) on dedicated A100 instances.
  • To request >1080p, include header X-GPU-Type: a100.

Q7: My asynchronous job failed with "ResourceExhausted."

  • This indicates GPU memory limits were exceeded (e.g., too many high-res frames).
  • Reduce resolution or switch to async=true with a higher-tier GPU via X-GPU-Type.
  • If persistent, split into smaller batches.

Q8: Can I reuse existing z_sequence across multiple videos?

  • Yes. For reproducibility, store both reference_image_id and the exact z_sequence.
  • Ensure you also specify the same seed when using latent control for deterministic output.

Q9: What image/video formats are supported?

  • Frames: PNG or JPEG (via output_format).
  • Videos: MP4 (H.264) or animated GIF.

Q10: Who do I contact for support or to request new features?


Q11: How do I get access to use the API?