Download OpenAPI specification:
This API enables clients to turn a single "reference" image of a subject into a full video sequence in which that subject appears consistently—regardless of changes in pose, background, or action. By uploading one image, consumers can programmatically request new frames or entire videos that preserve the subject's visual identity, driven by arbitrary control signals (e.g. poses, text prompts, or latent codes).
Key goals:
Concept | Description |
---|---|
Reference Image | The single input image (I_ref ) that defines the subject's appearance (size H×W×3). |
Identity Feature Vector | A learned embedding (f_identity ∈ ℝᵈ ) encoding the subject's appearance, disentangled from pose or scene. |
Control Signals | Per-frame vectors (z_t ) that drive motion, pose, or style. Can come from: |
* Pose Keypoints (2D or 3D joint locations) | |
* Text Prompts (via CLIP or other text encoder) | |
* Custom Latent Codes (sampled or learned priors) | |
Temporal Coherence | Mechanisms (optical-flow losses, temporal discriminators, attention across frames) ensuring smooth transitions between generated frames. |
Upload & Process Reference Image
I_ref
.reference_image_id
.M_ref
and embedding f_identity
.Generate Control Sequence
z_t
vectors describing desired per-frame variation.Frame / Video Generation
z_t
→ get I_t
.[z_1…z_T]
→ get a packaged video (MP4/GIF).Monitor & Retrieve
All endpoints are hosted under the versioned base URL:
https://api.aicharacter.studio/v1
Note: All requests must use HTTPS.
Header | Value | Description |
---|---|---|
Authorization |
Bearer <API_KEY> |
Your API key as a Bearer token. |
Content-Type |
application/json |
For all JSON request bodies. |
Accept |
application/json |
For all JSON responses. |
Generate and manage keys in your dashboard at aicharacter.studio.
Keys are tied to scopes; each key bears a set of permissions:
reference:read
– upload and fetch reference-image datacontrol:write
– generate control vectors (pose/text/latent)generate:write
– produce frames or videosPOST /v1/reference-image HTTP/1.1
Host: api.aicharacter.studio
Authorization: Bearer sk_live_abc123…
Content-Type: multipart/form-data; boundary=----XYZ
------XYZ
Content-Disposition: form-data; name="image"; filename="subject.jpg"
Content-Type: image/jpeg
<…binary image data…>
------XYZ--
Limit Type | Value |
---|---|
Standard | 100 requests per minute |
Burst | 50 requests per 10 seconds |
When you approach or exceed your limit, the API will respond with HTTP 429 and include:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1618881600
Retry-After
: seconds until you may retry.X-RateLimit-Reset
: UNIX timestamp when the window resets.All errors return a JSON payload in the following format:
{
"error": {
"code": "string",
"message": "string",
"details": { … }
}
}
HTTP Status | Code | Meaning |
---|---|---|
400 | BadRequest |
Invalid request syntax or parameters |
401 | Unauthorized |
Missing or invalid API key |
403 | Forbidden |
Insufficient scope or access denied |
404 | NotFound |
Resource (e.g. reference_image_id) not found |
422 | UnprocessableEntity |
Semantic validation failed (e.g. bad keypoint data) |
429 | RateLimitExceeded |
Rate limit exceeded |
500 | InternalError |
Server-side error — please retry later |
HTTP/1.1 400 Bad Request
Content-Type: application/json
{
"error": {
"code": "BadRequest",
"message": "Request body could not be parsed as JSON.",
"details": {
"line": 4,
"column": 15
}
}
}
This section covers endpoints for uploading a reference image, retrieving its metadata, obtaining a segmentation mask, and extracting the identity feature vector.
Upload a single image that defines the subject's appearance.
image required | string <binary> JPEG/PNG binary data of the subject. |
name | string Optional user label or filename. |
{- "reference_image_id": "ref_12345abcd",
- "width": 1024,
- "height": 768,
- "uploaded_at": "2025-05-21T08:42:13Z"
}
Retrieve basic metadata for a previously uploaded reference image.
reference_image_id required | string Example: ref_12345abcd ID returned on upload. |
{- "reference_image_id": "ref_12345abcd",
- "width": 1024,
- "height": 768,
- "status": "processed",
- "uploaded_at": "2025-05-21T08:42:13Z"
}
Download the binary mask isolating the subject from background. Clients can use If-None-Match
with a previously obtained ETag to perform conditional requests.
reference_image_id required | string Example: ref_12345abcd ID returned on upload. |
format | string Default: "png" Enum: "png" "base64"
|
Return the subject's embedding vector, disentangled from pose and context.
reference_image_id required | string Example: ref_12345abcd ID returned on upload. |
model | string Default: "cnn" Enum: "cnn" "vit"
|
{- "reference_image_id": "ref_12345abcd",
- "model": "cnn",
- "feature_dim": 512,
- "f_identity": [
- 0.023,
- -1.14,
- 0.987,
- 0.001
]
}
Endpoints in this section generate per-frame control vectors (zₜ
) that drive pose, text, or custom latent–based variation. Clients supply the reference_image_id
and modality‐specific inputs; the API returns one or more zₜ
vectors for downstream frame/video generation.
Encode a set of 2D or 3D keypoints into a control vector zₜ
for a single frame.
reference_image_id required | string ID of the uploaded reference image. |
required | Array of objects (Keypoint) List of joint coordinates. |
format | string Default: "json" Enum: "json" "base64" Format of the keypoints data if it were to be encoded differently (primarily for future use or alternative inputs, response is always JSON). |
{- "reference_image_id": "ref_12345abcd",
- "keypoints": [
- {
- "x": 0.5,
- "y": 0.1
}, - {
- "x": 0.52,
- "y": 0.25
}
]
}
{- "reference_image_id": "ref_12345abcd",
- "control_type": "pose",
- "frame_index": 0,
- "z_t": [
- 0.023,
- -1.14,
- 0.987,
- 0.001
]
}
Turn a natural-language prompt into a sequence of control vectors [z₁…zₜ]
via a text encoder + transformer.
reference_image_id required | string ID of the uploaded reference image. |
prompt required | string Text string describing desired action/scene. |
steps required | integer <int32> >= 1 Number of frames |
model | string Default: "clip" Enum: "clip" "custom-nlp" Text encoder to use. |
{- "reference_image_id": "ref_12345abcd",
- "prompt": "A dancing robot in a city street",
- "steps": 16
}
{- "reference_image_id": "ref_12345abcd",
- "control_type": "text",
- "steps": 16,
- "z_sequence": [
- [
- 0.12,
- -0.33,
- 0.77,
- 0.002
], - [
- 0.1,
- -0.3,
- 0.8,
- 0.003
]
]
}
Generate a sequence of random or learned latent vectors [z₁…zₜ]
suitable for scene or style variation.
length required | integer <int32> >= 1 Number of vectors to generate ( |
seed | integer <int64> Deterministic RNG seed. |
distribution | string Default: "gaussian" Enum: "gaussian" "uniform"
|
{- "control_type": "latent",
- "length": 8,
- "z_sequence": [
- [
- 0.05,
- -1.2,
- 0.32,
- 0.004
], - [
- 0.02,
- -1.15,
- 0.29,
- 0.005
]
]
}
Endpoints in this section consume one or more control vectors (zₜ
) along with the reference_image_id
to render either a single frame or a full video.
Render a single frame Iₜ
given a control vector zₜ
. Supports conditional requests via If-None-Match
if attempting to regenerate with identical parameters.
custom_cache_duration_seconds | integer Optional. Custom cache duration for the generated asset in seconds. |
X-GPU-Type | string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for processing. |
X-Instance-Count | integer [ 1 .. 4 ] Number of GPU workers (1-4) for asynchronous tasks (if applicable, though this endpoint is primarily sync). |
reference_image_id required | string ID of the uploaded reference image. |
control_type required | string Enum: "pose" "text" "latent" Type of control signal provided. |
z_t required | Array of numbers <float> [ items <float > ] Control vector for this frame (length must match model dim). |
resolution | string^\d+x\d+$ Default: "720x720" Output size, e.g. "1280x720". |
model | string Enum: "gan" "diffusion" "transformer" Generator mode. |
output_format | string Default: "url" Enum: "url" "base64" Desired output format for the image. |
identity_strength | number or null <float> Weight for identity preservation. Higher values enforce stricter subject matching. |
temporal_strength | number or null <float> Weight for temporal coherence. Higher values promote smoother transitions. |
perceptual_weight | number or null <float> Weight for perceptual loss (e.g., LPIPS). |
adv_weight | number or null <float> Weight for adversarial loss from GAN discriminators. |
pose_smoothness | number or null <float> Weight for pose regularization to penalize jittery movements. |
url_ttl | integer or null <int32> Optional. Time-to-live in seconds for the |
{- "reference_image_id": "ref_12345abcd",
- "control_type": "pose",
- "z_t": [
- 0.12,
- -0.33,
- 0.77,
- 0.001
], - "resolution": "1280x720",
- "model": "gan"
}
{- "reference_image_id": "ref_12345abcd",
- "frame_index": 0,
- "resolution": "1280x720",
}
Produce an entire video (T
frames) in one request. Supports both synchronous (small T
) and asynchronous (large T
) workflows.
Render T
frames as an MP4 or GIF, given a sequence of control vectors [z₁…z_T]
. Supports conditional requests via If-None-Match
if attempting to regenerate with identical parameters.
custom_cache_duration_seconds | integer Optional. Custom cache duration for the generated video in seconds. |
X-GPU-Type | string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for processing. |
X-Instance-Count | integer [ 1 .. 4 ] Number of GPU workers (1-4) for asynchronous tasks. |
reference_image_id required | string ID of the uploaded reference image. |
control_type required | string Enum: "pose" "text" "latent" Type of control signal sequence provided. |
z_sequence required | Array of numbers[ items <float >[ items <float > ] ] List of |
format | string Default: "mp4" Enum: "mp4" "gif" Desired output video format. |
resolution | string^\d+x\d+$ Default: "720x720" Video size, e.g. "1280x720". |
fps | integer <int32> Default: 30 Frames per second. |
model | string Default: "gan" Enum: "gan" "diffusion" "transformer" Generator mode to use for video generation. |
async | boolean Default: false
|
callback_url | string or null <url> If |
identity_strength | number or null <float> Weight for identity preservation. Higher values enforce stricter subject matching. |
temporal_strength | number or null <float> Weight for temporal coherence. Higher values promote smoother transitions. |
perceptual_weight | number or null <float> Weight for perceptual loss (e.g., LPIPS). |
adv_weight | number or null <float> Weight for adversarial loss from GAN discriminators. |
pose_smoothness | number or null <float> Weight for pose regularization to penalize jittery movements. |
url_ttl | integer or null <int32> Optional. Time-to-live in seconds for the |
{- "reference_image_id": "ref_12345abcd",
- "control_type": "text",
- "z_sequence": [
- [
- 0.05,
- -1.2,
- 0.32,
- 0.001
], - [
- 0.02,
- -1.15,
- 0.29,
- 0.002
]
], - "format": "mp4",
- "resolution": "1280x720",
- "fps": 24,
- "model": "gan"
}
{- "reference_image_id": "ref_12345abcd",
- "frames": 16,
- "format": "mp4",
- "resolution": "1280x720",
- "fps": 24,
- "duration_sec": 0.67,
}
This section describes two higher-level capabilities: 3D/generalization via NeRF and batch-style multi-reference processing.
Integration with NeRF (Neural Radiance Fields) for enhanced 3D pose and view generalization. Specific endpoint for direct NeRF interaction might be part of specialized workflows or future extensions. Current capabilities are implicitly leveraged in models supporting 3D awareness.
Functionality for processing multiple requests efficiently.
Process an array of independent video-generation tasks. Each task can target a different reference_image_id
and control sequence.
Supports both synchronous and asynchronous processing. Supports conditional requests via If-None-Match
if attempting to regenerate with identical parameters.
custom_cache_duration_seconds | integer Optional. Custom cache duration for all generated assets in this batch in seconds. |
X-GPU-Type | string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for processing the batch. |
X-Instance-Count | integer [ 1 .. 4 ] Number of GPU workers (1-4) for asynchronous batch tasks. |
required | Array of objects (BatchVideoTask) non-empty A list of individual video generation tasks. |
async | boolean Default: false
|
callback_url | string or null <url> If |
{- "tasks": [
- {
- "reference_image_id": "ref_12345abcd",
- "control_type": "text",
- "z_sequence": [
- [
- 0.1,
- 0.2
], - [
- 0.3,
- 0.4
]
], - "format": "mp4",
- "resolution": "640x480"
}, - {
- "reference_image_id": "ref_98765wxyz",
- "control_type": "pose",
- "z_sequence": [
- [
- 0.5,
- 0.6
], - [
- 0.7,
- 0.8
]
], - "fps": 24
}
], - "async": true,
}
{- "results": [
- {
- "reference_image_id": "ref_12345abcd",
}, - {
- "reference_image_id": "ref_98765wxyz",
}
]
}
Build a subject-specific NeRF model and render either the raw NeRF or a video of arbitrary camera trajectories. Supports conditional requests via If-None-Match
if attempting to regenerate with identical parameters.
custom_cache_duration_seconds | integer Optional. Custom cache duration for the generated NeRF model/video in seconds. |
X-GPU-Type | string Default: v100 Enum: "v100" "a100" "t4" Specify the type of GPU for NeRF processing. |
X-Instance-Count | integer [ 1 .. 4 ] Number of GPU workers (1-4) for NeRF processing (primarily for async NeRF tasks if supported). |
reference_image_id required | string ID of the uploaded reference image. |
required | Array of objects (CameraPose) non-empty List of camera poses for rendering or model fitting. |
resolution | string^\d+x\d+$ Default: "320x240" Render size, e.g. "640x480". |
output_format | string Default: "video" Enum: "nerf_model" "video" Desired output format. |
fps | integer or null <int32> Default: 24 If output_format is video, frames per second. |
url_ttl | integer or null <int32> Optional. Time-to-live in seconds for the |
{- "reference_image_id": "ref_12345abcd",
- "camera_trajectory": [
- {
- "yaw": 0,
- "pitch": 10,
- "roll": 0,
- "distance": 2
}, - {
- "yaw": 45,
- "pitch": 15,
- "roll": 0,
- "distance": 2
}, - {
- "yaw": 90,
- "pitch": 10,
- "roll": 0,
- "distance": 2
}
], - "resolution": "640x480",
- "output_format": "video",
- "fps": 30
}
{- "reference_image_id": "ref_12345abcd",
- "output_format": "nerf_model",
}
To meet the needs of diverse workloads—from single-frame tests to large-scale video batches—our API offers both synchronous and asynchronous modes, fine-grained job management, and options for GPU sizing, caching, and CDN delivery.
async
Flag
Add "async": true
in your /generate/video
or /v1/batch/video
requests to enqueue work without blocking your connection.
Job Creation Response
HTTP/1.1 202 Accepted
Location: /v1/jobs/{job_id}
{
"job_id": "job_abc123",
"status_url": "https://api.aicharacter.studio/v1/jobs/job_abc123"
}
Callback/Webhook
Supply "callback_url": "https://your.app/webhook"
to have job results POSTed when complete:
{
"job_id": "job_abc123",
"status": "completed",
"result": { /* identical to sync response */ }
}
Endpoint
GET /v1/jobs/{job_id}
Response Fields
Field | Type | Description |
---|---|---|
status |
string |
pending , processing , completed , failed |
progress |
number |
Percentage 0–100 of completion |
result |
object |
Populated when status=completed |
error |
object |
Populated when status=failed |
Example
GET /v1/jobs/job_abc123 HTTP/1.1
Authorization: Bearer sk_live_…
{
"job_id": "job_abc123",
"status": "processing",
"progress": 47.5
}
By default, jobs run on a shared GPU cluster. For predictable performance:
X-GPU-Type
Header (optional):
v100
(default)a100
t4
X-Instance-Count
Header (async only):
Note: Additional charges apply for dedicated hardware.
ETags & Conditional Requests
ETag
header.GET
with If-None-Match
to reuse cached results when control inputs and reference image are unchanged. (Note: For POST generation endpoints, this implies re-POSTing with same parameters might yield cached result URL if underlying generation matches a cached one).Cache TTL
custom_cache_duration_seconds
query parameter on generation calls.CDN-Backed URLs
image_url
and video_url
links point to a global CDN for low-latency delivery.URL Expiry
url_ttl
(in seconds) parameter on generation requests.Batch vs. Single Requests
/v1/batch/video
to reduce overhead.Connection Management
async
+ webhooks rather than long polling for large jobs.Retry & Backoff
Retry-After
header.Parallelization
Polls for the status of an asynchronous job (e.g., video generation, batch processing).
job_id required | string Example: job_98765abcd The unique identifier of the asynchronous job. |
{- "job_id": "job_98765abcd",
- "status": "in_progress",
- "progress": 65,
- "message": "Video encoding at 65%",
- "created_at": "2023-10-26T10:00:00Z",
- "updated_at": "2023-10-26T10:05:00Z"
}
You can control the underlying generator by passing the model
parameter in your /generate/frame
or /generate/video
requests. Available modes trade off speed, stochasticity, and visual fidelity.
Identifier: model=gan
Overview: Uses a StyleGAN-like backbone extended for video. Leverages Adaptive Instance Normalization (AdaIN) to inject the identity vector into each layer and adds per-frame noise to introduce fine stochastic detail.
Key Components:
Generator: 2D convolutional "synthesis" network with style modulation.
Noise Inputs: Per–layer Gaussian noise tensors (n\u209c \u223c N(0,1)
) to diversify textures.
AdaIN:
AdaIN(x, f_identity) = \u03c3(f_identity) * ((x - \u03bc(x)) / \u03c3(x)) + \u03bc(f_identity)
Discriminators:
API Usage Example:
{
"model": "gan",
"reference_image_id": "ref_12345abcd",
"control_type": "pose",
"z_t": [...],
"resolution": "1280x720"
}
Pros & Cons:
Pros | Cons |
---|---|
Fast inference (1–2 sec/frame at 720p) | May exhibit flicker without strong temporal loss tuning |
Rich, detailed textures via noise inputs | Can require heavier GPU for high-res |
Identifier: model=diffusion
Overview:
Iteratively denoises a Gaussian noise map into the target frame over S steps. Conditioned on f_identity
and z\u209c
via cross-attention in a 2D/3D U-Net.
Key Components:
U-Net Backbone:
Cross-Attention:
x\u209c\u02e2
; Keys/Values from [f_identity, z\u209c]
.Denoising Loss:
L_diff = E_{\u03b5, t, s} \u2016 \u03b5 - \u03b5_\u03b8(x\u209c\u02e2, f_identity, z\u209c) \u2016\u00b2
API Usage Example:
{
"model": "diffusion",
"reference_image_id": "ref_12345abcd",
"control_type": "text",
"z_sequence": [...],
"format": "mp4"
}
Pros & Cons:
Pros | Cons |
---|---|
Highly stable subject consistency | Slower (10–50 sec/frame at 720p) |
Better handling of complex scenes & lighting | May require tuning of step count S |
Identifier: model=transformer
Overview:
Treats video as a sequence of spatio-temporal tokens. A Vision Transformer encoder–decoder uses self-attention to model long-range dependencies, with cross-attention to integrate f_identity
.
Key Components:
Patch Extraction: Splits each frame into P×P pixel patches, linearly projected to token vectors.
Spatio-Temporal Blocks:
[f_identity, z\u209c]
.Output Head: Reconstructs patches into image grid.
API Usage Example:
{
"model": "transformer",
"reference_image_id": "ref_12345abcd",
"control_type": "latent",
"z_sequence": [...],
"fps": 24
}
Pros & Cons:
Pros | Cons |
---|---|
Excellent temporal coherence over long clips | Highest memory usage (TPU/GPU required) |
Unified handling of spatial & temporal context | Slower than GAN but faster than diffusion |
To guarantee that generated frames faithfully preserve the subject's appearance and transition smoothly, the system employs multiple specialized losses. Clients do not directly interact with these losses but can tune high-level consistency parameters via optional query fields in generation calls (e.g. identity_strength
, temporal_strength
).
Ensures each frame's subject matches the reference image in appearance.
Feature-Space Loss
Compare deep features of masked regions between I_ref
and generated frame I_t
using a pretrained VGG network:
$$ L_{\text{identity}} = \sum_{l} \big\lVert \phi_{l}(I_{\text{ref}}\odot M_{\text{ref}}) - \phi_{l}(I_{t}\odot M_{t}) \big\rVert_{2}^{2} $$
Face Recognition Loss (for facial subjects) Use a lightweight face-recognition embedding $e(\cdot)$ and enforce cosine similarity:
$$ L_{\text{face}} = 1 - \cos\big(e(I_{\text{ref}}),,e(I_{t})\big) $$
Encourages smooth transitions and discourages flicker or jumpiness.
Optical-Flow Warping Loss Compute forward flow $F_{t\to t+1}$ via a flow network, warp $I_{t}$ and compare to $I_{t+1}$:
$$ L_{\text{flow}} = \big\lVert \text{Warp}(I_{t},,F_{t\to t+1}) - I_{t+1}\big\rVert_{1} $$
Temporal Discriminator Loss A 3D-CNN discriminator $D_{\text{temp}}$ distinguishes real vs. generated clips:
$$ L_{\text{temp}} = \mathbb{E}\big[\log D_{\text{temp}}({I_{t}}{\text{real}})\big] + \mathbb{E}\big[\log(1 - D{\text{temp}}({I_{t}}_{\text{gen}}))\big] $$
Sliding-Window Attention Transformer-style self-attention over neighboring frames:
$$ \hat h_{t} = \mathrm{Attention}\big(Q_{t},,K_{t-k:t+k},,V_{t-k:t+k}\big) $$
– implicitly enforces consistency over a window of size $2k+1$.
Additional optional losses depending on subject type or client needs.
Loss Name | Purpose | Endpoint Parameter |
---|---|---|
Perceptual Loss | Enforce color/style consistency via LPIPS | perceptual_weight |
GAN Adversarial | Sharpness/detail via frame & temporal GANs | adv_weight |
Pose Regularization | Penalize implausible keypoint jumps | pose_smoothness |
Clients can adjust these via optional fields in generation requests:
{
"reference_image_id": "...",
"model": "diffusion",
"control_type": "text",
"z_sequence": [...],
"identity_strength": 0.7,
"temporal_strength": 0.8,
"adv_weight": 0.5
}
Below are quickstart snippets in cURL, Python, and JavaScript. For full reference, see our GitHub repos and PyPI/npm packages.
curl -X POST "https://api.aicharacter.studio/v1/reference-image" \\
-H "Authorization: Bearer $API_KEY" \\
-F "image=@/path/to/subject.jpg"
curl -X POST "https://api.aicharacter.studio/v1/control/pose" \\
-H "Authorization: Bearer $API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"reference_image_id": "ref_12345abcd",
"keypoints": [
{ "x": 0.5, "y": 0.1 },
{ "x": 0.5, "y": 0.25 }
]
}'
curl -X POST "https://api.aicharacter.studio/v1/generate/frame" \\
-H "Authorization: Bearer $API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"reference_image_id": "ref_12345abcd",
"control_type": "pose",
"z_t": [0.12, -0.33, 0.77],
"resolution": "1280x720"
}'
curl -X POST "https://api.aicharacter.studio/v1/generate/video" \\
-H "Authorization: Bearer $API_KEY" \\
-H "Content-Type: application/json" \\
-d '{
"reference_image_id": "ref_12345abcd",
"control_type": "text",
"z_sequence": [[…], […]],
"async": true,
"callback_url": "https://your.app/hooks/complete"
}'
from aicharacter import Client
client = Client(api_key="sk_live_abc123")
# 1. Upload image
ref = client.upload_reference_image("/path/to/subject.jpg")
# 2. Generate text control
z_seq = client.control.text(
reference_image_id=ref.id,
prompt="A cat playing piano",
steps=8
)
# 3. Generate video synchronously
video = client.generate.video(
reference_image_id=ref.id,
control_type="text",
z_sequence=z_seq,
format="mp4",
resolution="720x720",
fps=24
)
print("Download at:", video.video_url)
import { AiCharacterClient } from "@aicharacter/studio";
const client = new AiCharacterClient({ apiKey: "sk_live_abc123" });
async function run() {
// 1. Upload
const { reference_image_id } = await client.referenceImage.upload(
"/path/to/subject.jpg"
);
// 2. Pose control
const { z_t } = await client.control.pose({
reference_image_id,
keypoints: [{ x: 0.5, y: 0.1 }, { x: 0.5, y: 0.25 }]
});
// 3. Render frame
const frame = await client.generate.frame({
reference_image_id,
control_type: "pose",
z_t,
resolution: "1280x720"
});
console.log("Frame URL:", frame.image_url);
}
run();
temporal_strength
parameter in your generation call (range 0.0–1.0, default 0.8).model=diffusion
or model=transformer
, which natively enforce stronger temporal consistency than gan
.Retry-After
header in the 429 response.async=true
+ webhook to avoid holding client connections open.status
(from GET /v1/reference-image/{id}) is "processed"
."pending"
, wait a few seconds and retry.?include_mask=true
to your frame/video URL.mask_url
.1920x1080
on shared GPUs, and up to 4K
(3840×2160
) on dedicated A100 instances.X-GPU-Type: a100
.resolution
or switch to async=true
with a higher-tier GPU via X-GPU-Type
.z_sequence
across multiple videos?reference_image_id
and the exact z_sequence
.seed
when using latent control for deterministic output.output_format
).