Chen Ruolan - DiffSplat Fine-tune

Overview

This project fine-tunes DiffSplat — a diffusion-based text/image-to-3D generation framework — into a domain-specific generator for anime chibi characters represented as 3D Gaussian Splats.

Instead of treating 3D generation as an end-to-end opaque model, DiffSplat reuses Stable Diffusion 1.5 as a 2D backbone and pairs it with three specialized components — GSRecon, GSVAE, and GSDiff — to project diffusion outputs into Gaussian Splat representations. My goal: keep that architecture intact and adapt it to a stylized character domain that the original checkpoints don't cover.

This was a solo project on a tight 2-month window, so the work split cleanly into two halves: a training-side half (adapting the SD1.5-based DiffSplat training config to use the pretrained GSRecon / GSVAE / GSDiff checkpoints for character-domain fine-tuning) and a data-side half (building an automated FBX/Blender → multi-view → Parquet pipeline that the framework could actually consume).

Treating the data pipeline as a first-class deliverable — with deterministic camera placement, repeatable rendering, and Parquet-packaged datasets — is what made the rest of the experiment reproducible across machines and across character batches.

Why DiffSplat (and Why Fine-tune at All)

2D → 3D via Gaussian Splats
DiffSplat reuses an SD 1.5 backbone — meaning the broad visual prior of 2D diffusion is already baked in. The trick is converting diffusion outputs into 3D Gaussian Splats, which is what GSRecon, GSVAE, and GSDiff are responsible for. Fine-tuning here is a domain shift, not an architecture change.

Anime Chibi Domain
Stylized chibi characters live far outside the original DiffSplat training distribution — exaggerated proportions, flat shading, hard outlines. A direct sample from the base checkpoints collapses on geometry. That's the gap this project closes.

Reuse Pretrained Components
Rather than retraining from scratch, I leveraged the released GSRecon, GSVAE, and GSDiff checkpoints and adapted the SD 1.5-based DiffSplat training config so the fine-tune surgically targets character-domain behavior while reusing all the heavy lifting.

Tight Solo Timeline
A 2-month research sprint forced ruthless prioritization: get a clean dataset and a runnable training config first, then iterate on quality. The data pipeline is what makes the iteration loop fast.

Training Configuration

The fine-tuning side of the project is, in plain terms, "take SD 1.5-based DiffSplat and point it at a chibi-character dataset without breaking the GS-side components":

🎨 Backbone — Stable Diffusion 1.5
DiffSplat's 2D diffusion backbone. Adapted training config to load the SD 1.5-based DiffSplat checkpoint and prepared prompt embeddings consistent with the chibi-character captions.

🧱 GSRecon & GSVAE
Pretrained components that handle Gaussian Splat reconstruction and VAE encoding. Loaded from official checkpoints and kept as the "3D translation layer" on top of diffusion outputs.

🌀 GSDiff
The Gaussian-Splat diffusion module. The character-domain fine-tune targets the joint behavior of GSDiff with the SD 1.5 backbone, leaving GSRecon / GSVAE pretrained for stability.

Automated Data Pipeline — FBX / Blender → Multi-view → Parquet

DiffSplat consumes multi-view image sets with calibrated camera metadata. To make character-domain fine-tuning practical, I built an automated pipeline that turns raw FBX/Blender character assets into training-ready datasets — 40 rendered views and per-view camera metadata per character — packaged as Parquet files alongside prompt embeddings.

① Asset Ingest
FBX / Blender character assets normalized into a common scene layout — uniform scale, centered pivot, neutral lighting rig.

② Multi-view Render
Blender Python script places 40 cameras around each character and renders RGB views in a deterministic order — each view paired with a structured camera JSON (intrinsics + extrinsics).

③ Parquet Packaging
RGB views and camera JSON are packaged into Parquet datasets — compact, columnar, and directly loadable by Hugging Face Datasets in the training loop.

④ Prompt Embeddings
Chibi-character captions are pre-encoded into prompt embeddings consistent with the SD 1.5 text encoder, so the training step skips redundant text encoding per epoch.

Engineering Decisions

40 Views as a Training Contract
Locking the dataset to a fixed 40-view layout per character makes camera-conditioning consistent across the dataset, simplifies batching, and turns the rendering step into something fully deterministic — the same FBX always yields the same dataset row.

Parquet Over Loose Files
Choosing Parquet for the packaged dataset (instead of folder-of-PNGs) cuts I/O overhead during training, plays well with Hugging Face Datasets, and makes splits / shuffles / shards trivial.

Reuse, Don't Rebuild
Rather than retraining GSRecon / GSVAE / GSDiff from scratch, the fine-tune targets only the character-domain shift on top of pretrained checkpoints — a deliberate scoping decision so a 2-month solo project actually converges.

Pre-computed Prompt Embeddings
Encoding chibi-character captions once and caching the embeddings keeps each training step focused on the diffusion + GS components, not on repeatedly running the text encoder.

Deliverables

🧪 Training Config
SD 1.5-based DiffSplat training config adapted for chibi-character domain fine-tuning.

🛠️ Blender Renderer
Python-driven renderer that emits 40 calibrated views + camera JSON per character.

📦 Parquet Dataset Builder
Preprocessing scripts that pack RGB views, camera JSON, and prompt embeddings into Parquet datasets.

🌀 Domain-Adapted GS Outputs
Fine-tuned DiffSplat checkpoints that produce 3D Gaussian Splats in the chibi-character style.