How to Fine-Tune an LLM Using NEO
Fine-tune a Hugging Face model on a custom dataset with full SFT—no code, just a single prompt. NEO plans and runs the entire pipeline on your GPU node.
Problem Statement
We told NEO to: Fine-tune Qwen 3.5 4B on the Qwen3-Coder-Next-1800x dataset using full supervised fine-tuning (SFT), not LoRA—and to run the whole pipeline autonomously from one natural-language prompt in the VS Code extension, with no hand-written training code. We wanted environment setup, data loading and ChatML formatting, model loading with the right precision and device mapping, training config, training run, and checkpoint saving all handled by NEO on a connected GPU node.
Overview
Open NEO in your VS Code extension, connect to a GPU node, and paste a single prompt. NEO reads the prompt, plans the full training pipeline, and executes every step autonomously on your GPU node—no code required.
Example prompt:
“Finetune https://huggingface.co/Qwen/Qwen3.5-4B using https://huggingface.co/datasets/Crownelius/Qwen3-Coder-Next-1800x . I want full fine-tuning SFT, not LoRA.”
Step 1: Give NEO a Single Prompt
This is the entire user-facing workflow. Open NEO in your VS Code extension, connect to a GPU node, and paste the prompt above. NEO plans and runs the full training pipeline autonomously. You don’t need to write a single line of code. You can also watch the walkthrough on Google Drive if you prefer.
Step 2: NEO Sets Up the Environment
NEO automatically provisions the training environment. It detects your available hardware, installs the correct package versions, and verifies that CUDA is accessible before doing anything else.
What gets installed and why:
| Package | Purpose |
|---|---|
| transformers | Model loading and tokenization |
| trl | SFTTrainer for supervised fine-tuning |
| accelerate | Multi-GPU support and mixed precision training |
| datasets | Pulling and handling Hugging Face datasets |
| torch | PyTorch with CUDA support |
| sentencepiece | Tokenizer dependency for Qwen models |
No manual pip installs, no CUDA debugging, no version conflicts to sort out.
Step 3: NEO Loads and Formats the Dataset
NEO pulls the dataset directly from Hugging Face. The dataset used here is Crownelius/Qwen3-Coder-Next-1800x, which contains around 1,800 high-quality coding instruction–response pairs curated for Qwen-family models.
NEO converts every sample into ChatML format automatically, so the model sees inputs exactly the way it was trained to expect. No preprocessing, no schema mapping, no format errors on your end.
Step 4: NEO Loads the Model
NEO downloads Qwen 3.5 4B and loads it in bfloat16 precision, which halves memory usage without hurting numerical stability.
- device_map=“auto” — Model layers are distributed across your available GPUs automatically.
- Gradient checkpointing — Enabled to recompute activations during the backward pass instead of storing them, cutting VRAM enough to fit the full model on a single A100 40GB.
- Padding token — Set correctly so batched training works without errors.
Step 5: NEO Configures the Training Run
NEO sets every training parameter without any input from you.
| Parameter | Value | Why |
|---|---|---|
| Epochs | 3 | Enough to learn the dataset without overfitting 1,800 samples |
| Batch size per GPU | 2 | Safe for A100 40GB VRAM |
| Gradient accumulation | 8 steps | Effective batch of 16 without extra VRAM cost |
| Learning rate | 2e-5 | Standard for full SFT on instruction datasets |
| LR scheduler | Cosine | Smooth decay that avoids late-stage overfitting |
| Warmup | 5% of steps | Stabilizes early training before full LR kicks in |
| Precision | bfloat16 + tf32 | Speed and stability on Ampere and Hopper GPUs |
| Max sequence length | 2,048 tokens | Covers nearly all samples in the dataset |
This is full SFT: every weight in the 4B model is updated on every step. The result is one standalone checkpoint with no adapter files and no merging required.
Step 6: NEO Runs Training and Streams Logs
NEO launches the training job on the GPU node and streams live logs back to your VS Code terminal through the extension. You can watch everything in real time.
What to watch in the logs:
- Loss — Should decrease steadily across the first epoch. A healthy run typically goes from around 2.0 down to between 0.5 and 0.8 by epoch 3.
- Gradient norm — Should stay below 5.0. Consistent spikes above that suggest the learning rate is too high.
- Learning rate — Follows the cosine schedule: peaks after warmup, then decays smoothly toward zero.
Rough training time (3 epochs, 1,800 samples):
| GPU | Approx. time |
|---|---|
| A100 40GB | 25–40 minutes |
| H100 80GB | 12–20 minutes |
| 2× A100 40GB | 12–20 minutes |
If your VS Code window disconnects mid-run, training continues on the node and NEO re-attaches automatically when you reconnect.
Pipeline Architecture Overview
| Stage | What NEO does |
|---|---|
| 1. Prompt | User pastes single prompt in VS Code; NEO parses model, dataset, and training type (full SFT) |
| 2. Environment | Provisions GPU node, installs transformers, trl, accelerate, datasets, torch, sentencepiece; checks CUDA |
| 3. Data | Pulls Hugging Face dataset; converts samples to ChatML format |
| 4. Model | Loads Qwen 3.5 4B in bfloat16; device_map=auto, gradient checkpointing, padding token set |
| 5. Training | Runs SFTTrainer with configured epochs, batch size, LR, scheduler; streams logs to VS Code |
| 6. Checkpoint | Saves full model directory (config, tokenizer, model.safetensors) on node |
Step 7: NEO Saves the Final Checkpoint
When training finishes, NEO saves a complete model directory on the GPU node. Everything needed for inference is in one place.
qwen-coder-sft/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── special_tokens_map.json
├── model.safetensors
└── training_args.binFrom there you can download the checkpoint to your machine, push it to Hugging Face, or run inference on the node with vLLM, Ollama, llama.cpp, or the standard Transformers pipeline.
Why Full SFT Instead of LoRA?
LoRA trains a small set of adapter matrices while keeping most of the model frozen. It’s faster and cheaper, but you need a merge step before deployment, it can underfit on complex or diverse coding tasks, and merged models sometimes show quality degradation compared to a well-trained full SFT run.
Full SFT updates every weight in the model. The output is a single .safetensors file that is the model—nothing else attached. For a coding model you plan to use daily, the quality difference is noticeable on the tasks it was trained for.
Hardware Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | A100 40GB | H100 80GB or 2× A100 |
| System RAM | 64 GB | 128 GB |
| Storage | 100 GB free | 200 GB free |
Repository & Artifacts
This page describes a VS Code + GPU node workflow run with NEO. There is no standalone showcase GitHub repository—the primary artifact is the fine-tuned checkpoint NEO writes on your node (see Step 7).
Generated Artifacts:
- Fine-tuned model directory (
qwen-coder-sft/):config.json, tokenizer files,model.safetensors,training_args.bin - Training environment (packages and CUDA check) provisioned on the GPU node
- Dataset in ChatML form, loaded from Hugging Face inside the run
Upstream sources (public):