How to Fine-Tune an LLM Using NEO

Fine-tune a Hugging Face model on a custom dataset with full SFT—no code, just a single prompt. NEO plans and runs the entire pipeline on your GPU node.

Problem Statement

We told NEO to: Fine-tune Qwen 3.5 4B on the Qwen3-Coder-Next-1800x dataset using full supervised fine-tuning (SFT), not LoRA—and to run the whole pipeline autonomously from one natural-language prompt in the VS Code extension, with no hand-written training code. We wanted environment setup, data loading and ChatML formatting, model loading with the right precision and device mapping, training config, training run, and checkpoint saving all handled by NEO on a connected GPU node.

Overview

Open NEO in your VS Code extension, connect to a GPU node, and paste a single prompt. NEO reads the prompt, plans the full training pipeline, and executes every step autonomously on your GPU node—no code required.

Example prompt:

“Finetune https://huggingface.co/Qwen/Qwen3.5-4B using https://huggingface.co/datasets/Crownelius/Qwen3-Coder-Next-1800x . I want full fine-tuning SFT, not LoRA.”

Step 1: Give NEO a Single Prompt

This is the entire user-facing workflow. Open NEO in your VS Code extension, connect to a GPU node, and paste the prompt above. NEO plans and runs the full training pipeline autonomously. You don’t need to write a single line of code. You can also watch the walkthrough on Google Drive if you prefer.

Step 2: NEO Sets Up the Environment

NEO automatically provisions the training environment. It detects your available hardware, installs the correct package versions, and verifies that CUDA is accessible before doing anything else.

What gets installed and why:

Package	Purpose
transformers	Model loading and tokenization
trl	SFTTrainer for supervised fine-tuning
accelerate	Multi-GPU support and mixed precision training
datasets	Pulling and handling Hugging Face datasets
torch	PyTorch with CUDA support
sentencepiece	Tokenizer dependency for Qwen models

No manual pip installs, no CUDA debugging, no version conflicts to sort out.

Step 3: NEO Loads and Formats the Dataset

NEO pulls the dataset directly from Hugging Face. The dataset used here is Crownelius/Qwen3-Coder-Next-1800x, which contains around 1,800 high-quality coding instruction–response pairs curated for Qwen-family models.

NEO converts every sample into ChatML format automatically, so the model sees inputs exactly the way it was trained to expect. No preprocessing, no schema mapping, no format errors on your end.

Step 4: NEO Loads the Model

NEO downloads Qwen 3.5 4B and loads it in bfloat16 precision, which halves memory usage without hurting numerical stability.

device_map=“auto” — Model layers are distributed across your available GPUs automatically.
Gradient checkpointing — Enabled to recompute activations during the backward pass instead of storing them, cutting VRAM enough to fit the full model on a single A100 40GB.
Padding token — Set correctly so batched training works without errors.

Step 5: NEO Configures the Training Run

NEO sets every training parameter without any input from you.

Parameter	Value	Why
Epochs	3	Enough to learn the dataset without overfitting 1,800 samples
Batch size per GPU	2	Safe for A100 40GB VRAM
Gradient accumulation	8 steps	Effective batch of 16 without extra VRAM cost
Learning rate	2e-5	Standard for full SFT on instruction datasets
LR scheduler	Cosine	Smooth decay that avoids late-stage overfitting
Warmup	5% of steps	Stabilizes early training before full LR kicks in
Precision	bfloat16 + tf32	Speed and stability on Ampere and Hopper GPUs
Max sequence length	2,048 tokens	Covers nearly all samples in the dataset

This is full SFT: every weight in the 4B model is updated on every step. The result is one standalone checkpoint with no adapter files and no merging required.

Step 6: NEO Runs Training and Streams Logs

NEO launches the training job on the GPU node and streams live logs back to your VS Code terminal through the extension. You can watch everything in real time.

What to watch in the logs:

Loss — Should decrease steadily across the first epoch. A healthy run typically goes from around 2.0 down to between 0.5 and 0.8 by epoch 3.
Gradient norm — Should stay below 5.0. Consistent spikes above that suggest the learning rate is too high.
Learning rate — Follows the cosine schedule: peaks after warmup, then decays smoothly toward zero.

Rough training time (3 epochs, 1,800 samples):

GPU	Approx. time
A100 40GB	25–40 minutes
H100 80GB	12–20 minutes
2× A100 40GB	12–20 minutes

If your VS Code window disconnects mid-run, training continues on the node and NEO re-attaches automatically when you reconnect.

Pipeline Architecture Overview

Stage	What NEO does
1. Prompt	User pastes single prompt in VS Code; NEO parses model, dataset, and training type (full SFT)
2. Environment	Provisions GPU node, installs transformers, trl, accelerate, datasets, torch, sentencepiece; checks CUDA
3. Data	Pulls Hugging Face dataset; converts samples to ChatML format
4. Model	Loads Qwen 3.5 4B in bfloat16; device_map=auto, gradient checkpointing, padding token set
5. Training	Runs SFTTrainer with configured epochs, batch size, LR, scheduler; streams logs to VS Code
6. Checkpoint	Saves full model directory (config, tokenizer, model.safetensors) on node

Step 7: NEO Saves the Final Checkpoint

When training finishes, NEO saves a complete model directory on the GPU node. Everything needed for inference is in one place.

qwen-coder-sft/
├── config.json
├── tokenizer.json
├── tokenizer_config.json
├── special_tokens_map.json
├── model.safetensors
└── training_args.bin

From there you can download the checkpoint to your machine, push it to Hugging Face, or run inference on the node with vLLM, Ollama, llama.cpp, or the standard Transformers pipeline.

Why Full SFT Instead of LoRA?

LoRA trains a small set of adapter matrices while keeping most of the model frozen. It’s faster and cheaper, but you need a merge step before deployment, it can underfit on complex or diverse coding tasks, and merged models sometimes show quality degradation compared to a well-trained full SFT run.

Full SFT updates every weight in the model. The output is a single .safetensors file that is the model—nothing else attached. For a coding model you plan to use daily, the quality difference is noticeable on the tasks it was trained for.

Hardware Requirements

Component	Minimum	Recommended
GPU	A100 40GB	H100 80GB or 2× A100
System RAM	64 GB	128 GB
Storage	100 GB free	200 GB free

Repository & Artifacts

This page describes a VS Code + GPU node workflow run with NEO. There is no standalone showcase GitHub repository—the primary artifact is the fine-tuned checkpoint NEO writes on your node (see Step 7).

Generated Artifacts:

Fine-tuned model directory (qwen-coder-sft/): config.json, tokenizer files, model.safetensors, training_args.bin
Training environment (packages and CUDA check) provisioned on the GPU node
Dataset in ChatML form, loaded from Hugging Face inside the run

Upstream sources (public):

References

Direct (MP4)·Google Drive·Qwen/Qwen3.5-4B·Crownelius/Qwen3-Coder-Next-1800x