FastAPI + Transformers + 4-bit Mistral: .to() is not supported for bitsandbytes 4-bit models error

Himanshu · April 3, 2025, 7:01am

I’m deploying a FastAPI backend using Hugging Face Transformers with the mistralai/Mistral-7B-Instruct-v0.1 model, quantized to 4-bit using BitsAndBytesConfig. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:

ValueError:.tois not supported for4-bitor8-bit bitsandbytes models. Please use the model as it is...

What I’ve Done So Far: -I’m not calling .to(…) anywhere — explicitly removed all such lines.

-I’m using quantization_config=BitsAndBytesConfig(…) with load_in_4bit=True.

I removed device_map=“auto” as per the transformers GitHub issue✅
I’m calling .cuda() only once on the model after .from_pretrained(…), as suggested
Model and tokenizer are being loaded via Hugging Face Hub with HF_TOKEN properly set
The system detects CUDA correctly: torch.cuda.is_available() is True

and last, I cleared the Hugging Face cache (~/.cache/huggingface) and re-ran everything

Here’s the relevant part of the code that triggers the error:

    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quant_config,
    device_map=None,  # I explicitly removed this
    token=hf_token
).cuda()  # This is the only use of `.cuda()`

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)```

Yet I still get the same ValueError. Thank you in advance.