FastAPI + Transformers + 4-bit Mistral: .to() is not supported for bitsandbytes 4-bit models error

I’m deploying a FastAPI backend using Hugging Face Transformers with the mistralai/Mistral-7B-Instruct-v0.1 model, quantized to 4-bit using BitsAndBytesConfig. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:

ValueError:.tois not supported for4-bitor8-bit bitsandbytes models. Please use the model as it is...

What I’ve Done So Far: -I’m not calling .to(…) anywhere — explicitly removed all such lines.:white_check_mark:

-I’m using quantization_config=BitsAndBytesConfig(…) with load_in_4bit=True.:white_check_mark:

  • I removed device_map=“auto” as per the transformers GitHub issue✅
  • I’m calling .cuda() only once on the model after .from_pretrained(…), as suggested :white_check_mark:
  • Model and tokenizer are being loaded via Hugging Face Hub with HF_TOKEN properly set :white_check_mark:
  • The system detects CUDA correctly: torch.cuda.is_available() is True :white_check_mark:

and last, I cleared the Hugging Face cache (~/.cache/huggingface) and re-ran everything :white_check_mark:

Here’s the relevant part of the code that triggers the error:

    "mistralai/Mistral-7B-Instruct-v0.1",
    quantization_config=quant_config,
    device_map=None,  # I explicitly removed this
    token=hf_token
).cuda()  # This is the only use of `.cuda()`

tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)```

Yet I still get the same ValueError. Thank you in advance.