I’m deploying a FastAPI backend using Hugging Face Transformers with the mistralai/Mistral-7B-Instruct-v0.1 model, quantized to 4-bit using BitsAndBytesConfig. I’m running this inside an NVIDIA GPU container (CUDA 12.1, A10G GPU with 22GB VRAM), and I keep hitting this error during model loading:
ValueError:
.tois not supported for
4-bitor
8-bit bitsandbytes models. Please use the model as it is...
What I’ve Done So Far: -I’m not calling .to(…) anywhere — explicitly removed all such lines.
-I’m using quantization_config=BitsAndBytesConfig(…) with load_in_4bit=True.
- I removed device_map=“auto” as per the transformers GitHub issue✅
- I’m calling .cuda() only once on the model after .from_pretrained(…), as suggested
- Model and tokenizer are being loaded via Hugging Face Hub with HF_TOKEN properly set
- The system detects CUDA correctly: torch.cuda.is_available() is True
and last, I cleared the Hugging Face cache (~/.cache/huggingface) and re-ran everything
Here’s the relevant part of the code that triggers the error:
"mistralai/Mistral-7B-Instruct-v0.1",
quantization_config=quant_config,
device_map=None, # I explicitly removed this
token=hf_token
).cuda() # This is the only use of `.cuda()`
tokenizer = AutoTokenizer.from_pretrained(model_name, token=hf_token)```
Yet I still get the same ValueError. Thank you in advance.