Posts

Showing posts with the label nvidia

TensorRT

🧠 When to Use TensorRT Use TensorRT only for inference , not for training or fine-tuning. It provides: Lower latency Faster throughput Reduced memory footprint ⚙️ Requirements To use TensorRT: GPU with Tensor Cores (Volta, Turing, Ampere, etc.) ONNX model format (you need to export your model to ONNX first) Install: pip install nvidia-pyindex pip install tensorrt 🔥 Hugging Face + TensorRT You can export HuggingFace models using transformers.onnx : transformers-cli env # check installation transformers onnx --model=codellama/CodeLlama-7B-Instruct-hf --feature=causal-lm ./onnx/ Then optimize it via TensorRT with onnxruntime or trtexec . ⚠️ Kaggle Note Kaggle does not support TensorRT , as it lacks: root access for TensorRT driver-level installations required NVIDIA runtime permissions ✅ Use Locally or on Cloud (AWS/GCP/Colab Pro+ with CUDA support) Let me know if you want a step-by-step ONNX → TensorRT pipeline . To run inference with...

Google Cloud Run GPU Constraints & General Recommendations

Here's a breakdown of which new AI models fit within Cloud Run's resource constraints and how reasoning models can work, along with key considerations: Cloud Run GPU Constraints & General Recommendations: GPU Type: Cloud Run currently supports NVIDIA L4 GPUs, which have 24 GB of vRAM per instance. 1 Minimum Resources: When using GPUs, Cloud Run instances require a minimum of 4 vCPUs and 16 GiB of memory. 2 Scalability: Cloud Run automatically scales GPU instances, including scaling down to zero when not in use. 3 You can typically scale out up to 5 instances, with quota increases available for more. Cost: You're billed for the entire duration of the instance lifecycle when GPUs are attached, even if idle (for minimum instances). Optimization: Quantization: Use 4-bit quantized models whenever possible. 4 This significantly reduces memory footprint and can increase parallelism, allowing you to run larger models or more concurrent requests. Base Images: Sta...