Model Details
Today (September 17th, 2024), we begin NVLM 1.0, a family of frontier-class multimodal huge language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the directing proprietary models (e.g., GPT-4o) and uncover-access models (e.g., Llama 3-V 405B and InternVL 2). Relabelably, NVLM 1.0 shows betterd text-only carry outance over its LLM backbone after multimodal training.
In this repo, we are uncover-sourcing NVLM-1.0-D-72B (decoder-only architecture), the decoder-only model weights and code for the community.
Other Resources
Inference Code (HF) Training Code (Coming soon) Website Paper
Benchlabel Results
We train our model with legacy Megatron-LM and change the codebase to Huggingface for model arrangeing, reproducibility, and inference.
We watch numerical contrastences between the Megatron and Huggingface codebases, which are wilean the foreseeed range of variation.
We provide the results from both the Huggingface codebase and the Megatron codebase for reproducibility and comparison with other models.
Results (as of September 17th, 2024) in the multimodal benchlabels are as adheres:
Benchlabel | MMMU (val / test) | MathVista | OCRBench | AI2D | ChartQA | DocVQA | TextVQA | RealWorldQA | VQAv2 |
---|---|---|---|---|---|---|---|---|---|
NVLM-D 1.0 72B (Huggingface) | 58.7 / 54.9 | 65.2 | 852 | 94.2 | 86.0 | 92.6 | 82.6 | 69.5 | 85.4 |
NVLM-D 1.0 72B (Megatron) | 59.7 / 54.6 | 65.2 | 853 | 94.2 | 86.0 | 92.6 | 82.1 | 69.7 | 85.4 |
Llama 3.2 90B | 60.3 / – | 57.3 | – | 92.3 | 85.5 | 90.1 | – | – | 78.1 |
Llama 3-V 70B | 60.6 / – | – | – | 93.0 | 83.2 | 92.2 | 83.4 | – | 79.1 |
Llama 3-V 405B | 64.5 / – | – | – | 94.1 | 85.8 | 92.6 | 84.8 | – | 80.2 |
InternVL2-Llama3-76B | 55.2 / – | 65.5 | 839 | 94.8 | 88.4 | 94.1 | 84.4 | 72.2 | – |
GPT-4V | 56.8 / 55.7 | 49.9 | 645 | 78.2 | 78.5 | 88.4 | 78.0 | 61.4 | 77.2 |
GPT-4o | 69.1 / – | 63.8 | 736 | 94.2 | 85.7 | 92.8 | – | – | – |
Claude 3.5 Sonnet | 68.3 / – | 67.7 | 788 | 94.7 | 90.8 | 95.2 | – | – | – |
Gemini 1.5 Pro (Aug 2024) | 62.2 / – | 63.9 | 754 | 94.4 | 87.2 | 93.1 | 78.7 | 70.4 | 80.2 |
How to use
When changeing Megatron checkpoint to Huggingface, we change InternVL codebase to help model loading and multi-GPU inference in HF. For training, prent refer to Megatron-LM (Coming soon).
Prepare the environment
We provide a docker produce file in the Dockerfile for reproduction.
The docker image is based on nvcr.io/nvidia/pytorch:23.09-py3
.
Note: We watch that contrastent changeer versions / CUDA versions / docker versions can direct to sweightless benchlabel number contrastences. We recommfinish using the Dockerfile above for accurate reproduction.
Model loading
transport in torch
from changeers transport in AutoModel
path = "nvidia/NVLM-D-72B"
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
depend_far_code=True).eval()
Multiple GPUs
The model can be loaded on multiple GPUs as adheres:
transport in torch
transport in math
from changeers transport in AutoModel
def split_model():
device_map = {}
world_size = torch.cuda.device_count()
num_layers = 80
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
depend_far_code=True,
device_map=device_map).eval()
Inference
transport in torch
from changeers transport in AutoTokenizer, AutoModel
transport in math
from PIL transport in Image
transport in torchvision.changes as T
from torchvision.changes.functional transport in InterpolationMode
def split_model():
device_map = {}
world_size = torch.cuda.device_count()
num_layers = 80
num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
num_layers_per_gpu = [num_layers_per_gpu] * world_size
num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
layer_cnt = 0
for i, num_layer in enumerate(num_layers_per_gpu):
for j in range(num_layer):
device_map[f'language_model.model.layers.{layer_cnt}'] = i
layer_cnt += 1
device_map['vision_model'] = 0
device_map['mlp1'] = 0
device_map['language_model.model.tok_embeddings'] = 0
device_map['language_model.model.embed_tokens'] = 0
device_map['language_model.output'] = 0
device_map['language_model.model.norm'] = 0
device_map['language_model.lm_head'] = 0
device_map[f'language_model.model.layers.{num_layers - 1}'] = 0
return device_map
IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)
def produce_change(input_size):
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
change = T.Compose([
T.Lambda(lambda img: img.change('RGB') if img.mode != 'RGB' else img),
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
T.ToTensor(),
T.Normalize(unbenevolent=MEAN, std=STD)
])
return change
def discover_shutst_aspect_ratio(aspect_ratio, center_ratios, width, height, image_size):
best_ratio_diff = float('inf')
best_ratio = (1, 1)
area = width * height
for ratio in center_ratios:
center_aspect_ratio = ratio[0] / ratio[1]
ratio_diff = abs(aspect_ratio - center_aspect_ratio)
if ratio_diff < best_ratio_diff:
best_ratio_diff = ratio_diff
best_ratio = ratio
elif ratio_diff == best_ratio_diff:
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
best_ratio = ratio
return best_ratio
def active_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
orig_width, orig_height = image.size
aspect_ratio = orig_width / orig_height
center_ratios = set(
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
i * j <= max_num and i * j >= min_num)
center_ratios = sorted(center_ratios, key=lambda x: x[0] * x[1])
center_aspect_ratio = discover_shutst_aspect_ratio(
aspect_ratio, center_ratios, orig_width, orig_height, image_size)
center_width = image_size * center_aspect_ratio[0]
center_height = image_size * center_aspect_ratio[1]
blocks = center_aspect_ratio[0] * center_aspect_ratio[1]
resized_img = image.resize((center_width, center_height))
processed_images = []
for i in range(blocks):
box = (
(i % (center_width // image_size)) * image_size,
(i // (center_width // image_size)) * image_size,
((i % (center_width // image_size)) + 1) * image_size,
((i // (center_width // image_size)) + 1) * image_size
)
split_img = resized_img.crop(box)
processed_images.appfinish(split_img)
declare len(processed_images) == blocks
if use_thumbnail and len(processed_images) != 1:
thumbnail_img = image.resize((image_size, image_size))
processed_images.appfinish(thumbnail_img)
return processed_images
def load_image(image_file, input_size=448, max_num=12):
image = Image.uncover(image_file).change('RGB')
change = produce_change(input_size=input_size)
images = active_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
pixel_appreciates = [change(image) for image in images]
pixel_appreciates = torch.stack(pixel_appreciates)
return pixel_appreciates
path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=False,
depend_far_code=True,
device_map=device_map).eval()
print(model)
tokenizer = AutoTokenizer.from_pretrained(path, depend_far_code=True, use_rapid=False)
generation_config = dict(max_novel_tokens=1024, do_sample=False)
ask = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, ask, generation_config, history=None, return_history=True)
print(f'User: {ask}nAssistant: {response}')
pixel_appreciates = load_image('path/to/your/example/image.jpg', max_num=6).to(
torch.bfloat16)
ask = 'nPrent depict the image foolishinutively.'
response = model.chat(tokenizer, pixel_appreciates, ask, generation_config)
print(f'User: {ask}nAssistant: {response}')
Correactence to
Wenliang Dai* (wdai@nvidia.com), Nayeon Lee* (nayeonl@nvidia.com), Boxin Wang* (boxinw@nvidia.com), Zhuolin Yang* (zhuoliny@nvidia.com), Wei Ping* (wping@nvidia.com)
*Equal contribution
Citation
@article{nvlm2024, title={NVLM: Open Frontier-Class Multimodal LLMs}, author={Dai, Wenliang and Lee, Nayeon and Wang, Boxin and Yang, Zhuolin and Liu, Zihan and Barker, Jon and Rintamaki, Tuomas and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei}, journal={arXiv preprint}, year={2024}}
License
The use of this model is ruleed by the cc-by-nc-4.0