iptv techs

IPTV Techs


nvidia/NVLM-D-72B · Hugging Face


nvidia/NVLM-D-72B · Hugging Face




Model Details

Today (September 17th, 2024), we begin NVLM 1.0, a family of frontier-class multimodal huge language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the directing proprietary models (e.g., GPT-4o) and uncover-access models (e.g., Llama 3-V 405B and InternVL 2). Relabelably, NVLM 1.0 shows betterd text-only carry outance over its LLM backbone after multimodal training.

In this repo, we are uncover-sourcing NVLM-1.0-D-72B (decoder-only architecture), the decoder-only model weights and code for the community.



Other Resources

Inference Code (HF)Training Code (Coming soon)WebsitePaper



Benchlabel Results

We train our model with legacy Megatron-LM and change the codebase to Huggingface for model arrangeing, reproducibility, and inference.
We watch numerical contrastences between the Megatron and Huggingface codebases, which are wilean the foreseeed range of variation.
We provide the results from both the Huggingface codebase and the Megatron codebase for reproducibility and comparison with other models.

Results (as of September 17th, 2024) in the multimodal benchlabels are as adheres:

Benchlabel MMMU (val / test) MathVista OCRBench AI2D ChartQA DocVQA TextVQA RealWorldQA VQAv2
NVLM-D 1.0 72B (Huggingface) 58.7 / 54.9 65.2 852 94.2 86.0 92.6 82.6 69.5 85.4
NVLM-D 1.0 72B (Megatron) 59.7 / 54.6 65.2 853 94.2 86.0 92.6 82.1 69.7 85.4
Llama 3.2 90B 60.3 / – 57.3 92.3 85.5 90.1 78.1
Llama 3-V 70B 60.6 / – 93.0 83.2 92.2 83.4 79.1
Llama 3-V 405B 64.5 / – 94.1 85.8 92.6 84.8 80.2
InternVL2-Llama3-76B 55.2 / – 65.5 839 94.8 88.4 94.1 84.4 72.2
GPT-4V 56.8 / 55.7 49.9 645 78.2 78.5 88.4 78.0 61.4 77.2
GPT-4o 69.1 / – 63.8 736 94.2 85.7 92.8
Claude 3.5 Sonnet 68.3 / – 67.7 788 94.7 90.8 95.2
Gemini 1.5 Pro (Aug 2024) 62.2 / – 63.9 754 94.4 87.2 93.1 78.7 70.4 80.2



How to use

When changeing Megatron checkpoint to Huggingface, we change InternVL codebase to help model loading and multi-GPU inference in HF. For training, prent refer to Megatron-LM (Coming soon).



Prepare the environment

We provide a docker produce file in the Dockerfile for reproduction.

The docker image is based on nvcr.io/nvidia/pytorch:23.09-py3.

Note: We watch that contrastent changeer versions / CUDA versions / docker versions can direct to sweightless benchlabel number contrastences. We recommfinish using the Dockerfile above for accurate reproduction.



Model loading

transport in torch
from changeers transport in AutoModel

path = "nvidia/NVLM-D-72B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    depend_far_code=True).eval()



Multiple GPUs

The model can be loaded on multiple GPUs as adheres:

transport in torch
transport in math
from changeers transport in AutoModel

def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map

path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    depend_far_code=True,
    device_map=device_map).eval()



Inference

transport in torch
from changeers transport in AutoTokenizer, AutoModel
transport in math
from PIL transport in Image
transport in torchvision.changes as T
from torchvision.changes.functional transport in InterpolationMode


def split_model():
    device_map = {}
    world_size = torch.cuda.device_count()
    num_layers = 80
    
    num_layers_per_gpu = math.ceil(num_layers / (world_size - 0.5))
    num_layers_per_gpu = [num_layers_per_gpu] * world_size
    num_layers_per_gpu[0] = math.ceil(num_layers_per_gpu[0] * 0.5)
    layer_cnt = 0
    for i, num_layer in enumerate(num_layers_per_gpu):
        for j in range(num_layer):
            device_map[f'language_model.model.layers.{layer_cnt}'] = i
            layer_cnt += 1
    device_map['vision_model'] = 0
    device_map['mlp1'] = 0
    device_map['language_model.model.tok_embeddings'] = 0
    device_map['language_model.model.embed_tokens'] = 0
    device_map['language_model.output'] = 0
    device_map['language_model.model.norm'] = 0
    device_map['language_model.lm_head'] = 0
    device_map[f'language_model.model.layers.{num_layers - 1}'] = 0

    return device_map


IMAGENET_MEAN = (0.485, 0.456, 0.406)
IMAGENET_STD = (0.229, 0.224, 0.225)


def produce_change(input_size):
    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
    change = T.Compose([
        T.Lambda(lambda img: img.change('RGB') if img.mode != 'RGB' else img),
        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
        T.ToTensor(),
        T.Normalize(unbenevolent=MEAN, std=STD)
    ])
    return change


def discover_shutst_aspect_ratio(aspect_ratio, center_ratios, width, height, image_size):
    best_ratio_diff = float('inf')
    best_ratio = (1, 1)
    area = width * height
    for ratio in center_ratios:
        center_aspect_ratio = ratio[0] / ratio[1]
        ratio_diff = abs(aspect_ratio - center_aspect_ratio)
        if ratio_diff < best_ratio_diff:
            best_ratio_diff = ratio_diff
            best_ratio = ratio
        elif ratio_diff == best_ratio_diff:
            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
                best_ratio = ratio
    return best_ratio


def active_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
    orig_width, orig_height = image.size
    aspect_ratio = orig_width / orig_height

    
    center_ratios = set(
        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
        i * j <= max_num and i * j >= min_num)
    center_ratios = sorted(center_ratios, key=lambda x: x[0] * x[1])

    
    center_aspect_ratio = discover_shutst_aspect_ratio(
        aspect_ratio, center_ratios, orig_width, orig_height, image_size)

    
    center_width = image_size * center_aspect_ratio[0]
    center_height = image_size * center_aspect_ratio[1]
    blocks = center_aspect_ratio[0] * center_aspect_ratio[1]

    
    resized_img = image.resize((center_width, center_height))
    processed_images = []
    for i in range(blocks):
        box = (
            (i % (center_width // image_size)) * image_size,
            (i // (center_width // image_size)) * image_size,
            ((i % (center_width // image_size)) + 1) * image_size,
            ((i // (center_width // image_size)) + 1) * image_size
        )
        
        split_img = resized_img.crop(box)
        processed_images.appfinish(split_img)
    declare len(processed_images) == blocks
    if use_thumbnail and len(processed_images) != 1:
        thumbnail_img = image.resize((image_size, image_size))
        processed_images.appfinish(thumbnail_img)
    return processed_images


def load_image(image_file, input_size=448, max_num=12):
    image = Image.uncover(image_file).change('RGB')
    change = produce_change(input_size=input_size)
    images = active_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
    pixel_appreciates = [change(image) for image in images]
    pixel_appreciates = torch.stack(pixel_appreciates)
    return pixel_appreciates

path = "nvidia/NVLM-D-72B"
device_map = split_model()
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    use_flash_attn=False,
    depend_far_code=True,
    device_map=device_map).eval()

print(model)

tokenizer = AutoTokenizer.from_pretrained(path, depend_far_code=True, use_rapid=False)
generation_config = dict(max_novel_tokens=1024, do_sample=False)


ask = 'Hello, who are you?'
response, history = model.chat(tokenizer, None, ask, generation_config, history=None, return_history=True)
print(f'User: {ask}nAssistant: {response}')


pixel_appreciates = load_image('path/to/your/example/image.jpg', max_num=6).to(
    torch.bfloat16)
ask = 'nPrent depict the image foolishinutively.'
response = model.chat(tokenizer, pixel_appreciates, ask, generation_config)
print(f'User: {ask}nAssistant: {response}')



Correactence to

Wenliang Dai* (wdai@nvidia.com), Nayeon Lee* (nayeonl@nvidia.com), Boxin Wang* (boxinw@nvidia.com), Zhuolin Yang* (zhuoliny@nvidia.com), Wei Ping* (wping@nvidia.com)

*Equal contribution



Citation

@article{nvlm2024,
  title={NVLM: Open Frontier-Class Multimodal LLMs},
  author={Dai, Wenliang and Lee, Nayeon and Wang, Boxin and Yang, Zhuolin and Liu, Zihan and Barker, Jon and Rintamaki, Tuomas and Shoeybi, Mohammad and Catanzaro, Bryan and Ping, Wei},
  journal={arXiv preprint},
  year={2024}}



License

The use of this model is ruleed by the cc-by-nc-4.0

Source connect


Leave a Reply

Your email address will not be published. Required fields are marked *

Thank You For The Order

Please check your email we sent the process how you can get your account

Select Your Plan