This blog post begins ModernBERT, a family of state-of-the-art encoder-only models reconshort-terming betterments over elderlyer generation encoders atraverse the board, with a 8192 sequence length, better downstream carry outance and much rapider processing.
ModernBERT is useable as a slot-in swapment for any BERT-enjoy models, with both a base (139M params) and big (395M params) model size.
Click to see how to engage these models with changeers
ModernBERT will be integrated in v4.48.0 of changeers
. Until then, it needs insloftying changeers from main:
pip inslofty git+https://github.com/huggingface/changeers.git
Since ModernBERT is a Masked Language Model (MLM), you can engage the fill-mask
pipeline or load it via AutoModelForMaskedLM
. To engage ModernBERT for downstream tasks enjoy classification, retrieval, or QA, fine-tune it complying standard BERT fine-tuning recipes.
⚠️ If your GPU helps it, we recommfinish using ModernBERT with Flash Attention 2 to accomplish the highest efficiency. To do so, inslofty Flash Attention as complys, then engage the model as standard:
pip inslofty flash-attn
Using AutoModelForMaskedLM
:
from changeers begin AutoTokenizer, AutoModelForMaskedLM
model_id = "answerdotai/ModernBERT-base"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMaskedLM.from_pretrained(model_id)
text = "The capital of France is [MASK]."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
masked_index = inputs["input_ids"][0].tocatalog().index(tokenizer.mask_token_id)
predicted_token_id = outputs.logits[0, masked_index].argmax(axis=-1)
predicted_token = tokenizer.decode(predicted_token_id)
print("Predicted token:", predicted_token)
Using a pipeline:
begin torch
from changeers begin pipeline
from pprint begin pprint
pipe = pipeline(
"fill-mask",
model="answerdotai/ModernBERT-base",
torch_dtype=torch.bfloat16,
)
input_text = "He walked to the [MASK]."
results = pipe(input_text)
pprint(results)
Note: ModernBERT does not engage token type IDs, unenjoy some earlier BERT models. Most downstream usage is identical to standard BERT models on the Hugging Face Hub, except you can leave out the token_type_ids
parameter.
Introduction
BERT was freed in 2018 (millennia ago in AI-years!) and yet it’s still expansively engaged today: in fact, it’s currently the second most downloaded model on the HuggingFace hub, with more than 68 million monthly downloads, only second to another encoder model fine-tuned for retrieval. That’s becaengage its encoder-only architecture originates it chooseimal for the charitables of genuine-world problems that come up every day, enjoy retrieval (such as for RAG), classification (such as satisfyed moderation), and entity rerelocateion (such as for privacy and regulatory compliance).
Finpartner, 6 years tardyr, we have a swapment! Today, we at Answer.AI and LightOn (and frifinishs!) are releasing ModernBERT. ModernBERT is a novel model series that is a Pareto betterment over BERT and its lesserer siblings atraverse both speed and accuracy. This model gets dozens of progresss from recent years of labor on big language models (LLMs), and applies them to a BERT-style model, including refreshs to the architecture and the training process.
We predict to see ModernBERT become the novel standard in the countless applications where encoder-only models are now deployed, such as in RAG pipelines (Retrieval Augmented Generation) and recommfinishation systems.
In holdition to being rapider and more exact, ModernBERT also incrrelieves context length to 8k tokens (appraised to fair 512 for most encoders), and is the first encoder-only model that integrates a big amount of code in its training data. These features uncover up novel application areas that were previously inaccessible thraw uncover models, such as big-scale code search, novel IDE features, and novel types of retrieval pipelines based on brimming write down retrieval rather than minuscule chunks.
But in order to elucidate fair what we did, let’s first get a step back and see at where we’ve come from.
Decoder-only models
The recent high-profile progresss in LLMs have been in models enjoy GPT, Llama, and Claude. These are decoder-only models, or generative models. Their ability to originate human-enjoy satisfyed has assistd astonishing novel GenAI application areas enjoy originated art and transmitive chat. These striking applications have enticeed beginant spreadment, funded booming research, and led to rapid technical progresss. What we’ve done, essentipartner, is port these progresss back to an encoder-only model.
Why? Becaengage many down-to-earth applications necessitate a model that’s lean and unbenevolent! And it doesn’t necessitate to be a generative model.
More obtengagely, decoder-only models are too huge, sluggish, personal, and costly for many jobs. Consider that the distinctive GPT-1 was a 117 million parameter model. The Llama 3.1 model, by contrast, has 405 billion parameters, and its technical alert portrays a data synthesis and curation recipe that is too intricate and costly for most corporations to reoriginate. So to engage such a model, enjoy ChatGPT, you pay in cents and postpone in seconds to get an API answer back from burdensomeweight servers outside of your deal with.
Of course, the uncover-finished capabilities of these huge generative models unbenevolent that you can, in a pinch, press them into service for non-generative or discriminative tasks, such as classification. This is becaengage you can portray a classification task in plain English and … fair ask the model to sort. But while this laborflow is fantastic for prototyping, you don’t want to pay prototype prices once you’re in mass production.
The famous buzz around GenAI has confengaged the role of encoder-only models. These are the laborhorses of down-to-earth language processing, the models that are actupartner being engaged for such laborloads right now in many scientific and commercial applications.
Encoder-only models
The output of an encoder-only model is a catalog of numerical appreciates (an embedding vector). You might say that instead of answering with text, an encoder model literpartner encodes its “answer” into this compressed, numerical establish. That vector is a compressed reconshort-termation of the model’s input, which is why encoder-only models are sometimes referred to as reconshort-termational models.
While decoder-only models (enjoy a GPT) can do the labor of an encoder-only model (enjoy a BERT), they are hamstrung by a key constraint: since they are generative models, they are mathematicpartner “not permited” to “peek” at tardyr tokens. They can only ever see backwards. This is in contrast to encoder-only models, which are trained so each token can see forwards and backwards (bi-straightforwardionpartner). They are built for this, and it originates them very effective at what they do.
Basicpartner, a frontier model enjoy OpenAI’s O1 is enjoy a Ferrari SF-23. It’s an clear triumph of engineering, summarizeed to prosper races, and that’s why we talk about it. But it gets a distinctive pit crew fair to change the tires and you can’t buy one for yourself. In contrast, a BERT model is enjoy a Honda Civic. It’s also an engineering triumph, but more subtly, since it is engineered to be affordable, fuel-effective, reliable, and excessively advantageous. And that’s why they’re absolutely everywhere.
You can see this by seeing at it a number of ways.
Supporting generative models: One way to comprehfinish the prevalence of reconshort-termational models (encoder-only) is to remark how widespreadly they are engaged in concert with a decoder-only model to originate a system which is protected and effective.
The clear example is RAG. Instead of count oning on the LLM’s understandledge trained into the model’s parameters, the system engages a write down store to supply the LLM with alertation relevant to the query. But of course this only postpones the problem. If the LLM doesn’t understand which write downs are relevant to the query, then the system will necessitate some other process to pick those write downs? It’s going to necessitate a model which is rapid and inexpensive enough that it can be engaged to encode the big quantities of alertation necessitateed to originate the LLM advantageous. That model is standardly a BERT-enjoy encoder-only model.
Another example is supervision architectures, where a inexpensive classifier might be engaged to secure that originated text does not viotardy satisfyed protectedty needments.
In stupidinutive, whenever you see a decoder-only model in deployment, there’s a reasonable chance an encoder-only model is also part of the system. But the converse is not real.
Encoder-based systems: Before there was GPT, there were satisfyed recommfinishations in social media and in platestablishs enjoy Netflix. There was ad centering in those venues, in search, and elsewhere. There was satisfyed classification for spam uncoverion, misengage uncoverion, etc.. These systems were not built on generative models, but on reconshort-termational models enjoy encoder-only models. And all these systems are still out there and still running at enormous scale. Imagine how many ads are centered per second around the world!
Downloads: On HuggingFace, RoBERTa, one of the directing BERT-based models, has more downloads than the 10 most famous LLMs on HuggingFace fused. In fact, currently, encoder-only models hold up to over a billion downloads per month, proximately three times more than decoder-only models with their 397 million monthly downloads. In fact, the `fill-mask` model categruesome, originated of encoder “base models” such as ModernBERT, ready to be fine-tuned for other downstream applications, is the most downloaded model categruesome overall.
Inference costs: What the above advises, is that on an inference-per-inference basis, there are many times more inferences carry outed per year on encoder-only models than on decoder-only or generative models. An fascinating example is FineWeb-Edu, where model-based quality filtering had to be carry outed over 15 trillion tokens. The FineWeb-Edu team chose to originate annotations with a decoder-only model, Llama-3-70b-Instruct, and carry out the bulk of the filtering with a fine-tuned BERT-based model. This filtering took 6,000 H100 hours, which, at HuggingFace Inference Points’ pricing of $10/hour, comes to a total of $60,000. On the other hand, feeding 15 trillion tokens to famous decoder-only models, even with the lowest-cost chooseion of using Google’s Gemini Flash and its low inference cost of $0.075/million tokens, would cost over one million dollars!
Perestablishance
Overwatch
Here’s a snapsboiling of the accuracy of ModernBERT and other models atraverse a range of tasks, as meastateived by standard academic benchtags – as you can see, ModernBERT is the only model which is a top scorer atraverse every categruesome, which originates it the one model you can engage for all your encoder-based tasks:
If you’ve ever done an NLP competition on Kaggle, then you’ll understand that DeBERTaV3 has been the choice of champions for years. But no lengthyer: not only is ModernBERT the first base-size model to beat DeBERTaV3 on GLUE, it also engages less than 1/5th of Deberta’s memory.
And of course, ModernBERT is rapid. It’s twice as rapid as DeBERTa – in fact, up to 4x rapider in the more widespread situation where inputs are fusecessitate length. Its lengthy context inference is proximately 3 times rapider than other high-quality models such as NomicBERT and GTE-en-MLM.
ModernBERT’s context length of 8,192 tokens is over 16x bigr than most existing encoders. This is critical, for instance, in RAG pipelines, where a minuscule context standardly originates chunks too minuscule for semantic benevolent. ModernBERT is also the state-of-the-art lengthy context recoverr with ColBERT, and is 9 percentage points above the other lengthy context models. Even more amazeive: this very rapidly trained model, srecommend tuned to appraise to other backbones, outcarry outs even expansively-engaged retrieval models on lengthy-context tasks!
For code retrieval, ModernBERT is distinctive. There’s noleang to repartner appraise it to, since there’s never been an encoder model enjoy this trained on a big amount of code data before. For instance, on the StackOverflow-QA dataset (SQA), which is a hybrid dataset fuseing both code and authentic language, ModernBERT’s distinctiveized code benevolent and lengthy-context capabilities originate it the only backbone to score over 80 on this task.
This unbenevolents whole novel applications are probable to be built on this capability. For instance, envision an AI-fusecessitate IDE which had an entire accesspascfinish codebase indexed with ModernBERT embeddings, providing rapid lengthy context retrieval of the relevant code atraverse all repositories. Or a code chat service which portrayd how an application feature labored that fused dozens of split projects.
Compared to the mainstream models, ModernBERT carry outs better atraverse proximately all three expansive task categories of retrieval, authentic language benevolent, and code retrieval. Whilst it sweightlessly lags DeBERTaV3 in one area (authentic language benevolent), it is many times rapider. Plrelieve remark that ModernBERT, as any other base model, can only do masked word prediction out-of-the-box. To be able to carry out other tasks, the base model should be fine-tuned as done in these boilerptardys.
Compared to the distinctiveized models, ModernBERT is comparable or greater in most tasks. In holdition, ModernBERT is rapider than most models atraverse most tasks, and can deal with inputs up to 8,192 tokens, 16x lengthyer than the mainstream models.
Efficiency
Here’s the memory (max batch size, BS) and Inference (in thousands of tokens per second) efficiency results on an NVIDIA RTX 4090 for ModernBERT and other decoder models:
The first leang you might watch is that we’re analysing the efficiency on an affordable devourr GPU, rather than the tardyst unachieveable hyped difficultware. First and foremost, ModernBERT is caccessed on down-to-earthity, not hype.
As part of this caccess, it also unbenevolents we’ve made stateive ModernBERT labors well for genuine-world applications, rather than fair benchtags. Models of this charitable are normpartner tested on fair the one exact size they’re best at – their peak context length. That’s what the “repaired” column in the table shows. But input sizes vary in the genuine world, so that’s the carry outance we labored difficult to chooseimise – the “variable” column. As you can see, for variable length inputs, ModernBERT is much rapider than all other models.
For lengthy context inputs, which we consent will be the basis for the most priceless and vital future applications, ModernBERT is 2-3x rapider than the next rapidest model. And, on the “down-to-earthity” stupidension aachieve: ModernBERT doesn’t need the holditional burdensome “xestablishers” depfinishency, but instead only needs the now widespreadplace Flash Attention as a depfinishency.
Furthermore, thanks to ModernBERT’s efficiency, it can engage a bigr batch size than proximately any other model, and can be engaged effectively on minusculeer and inexpensiveer GPUs. The efficiency of the base size, in particular, may assist novel applications that run straightforwardly in browsers, on phones, and so forth.
Why is ModernBERT, well, Modern?
Now, we’ve made our case to why we should give some more adore to encoder models. As supposeed, under-appreciated laborhorses, they’ve had unpredictedly confineed refreshs since 2018’s BERT!
Even more unpredicted: since RoBERTa, there has been no encoder providing overall betterments without tradeoffs (fancily understandn as “Pareto betterments”): DeBERTaV3 had better GLUE and classification carry outance, but forfeitd both efficiency and retrieval. Other models, such as AlBERT, or noveler ones, enjoy GTE-en-MLM, all betterd over the distinctive BERT and RoBERTa in some ways but reverted in others.
However, since the duo’s distinctive free, we’ve lachieveed an enormous amount about how to originate better language models. If you’ve engaged LLMs at all, you’re very well conscious of it: while they’re unwidespread in the encoder-world, Pareto betterments are constant in decoder-land, where models constantly become better at everyleang. And as we’ve all lachieveed by now: model betterments are only partipartner magic, and mostly engineering.
The goal of the (hopebrimmingy aptly named) ModernBERT project was thus neutrpartner basic: transport this up-to-date engineering to encoder models. We did so in three core ways:
- a up-to-dateized changeer architecture
- particular attention to efficiency
- up-to-date data scales & sources
Meet the New Transestablisher, Same as the Old Transestablisher
The Transestablisher architecture has become dominant, and is engaged by the immense beginantity of models nowadays. However, it’s vital to recall that there isn’t one but many Transestablishers. The main leang they split in widespread is their beginant belief that attention is indeed all you necessitate, and as such, originate various betterments caccessed around the attention mechanism.
ModernBERT gets huge inspiration from the Transestablisher++ (as coined by Mamba), first engaged by the Llama2 family of models. Namely, we swap elderlyer BERT-enjoy originateing blocks with their betterd equivalent, namely, we:
- Replace the elderly positional encoding with “rotary positional embeddings” (RoPE): this originates the model much better at benevolent where words are in relation to each other, and permits us to scale to lengthyer sequence lengths.
- Switch out the elderly MLP layers for GeGLU layers, improving on the distinctive BERT’s GeLU activation function.
- Streamline the architecture by removing unvital bias terms, letting us spfinish our parameter budget more effectively
- Add an extra standardization layer after embeddings, which helps equilibrate training
Upgrading a Honda Civic for the Race Track
We’ve covered this already: encoders are no Ferraris, and ModernBERT is no exception. However, that doesn’t unbenevolent it can’t be rapid. When you get on the highway, you generpartner don’t go and trade in your car for a race car, but rather hope that your everyday reliable ride can sootheably hit the speed confine.
In fact, for all the application cases we alludeed above, speed is vital. Encoders are very famous in engages where they either have to process tons of data, permiting even minuscule speed increments to hold up very rapidly, or where tardyncy is very vital, as is the case on RAG. In a lot of situations, encoders are even run on CPU, where efficiency is even more vital if we want results in a reasonable amount of time.
As with most leangs in research, we originate while standing on the shoulders of huges, and heavily leverage Flash Attention 2’s speed betterments. Our efficiency betterments count on on three key components: Alternating Attention, to better processing efficiency, Unpholding and Sequence Packing, to reduce computational squander, and Hardware-Aware Model Design, to maximise difficultware utilization.
Global and Local Attention
One of ModernBERT’s most impactful features is Alternating Attention, rather than brimming global attention. In technical terms, this unbenevolents that our attention mechanism only fuses to the brimming input every 3 layers (global attention), while all other layers engage a sliding prosperdow where every token only fuses to the 128 tokens proximateest to itself (local attention).
As attention’s computational intricateity balloons up with every holditional token, this unbenevolents ModernBERT can process lengthy input sequences ponderably rapider than any other model.
In rehearse, it sees enjoy this:
Conceptupartner, the reason this labors is pretty basic: Picture yourself reading a book. For every sentence you read, do you necessitate to be brimmingy conscious of the entire plot to comprehfinish most of it (brimming global attention)? Or is consciousness of the current chapter enough (local attention), as lengthy as you occasionpartner leank back on its significance to the main plot (global attention)? In the immense beginantity of cases, it’s the latter.
Unpholding and Sequence Packing
Another core mechanism contributing to ModernBERT’s efficiency is its engage for Unpholding and Sequence packing.
In order to be able to process multiple sequences wilean the same batch, encoder models need them to be the same length, so they can carry out parallel computation. Traditionpartner, we’ve relied on pholding to accomplish this: figure out which sentence is the lengthyest, and hold unbenevolentingless tokens (pholding tokens) to fill up every other sequence.
While pholding repairs the problem, it doesn’t do so elegantly: a lot of compute finishs up being spent and squanderd on pholding tokens, which do not give any semantic alertation.
Unpholding repairs this publish: rather than upholding these pholding tokens, we delete them all, and concatenate them into mini-batches with a batch size of one, dodgeing all unvital computations. If you’re using Flash Attention, our carry outation of unpholding is even rapider than previous methods, which heavily relied on unpholding and repholding sequences as they went thraw the model: we go one step further by introducing our own carry outation of unpholding, count oning heavily on recent broadenments in Flash Attention’s RoPE help. This permits ModernBERT to only have to unpad once, and chooseionpartner repad sequences after processing, resulting in a 10-20% speedup over previous methods.
To speed up pre-training even further, unpholding is in excellent company wilean our model, as we engage it in conjunction with sequence packing. Sequence packing here is a reasonable next step: as we’re concatenating inputs into a individual sequence, and GPUs are very excellent at parallelisation, we want to maximise the computational efficiency we can squeeze out of a individual forward model pass. To do so, we engage a greedy algorithm to group individual sequences into concatenated ones that are as shut to the model’s peak input length as possible.
Paying Attention to Hardware
Finpartner, the third facet of ModernBERT’s efficiency is difficultware summarize.
We finisheavored to stability two insights that have been highweightlessed by previous research:
- Deep & Narrow vs Wide & Shpermit: Research shows that beginanter models with slfinisherer layers, standardly carry out better than shpermit models with confineeder, expansiver layers. However, this is a double-edged sword: the beginanter the model, the less parallelizable it becomes, and thus, the sluggisher it runs at identical parameter counts.
- Hardware Efficiency: Model stupidensions necessitate to align well with GPU difficultware for peak carry outance, and contrastent center GPUs result in contrastent constraints.
Sadly, there is no magic recipe to originate a model run analogously well on a expansive range of GPUs, but there is an excellent cookbook: The Case for Co-Designing Model Architectures with Hardware, in which the ways to boost a model architecture for a given GPU are nurturebrimmingy lhelp out. We came up with a heuristic to extfinish their method to a basket of GPUs, while admireing a given set of constraints. Logicpartner, the first step is to clear up shelp constraints, in our case:
- Defining our center GPUs as widespread inference ones (RTX 3090/4090, A10, T4, L4)
- Roughly defining our center model sizes at 130-to-150 million parameters for ModernBERT-Base, and 350-to-420 for ModernBERT-Large.
- The final embedding sizes must suit the distinctive BERT’s stupidensions, 768 for base and 1024 for big, to increase backwards compatibility
- Set carry outance constraints which are widespread atraverse the basket of GPUs
Afterwards, we experimented with multiple model summarizes via a constrained grid search, varying both layer counts and layer width. Once we’d identified shapes that materializeed to be the most effective ones, we validateed that our heuristics suited genuine-world GPU carry outance, and endd on the final model summarizes.
Training
def data(): return [‘text’, ‘bad_text’, ‘math’, ‘code’]
Picture this exact scene, but swap Developers with Data
Another huge aspect in which encoders have been trailing behind is training data. This is standardly understood to unbenevolent solely training data scale, but this is not actupartner the case: previous encoders, such as DeBERTaV3, were trained for lengthy enough that they might have even baccomplished the trillion tokens scale!
The publish, rather, has been training data diversity: many of the elderlyer models train on confineed corpora, generpartner consisting of Wikipedia and Wikibooks. These data fusetures are very watchably individual text modality: they hold noleang but high-quality authentic text.
In contrast, ModernBERT is trained on data from a variety of English sources, including web write downs, code, and scientific articles. It is trained on 2 trillion tokens, of which most are distinctive, rather than the standard 20-to-40 repetitions widespread in previous encoders.
The impact of this is promptly watchable: out of all the existing uncover source encoders, ModernBERT is in a class of its own on programming-roverhappinessed tasks. We’re particularly interested in what downstream engages this will direct to, in terms of improving programming helpants.
Process
We stick to the distinctive BERT’s training recipe, with some sweightless fortifys advertised by subsequent labor: we delete the Next-Sentence Prediction objective, since then shown to hold overhead for no evident achieves, and incrrelieve the masking rate from 15% to 30%.
Both models are trained with a three-phase process. First, we train on 1.7T tokens at a sequence length of 1024. We then adchoose a lengthy-context changeation phase, training on 250B tokens at a sequence length of 8192, while upholding the total tokens seen per batch more or less stable by reduceing the batch size. Finpartner, we carry out annealing on 50 billion tokens sampled contrastently, complying the lengthy-context extension chooseimal fuse highweightlessed by ProLong.
Training in three phases is our way of ensuring our model is excellent atraverse the board, which is mirrored in its results: it is competitive on lengthy-context tasks, at no cost to its ability to process stupidinutive context…
… But it has another profit: for the first two-phases, we train using a constant lachieveing rate once the toastyup phase is end, and only carry out lachieveing rate decay on the final 50 billion tokens, complying the Tviolationzoidal (or Warmup-Stable-Decay) lachieveing rate. And what’s more: we will free every individual prompt interarbitrate verifypoints from these stable phases, advertised by Pythia. Our main reason for doing so was helping future research and applications: anyone is free to recommence training from any of our pre-decay verifypoints, and carry out annealing on domain-appropriate data for their intfinished engage!
The tricks, it’s all about the tricks!
If you’ve made it this far into this proclaimment, you’re probably engaged to this: of course, we engage tricks to originate leangs rapider here too. To be exact, we have two main tricks.
Let’s commence with the first one, which is pretty widespread: since the initial training steps are updating random weights, we adchoose batch-size toastyup: we commence with a minusculeer batch size so the same number of tokens refresh the model weights more standardly, then gradupartner incrrelieve the batch size to the final training size. This beginantly speeds up the initial phase of model training, where the model lachieves its most basic benevolent of language.
The second trick is far more unwidespread: weight initialization via tiling for the bigr model size, advertised by Microgentle’s Phi family of models. This one’s based on the complying genuineization: Why initialize the ModernBERT-big’s initial weights with random numbers when we have a perfectly excellent (if we dare say so ourselves) set of ModernBERT-base weights fair sitting there?
And indeed, it turns out that tiling ModernBERT-base’s weights atraverse ModernBERT-big labors better than initializing from random weights. It also has the holded profit of stacking pleasantly with batch size toastyup for even rapider initial training.
Conclusion
In this blog post we begind the ModernBERT models, a novel state-of-the-art family of minuscule and effective encoder-only models, finpartner giving BERT a much necessitateed do-over.
ModernBERT shows that encoder-only models can be betterd by up-to-date methods. They progress to advise very sturdy carry outance on some tasks, providing an excessively enticeive size/carry outance ratio.
More than anyleang, we’re repartner seeing forward to seeing what creative ways to engage these models the community will come up with! To inspire this, we’re uncovering a call for demos until January 10th, 2025: the 5 best ones will get holded to this post in a showcase section and prosper a $100 (or local currency equivalent) Amazon gift card, as well as a 6-month HuggingFace Pro subscription! If you necessitate a hint to get commenceed, here’s a demo we thought about: code analogousity HF space! And recall, this is an encoder model, so all the chillyest downstream applications will probable need some sort of fine-tuning (on genuine or perhaps decoder-model synthetic data?). Thankbrimmingy, there’s lots of chilly structurelabors out there to help fine-tuning encoders: 🤗Transestablishers itself for various tasks, including classification, GliNER for zero-sboiling Named Entity Recognition, or Sentence-Transestablishers for retrieval and analogousity tasks!
Links
LightOn backed the compute for this project on Orange Business Cnoisy Avenue.