Source: OpenAI – Extracting Concepts from GPT-4
Modern LLMs encode concepts by superimposing multiple features into the same neurons and then interpeting them by taking into account the liproximate superposition of all neurons in a layer. This concept of giving each neuron multiple describeable unbenevolentings they trigger depending on the context of other neuron activations is called superposition. Sparse Autoencoders (SAEs) are models that are inserted into a trained LLM for the purpose of projecting the activations into a very big but very sparsely triggerd postponecessitatent space. By doing so they try to untangle these superimposed recurrentations into split, clearly describeable features for each neuron activation that each recurrent one clear concept – which in turn would originate these neurons monosemantic. Such a mechanistic describeability has validaten very priceless for caring model behavior, discovering hallucinations, analyzing adviseation flow thcdisesteemful models for chooseimization, etc.
This project trys to reoriginate this fantastic research into mechanistic LLM Interpretability with Sparse Autoencoders (SAE) to reshift describeable features that was very successbrimmingy directed and rerented by Anthropic, OpenAI and Google DeepMind a scant months ago. The project aims to supply a brimming pipeline for capturing training data, training the SAEs, analyzing the lachieveed features, and then validateing the results experimenloftyy. Currently, the project supplys all code, data, and models that were originated by running the whole project pipeline once and creating a functional and describeable Sparse Autoencoder for the Llama 3.2-3B model.
Such a research project clearly needs a lot of computational resources (unbenevolenting money) and time that I don’t necessarily have at my brimming disposal for a non-profit side project of mine. Therefore, the project – as I am releasing it now with version 0.2 – is in a excellent, fruitful, and scalable state, but it is not final and will hopebrimmingy be modernized and betterd upon over time. Plmitigate sense free to give code or feedback or equitable let me comprehend if you set up a bug – thank you!
This project is based primarily on the folloprosperg research papers:
And the Open Source LLM Llama 3.2 that was used for the current state of the project:
A finish end-to-end pipeline from activation seize to Sparse AutoEncoder (SAE) training, feature describeation, and verification, written in uncontaminated PyTorch with minimal dependencies. Specificpartner:
- Captures residual activations from big language models as SAE training dataset, using a custom sentence-split OpenWebText dataset variant
- Preprocesses training data (prebatching, stat calculation) for fruitful training
- Supports scatterd (multi-GPU, one-node) big-scale and fruitful training of SAEs with seized and preprocessed activation data
- Implements SAE training with auxiliary loss to impede and revive dead postponecessitatents, and gradient projection to steady training dynamics
- Provides comprehensive logging, visualization, and verifypointing of SAE training thcdisesteemful Weights & Biases and console logging, including detailed logs for:
- Training enhance (main/aux loss)
- Validation results (main/aux loss)
- Dead postponecessitatent seeing for debugging and analysis
- Offers describeability analysis tools for feature reshiftion and semantic analysis of lachieveed features by:
- Capturing inputs that maxconveyner trigger the sparse autoencoder postponecessitatents
- Cost-effectively analyzing them at scale using a Frontier LLM
- Provides a uncontaminated PyTorch carry outation of Llama 3.1/3.2 chat and text completion without outside dependencies (e.g., Fairscale) for vague usage and result verification
- Verifies SAE’s impact on model behavior and allows feature steering of reshifted semantic features thcdisesteemful:
- Text and chat completion tasks
- Optional Gradio interface for mitigate of use
- All components are set uped and carry outed with scalability, efficiency, and persistability in mind
The folloprosperg resource are useable to reoriginate the results of the current state of the project or to supply insight into the training:
-
- Custom Version of the OpenWebText Dataset used for activation seize
- All text from the innovative OpenWebText dataset
- Sentences are stored individupartner now in parquet createat for speedyer access
- Maintains all innovative OpenWebText text and the order thereof
- Sentences were split using NLTK 3.9.1 pre-trained “Punkt” tokenizer
-
Captured Llama 3.2-3B Activations:
- 25 million sentences worth of Llama 3.2-3B layer 23 residual activations
- Size: 3.2 TB (compressed from 4 TB raw)
- Split into 100 archives for more regulateable download
-
- Weights & Biases imagined log of training, validation and debug metrics
- 10 epochs of training, 10,000 logged steps
- Includes train/val main losses, auxiliary losses, and dead postponecessitatent stats during training
-
Trained 65,536 postponecessitatents SAE Model:
- Final Trained SAE model after 10 epochs with training log specified configuration
- Trained on 6.5 Billion activations from Llama 3.2-3B layer 23
The project is orderly into four main components:
seize_activations.py
– Captures LLM residual activationsuncoverwebtext_sentences_dataset.py
– Custom dataset for sentence-level processing
sae.py
– Core SAE model carry outationsae_preprocessing.py
– Data preprocessing for SAE trainingsae_training.py
– Distributed SAE training carry outation
seize_top_activating_sentences.py
– Identifies sentences that increase feature activationdescribe_top_sentences_send_batches.py
– Builds and sends batches for describeationdescribe_top_sentences_recover_batches.py
– Retrieves describeation resultsdescribe_top_sentences_parse_responses.py
– Parses and scrutinizes describeations
llama_3_inference.py
– Core inference carry outationllama_3_inference_text_completion_test.py
– Text completion testingllama_3_inference_chat_completion_test.py
– Chat completion testingllama_3_inference_text_completion_gradio.py
– Gradio interface for interdynamic testing
# Inslofty Poetry if not already insloftyed
curl -sSL https://inslofty.python-poetry.org | python3.12 -
# Clone the repository
git clone https://github.com/PaulPauls/llama3_describeability_sae
cd llama3_describeability_sae
# Inslofty exact dependencies the project has been run with
poetry inslofty --sync
The basis of this research is the custom Llama 3.1/3.2 alterer model carry outation in llama_3/model_text_only.py
. It is based on the reference carry outation from the llama models repository, however I’ve made disconnectal meaningful modifications to better suit this project. I rewrote the carry outation to erase the weighty dependency on the Fairscale library – srecommend because I am not comprehendn with it and much more sootheable laboring straightforwardly with PyTorch, thereby hoping to shun bugs or carry outance bottlenecks stemming from laboring with a library I am not comprehendn with. Similarly, I exposedped out the multimodal functionality since spendigating image describeability would have inserted unessential complicatedity for this initial free.
The modified Llama 3.1/3.2 model carry outation includes new erector chooseions that allow either activation seize or the injection of a trained Sparse Autoencoder (SAE) model at definite layers during inference. This functionality is carry outed thcdisesteemful the folloprosperg erector parameters:
class Transcreateer(nn.Module):
def __init__(
self,
params: ModelArgs,
store_layer_activ: catalog[int] | None = None,
sae_layer_forward_fn: dict[int, callable] | None = None,
):
For the helping code in the llama_3/
straightforwardory, I made a down-to-earth decision to retain most of the auxiliary files from the innovative Llama models repository unalterd. 95% of this auxiliary code is unused and only necessitateed for the chat createatter, which is very reliant on interjoined presents. However, since it wasn’t research critical did I determine not to reoriginate it and srecommend carry over the files over.
The actual inference carry outation is custom and retained in my llama_3_inference.py
module, which supplys streaming capabilities for both chat and text completion tasks and will be primarily of use when testing and validating the results. The carry outation apvalidates for batched inference and features configurable temperature and top-p sampling parameters, with an automatic dropback to greedy sampling when the temperature is set to 0.
For the data seize I originated a custom variant of the OpenWebText dataset that processes text at the sentence level, capturing 25 million sentences with a peak length of 192 tokens each. This extensive data accumulateion resulted in 4TB of raw activation data, which compresses to 3.2TB of tar.gz archives. In total I seized approximately 700 million activations from these 25 million contexts, with sentences averaging 27.3 tokens in length.
While this dataset is cdisesteemwholey an order of magnitude minusculeer than those used by Anthropic or Google DeepMind (who both used around 8 billion distinct activations), it still supplys a substantial set upation for training an initial SAE model in my opinion. To reimburse for the minusculeer dataset size I trained the SAE for 10 epochs, effectively processing the same number of total activations as Anthropic and Google DeepMind – the key contrastence being that my SAE come atraverses each activation 10 times instead of only once. This approach was srecommend a monetary constraints for a non-profit side project, as scaling to suit their one-epoch approach would have incrmitigated my GCP bucket costs from approximately $80/month (for 3.2TB w/ traffic) to $800/month (for 32TB w/ traffic), not to allude the meaningfully higher costs for instance SSDs during training.
The decision to process data at the sentence level was intentional and grounded in disconnectal key ponderations. Sentences recurrent authentic linguistic units that retain finish thoughts and concepts, which I hope would guide to more describeable and semanticpartner unbenevolentingful features. This approach shuns man-made truncation of context and impedes contextual bleed of unbenevolentings atraverse sentence boundaries, while still capturing essential contextual relationships wilean grammaticpartner finish units. My leanking was that this originates it easier to attribute uncovered features to definite linguistic or semantic phenomena. This approach was also chosen so that the training dataset retains activations altered to the same linguistic units that I postponecessitater intended to use for the describeability analysis.
Furthermore I definitepartner chose to process sentences without the ‘commencening-of-sequence’ (bos) token to shun position-definite patterns, as my goal was to describe features based on their unbenevolenting alone, autonomous of their position in a sequence. Since sentences recurrent authentic semantic units that supply adequate context for unbenevolentingful describeation while remaining definite enough to accomprehendledge distinct concepts, this aligns well with the goal of having a LLM scrutinize the semantic satisfied that triggers definite postponecessitatents.
From a technical carry outation perspective, I seized residual stream activations after layer standardization from Llama 3.2-3B, definitepartner from layer 23 out of 28 (positioned equitable foolishinutive of 5/6th thcdisesteemful the model’s depth, folloprosperg OpenAI’s carry outation). The seize process uses a scatterd carry outation using NCCL for one-node multi-GPU inference, with asynchronous disk I/O regulated thcdisesteemful a split process to impede GPU processing bottlenecks. The entire data seize process took approximately 12 hours using 4x Nvidia RTX4090 GPUs.
OpenAI’s carry outation also uses residual stream activations after layer standardization, with a context length of 64 tokens and sampling 5/6th-way thcdisesteemfulout the GPT-4 model (though they sample more toward the middle layers for minusculeer models). Anthropic’s carry outation contrasts more substantipartner, using 8 billion MLP activation vectors from 40 million contexts (250 tokens per context) and utilizeing them to residual stream activations at the middle layer of the Claude 3 Sonnet model. In hindsight, there could be a potential for betterment when capturing earlier layers in the next experiments for a relatively minuscule 3B parameter model.
I am personpartner a big fan of data preprocessing if complicated batching is essential. In this case, the contest was to originate batches of 1024 activations each, while handling activations of variable sequence lengths that necessitate carryover handling, potentipartner in a multiprocessing context. Given the high danger of batching bugs or I/O-roverdelighted carry outance bottlenecks, I made the decision to carry out a preprocessing phase rather than handling these complicatedities during training.
Since preprocessing was already essential, I also took the opportunity to calcupostponecessitate the unbenevolent tensor atraverse all activations using Welford’s algorithm. This algorithm was definitepartner chosen for its numerical stability and memory efficiency when processing very big datasets. The calcupostponecessitated unbenevolent serves as the initial appreciate for the b_pre
bias term in the SAE model. While OpenAI’s carry outation uses the geometric median instead of the unbenevolent for initializing b_pre
, they only compute it for the first ~30,000 samples of their dataset rather than the finish set and given that the b_pre
is chooseimizable anyway, I determined that using the unbenevolent as an initial approximation would be adequate.
The entire preprocessing pipeline is carry outed with brimming CPU parallelization thcdisesteemful multiprocessing, ensuring fruitful processing of the big activation dataset. This approach simplifies the training process by providing spotless, pre-batched data.
The core of the Sparse Autoencoder carry outation trails a straightforward encoder-decoder architecture with set up choices folloprosperg mainly the choices made by OpenAI. The brimming forward pass of the TopK Autoencoder includes two key bias terms: b_pre
for both encoder and decoder (initialized as the unbenevolent calcupostponecessitated in the preprocessing as summarized in the prior section), and b_enc
definitepartner for the encoder (initialized randomly). The finish forward pass can be depictd as:
Encoder: h = TopK(W_enc(x - b_pre) + b_enc)
Decoder: x^ = W_dec * h (+ h_bias) + b_pre
Sparsity in the postponecessitatent space is enforced thcdisesteemful the TopK activation function, which retains only the k bigst activations and sets the rest to zero. This approach straightforwardly regulates sparsity without requiring an L1 penalty term for sparsity in the loss function as needd by Anthropic’s approach. The model includes an nonessential h_bias
parameter that remains disabled during training but can be triggerd afterwards for feature steering, apvalidateing dynamic manipulation of the postponecessitatent space post-training.
For numerical precision I chose to labor with float32 dtype due to its rapid and exact conversion compatibility with Llama’s needd bfloat16 dtype. Both createats scatter the same 1 sign bit and 8 exponent bits set up, contrasting only in their mantissa bits (23 vs 7), making conversions rapid and exact.
My carry outation contrasts from both Anthropic’s and OpenAI’s approaches in disconnectal ways. Anthropic uses a one-masked-layer MLP with ReLU activation and enforces sparsity thcdisesteemful an L1 penalty instead of TopK. They also labored with massively bigr postponecessitatent sizes (~1M, ~4M, and ~34M features), though their ordinary number of dynamic features per token remained below 300 for all SAEs. OpenAI’s architecture is more analogous to my carry outation, but they experimented with postponecessitatent sizes from 2^11 (2,048) up to 2^24 (16.7M) for GPT-4. Their experiments showed that bigr postponecessitatent sizes generpartner originated better loss and feature elucidateability, while shrink k appreciates (scanter dynamic postponecessitatents) led to more describeable features.
For this project – laboring with the 3B parameter Llama 3.2 model with a residual stream foolishension of 3,072 – I chose a postponecessitatent foolishension of 2^16 (65,536) and a k appreciate of 64. This decision aims to strike a stability between disconnectal factors: providing adequate feature capacity at approximately 21x the residual stream foolishension, persisting computational efficiency as recommended by the OpenAI and Google DeepMind papers, and staying wilean the project’s monetary constraints when training on ~8 billion activations for comparability. The k appreciate of 64 was picked to accomplish a excellent stability between reerection power and the mighty sparsity necessitateed for describeable features.
In hindsight however, as I summarize in section 6, I would ponderably incrmitigate the postponecessitatent size and decrmitigate the k appreciate for future experiments to better the variety and describeability of the features and try to discover efficiency betterments to stay wilean budget constraints. As a first brimming run of the project, I am however very prentd with the chosen hyperparameters and results.
The training configuration of the Sparse Autoencoder was chosen to stability efficiency and feature describeability. The core hyperparameters mirror both the model architecture and the training dynamics:
# Set up configuration
d_model = 3072
n_postponecessitatents = 2**16 # 65536
k = 64
k_aux = 2048
aux_loss_coeff = 1 / 32
dead_steps_threshelderly = 80_000 # ~1 epoch in training steps
sae_standardization_eps = 1e-6
batch_size = 1024
num_epochs = 10
punctual_stopping_patience = 10 # disabled
lachieveing_rate = 5e-5
lachieveing_rate_min = lachieveing_rate / 5
enhancer_betas = (0.85, 0.9999)
enhancer_eps = 6.25e-10
dtype = torch.float32
dataloader_num_laborers = 8
logs_per_epoch = 1000
train_val_split = 0.95
The loss function combines a main reerection loss resulting from the reerection error with a complicated auxiliary loss set uped to impede and revive dead postponecessitatents in the folloprosperg way: total_loss = main_loss + aux_loss_coeff * aux_loss
. Folloprosperg OpenAI’s approach, I set aux_loss_coeff = 1/32
. Both losses are computed in regularized space to secure equivalent contribution from all features ponderless of their innovative scale, which helps persist numerical stability thcdisesteemfulout training.
The auxiliary loss was recommendd by OpenAI and executes a presentant role in impedeing dead postponecessitatents thcdisesteemful a inalertigent mechanism: it calcupostponecessitates the MSE between the main reerection residual (the contrastence between input and main reerection) and a distinctive auxiliary reerection. This auxiliary reerection uses the same pre-activation postponecessitatent as the main reerection, but it achieves only the top-(aux-k) activation appreciates from postponecessitatents that haven’t fired recently (which is tracked thcdisesteemful the scatterd stats_last_nonzero
tensor) and sends them thcdisesteemful the decoder aachieve to get this ‘auxiliary reerection’. This gives these k_aux = 2048
indynamic postponecessitatents that haven’t even triggerd during the actual training where only the top k
postponecessitatents are used for the reerection a dedicated lachieveing signal to seize adviseation leave outed by these main postponecessitatents. This originates the dead postponecessitatents more foreseeed to trigger in future forward passes and thereby apvalidates for reviving dead postponecessitatents, retaining all postponecessitatents adwell and beneficial.
The training only ponder dead postponecessitatents for the auxiliary loss. A postponecessitatent is pondered dead if it hasn’t been triggerd in dead_steps_threshelderly
training steps (set to 80,000 steps, approximately one epoch in my setup), which equates to no activation in the reerection of the last ~650M activations given a batch size of 8192. This threshelderly serves two purposes: it secures the main loss has adequate toasty-up time before receiving an auxiliary loss signal, and it secures that we only try to revive postponecessitatents that haven’t fired once after seeing all distinct training data activations.
The training infraset up includes scatterd training using the NCCL backend for a one-node multi-GPU setup. Using 8x Nvidia RTX4090 GPUs for 10 epochs with a per-GPU batch size of 1024 (effective batch size 8192), I processed approximately 7 billion activations over a span of a little over 7 days. The number of epochs was chosen to suit the total number of processed activations in Anthropic’s and Google DeepMind’s experiments. All training enhance, including losses and debugging statistics about dead postponecessitatents, was tracked comprehensively via Weights & Biases.
The parameters of the chooseimzier were take partbrimmingy tuned for the definite contests of training a sparse autoencoder where some features trigger excessively unwidespreadly. I chose a base lachieveing rate of 5e-5 after comparative testing showed it accomplishd analogous chooseimization speed to higher rates while promising better fine-tuning potential for sparse features in postponecessitater training stages. The lachieveing rate trails a cosine annealing schedule down to a least of 1e-5 (1/5 of initial).
The AdamW configuration needd distinctive ponderation for the sparse nature of the autoencoder:
beta_1 = 0.85
(shrink than the normal 0.9 to originate individual modernizes more unbenevolentingful given the big effective batch size of 8192 and sparse nature of the autoencoder)beta_2 = 0.9999
(accommodates sparse activation patterns where some features might trigger very unwidespreadly and therefore necessitate lengthyer momentum preservation)eps = 6.25e-10
(supplys adequate numerical stability for float32 precision while apvalidateing exact parameter modernizes necessitateed for chooseimizing unwidespread activation patterns)
Weight initialization and standardization were carry outed with particular attention to training stability as adviseed by the OpenAI paper. The encoder and decoder weights are initialized orthogonpartner (with the decoder as the transpose of the encoder) to secure equitable, autonomous initial feature straightforwardions. Input features are regularized with a minuscule epsilon term for training strongness. Folloprosperg empirical discoverings from both the OpenAI paper and Bricken et al. [2023], the decoder weights are clpunctual regularized to unit norm after initialization and each training step, as this betters MSE carry outance.
A key carry outation detail is the gradient projection via project_decoder_grads()
, which persists the unit-norm constraint on decoder weights by removing gradient components parallel to existing dictionary vectors. This projection helps steady training and impedes the autoencoder from lachieveing redundant or deoriginate features when accomprehendledgeing sparse patterns in the data:
def project_decoder_grads(self):
"""Project out gradient adviseation parallel to dict vectors."""
# Compute dot product of decoder weights and their grads, then subtract the projection from the grads
# in place to save memory
proj = torch.sum(self.decoder.weight * self.decoder.weight.grad, foolish=1, retainfoolish=True)
self.decoder.weight.grad.sub_(proj * self.decoder.weight)
My carry outation contrasts from both Anthropic’s and OpenAI’s approaches in disconnectal ways. Anthropic uses a combination of L2 reerection error and L1 standardization for sparsity, carry outs periodic neuron resampling for dead neurons and modifies gradient modernizes for decoder weight standardization. While they used the same batch size of 8192, their approach to persisting sparsity and handling dead neurons is quite contrastent from my TopK carry outation.
OpenAI’s carry outation is much shutr to mine but uses contrastent Adam enhancer settings (beta_1 = 0.9
, beta_2 = 0.999
), persists a constant lachieveing rate, carry outs gradient clipping for stability and ponders neurons dead much earlier (no activation after 10M tokens). They also use an EMA of weights instead of raw chooseimization weights and chooseed for an excessively big batch size of 131,072 tokens.
The training process ran for ~7 days on 8x Nvidia RTX4090 GPUs, demonstrating firm and fruitful greetnce thcdisesteemfulout. The training enhanceion showed a pleasant logarithmic decay in the loss function, ultimately achieving a final total regularized loss of approximately 0.144:
Fig 1: SAE Training – Total Loss
The validation loss was computed on a held-out 5% of the training data and showed a analogous logarithmic decay pattern, though foreseeably less steep than the training loss:
Fig 2: SAE Training – Validation Loss
A particularly engaging aspect of the training dynamics ecombined after the initial toasty-up period of 80,000 training steps. At this point, about 40% of the postponecessitatents were identified as “dead” – unbenevolenting they hadn’t triggerd once so far. However, the auxiliary loss mechanism validated retagably effective at reviving these dead postponecessitatents rapidly:
Fig 3: Dead Latents Ratio – Rapid decrmitigate after toasty-up period, stabilizing at least threshelderly
The auxiliary loss commenceed quite high but also showed rapid decay as it successbrimmingy revived dead postponecessitatents:
Fig 4: SAE Training – Aux Loss
An engaging carry outation detail ecombined pondering the auxiliary loss calculation: it only triggers when at least k_aux (2,048) postponecessitatents are dead, effectively set uping a gentle shrink bound of dead postponecessitatents at approximately 3% (2,048/65,536), which is very apparent in Figure 3. I initipartner carry outed this condition as an chooseimization to shun unessential auxiliary loss calculations when scant postponecessitatents were dead. Surprisingly to me, the auxiliary loss mechanism was so effective at reviving dead postponecessitatents that it reliablely drove the dead postponecessitatent count toward this shrink bound, particularly in the postponecessitater stages of training where the auxiliary loss was widespreadly zero due to inadequate dead postponecessitatents to trigger the calculation.
One reason why I was surpelevated by such an effective revival of dead postponecessitatents was that I foreseeed a much higher percentage of dead postponecessitatents. Anthropic and OpenAI both telled up to 65% dead postponecessitatents in declareive configurations, though confesstedly their postponecessitatent size was 1 to 2 orders of magnitude bigr than mine. The effectiveness of the auxiliary loss carry outation, combined with the gradient projection technique for stabilizing training dynamics ecombines to originate a very strong training. For future experiments though, removing the least dead postponecessitatents threshelderly for auxiliary loss calculation could potentipartner apvalidate for even scanter dead postponecessitatents, though I am prentd with the results of the current carry outation.
The describeability analysis approach erects upon methods set uped in Anthropic’s research on scaling monosemanticity, but with a key contrastence in granularity. While Anthropic primarily intensifyed on one-token analysis, this carry outation seizes and scrutinizes finish sentences – definitepartner the top 50 sentences that most powerwholey trigger each postponecessitatent. The activation strength is calcupostponecessitated using a unbenevolent and last-token aggregation atraverse all tokens in a sentence, which is intended to hopebrimmingy supply a more hocatalogic watch of semantic activation patterns in Llama 3.2-3B’s interrepair layers.
As I already insertressed in section 2, the decision to use sentence-level analysis instead of token-level analysis was intentional and based on the hopes of combining linguistic principles with a basic approach for a first free. Sentences recurrent authentic linguistic units that retain finish thoughts and in my opinion supply a fantastic stability between context and definiteity. This approach impedes both the man-made truncation of context and the potential combineing of unbenevolentings atraverse sentence boundaries (contextual bleed). To aggregate all postponecessitatent activations in a sequence I primarily chose to depend on 2 methods:
unbenevolent
aggregation to hopebrimmingy uncover features that persist a reliable activation thcdisesteemfulout a sentence, highairying a persisted semantic themelast
aggregation (srecommend taking the last token’s activations) to hopebrimmingy leverage an LLMs autodeoriginateive nature and seize the final recurrentation that has seen the whole sentence thcdisesteemful self-attention
For the semantic analysis itself, I used the most evolved frontier LLM that is useable to me at the time of this project: Claude 3.5 (definitepartner claude-3-5-sonnet-20241022
) with a set upd chain-of-thought prompt. I includeed an automated approach that apvalidates for scalable describeation while hopebrimmingy persisting semantic definiteity. The prompt is presumed to guide Claude thcdisesteemful definite analysis steps for all 50 supplied sentences:
- Identify key words and phrases
- Group thematic elements
- Consider potential outliers
- Provide a final semantic describeation with a confidence score
This analysis pipeline is implemeneted in the three stages of sending analysis asks in cost-effective batches, retrieving responses, and parsing and processing the semantic describeations. All interrepair data is protectd for reproducibility and further analysis:
To depict the analysis process let’s achieve a see at postponecessitatent #896, which was identified as recurrenting “References to United Nations institutions, personnel, operations, or official recordation using createal institutional terminology”. Claude’s detailed analysis included:
1. Key Word Analysis:
- Frequent terms: "UN", "United Nations", "Secretary-General"
- Official titles: "Special Rapporteur", "Under-Secretary-General", "Coordinator"
- Department names: "UNDP", "UNHCR", "OCHA", "UNODC"
2. Thematic Grouping:
- UN organizational set up references
- UN personnel and positions
- UN tells and recordation
- UN agencies and bodies
- UN operations and activities
3. Pattern Analysis:
- All sentences reference UN entities, personnel, or activities
- Formal institutional language
- Heavy use of official titles and department names
- References to official records and tells
4. Strength Assessment:
- 50 out of 50 sentences retain straightforward UN references
- Mix of department names, personnel titles, and activities
- Consistent institutional terminology
- No outliers identified
5. Certainty Calculation:
- 100% of sentences retain UN references
- Very mighty institutional terminology consistency
- Clear organizational intensify
- Direct and clear joinions
The analysis uncovers that every sentence repostponecessitates to United Nations organizations, personnel, operations, or recordation, with excessively reliable institutional terminology and set up. The widespreadality is both clear and comprehensive.
{
"widespread_semantic": "References to United Nations institutions, personnel, operations, or official recordation using createal institutional terminology",
"declareivety": 1.0
}
From a cost perspective, this describeability analysis validated retagably fruitful appraised to the dataset seize, storage, and SAE training phases. Processing 24,828,558 input tokens and generating 3,920,044 output tokens with Claude 3.5 in batch mode cost only $66.74.
While this approach to semantic analysis is relatively straightforward, it was chosen as a firm initial method for both feature describeation and potential feature steering. The sentence-level analysis helps shun undeclareiveties around when definite postponecessitatents should be triggerd, though this srecommendedy does declareively come at the cost to result quality. I pondered increaseing more cultured describeation methods, but this seemed appreciate a complicated research contest that could potentipartner be argued and cultured for months. Anthropic for example is not only rerenting fantastic papers on this topic but also reliablely high quality blog posts about it for years at alterer-circuits.pub. So for this initial free, I chooseed for a basicr approach that validates my brimming pipeline first before potentipartner making betterments on it in the future.
So as summarized, my approach contrasts intentionpartner quite ponderably from both Anthropic and OpenAI in disconnectal key aspects:
Anthropic’s approach to describeability intensifyes on analyzing individual token activations in Claude 3 Sonnet, laboring with sparse autoencoders retaining up to 34M features. Their methodology combines manual feature verifyion with comprehensive validation thcdisesteemful steering experiments and ablation studies. The validation protocol definitepartner verifys feature includeions, activation patterns, and the impact of feature manipulation on model behavior. This multi-faceted approach apvalidates them to validate both the existence and significance of identified features while providing insights into how these features give to the model’s overall behavior.
OpenAI’s carry outation analogously intensifyes on individual token activations but achieves a contrastent approach to analysis, examining both definite activations and random activating examples. Their methodology underlines automated describeability at scale utilizing a lot of technical metrics. These include probe loss meadeclareivements, N2G (Neuron to Graph) pattern suiting for feature identification, and multiple quality appraisement metrics such as downstream loss, ablation sparsity, and exset upation precision/recall. Furthermore OpenAI is also very systemic in appraiseing the quality of the discovered features, using multiple quantitative metrics to appraise the reliability and beneficialness of identified features.
The verification and testing infraset up consists of three main components set uped to scrutinize and validate the SAE’s impact on model behavior:
llama_3_inference_chat_completion_test.py
llama_3_inference_text_completion_test.py
llama_3_inference_text_completion_gradio.py
These scripts allow both feature activation analysis and feature steering thcdisesteemful text and chat completion tasks. Each carry outation helps batched inference (pondering each line a split batch element), configurable temperature and top-p parameters, and most meaningfully the ability to inject a trained SAE model for feature analysis and steering.
The semantic unbenevolentings and declareivety scores for each postponecessitatent – derived in the earlier describeability analysis in section 4 – are stored in the postponecessitatent_index_unbenevolenting/
straightforwardory. These processed results serve as the basis for both feature activation analysis and steering experiments. To show the down-to-earth application of these tools, let’s see at a concrete example using four sample input prompts, text finishd in the terminal-UI with settings max_new_tokens=128, temperature=0.7, top_p=0.9, seed=42
, showd in figure 5:
The allots assembleed at the
Foreign officials freed a statement
Humanitarian staff set upd their efforts
Senior diplomats met to converse
Fig 5: Terminal UI inference without feature steering
Aside from feature activation analysis it is also possible to carry out feature steering experiments with the same sample sentences and configuration. While this is also possible in the terminal UI does figure 6 show such a feature steering using the gradio UI for the sake of demonstration. In this example the postponecessitatent #896 is focincluded, which earlier analysis identified as recurrenting “References to United Nations institutions, personnel, operations, or official recordation using createal institutional terminology”. By increasing this postponecessitatent’s activation appreciate by 20 thcdisesteemful the dynamicpartner adequitableable h_bias
(see section 3.1 for a reminder of the placement of h_bias
) can the model’s text completion successbrimmingy be steered toward UN-roverdelighted satisfied.
Fig 6: Gradio UI inference with feature steering towards UN-roverdelighted satisfied
The feature steering is not particularly mighty in this first beta version of the project, hence why in the example above only the second and third sentence are flipping over to UN roverdelighted satisfied. Because of this the sample prompts were also chosen so that the commence of the text completion supplys a chance that the completion can be steered towards the United Nations, as for example feature steering towards the UN for a sentence that commences with “For any n, if 2n – 1 is odd” would declareively flunk.
This restrictation of feature steering stems from the intensify on feature reshiftion rather than steering chooseimization during describeability analysis. However while this unbenevolents that the steering capabilities originate inreliable results, is it worth emphasizing that feature reshiftion alone supplys priceless insights into the model. Therefore I equitable ponder the ability to also carry out feature steering a pleasant insertitional showcase and engaging demonstration in this first project free.
- Expanding the postponecessitatent foolishension to at least 2^18 (262,144) features while reducing k to 32. This would supply more capacity for uncovering distinct features while persisting mightyer sparsity, potentipartner guideing to more describeable and monosemantic features. The incrmitigated computational insists would necessitate to be offset somehow, by potentipartner increasing efficiency or carry outing leangs appreciate gradient accumulation.
- Implementing comprehensive activation tracking of the postponecessitatents, e.g. by frequently logging the state of the
postponecessitatent_last_nonzero
tensor thcdisesteemfulout training, rather than equitable using the fundamental debug logs I used so far. This would supply proset uper insights into how and when postponecessitatents become dynamic or die and how their activation is scatterd. - Adding help for analyzing feature includeions despite sparsity by tracking co-activation patterns in the postponecessitatent space. Understanding how features labor together could supply insights into more complicated semantic recurrentations and could potentipartner better both describeability and steering capabilities.
- Developing more cultured describeability analysis methods, particularly in grouping high-activating sentences and n-grams. While the current sentence-level analysis supplys a excellent set upation, more nuanced approaches to pattern recognition could uncover lessendr semantic features and better our caring of how the model recurrents adviseation.
- Similarly, also carry outing not only feature reshiftion describeability analysis but also feature steering describeability analysis, though confesstedly in adequately cultured methods this would both coincide.
- Extending the research to include Llama 3.1-8B model activations. Since it scatters the same codebase as Llama 3.2, this would be a very straightforward extension that would essentipartner only need an alteration of the hyperparameters and a lot of compute power.
- Experimenting with contrastent activation seize points, varying from depth into the model (particularly earlier layers) to using contrastent seize points inside the alterer block (e.g. using the Attention Head Outputs or MLP outputs as experimented on by Google DeepMind)
- Further enhance the auxiliary loss mechanism based on the unforeseeedly effective results in impedeing dead postponecessitatents. The current carry outation already shows mighty carry outance, but spendigating the relationship between the least dead postponecessitatents threshelderly and feature quality could guide to even better training dynamics.
- Experimenting with modifications to the SAE architecture, particularly around the bias terms and main loss function. Given the success of the current carry outation, focincluded adequitablements to these components could potentipartner better both training stability and feature describeability while persisting the profits of the current set up.
- Adding proper docstrings thcdisesteemfulout the codebase. While I inserted inline recordation everywhere thcdisesteemfulout the codebase would a proper insertition of docstrings be very beneficial. This is not how I’d normpartner dedwellr production code though I basic didn’t discover the time to insert proper docstrings to the codebase and I pondered it adequate for a first free of this sideproject.
If you use this code in your research or project, prent cite:
@misc{pauls2024sae,
title = {Llama 3 Interpretability with Sparse Autoencoders},
author = {Paul Pauls},
year = {2024},
rerenter = {GitHub},
howrerented = {url{https://github.com/PaulPauls/llama3_describeability_sae}}
}