Technical Deep Dive

Intro to Model Merging

A detailed exploration of the ethical considerations surrounding artificial intelligence, emphasizing fairness, accountability, and transparency.

I've been fascinated by model merging and have been playing with it recently. I wanted to understand how it works and these are my findings.

Adding new capabilities to foundational models like LLAMA and Mistral has been an area of focus for AI practitioners looking to adapt LLMs into their applications. Merging models has emerged as a new technique, and these models have been dominating the LLM Leaderboard. We'll walk through how to merge two models to create an improved one. We'll then learn how to evaluate our new model's performance and benchmark it against the current state of the art models.

What is Model Merging?

Task Vectors - The core idea in model merging is derived from the concept of task vectors. The main idea here is that once you have finetuned a model on a specific task, if you subtract the weights from the base model, it gives you a "vector" which captures the modifications needed for the task.

All model merging approaches work by combining the task vectors in different ways. Some approaches include Linear Interpolation (LERP), Spherical Linear Interpolation (SLERP), TIES, and DARE.

The intuition here is that if you have different models that are good at different things, you can combine different task vectors (such as taking an average in different ways) to produce a new model that is good at both tasks. For example, if model A is good at math, and model B is good at programming, you can merge the models to product a model that is good at both.

We tried merging 2 models with a popular technique called SLERP (Spherical LinEar inteRPolation) which resulted in a better model then the base ones. We'll go through it step by step next.

How to merge models using MergeKit

Mergekit has emerged as the library of choice for researchers dabbling with model merging. This library implements all the popular algorithms, and makes it easy to experiment with different parameters for merging.

For this post, we will use the model Pearl-7B-slerp as a reference. It is a merge of OmniBeagle-7B and WizardMath-7B-V1.1 and outperforms both it's parents in GSM8k task (Grade School Math). More details on the dataset can be found here.

Install mergekit -

!git clone https://github.com/arcee-ai/mergekit.git
!cd mergekit && pip install -q -e .

Define your config.yaml for merging the models. Here's the config from model's page -

slices:
 - sources:
     - model: mlabonne/OmniBeagle-7B
       layer_range: [0, 32]
     - model: WizardLM/WizardMath-7B-V1.1
       layer_range: [0, 32]
merge_method: slerp
base_model: mlabonne/OmniBeagle-7B
parameters:
 t:
   - filter: self_attn
     value: [0, 0.5, 0.3, 0.7, 1]
   - filter: mlp
     value: [1, 0.5, 0.7, 0.3, 0]
   - value: 0.5
dtype: bfloat16

The slices key allows you to specify particular portions (or "slices") of layers from different models that you want to include in your merged model. Multiple 't' values help customize the blending of parameters from different models for specific layers of the models. The first value (0) indicates that, at the beginning of the self-attention layers range, the blend should completely favor the base model's parameters. The last value (1) indicates that, at the end of the range, the blend should completely favor the other model's parameters.

Finally execute the merge using this command -

mergekit-yaml config.yml ./merged-model

This will store the output model in merged-model folder, and would take around 10 minutes to run.

Evaluate Model Performance

LM-evaluation-harness is one of the popular open source tools to evaluate model performance. It has pre-built definitions for various standard tasks used to evaluate model. For this merge, we can use the tool to evaluate performance over GSM8K dataset. Here's how you can install the library -

git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Once the library is installed, it makes it very easy to run the evaluation job. Here's the command you can use to test the merged model -

lm_eval --model hf --model_args pretrained=merged-model/ --tasks gsm8k --device cuda:0 --batch_size 2

For this experiment, the merged model returned the following results -

|Tasks|Filter  |n-shot|  Metric   |Value |   |Stderr|
|-----|----------|-----:|-----------|-----:|---|-----:|
|gsm8k|get-answer|     5|exact_match|0.7809|±  |0.0114|

The 78% accuracy is better then the base models OmniBeagle-7B (71.6) and WizardMath-7B-V1.1 (73.0).

Conclusion

More Art then Science - Like everything in LLM world, these approaches are a bit like black box. While they have some intuitive reasoning, it seems like this is also more of an art then exact science.

By seamlessly blending the strengths of specialized models, model merging opens up new ways for achieving improved performance across diverse tasks. For all AI practitioners, understanding and applying this can be a valuable addition to your toolkit.

We'll be posting more blogs about new developments in this space. Subscribe to stay posted!