helpful, harmless, honest fusion.
We boost the performance of the resulting
$ pip install requirements.txt
config.py file:
please provide your huggingface token by asigning the variable hf_token
You may download the Raw models/ directory. This model is the combined version of individually aligned models. Thus, model needs to fine-tune its MoE layers.
However, you may also choose to create your own MoE model by first running align_basemodel.py for helpfulness, safety, and truthfulness and get three indivually aligned models. For example:
$ python align_basemodel.py --moe_flag 0 --task_name helpfulness
$ python align_basemodel.py --moe_flag 0 --task_name safety
$ python align_basemodel.py --moe_flag 0 --task_name truthfulness
These will create 3 models under results/outputs/{task_name} directory.
Then, run the following for creating MoE model:
$ cd models
$ python llama_moe.py
This will result model file moe.pt. This model also Raw and needs to be finetuned for the newly introduced weights at its MoE layer.
Here how to train, test, and infer
run the align_basemodel.py with the following flags:
- --task_name: set "mix" for helpful, honest, and harmless model tuning.
- --moe_flag: set 1 for MoE alignment.
- --expert_topk: controls number of experts in the MoE layer
- --gate_loss_weight: controls the gating loss
$\lambda$ e.g. 0.01 - --helpfulness_weight: controls the regularization on the helpfulness expert at the MoE layer
$\gamma_{1}$ e.g. 0.001, where the default values is 0 - --safety_weight: controls the regularization on the safety expert at the MoE layer
$\gamma_{2}$ e.g. 0.001, where the default values is 0 - --truthfulness_weight: controls the regularization on the truthful expert at the MoE layer
$\gamma_{3}$ e.g. 0.001, where the default values is 0
example run for aligning MoE model by using gating loss and without regularazing would be:
$ python align_basemodel.py --moe_flag 1 --expert_topk 2 --gate_loss_weight 0.01
The trained model will be stored in ./results/outputs directory. To infer the model that is stored for the safety test dataset run the following:
$ python inference.py --task_name safety --cross_task "mix_moe_reg_00000_00000_00000_gate_00100" --dataset_type test
Before performing any evaluation, you must run the inference.py to obtain the test outputs by the model.
Here, we show how we evaluate our models but there are two prerequisites:
- Helpfullness requires OpenAI API key.
- Truthfulness requires you to train GPT-Judge, which is the finetuned text-davinci-003 model. Follow the instructions shown in TruthfulQA to create a model for truthfulness and a model for informativeness.
First go the evaluator directory and run evaluate_helpfulness.py
$ cd evaluator/
$ python evaluate_helpfulness.py
It will convert the model outputs to a json format. This will be than compared with reference_model.json using alpaca_eval library. Here is a sample call:
pip install alpaca-eval
export OPENAI_API_KEY=<>
export IS_ALPACA_EVAL_2=False
alpaca_eval --model_outputs aligned.json --reference_outputs reference_model.json --output_path alpaca_eval_output
We use PKU-Alignment/beaver-dam-7b model to perform evaluation, here is a sample call
$ cd evaluator
$ python evaluate_safety.py --task_name <infer_dir_name>
here <infer_dir_name> is the directory name under 'results/outputs' such as cross_mix_moe_reg_00000_00000_00000_gate_00100
After training GPT-Judge, you need to copy the engine names from your OpenAI API to the variables informative_engine_name and truthful_engine_name inside evaluate_truth_and_info.py script.
For truthfulness score, run:
$ python evaluate_truth_and_info.py --task_name <infer_dir_name> --mode 0
For informativeness score, run:
$ python evaluate_truth_and_info.py --task_name <infer_dir_name> --mode 1
