ReLearn: Learning new things for Large Language Models

February 21, 2025

Page content

Summary

The paper “ReLearn: Unlearning via Learning for Large Language Models” presents a novel method for unlearning in LLMs while preserving fluency and relevance. It introduces a data augmentation and fine-tuning pipeline as an alternative to ‘gradient ascent (GA)’ and ’negative preference optimization (NPO]’, which degrade linguistic coherence.

How to Implement This Paper

To implement ReLearn, we will follow these key steps:

1️⃣ Understanding the Core Approach

Data Augmentation: Generate diverse question-answer (QA) variations for forgetting while ensuring non-sensitive yet relevant responses.
Fine-Tuning: Replace the knowledge to be forgotten with relevant but non-sensitive responses.
Evaluation Metrics: Use the paper’s Knowledge Forgetting Rate (KFR), Knowledge Retention Rate (KRR), and Linguistic Score (LS) for performance assessment.

2️⃣ Setting Up the Development Environment

We need:

A large language model (LLM): Llama-2-7b-chat or Gemma-2-2b-it (from the paper).
Training framework: PyTorch with Hugging Face’s transformers and datasets.
Fine-tuning tools: LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning.
Evaluation libraries: sentence-transformers, nltk, and GPT-4 API for linguistic evaluation.

3️⃣ Implementing the Unlearning Pipeline

We can break this into:

Step 1: Data Augmentation

Augment QA pairs using LLM-based synthetic transformations:
- Simple Variant: Rephrase the question.
- Contextual Variant: Add context to generalize.
- Noise Variant: Introduce noise to make the model robust.
- Logical Variant: Change the logic of the question.
Augment answers by replacing specific information with generic responses.

Step 2: Content Verification

Check for sensitive information in generated responses using GPT-based classifiers.

Step 3: Data Diversification

Convert QA into sentence completion tasks.
Integrate generic knowledge from Wikipedia and chatbot datasets.

Step 4: Fine-Tuning

Use cross-entropy loss on augmented forget and retain datasets.
Employ KL divergence loss to minimize the difference from the pre-unlearning model.

Step 5: Evaluation

Implement the KFR, KRR, and LS metrics.
Use GPT-4 or another LLM for fluency and relevance evaluation.

Step 1: Setting Up the Environment

Install dependencies:

pip install fastapi uvicorn transformers torch accelerate peft bitsandbytes sentence-transformers datasets

Step 2: Model Fine-Tuning (Unlearning)

We’ll use LoRA fine-tuning to apply the ReLearn unlearning approach.

Load the Base Model

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

MODEL_NAME = "meta-llama/Llama-2-7b-chat-hf"  # Change to Gemma if needed

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME, load_in_8bit=True, device_map="auto")

Prepare Unlearning Data

Create augmented question-answer pairs.

unlearning_data = [
    {"question": "What is John Doe's email?", "answer": "John Doe's contact details are private."},
    {"question": "Where does Alice live?", "answer": "Alice's location is not publicly available."}
]

Fine-Tune with LoRA

from peft import LoraConfig, get_peft_model
from transformers import TrainingArguments, Trainer

lora_config = LoraConfig(
    r=8, lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.1, bias="none"
)
model = get_peft_model(model, lora_config)

# Tokenize data
def tokenize_data(data):
    return tokenizer(data["question"], data["answer"], padding="max_length", truncation=True)

train_dataset = [{"input_ids": tokenize_data(q)["input_ids"], "labels": tokenize_data(q)["input_ids"]} for q in unlearning_data]

# Training arguments
training_args = TrainingArguments(
    per_device_train_batch_size=2,
    learning_rate=5e-5,
    num_train_epochs=1,
    output_dir="./relearn-model",
    save_strategy="epoch"
)

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()

Step 3: API for Unlearning Queries

Create a FastAPI service.

API Server

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class QueryRequest(BaseModel):
    question: str

@app.post("/unlearn/")
async def unlearn_query(request: QueryRequest):
    input_text = request.question
    input_ids = tokenizer.encode(input_text, return_tensors="pt").to("cuda")
    output_ids = model.generate(input_ids, max_length=50)
    response = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return {"response": response}

Run the API:

uvicorn main:app --reload

Step 4: Frontend with Streamlit (Optional)

Install Streamlit:

pip install streamlit

Create app.py:

import streamlit as st
import requests

st.title("ReLearn: AI Unlearning App")
question = st.text_input("Ask a question:")

if st.button("Submit"):
    response = requests.post("http://127.0.0.1:8000/unlearn/", json={"question": question}).json()
    st.write(response["response"])

Run:

streamlit run app.py

Deployment

You can deploy this on Hugging Face Spaces, AWS, or any cloud provider. Would you like help with deployment setup?

Notes

Gradient Ascent (GA) in the Context of Language Models:

Core Idea:
- In the context of aligning language models, GA is often used to directly maximize a reward signal. This reward signal might be a score from a preference model or some other metric that indicates how well the generated text aligns with desired characteristics (e.g., helpfulness, harmlessness).
- Essentially, the model’s parameters are adjusted in the direction that increases the reward.
How it Degrades Coherence:
- Over-optimization:
  - GA can lead to over-optimization, where the model becomes overly focused on maximizing the reward at the expense of other important qualities, such as fluency and naturalness.
  - The model might start generating text that is highly optimized for the reward but sounds unnatural or nonsensical.
- Reward hacking:
  - The model may learn to find ways to “hack” the reward function, generating text that scores high but doesn’t genuinely reflect the desired characteristics.
  - For example, if the reward is based on the presence of certain keywords, the model might overuse those keywords, resulting in repetitive and unnatural text.
- Loss of diversity:
  - GA can cause the model to converge to a narrow set of highly rewarded outputs, reducing the diversity of generated text.

Negative Preference Optimization (NPO):

Core Idea:
- NPO aims to discourage the model from generating text that is associated with negative preferences.
- Instead of directly maximizing a positive reward, it focuses on minimizing a “negative reward” or loss associated with undesirable outputs.
- This is often done by comparing the output of the model against a set of negative examples.
How it Degrades Coherence:
- Over-correction:
  - NPO can lead to over-correction, where the model becomes overly cautious and avoids generating certain types of text, even if they are linguistically valid.
  - This can result in bland, generic, or overly constrained text.
- “Gaping holes” in the output distribution:
  - By focusing on avoiding negative preferences, NPO might create “gaps” in the model’s output distribution, where certain types of coherent and natural text are suppressed.
- Conflicting negatives:
  - If the negative preferences are not well-defined or are contradictory, the model might struggle to find a balance, leading to incoherent or inconsistent outputs.
- Loss of natural flow:
  - Constantly avoiding negative examples can cause the model to produce text that lacks a natural flow or rhythm.

General Factors Contributing to Coherence Degradation:

Reward/Loss Function Design: The design of the reward or loss function is crucial. If it doesn’t accurately capture the desired characteristics, the model might learn to optimize for the wrong things.
Data Bias: If the training data is biased, the model might learn to generate text that reflects those biases, leading to incoherent or inappropriate outputs.
Optimization Techniques: The specific optimization techniques used can also affect coherence. Aggressive optimization or insufficient regularization can lead to overfitting and degraded fluency.

Mitigation Strategies:

Careful Reward/Loss Function Design: Use reward/loss functions that accurately capture the desired characteristics and promote linguistic coherence.
Regularization Techniques: Use regularization techniques to prevent overfitting and encourage diversity.
Hybrid Approaches: Combine GA and NPO with other techniques, such as reinforcement learning from human feedback (RLHF), to balance alignment and coherence.
Human Evaluation: Regularly evaluate the model’s output using human evaluators to identify and address coherence issues.
Constrained Decoding: Use decoding strategies that allow for the enforcement of rules to maintain coherence.
Iterative Refinement: Use an iterative refinement approach, where the model’s output is gradually improved through multiple rounds of optimization and feedback.