Detecting AI-Generated Text: Challenges and Solutions

Page content

Summary

Artificial Intelligence (AI) has revolutionized the way we generate and consume text. From chatbots crafting customer responses to AI-authored articles, artificial intelligence is reshaping how we create and consume content. As AI-generated text becomes indistinguishable from human writing, distinguishing between the two has never been more critical. Here are some of the reasons it is important to be able to verify the source of information:

  • Preventing plagiarism
  • Maintaining academic integrity
  • Ensuring transparency in content creation
  • If AI models are repeatedly trained on AI-generated text, their quality may degrade over time.

In this blog post, we’ll explore the current most effective methods for detecting AI-generated text.

  • AI model-based detection (e.g., RoBERTa)
  • Perplexity & perturbation-based detection
  • Sparse Autoencoders (SAEs) + XGBoost

We will also explore AI watermarking. This is where AI generated text has hidden markers to identify as AI.


Comparing Human and AI generated text

Let’s start with an example. This is some obvious AI vs human sentences.

10 Examples of AI-Generated vs. Human written sentences

AI-Generated Sentences (Overly Formal, Repetitive, Fact-Heavy, or Lacks Personality)

  1. “In the ever-evolving landscape of artificial intelligence, language models continue to demonstrate unprecedented capabilities in generating human-like text.”
  2. “The significance of sustainable energy solutions cannot be overstated in the modern era of climate change and environmental awareness.”
  3. “The Renaissance was a pivotal period in human history, characterized by remarkable advancements in art, science, and philosophy.”
  4. “Machine learning algorithms leverage vast datasets to optimize predictive performance in a variety of real-world applications.”
  5. “Throughout history, civilizations have relied on innovation to drive progress and enhance societal development.”
  6. “The impact of artificial intelligence on the global workforce is a topic of considerable debate among experts in the field.”
  7. “While natural language processing has significantly improved over the past decade, challenges in context retention and sentiment analysis remain.”
  8. “Technological advancements have revolutionized the way humans interact with digital ecosystems, fostering unprecedented levels of connectivity.”
  9. “The intricate relationship between data privacy and cybersecurity continues to shape global policies in the digital age.”
  10. “Future developments in artificial intelligence are expected to further blur the distinction between human and machine-generated content.”

Common Traits in AI-Generated Sentences:

  • Overly formal and structured.
  • Often too polished and generic.
  • Repeats trendy phrases (e.g., “unprecedented capabilities”).
  • Lacks personal insight, humor, or informal quirks.

Human-Written Sentences (Casual, Imperfect, Personal, or Context-Rich)

  1. “Honestly, I had no idea AI could write this well until I saw ChatGPT in action.”
  2. “I still remember that summer when we stayed up all night talking—somehow, those moments stick with you forever.”
  3. “I tried making sourdough bread last weekend, and let’s just say it was more of a rock than a loaf.”
  4. “Look, I know AI is cool and all, but I still don’t trust a machine to write my wedding vows.”
  5. “You ever get that weird feeling that you left the stove on, even though you know you didn’t?”
  6. “The coffee at that new place on 5th Street is honestly overrated—too bitter and way overpriced.”
  7. “I can’t explain why, but I really love the sound of rain hitting the roof at night.”
  8. “We spent the whole day hiking, only to realize we took the wrong trail back—thankfully, we had snacks!”
  9. “I swear, my cat understands English but just chooses to ignore me unless I say ‘treats.’”
  10. “Grandma always had the best stories about growing up in the countryside, and I wish I’d written them down.”

Common Traits in Human-Written Sentences:

  • More casual and conversational.
  • Uses contractions and informal phrasing.
  • Includes personal experiences, emotions, or humor.
  • Not perfectly structured (may have sentence fragments, run-ons, or slang).

AI Traits: Overly formal language arises because models are trained on large corpora of formal texts (e.g., academic papers, news articles). Human Traits: Informality stems from unique personal experiences, emotions, and context-specific communication.

However there are different types of Human writing

  • Blogs style informal would be markedly different from AI
  • Papers style formal would be closer to AI
  • Books style informal would be close Human
  • Technical books style formal would be close AI

Now that we’ve seen the differences, let’s explore how we can detect AI-generated text programmatically.


Method 1: AI Model-Based Detection (RoBERTa)

This is an older model for detecting AI generated text. We need to start here so we can see the challenges and understand what we need to do.

import sqlite3
from typing import List, Tuple
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

# Set device for model
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("roberta-base-openai-detector")
model = AutoModelForSequenceClassification.from_pretrained("roberta-base-openai-detector")
pipe = pipeline("text-classification", model=model, tokenizer=tokenizer, device=0 if DEVICE == "cuda:0" else -1)

# Initialize SQLite database
DB_NAME = "classification_results.db"

def initialize_database():
    """
    Initializes the SQLite database and creates the classifications table if it doesn't exist.
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS classifications (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            filename TEXT,
            line_text TEXT,
            label TEXT,
            confidence REAL
        )
    """)
    conn.commit()
    conn.close()

def classify_text(sentences: List[str]) -> List[Tuple[str, float]]:
    """
    Classifies a list of sentences as 'Human' or 'AI' with confidence scores.

    :param sentences: List of text inputs to classify.
    :return: List of tuples containing the predicted label ('Human' or 'AI') and confidence score.
    """
    results = pipe(sentences)
    return [("Human" if res["label"] == "Real" else "AI", res["score"]) for res in results]

def classify_file(filename: str) -> List[Tuple[str, float]]:
    """
    Reads a file line by line, classifies each line as 'Human' or 'AI', prints the results,
    and stores them in an SQLite database.

    :param filename: Name of the text file to classify.
    :return: List of tuples containing classification results.
    """
    results = []
    
    with open(filename, "r", encoding="utf-8") as file:
        lines = [line.strip() for line in file if line.strip()]  # Remove empty lines

    if not lines:
        print(f"Warning: {filename} is empty or contains only whitespace.")
        return results
    
    classifications = classify_text(lines)

    # Store results in SQLite
    save_to_database(filename, lines, classifications)
    
    # Print results
    for line, classification in zip(lines, classifications):
        print(f"{classification} : {line}")
        results.append(classification)
    
    return results

def save_to_database(filename: str, lines: List[str], classifications: List[Tuple[str, float]]):
    """
    Saves classification results to an SQLite database.

    :param filename: Name of the source file.
    :param lines: List of text lines from the file.
    :param classifications: List of classification results (label, confidence).
    """
    conn = sqlite3.connect(DB_NAME)
    cursor = conn.cursor()

    data = [(filename, line, label, confidence) for line, (label, confidence) in zip(lines, classifications)]
    cursor.executemany("INSERT INTO classifications (filename, line_text, label, confidence) VALUES (?, ?, ?, ?)", data)

    conn.commit()
    conn.close()
    print(f"Results from {filename} saved to database.")

# Initialize database
initialize_database()

# Run classification and store results
classify_file("ai.txt")
classify_file("human.txt")

Results: Astonishingly bad

Filename Text Score Confidence Label
ai.txt “In the ever-evolving landscape of artificial inte… 0.30 53.91% Human-written
ai.txt “The significance of sustainable energy solutions … 0.27 55.08% Human-written
ai.txt “The Renaissance was a pivotal period in human his… 0.76 64.04% AI-generated
ai.txt “Machine learning algorithms leverage vast dataset… 0.05 63.59% Human-written
ai.txt “Throughout history, civilizations have relied on … 0.40 50.02% AI-generated
ai.txt “The impact of artificial intelligence on the glob… 0.48 53.23% AI-generated
ai.txt “While natural language processing has significant… 0.03 64.61% Human-written
ai.txt “Technological advancements have revolutionized th… 0.75 63.61% AI-generated
ai.txt “The intricate relationship between data privacy a… 0.11 61.34% Human-written
ai.txt “Future developments in artificial intelligence ar… 0.47 52.81% AI-generated
human.txt Honestly, I had no idea AI could write this well u… 0.14 60.40% Human-written
human.txt I still remember that summer when we stayed up all… 0.51 54.49% AI-generated
human.txt I tried making sourdough bread last weekend, and l… -0.09 68.94% Human-written
human.txt Look, I know AI is cool and all, but I still don’t… -0.06 67.63% Human-written
human.txt You ever get that weird feeling that you left the … 0.16 59.63% Human-written
human.txt The coffee at that new place on 5th Street is hone… 0.19 58.47% Human-written
human.txt I can’t explain why, but I really love the sound o… 0.17 58.91% Human-written
human.txt We spent the whole day hiking, only to realize we … 0.08 62.58% Human-written
human.txt I swear, my cat understands English but just choos… -1.09 93.17% Human-written
human.txt Grandma always had the best stories about growing … -0.75 87.52% Human-written

As we can see, the results are poor. The RoBERTa model was trained on GPT-2, but we are testing against GPT-4+. The model frequently misclassified AI text as human. The point I am making here is that the older models will find it difficult to detect newer models. We are working against a moving target.


Method 2: Perplexity & Perturbation-Based Detection

Perplexity measures how well a model predicts a given text. In simple terms:

Low perplexity: The text is predictable (common in AI-generated text).
High perplexity: The text is unexpected (common in human writing).

Example: GPT Perplexity on Different Texts

Text Sample GPT Perplexity Score Why?
“The sun rises in the east.” 2 (Low) Very common phrase, easy for AI to predict.
“Neural networks revolutionized quantum linguistics.” 5 (Medium) Less common but still logical.
“Shakespeare’s sonnets unravel cosmic anomalies in Mars dust storms.” 15.8 (High) Rare, unexpected combination of words.

AI-generated text tends to have lower perplexity than human text because it’s trained to produce highly fluent, predictable sentences.

Why AI Detection Needs More Than Just Perplexity

While perplexity is a good first step, it’s not enough for AI detection. Some human-written text also has low perplexity (e.g., simple sentences), and AI generated text can mimic complex writing styles.

Solution? We introduce perturbation.


Perturbation

Perturbation is a technique that slightly modifies a text (e.g., inserting words, shuffling sentences, paraphrasing) and then measures how much its perplexity changes.

Key Insight:

  • Human written text is adaptable and maintains coherence when modified.
  • AI generated text is brittle small changes break its logical flow, causing a sharp drop in perplexity.

Example: How Perturbation Affects AI vs. Human Text

Text Type Before Perturbation (Log-Likelihood) After Perturbation (Log-Likelihood) Likelihood Change Prediction
Human-written -5.20 -5.40 Small Change Likely Human
AI-generated -3.30 -6.80 Large Drop Likely AI

By comparing the likelihood of the original and perturbed text, we can confidently classify AI-generated text.


How Perturbation is Applied in AI Detection

The best perturbation methods for breaking AI-generated text include:

Perturbation Type Effectiveness Why It Works?
GPT Completion Perturbation Very Strong Adds AI-generated content, breaking coherence.
Text Paraphrasing (T5) Strong AI struggles to handle structural changes.
Sentence Insertion Strong AI-generated text lacks adaptability to new context.
Synonym Substitution Moderate Small vocabulary changes disrupt AI-generated fluency.
Word Shuffling Weak AI understands word order well, so minimal impact.

Conclusion: Why This Matters

1️⃣ AI generated text is predictable (low perplexity), but fragile.
2️⃣ Perturbation disrupts fluency, revealing AI’s inability to adapt.
3️⃣ Large likelihood drops after perturbation indicate AI-generated text.

By combining GPT perplexity with perturbation, AI detection becomes far more accurate, even as models become more advanced.


Perturbation code example

This code defines a Perturbation class that applies various text modifications (perturbations) to assess whether a given text is AI-generated. It leverages GPT-based completion, T5 paraphrasing, synonym substitution, and other techniques to introduce controlled variations in the text and measure how it affects log-likelihood scores. The PerturbationConfig class provides a centralized way to configure model choices, perturbation methods, and randomness, making the system flexible and easy to customize.


import random
import torch
import logging
from nltk.corpus import wordnet
from transformers import AutoModelForSeq2SeqLM, AutoModelForCausalLM, AutoTokenizer, T5Tokenizer

# Configure Logging
logger = logging.getLogger(__name__)

class PerturbationConfig:
    """Configuration class for Perturbation settings."""

    def __init__(self):
        # Default model settings
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.t5_model_name = "t5-small"
        self.gpt_model_name = "EleutherAI/gpt-neo-1.3B"

        # Default perturbation method
        self.default_method = "gpt_completion"  # Change to "random" for randomized perturbations

        # Randomization settings
        self.use_random_perturbation = False  # Set to True to randomize perturbation selection

        # Perturbation parameters
        self.top_p = 0.92  # Top-p sampling for GPT perturbations
        self.paraphrase_top_p = 0.95  # Top-p sampling for T5 paraphrasing
        self.max_length = 512  # Max length for generated text perturbations

class Perturbation:
    """Class for perturbing text using different techniques to assess AI-generated text."""

    def __init__(self, config: PerturbationConfig = PerturbationConfig()):
        self.config = config
        self.device = self.config.device

        # Load T5 model for paraphrasing
        self.t5_model = AutoModelForSeq2SeqLM.from_pretrained(self.config.t5_model_name).to(self.device)
        self.t5_tokenizer = T5Tokenizer.from_pretrained(self.config.t5_model_name, model_max_length=512)

        # Load GPT model for AI-based text perturbation
        self.gptj_tokenizer = AutoTokenizer.from_pretrained(self.config.gpt_model_name)
        self.gptj_model = AutoModelForCausalLM.from_pretrained(self.config.gpt_model_name).to(self.device)

        logger.info(f"Perturbation class initialized with default method: {self.config.default_method}")

    def shuffle_text(self, text):
        """Randomly shuffles words in the text."""
        words = text.split()
        random.shuffle(words)
        perturbed_text = " ".join(words)
        logger.debug(f"Shuffle Perturbation: {text}{perturbed_text}")
        return perturbed_text

    def remove_word(self, text):
        """Randomly removes a word from the text."""
        words = text.split()
        if len(words) > 1:
            index = random.randint(0, len(words) - 1)
            removed_word = words.pop(index)
            perturbed_text = " ".join(words)
            logger.debug(f"Remove Word Perturbation: Removed '{removed_word}' → {perturbed_text}")
        else:
            perturbed_text = text
        return perturbed_text

    def replace_with_mask(self, text):
        """Randomly replaces a word with a mask token."""
        words = text.split()
        if len(words) > 1:
            index = random.randint(0, len(words) - 1)
            replaced_word = words[index]
            words[index] = "<mask>"
            perturbed_text = " ".join(words)
            logger.debug(f"Replace Word with Mask: Replaced '{replaced_word}' with '<mask>' → {perturbed_text}")
        else:
            perturbed_text = text
        return perturbed_text

    def synonym_substitution(self, text):
        """Replaces words with their synonyms using WordNet."""
        words = text.split()
        new_words = []
        for word in words:
            synonyms = wordnet.synsets(word)
            if synonyms:
                new_word = synonyms[0].lemmas()[0].name()
                new_words.append(new_word)
                logger.debug(f"Synonym Substitution: Replaced '{word}' with '{new_word}'")
            else:
                new_words.append(word)
        return " ".join(new_words)

    def paraphrase_text(self, text):
        """Uses T5 to generate a paraphrased version of the text."""
        input_text = f"paraphrase: {text} </s>"
        encoding = self.t5_tokenizer.encode_plus(input_text, return_tensors="pt").to(self.device)

        with torch.no_grad():
            outputs = self.t5_model.generate(**encoding, max_length=self.config.max_length, do_sample=True, top_p=self.config.paraphrase_top_p)

        perturbed_text = self.t5_tokenizer.decode(outputs[0], skip_special_tokens=True)
        logger.debug(f"Paraphrase Perturbation: {text}{perturbed_text}")
        return perturbed_text

    def insert_random_sentence(self, text):
        """Inserts a random unrelated sentence into the text."""
        random_sentences = [
            "The moon orbits the Earth approximately every 27.3 days.",
            "Neural networks have been widely used in deep learning applications.",
            "Shakespeare wrote some of the most famous plays in history."
        ]
        random_sentence = random.choice(random_sentences)
        words = text.split()
        insert_position = random.randint(0, len(words))
        words.insert(insert_position, random_sentence)
        perturbed_text = " ".join(words)
        logger.debug(f"Insert Random Sentence: Inserted '{random_sentence}' → {perturbed_text}")
        return perturbed_text

    def gpt_completion_perturbation(self, text):
        """Uses a GPT model to generate AI-based perturbations."""
        input_ids = self.gptj_tokenizer.encode(text, return_tensors="pt").to(self.device)
        with torch.no_grad():
            output = self.gptj_model.generate(input_ids, max_length=len(input_ids[0]) + 5, do_sample=True, top_p=self.config.top_p)

        perturbed_text = self.gptj_tokenizer.decode(output[0], skip_special_tokens=True)
        logger.debug(f"GPT Completion Perturbation: {text}{perturbed_text}")
        return perturbed_text

    def perturb_text(self, text, method=None):
        """
        Applies the selected perturbation method.

        :param text: Input text to be perturbed.
        :param method: Name of the perturbation method to apply.
                       Defaults to the config setting. Use "random" to choose a random method.
        :return: Perturbed text.
        """
        perturbation_methods = {
            "shuffle": self.shuffle_text,
            "remove_word": self.remove_word,
            "replace_mask": self.replace_with_mask,
            "synonym_substitution": self.synonym_substitution,
            "paraphrase": self.paraphrase_text,
            "insert_sentence": self.insert_random_sentence,
            "gpt_completion": self.gpt_completion_perturbation
        }

        # If no method is provided, use the configured default
        if method is None:
            method = self.config.default_method

        # If configured to use random perturbation, pick one randomly
        if method == "random" or self.config.use_random_perturbation:
            method = random.choice(list(perturbation_methods.keys()))

        if method in perturbation_methods:
            logger.info(f"Applying Perturbation: {method} on text: {text}")
            return perturbation_methods[method](text)
        else:
            logger.warning(f"Invalid perturbation method: {method}. Defaulting to GPT Completion.")
            return self.gpt_completion_perturbation(text)

Perplexity code example

The GPTPerplexity class is a text analysis tool designed to detect AI-generated text by evaluating the log-likelihood of a sentence using GPT-2 and measuring how this likelihood changes after applying perturbations. The compute_ai_score method assesses the difference between the original and modified text—AI-generated text is typically more fragile, showing a significant likelihood drop after perturbation. The results are encapsulated in the ClassificationResult data structure, which includes the AI score, confidence level, and a final classification of whether the text is “AI-generated” or “Human-written.”

In this example I chose smaller models because of available VRAM.

from dataclasses import dataclass
import torch
import numpy as np
import re
import math
import logging
from transformers import GPT2LMHeadModel, GPT2TokenizerFast, T5Tokenizer
from transformers import AutoModelForSeq2SeqLM
from sklearn.preprocessing import StandardScaler
from scipy.special import erf


#  Configure Logging
logging.basicConfig(
    filename="gpt_perplexity.log",
    filemode="w",
    format="%(asctime)s - %(levelname)s - %(message)s",
    level=logging.DEBUG
)

logger = logging.getLogger(__name__)


@dataclass
class ClassificationResult:
    """Dataclass to store classification results."""
    score: float
    diff: float
    std_dev: float
    confidence: float
    label: str

    def __str__(self):
        """Custom string representation."""
        return f"{self.confidence:.2f}% confidence that the text is {self.label}."


class GPTConfig:
    """Configuration class for GPTPerplexity model settings."""
    
    def __init__(self):
        self.device = "cuda" if torch.cuda.is_available() else "cpu"
        self.gpt_model_id = "gpt2" 
        self.t5_model_id = "t5-small"

        # Perplexity settings
        self.max_length = 1024
        self.stride = 51
        self.threshold = 0.7

        # Masking settings
        self.mask_span = 2
        self.mask_ratio = 0.3
        self.mask_samples = 100

        # T5 Generation settings
        self.t5_top_p = 0.96
        self.t5_num_return_sequences = 1


class GPTPerplexity:
    """GPT-based Perplexity and AI-Text Detection Model."""

    def __init__(self, config: GPTConfig):
        self.config = config
        self.device = config.device

        # Load models
        self.gpt_model = GPT2LMHeadModel.from_pretrained(config.gpt_model_id).to(self.device)
        self.tokenizer = GPT2TokenizerFast.from_pretrained(config.gpt_model_id)

        self.t5_model = AutoModelForSeq2SeqLM.from_pretrained(config.t5_model_id).to(self.device).half()
        self.t5_tokenizer = T5Tokenizer.from_pretrained(config.t5_model_id, model_max_length=512)


        # Initialize Perturbation class
        self.perturbation = Perturbation(device=self.device)

        logger.info("GPTPerplexity Model Initialized.")

    def get_log_likelihood(self, sentence):
        """Calculates the log-likelihood of a sentence using GPT2."""
        encodings = self.tokenizer(sentence, return_tensors="pt")
        seq_len = encodings.input_ids.size(1)
        nlls = []

        prev_end_loc = 0
        for begin_loc in range(0, seq_len, self.config.stride):
            end_loc = min(begin_loc + self.config.max_length, seq_len)
            trg_len = end_loc - prev_end_loc
            input_ids = encodings.input_ids[:, begin_loc:end_loc].to(self.device)
            target_ids = input_ids.clone()
            target_ids[:, :-trg_len] = -100

            with torch.no_grad():
                outputs = self.gpt_model(input_ids, labels=target_ids)
                nlls.append(outputs.loss * trg_len)

            prev_end_loc = end_loc
            if end_loc == seq_len:
                break

        likelihood = (-1 * torch.stack(nlls).sum() / end_loc).cpu().item()
        logger.debug(f"Computed log-likelihood: {likelihood}")
        return likelihood

    def compute_ai_score(self, sentence):
        """Computes an AI-score using log-likelihood differences between original and perturbed texts."""
        original_sentence = sentence.strip()
        logger.debug(f"Processing sentence: {original_sentence}")

        real_log_likelihood = self.get_log_likelihood(original_sentence)
        logger.debug(f"Real log-likelihood: {real_log_likelihood}")

        # Generate perturbed versions of the sentence
        perturbed_sentences = [self.get_log_likelihood(self.perturbation.perturb_text(original_sentence)) 
                               for _ in range(self.config.mask_samples)]        
        logger.debug(f"Perturbed sentances: {perturbed_sentences}")

        if not perturbed_sentences:
            logger.error("Error perturbing sentences. No valid perturbations.")
            return -1

        # **Use scikit-learn to standardize data**
        scaler = StandardScaler()
        all_scores = np.array([real_log_likelihood] + perturbed_sentences).reshape(-1, 1)
        standardized_scores = scaler.fit_transform(all_scores).flatten()

        logger.debug(f"Standardized Scores: {standardized_scores}")

        standardized_real_score = standardized_scores[0]
        standardized_perturbed_mean = np.mean(standardized_scores[1:])
        standardized_perturbed_std = np.std(standardized_scores[1:])

        if standardized_perturbed_std == 0:
            logger.warning("Standard deviation is zero. Adding small jitter.")
            standardized_perturbed_std = 1e-8 + np.random.uniform(0, 1e-5)  

        ai_score = (standardized_real_score - standardized_perturbed_mean) / standardized_perturbed_std

        logger.info(f"AI Score: {ai_score}, Difference: {standardized_real_score - standardized_perturbed_mean}, Std Dev: {standardized_perturbed_std}")

        return float(ai_score), float(standardized_real_score - standardized_perturbed_mean), float(standardized_perturbed_std)

    def classify_text(self, sentence) -> ClassificationResult:
        """Classifies text as AI-generated or Human-written and returns a structured result."""
        sentence = re.sub(r"\[[0-9]+\]", "", sentence)  
        score, diff, std_dev = self.compute_ai_score(sentence)

        if score == -1 or math.isnan(score):
            logger.error("Error: Computed an invalid AI score.")
            return ClassificationResult(score=-1, diff=0, std_dev=0, confidence=0, label="Error")

        confidence = self.normal_cdf(abs(self.config.threshold - score)) * 100
        label = "AI-generated" if score > self.config.threshold else "Human-written"
        
        logger.info(f"Final Classification: {confidence:.2f}% confidence that the text is {label}.")

        return ClassificationResult(score=score, diff=diff, std_dev=std_dev, confidence=confidence, label=label)

    @staticmethod
    def normal_cdf(x):
        """Approximate the CDF of a normal distribution using erf."""
        return 0.5 * (1 + erf(x / math.sqrt(2)))

Helper functions


import sqlite3

def setup_database():
    """Creates the SQLite database and results table if it doesn't exist."""
    conn = sqlite3.connect("classification_results.db")
    cursor = conn.cursor()
    cursor.execute("""
        CREATE TABLE IF NOT EXISTS results (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            filename TEXT,
            text TEXT,
            score REAL,
            diff REAL,
            std_dev REAL,
            confidence REAL,
            label TEXT
        )
    """)
    conn.commit()
    conn.close()

def classify_file(model, filename):
    """
    Reads a file line by line, classifies each line as 'AI-generated' or 'Human-written',
    and stores the results in an SQLite database.

    :param model: The GPTPerplexity model for classification.
    :param filename: Name of the text file to classify.
    """
    print(f"\nProcessing file: {filename}\n" + "-"*40)

    with open(filename, "r", encoding="utf-8") as file:
        lines = [line.strip() for line in file if line.strip()]  # Remove empty lines

    if not lines:
        print(f"Warning: {filename} is empty or contains only whitespace.")
        return

    conn = sqlite3.connect("classification_results.db")
    cursor = conn.cursor()

    for line in lines:
        classification_result = model.classify_text(line)

        # Insert into the database
        cursor.execute("""
            INSERT INTO results (filename, text, score, diff, std_dev, confidence, label)
            VALUES (?, ?, ?, ?, ?, ?, ?)
        """, (filename, line, classification_result.score, classification_result.diff,
              classification_result.std_dev, classification_result.confidence, classification_result.label))

        print(f" Text: {line}\n {str(classification_result)}\n")

    conn.commit()
    conn.close()

def generate_markdown_report():
    """Generates a markdown report from the classification results stored in the database."""
    conn = sqlite3.connect("classification_results.db")
    cursor = conn.cursor()

    # Fetch results
    cursor.execute("SELECT filename, text, score, confidence, label FROM results")
    rows = cursor.fetchall()
    conn.close()

    # Create Markdown content
    markdown_content = "# AI vs. Human Text Classification Report\n\n"
    markdown_content += "| Filename | Text | Score | Confidence | Label |\n"
    markdown_content += "|----------|------|-------|------------|-------|\n"

    for row in rows:
        filename, text, score, confidence, label = row
        markdown_content += f"| {filename} | {text[:50]}... | {score:.2f} | {confidence:.2f}% | {label} |\n"

    # Write to markdown file
    with open("classification_report.md", "w", encoding="utf-8") as md_file:
        md_file.write(markdown_content)

    print("\n Report generated: classification_report.md")

Running the code

# Step 1: Setup the database
setup_database()

# Step 2: Process files and store results in SQLite
config = GPTConfig()
model = GPTPerplexity(config)

classify_file(model, "ai.txt")
classify_file(model, "human.txt")

# Step 3: Generate markdown report from the stored results
generate_markdown_report()

Results using perplexity

  • The results here are heavily influenced by the models I used. Because I used smaller models the results were not as strong.
  • I increased the threshold from .7 to .4 to stop the models just detecting human text.
  • Even so the results are a significant improvement.

AI vs. Human Text Classification Report

Filename Text Score Confidence Label
ai.txt “In the ever-evolving landscape of artificial inte… 0.30 53.91% Human-written
ai.txt “The significance of sustainable energy solutions … 0.27 55.08% Human-written
ai.txt “The Renaissance was a pivotal period in human his… 0.76 64.04% AI-generated
ai.txt “Machine learning algorithms leverage vast dataset… 0.05 63.59% Human-written
ai.txt “Throughout history, civilizations have relied on … 0.40 50.02% AI-generated
ai.txt “The impact of artificial intelligence on the glob… 0.48 53.23% AI-generated
ai.txt “While natural language processing has significant… 0.03 64.61% Human-written
ai.txt “Technological advancements have revolutionized th… 0.75 63.61% AI-generated
ai.txt “The intricate relationship between data privacy a… 0.11 61.34% Human-written
ai.txt “Future developments in artificial intelligence ar… 0.47 52.81% AI-generated
human.txt Honestly, I had no idea AI could write this well u… 0.14 60.40% Human-written
human.txt I still remember that summer when we stayed up all… 0.51 54.49% AI-generated
human.txt I tried making sourdough bread last weekend, and l… -0.09 68.94% Human-written
human.txt Look, I know AI is cool and all, but I still don’t… -0.06 67.63% Human-written
human.txt You ever get that weird feeling that you left the … 0.16 59.63% Human-written
human.txt The coffee at that new place on 5th Street is hone… 0.19 58.47% Human-written
human.txt I can’t explain why, but I really love the sound o… 0.17 58.91% Human-written
human.txt We spent the whole day hiking, only to realize we … 0.08 62.58% Human-written
human.txt I swear, my cat understands English but just choos… -1.09 93.17% Human-written
human.txt Grandma always had the best stories about growing … -0.75 87.52% Human-written

Modern LLMs intentionally generate high-perplexity text to sound human. This approach although usable will gradually become ineffective. Also if you use LLM’s a lot you will start to see they are sounding human for longer and longer. Today they can make sentences short blog posts soon they will be writing short books and on.


Sparse Autoencoders and XGBoost

As we have seen the detection cant keep up with the newer models. To tackle this challenge, we propose a hybrid approach that combines:
Sparse Autoencoders (SAE) for feature extraction from text embeddings
XGBoost for classification of AI vs. Human text

This method provides robust detection by leveraging deep learning for feature learning and gradient boosting for accurate classification.


Why Sparse Autoencoders (SAE) + XGBoost?

Method Advantage
SAE (Sparse Autoencoder) Compresses text embeddings into a lower-dimensional representation, learning hidden patterns in AI-generated text
XGBoost Classifier Classifies AI vs. Human text based on extracted SAE features with high accuracy

This two-step approach allows us to learn meaningful representations of text without explicit supervision while still leveraging a powerful classification algorithm.


Step 1: Collecting Data

For this project, we use the Human vs. Machine dataset from Hugging Face:

from datasets import load_dataset
import pandas as pd

# Load the dataset
dataset = load_dataset("NicolaiSivesind/human-vs-machine")

# Convert to DataFrame
df_train = pd.DataFrame(dataset["train"])
df_test = pd.DataFrame(dataset["test"])

print(f"Training samples: {len(df_train)}, Test samples: {len(df_test)}")

Now, we have a dataset with labeled human and AI-generated text.


Step 2: Generating Text Embeddings

Before feeding text into the SAE, we convert it into embeddings using Ollama’s mxbai-embed-large model.

import torch
import ollama

def get_embedding(text):
    """Generates an embedding using Ollama"""
    try:
        embedding_data = ollama.embeddings(model="mxbai-embed-large", prompt=text)
        return torch.tensor(embedding_data["embedding"], dtype=torch.float32)
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

# Convert text to embeddings
train_embeddings = [get_embedding(text) for text in df_train["text"]]
test_embeddings = [get_embedding(text) for text in df_test["text"]]

# Stack embeddings into tensors
train_embeddings_tensor = torch.stack([e for e in train_embeddings if e is not None])
test_embeddings_tensor = torch.stack([e for e in test_embeddings if e is not None])

With our embeddings prepared, we can now extract meaningful features.


Step 3: Training the Sparse Autoencoder (SAE)

The Sparse Autoencoder compresses embeddings to a lower-dimensional feature space while preserving important information.

import torch.nn as nn
import torch.optim as optim

class SparseAutoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(SparseAutoencoder, self).__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU()
        )
        self.decoder = nn.Sequential(
            nn.Linear(hidden_dim, input_dim),
            nn.Sigmoid()
        )

    def forward(self, x):
        encoded = self.encoder(x)
        decoded = self.decoder(encoded)
        return decoded, encoded

# Initialize model
input_dim = train_embeddings_tensor.shape[1]  # Embedding size
hidden_dim = 128  # Feature compression dimension, you may need to adjust
sae = SparseAutoencoder(input_dim, hidden_dim).to("cuda")

# Train SAE
# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)  # Add weight decay for regularization
loss_fn = nn.MSELoss()
epochs = 500

for epoch in range(epochs):
    optimizer.zero_grad()
    reconstructed, encoded = sae(train_embeddings_tensor.to("cuda"))
    loss = loss_fn(reconstructed, train_embeddings_tensor.to("cuda"))
    loss.backward()
    optimizer.step()

    # Compute validation loss
    model.eval()  # Set to evaluation mode
    with torch.no_grad():
        val_reconstructed, _ = model(val_embeddings)
        val_loss = loss_fn(val_reconstructed, val_embeddings)

    # Save best model based on validation loss
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        print(f"Saving best model with validation loss: {best_val_loss.item():.6f} .. {epoch}")
        torch.save(model.state_dict(), "best_sparse_autoencoder.pth")

    print(f"Epoch [{epoch+1}/{epochs}] - Train Loss: {train_loss.item():.6f} | Validation Loss: {val_loss.item():.6f}")

print("✅ Training completed. Best model saved!")

Now, the SAE learns compact feature representations of text embeddings.


Step 4: Extracting Features Using the SAE

We now use the trained SAE encoder to extract features.

def detect_ai_text(model, test_embeddings, threshold=0.21):
    """
    Uses the Sparse Autoencoder to classify AI-generated text.
    - Computes reconstruction loss.
    - Labels as AI-generated if loss is below threshold.
    """
    with torch.no_grad():
        reconstructed, _ = model(test_embeddings)  # Forward pass through SAE
        reconstruction_loss = torch.mean((test_embeddings - reconstructed) ** 2, dim=1)  # MSE loss per sample

    # Convert to CPU for further processing
    reconstruction_loss = reconstruction_loss.cpu().numpy()

    # Classify based on threshold
    predictions = (reconstruction_loss > threshold).astype(int)  # 1 = AI, 0 = Human

    return reconstruction_loss, predictions

# Run AI detection on test embeddings
test_reconstruction_loss, test_predictions = detect_ai_text(loaded_model, test_embeddings_tensor)

Now, we have feature vectors ready for classification!


Step 5: Training XGBoost for AI Detection

Using SAE features, we train XGBoost for classification.

import xgboost as xgb
from sklearn.metrics import accuracy_score

# Train XGBoost classifier
xgb_model = xgb.XGBClassifier(
    objective="binary:logistic",
    eval_metric="logloss",
    max_depth=7,
    learning_rate=0.01,
    n_estimators=500
)

xgb_model.fit(train_features, df_train["label"])

# Predict on test set
test_preds = xgb_model.predict(test_features)

# Evaluate accuracy
accuracy = accuracy_score(df_test["label"], test_preds)
print(f"XGBoost Accuracy: {accuracy:.4f}")

Now, we have a trained AI-text classifier!


Step 6: Visualizing Results

We use t-SNE to check how well the SAE separates AI vs. Human text.

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Reduce SAE features using t-SNE
tsne = TSNE(n_components=2, random_state=42)
test_features_2d = tsne.fit_transform(test_features)

# Plot results
plt.figure(figsize=(8, 6))
plt.scatter(test_features_2d[df_test["label"] == 0, 0], test_features_2d[df_test["label"] == 0, 1], 
            c="blue", label="Human", alpha=0.5)
plt.scatter(test_features_2d[df_test["label"] == 1, 0], test_features_2d[df_test["label"] == 1, 1], 
            c="red", label="AI", alpha=0.5)
plt.xlabel("t-SNE Component 1")
plt.ylabel("t-SNE Component 2")
plt.title(" t-SNE Visualization of SAE Features")
plt.legend()
plt.show()

SAE Results

import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix

# Compute confusion matrix
cm = confusion_matrix(test_labels, test_preds)

# Plot confusion matrix
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["Human", "AI"], yticklabels=["Human", "AI"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("📊 XGBoost Confusion Matrix")
plt.show()

XGBoost Confusion Matrix

Now, we can visually check if AI and Human text embeddings separate well!


Conclusion

By combining Sparse Autoencoders (SAE) for feature learning with XGBoost for classification, we achieve a robust AI-text detection pipeline.

SAE compresses embeddings into meaningful representations
XGBoost leverages these features for high-accuracy classification
The approach generalizes well across AI writing models

This hybrid method is a powerful solution for detecting AI-generated text as LLMs continue to evolve.


Watermarking

As AI-generated text becomes more convincing, watermarking techniques help identify AI-generated content while ensuring security and integrity. Here are the most advanced AI watermarking methods used today.

While watermarking helps identify AI-generated text, it raises questions about privacy and intellectual property.

Probabilistic Watermarking (Token-Level Control)

How it Works:

  • AI models slightly modify token probabilities when generating text.
  • Some words or token sequences are subtly biased to create a hidden pattern.
  • This bias is detectable statistically but remains invisible to humans.

Example:

  • A model might favor words like “however” or “indeed” at a higher frequency than usual.
  • A secret watermarking key can verify if text contains the pattern.

Real-World Usage:
OpenAI & Anthropic use this technique to watermark ChatGPT and Claude-generated text.


Cryptographic Watermarking (Steganographic Methods)

How it Works:

  • AI embeds a unique cryptographic signature in the text.
  • It uses word embedding shifts, synonym replacements, or whitespace variations to hide metadata.
  • The text can be verified using a decryption key to check if it’s AI-generated.

Example:

  • “The quick brown fox jumps over the lazy dog”
  • Watermarked: “The swift brown fox leaps over the lazy hound”
  • A special algorithm detects this pattern of changes.

Real-World Usage:
Used in Google DeepMind’s AI-generated text detection system.


Style-Based Watermarking (Linguistic Fingerprinting)

How it Works:

  • AI intentionally maintains specific writing quirks in sentence structure, word choice, or punctuation.
  • A fingerprinting model tracks these patterns to identify AI-generated text.

Example:

  • AI often avoids contractions or uses a repetitive sentence rhythm.
  • A detector trained on style patterns can predict AI-generated text.

Real-World Usage:
Used by Turnitin AI detection and Hugging Face AI classifiers.


Invisible Character Watermarking (Zero-Width Characters)

How it Works:

  • AI inserts invisible Unicode characters (like zero-width spaces or non-printable characters) into the text.
  • These characters don’t appear to the human eye but can be detected programmatically.

Example:

  • AI-generated text looks normal but contains hidden Unicode sequences like \u200b (zero-width space).
  • A script can detect the hidden characters and confirm if text is AI-generated.

Real-World Usage:
Used in Meta AI and Microsoft watermarking solutions for detecting AI text in social media content.


Semantic Watermarking (Meaning-Based Alterations)

How it Works:

  • AI slightly rewords sentences while maintaining identical meaning.
  • A detector algorithm recognizes the unique phrasing patterns AI models use.

Example:

  • Human: “AI is changing the world in unexpected ways.”
  • AI (Watermarked): “The world is experiencing unexpected transformations due to AI.”
  • The subtle semantic shift is detectable by a trained classifier.

Real-World Usage:
Google’s AI watermarking system uses this in its Bard and Gemini models.


Which Method is Most Effective?

Watermarking Type Visibility Robustness Use Cases
Probabilistic Token Control Invisible Hard to remove AI chatbot text
Cryptographic Watermarks Hidden Cryptographically secure Sensitive AI-generated content
Style-Based Fingerprinting Subtle Can be bypassed AI plagiarism detection
Zero-Width Characters Detectable with tools Easy to remove AI-written online content
Semantic Watermarking Undetectable Hard to remove Journalism & misinformation detection

Example AI Watermark Detection Script

This Python pipeline will help detect AI-generated text watermarks using multiple watermarking techniques, including: Probabilistic Token Frequency Analysis
Zero-Width Character Detection
Style-Based AI Fingerprinting

Supports detection for OpenAI, Google, and other AI-generated text.

import re
import numpy as np
import nltk
from nltk.tokenize import word_tokenize
from collections import Counter
from transformers import AutoModelForSequenceClassification, AutoTokenizer

nltk.download("punkt")

class AIWatermarkDetector:
    def __init__(self):
        """Initialize token patterns and AI classifiers."""
        self.common_ai_tokens = {"thus", "moreover", "indeed", "consequently", "notably"}  # Probabilistic watermark
        self.tokenizer = AutoTokenizer.from_pretrained("roberta-base-openai-detector")
        self.model = AutoModelForSequenceClassification.from_pretrained("roberta-base-openai-detector")

    def detect_token_watermark(self, text):
        """Check if AI-preferred words appear more frequently than expected."""
        tokens = word_tokenize(text.lower())
        token_counts = Counter(tokens)

        ai_bias_score = sum(token_counts[token] for token in self.common_ai_tokens if token in token_counts)
        normalized_score = ai_bias_score / len(tokens)

        return normalized_score > 0.02, f"AI Token Watermark Score: {normalized_score:.4f}"

    def detect_zero_width_chars(self, text):
        """Detect invisible Unicode characters that AI models may embed as watermarks."""
        hidden_chars = re.findall(r'[\u200B\u200C\u200D\uFEFF]', text)  # Zero-width characters
        return len(hidden_chars) > 0, f"Zero-Width Characters Found: {len(hidden_chars)}"

    def detect_style_patterns(self, text):
        """Use an AI detector model to classify writing style."""
        inputs = self.tokenizer(text, return_tensors="pt", truncation=True, padding=True)
        outputs = self.model(**inputs)
        ai_confidence = outputs.logits.softmax(dim=-1).tolist()[0][1]  # AI-generated confidence

        return ai_confidence > 0.7, f"Style AI Confidence Score: {ai_confidence:.4f}"

    def analyze_text(self, text):
        """Run all detection methods and return the results."""
        results = {
            "Token Watermark": self.detect_token_watermark(text),
            "Zero-Width Watermark": self.detect_zero_width_chars(text),
            "Style-Based Fingerprinting": self.detect_style_patterns(text),
        }
        
        ai_detected = any(flag for flag, _ in results.values())
        final_verdict = "AI-Generated Text Detected" if ai_detected else "Likely Human-Written"

        return {"Verdict": final_verdict, "Details": results}

# Usage Example
if __name__ == "__main__":
    detector = AIWatermarkDetector()
    sample_text = "Indeed, the rapid growth of AI has notably changed how we communicate. Moreover, its impact is undeniable."
    
    result = detector.analyze_text(sample_text)
    print(result)

How This Works

  1. Detects AI-token biases (words that AI prefers like indeed, thus, consequently).
  2. Checks for zero-width character watermarks (used for AI attribution).
  3. Uses an AI classifier (RoBERTa) to identify AI writing style patterns.

If any test is triggered, the text is marked as AI-generated.


Running the Detection Script

Run the script with:

python ai_watermark_detector.py

Sample Output:

{
    "Verdict": "AI-Generated Text Detected",
    "Details": {
        "Token Watermark": (True, "AI Token Watermark Score: 0.0345"),
        "Zero-Width Watermark": (False, "Zero-Width Characters Found: 0"),
        "Style-Based Fingerprinting": (True, "Style AI Confidence Score: 0.8312")
    }
}

AI Text Detection Method Comparison

Method Strengths Weaknesses Best For
RoBERTa Detector Fast, Pre-trained Poor against GPT-4+ Basic AI text detection
Perplexity + Perturbation Exposes brittle AI writing Requires tuning for each model Detecting AI fluency patterns
Sparse Autoencoders + XGBoost Learns hidden AI patterns Needs large dataset Model-agnostic AI detection
AI Watermarking Detection Can verify AI text with certainty Only works if watermark exists Identifying marked AI content

Code and further examples

The code used while writing and researching this post can be found here:

aitext

References

1️⃣ Feature-Level Insights into Artificial Text Detection with Sparse Autoencoders 2️⃣ NicolaiSivesind/human-vs-machine 3️⃣ Multiscale Positive-Unlabeled Detection of AI-Generated Texts 4️⃣ gpt-2-output-dataset 5️⃣ DetectGPT

Conclusion

Looking at the results, AI detection still has a significant margin of error. As models improve, the distinction between AI-generated and human-written text will blur even further. In the future, it may be impossible to say with certainty whether a piece of text was written by a machine or a person.

The reality of AI? It’s all about probabilities estimations, not certainties. Every detection method, from perplexity and perturbation to sparse autoencoders, relies on probability scores. But as LLMs like GPT-4o and Claude advance, even the best detection techniques are beginning to struggle.

🔹 The Inevitable Rise of AI-Generated Content
Instead of fighting AI-generated text, we should focus on how to use AI effectively. AI is not a threat to creativity it is a tool for amplification.

The best approach? Use AI to generate content, then rewrite it in your own voice. If a human adds their own insights, experiences, and emotions, can we still call it AI-generated? Or is it simply enhanced human writing?

🔹 The Future of Writing: AI + Human Synergy
🔹 AI isn’t replacing human creativity it’s evolving it.
🔹 Detection will become harder, but authenticity will remain in human expression.
🔹 The key isn’t avoiding AI, but learning how to use it in a way that enhances rather than replaces originality.

So instead of asking whether text is AI-generated, perhaps the better question is: How do we use AI to create something uniquely ours?

The real question about a piece of work or art isn’t how it was created, but whether you find it useful or meaningful.

Use it. Improve it. Make it yours.


Glossary

Term Definition
RoBERTa Detector A pre-trained AI model that classifies whether text is AI-generated or human-written.
Perplexity A measure of how predictable a text is; AI-generated text often has lower perplexity because it follows structured patterns.
Perturbation A technique where text is slightly modified (e.g., paraphrasing, inserting words) to see if AI-generated text breaks under small changes.
Sparse Autoencoder (SAE) A machine learning model that learns a compressed representation of text features, helping to distinguish AI-generated patterns.
XGBoost A gradient boosting machine learning algorithm used for classification; it helps detect AI-generated text by analyzing features extracted by the SAE.
Watermarking A method where AI models embed hidden markers in generated text to later verify whether it was created by AI.
AI-Generated Text Any text produced by AI models rather than humans, often trained to mimic human style and fluency.
Zero-Width Watermarking A technique where invisible Unicode characters (e.g., zero-width spaces) are inserted in AI text for detection.
Cryptographic Watermarking A security method where AI text is digitally signed with cryptographic markers that can be later verified.