Using Hugging Face Datasets

Page content

Summary

Machine learning operates on data. Essentially, it processes data to extract meaningful information, which can then be used to make intelligent decisions. This is the foundation of Artificial Intelligence. The more data you have the better your machine learning apps will be. There is a caveat though the data has to be high quality. The more data you have and the higher quality the better your apps will be.

The Hugging Face Datasets library provides easy access to thousands of ML datasets covering text, images, and structured data.

What You’ll Learn

  • How to find and load datasets from Hugging Face
  • How to filter, process, and transform datasets
  • How to save datasets in different formats (CSV, Arrow, Parquet)
  • How to create and update your own dataset
  • How to use Git and the Hugging Face CLI for dataset management

By the end of this post, you will understand Hugging Face Datasets, how to configure them, and how to use them effectively. You will also learn a little about pandas.


Loading, Processing, and Analyzing Hugging Face Datasets

1️⃣ Choosing a dataset

For this tutorial I will be working with

1-800-SHARED-TASKS/COLING-2025-GENAI-MULTI

This is a large dataset of human and AI generated text in English and Chinese.

How do you find a dataset?

  1. Decide on the task you are going to perform.
  2. Look at recent related papers and the dataset(s) they chose for their research. The reasoning here is that they will generally chose more modern and well vetted datasets.
  3. Follow the likes and downloads on Hugging Face for example Trending Text Datasets
  4. If you cant find what you may need then build your own.

2️⃣ Loading a Dataset

Before we begin, install the Hugging Face datasets package:

pip install datasets pandas

Next we will load and view the dataset

from datasets import load_dataset, DatasetDict
import pandas as pd

# Load the dataset
dataset = load_dataset("1-800-SHARED-TASKS/COLING-2025-GENAI-MULTI")

# Print dataset structure
print(dataset)

Output:

DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'sub_source', 'lang', 'model', 'label', 'text'],
        num_rows: 674083
    })
    validation: Dataset({
        features: ['id', 'source', 'sub_source', 'lang', 'model', 'label', 'text'],
        num_rows: 288894
    })
})

Hugging Face stores the dataset as a DatasetDict, which includes train and test splits. Note: not all datasets follow the same format so you will need to check.

Most datasets have a card, a home page with a brief description of that the dataset is how it was generated and what its intended use is.

You can also view details of the dataset, search the dataset and see what models were trained on this dataset.

Understanding Dataset Splits

A split is a specific subset of a dataset like train and test. List a dataset’s split names with the get_dataset_split_names() function:

from datasets import load_dataset, get_dataset_split_names

names = get_dataset_split_names("cornell-movie-review-data/rotten_tomatoes")

print(names)

## load a specific split
dataset = load_dataset("cornell-movie-review-data/rotten_tomatoes", split="train")
['train', 'validation', 'test']

Configurations

Some datasets contain several sub-datasets. These sub-datasets are known as configurations or subsets, and you must explicitly select one when loading the dataset.

Use the get_dataset_config_names() function to retrieve a list of all the possible configurations available to your dataset:

from datasets import get_dataset_config_names

configs = get_dataset_config_names("PolyAI/minds14")
print(configs)
['cs-CZ', 'de-DE', 'en-AU', 'en-GB' ... 'all']

3️⃣ Exploring the Dataset

After loading the dataset, inspect its contents:

# Check the first sample
print(dataset["train"][0])

Output:

{'title': 'Study of coupling loss ...', 'word_count': 280}

id → The id for the entry ️ source → The source ️ sub_source → The paper title model → Model use the generate the data ️ label1 (AI), 0 (Human)
text → the relevant text

# Assuming dataset is already loaded as 'dataset'
# Convert to Pandas DataFrame
train_df = dataset["train"].to_pandas()
val_df = dataset["validation"].to_pandas()

# Find unique values
unique_sources = train_df['source'].unique()
unique_sub_sources = train_df['sub_source'].unique()
unique_models = train_df['model'].unique()

# Display results
print("Unique Sources:", unique_sources)
print("Unique Sub-Sources:", unique_sub_sources)
print("Unique Models:", unique_models)
Unique Sources: ['m4gt' 'mage' 'hc3']
Unique Sub-Sources: ['arxiv' 'wp' 'xsum' 'open_qa' ...]
Unique Models: ['human', 'gpt4' ...]

Dataset Analysis

For my task I am training a model to classify text as AI-generated or human-written, you need sources that provide a clear distinction between AI and human text while covering diverse writing styles and topics.

Best Sources for AI vs. Human Classification

What you do is look at the sources you have and query the web or an LLM for information about them. For this dataset contents I found:

  1. HC3 (Helpful Conversations with Chatbots)

    • Why? Explicitly designed for AI vs. human comparison. Contains both AI-generated and human-written responses, making it a goldmine for classification.
  2. M4GT (Massively Multi-Source Generative Text Dataset)

    • Why? A dataset designed to benchmark AI text generation, likely containing AI-written samples.

Best Sub-Sources to Include

  • Wikipedia/News-Based Sub-Sources (e.g., 'wikipedia', 'News/Wikipedia', 'arxiv', 'id_newspaper_2018')
    • Why? These contain high-quality, fact-based human-written text, useful as strong human examples.

Final Selection for AI vs. Human Text Classification

Good sources: 'hc3', 'm4gt'
Best sub-sources: 'open_qa', 'wikipedia', 'News/Wikipedia', 'arxiv', 'id_newspaper_2018'
Models 'gpt4', 'gpt-3.5-turbo', 'gpt_j', 'gpt4o'
Language 'en'
Text Length '< 2000'


# Define the selected sources and sub-sources
selected_sources = {'hc3', 'm4gt'}
selected_sub_sources = {'wikipedia', 'News/Wikipedia', 'arxiv', 'id_newspaper_2018'}
select_models = {'gpt4', 'gpt-3.5-turbo', 'gpt_j', 'gpt4o'}

# Apply the filter
filtered_train = dataset["train"].filter(lambda x: 
    x["source"] in selected_sources and 
    x["sub_source"] in selected_sub_sources and 
    x["model"] in select_models and 
    x["lang"] == "en" and  # Keep only English text
    len(x["text"]) <= 2000
)

filtered_val = dataset["validation"].filter(lambda x: 
    x["source"] in selected_sources and 
    x["sub_source"] in selected_sub_sources and 
    x["model"] in select_models and 
    x["lang"] == "en" and  # Keep only English text
    len(x["text"]) <= 2000
)

# Check new dataset size
print(f"Filtered train size: {len(filtered_train)}")
print(f"Filtered validation size: {len(filtered_val)}")
Filter: 100%|██████████| 674083/674083 [00:03<00:00, 198703.42 examples/s]
Filter: 100%|██████████| 288894/288894 [00:01<00:00, 198426.43 examples/s]Filtered train size: 7911
Filtered validation size: 3477

After filtering, we reduced the dataset from 674,083 samples to 7,911 training examples. I wanted to filter out the latest OpenAI models, English text only, and I kept the text length under 2000 characters.


Add back in our Human data

# Filter human-written text (label == 0)
filtered_human_train = dataset["train"].filter(lambda x: 
    x["label"] == 0 and  # Only human-written
    x["lang"] == "en" and  # Keep only English text
    len(x["text"]) <= 2000  # Limit text length
)

filtered_human_val = dataset["validation"].filter(lambda x: 
    x["label"] == 0 and 
    x["lang"] == "en" and 
    len(x["text"]) <= 2000
)

# Find the minimum count to ensure a balanced dataset
num_train_samples = min(len(filtered_train), len(filtered_human_train))
num_val_samples = min(len(filtered_val), len(filtered_human_val))

# Sample human examples to match AI count
balanced_human_train = filtered_human_train.shuffle(seed=42).select(range(num_train_samples))
balanced_human_val = filtered_human_val.shuffle(seed=42).select(range(num_val_samples))

# Check new dataset size
print(f"Filtered human train size: {len(balanced_human_train)}")
print(f"Filtered human validation size: {len(balanced_human_val)}")
Filter: 100%|██████████| 674083/674083 [00:03<00:00, 193552.39 examples/s]
Filter: 100%|██████████| 288894/288894 [00:01<00:00, 199475.22 examples/s]
Filtered human train size: 7911
Filtered human validation size: 3477

Merge the human back in

from datasets import concatenate_datasets

# Merge AI and Human examples for train and validation
final_train = concatenate_datasets([filtered_train, balanced_human_train]).shuffle(seed=42)
final_val = concatenate_datasets([filtered_val, balanced_human_val]).shuffle(seed=42)

# Check the distribution
print(f"Final train size: {len(final_train)} (AI + Human)")
print(f"Final validation size: {len(final_val)} (AI + Human)")
print("✅ Final datasets created and shuffled")

Save this filtered dataset

from datasets import DatasetDict

final_dataset = DatasetDict({
    "train": final_train,
    "validation": final_val
})

final_dataset.save_to_disk("ai_text_dataset")

This will save a folder ai_text_dataset with two directories text and validation:

Dataset

You will get a folder ai_text_dataset/ with the following contents:

ai_text_dataset/
│── dataset_info.json
│── train/
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── state.json
│── validation/
│   ├── dataset.arrow
│   ├── dataset_info.json
│   ├── state.json

Explanation of Each File

File/Folder Description
dataset_info.json Contains metadata about the dataset (e.g., schema, creation date).
train/ and validation/ Separate folders for train and validation splits.
train/dataset.arrow Binary file storing the actual dataset (Apache Arrow format).
train/dataset_info.json Metadata for the training dataset split (features, types).
train/state.json Keeps track of the dataset state (useful for resumption and shuffling).
validation/dataset.arrow Binary file storing the validation set.
validation/dataset_info.json Metadata for validation.
validation/state.json Tracks dataset modifications.

Understanding .arrow Files

  • The .arrow files contain the dataset in a highly optimized, columnar binary format (Apache Arrow).
  • These files allow fast loading and efficient storage.
  • Unlike CSVs, they retain data types (e.g., string, integer, float) for better performance.

Loading the Saved Dataset

Once saved, you can reload the dataset using:

from datasets import load_from_disk

# Load dataset from disk
loaded_dataset = load_from_disk("ai_text_dataset")

# Check dataset structure
print(loaded_dataset)
DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'sub_source', 'lang', 'model', 'label', 'text'],
        num_rows: 15822
    })
    validation: Dataset({
        features: ['id', 'source', 'sub_source', 'lang', 'model', 'label', 'text'],
        num_rows: 6954
    })
})

Using the Hugging Face cli tool

Install the Hugging Face CLI

If you haven’t installed it yet, run:

pip install huggingface_hub

Then, log in to your Hugging Face account:

huggingface-cli login

You’ll be prompted to enter your Hugging Face API token, which you can get from: Hugging Face Settings.


Using the Hugging Face CLI to Download a Dataset

The Hugging Face CLI allows you to directly download datasets to your local machine without writing Python code. It can be faster for large datasets.

Find and Download a Dataset

To find a dataset, go to Hugging Face Datasets.

To download a dataset, use:

huggingface-cli download --repo-type dataset 1-800-SHARED-TASKS/COLING-2025-GENAI-MULTI

Note: When downloading a dataset you need to set the --repo-type dataset flag.

This downloads the dataset to your local cache.

How to Review Your Local Cache Directory in Hugging Face CLI

The Hugging Face CLI stores downloaded models and datasets in a cache directory to avoid re-downloading files. You can find this directory using the following commands.

Method 1: Use the Hugging Face CLI Command

Run this in your terminal:

huggingface-cli scan-cache
REPO ID REPO TYPE SIZE ON DISK NB FILES LAST_ACCESSED LAST_MODIFIED REFS LOCAL PATH
1-800-SHARED-TASKS/COLING-2025-GENAI-MULTI dataset 1.5G 3 2 days ago 2 days ago main E:\huggingface_models\hub\datasets–1-800-SHARED-TASKS–COLING-2025-GENAI-MULTI
Hello-SimpleAI/HC3 dataset 146.9M 9 4 days ago 4 days ago main E:\huggingface_models\hub\datasets–Hello-SimpleAI–HC3
HuggingFaceH4/ultrafeedback_binarized dataset 424.1M 6 2 weeks ago 2 weeks ago main E:\huggingface_models\hub\datasets–HuggingFaceH4–ultrafeedback_binarized
NicolaiSivesind/human-vs-machine dataset 14.4M 5 2 days ago 3 days ago main E:\huggingface_models\hub\datasets–NicolaiSivesind–human-vs-machine
vblagoje/PubMedQA_instruction dataset 274.8M 3 2 weeks ago 2 weeks ago main E:\huggingface_models\hub\datasets–vblagoje–PubMedQA_instruction

Changing you Hugging Face cache location

Use the environment variable HF_HOME to fix where you want this cache to be located.


Using Git to add our new dataset to Hugging Face

Hugging Face repositories function like GitHub repositories. By using Git, you can version datasets, track changes, and collaborate on dataset updates.

I created a repo to store our new dataset:

ai-human

We can clone this locally to work with it

git clone https://huggingface.co/datasets/ernanhughes/ai-human
cd ai-human

Export csv file from our final DataSet

We can export csv files for out datasets.

# Convert to Pandas DataFrames
train_df = filtered_dataset["train"].to_pandas()
validation_df = filtered_dataset["validation"].to_pandas()

# Save as CSV
train_df.to_csv("train.csv", index=False)
validation_df.to_csv("validation.csv", index=False)

print("✅ CSV files saved: train.csv, validation.csv")

Adding data to our Hugging Face dataset

We can use the Hugging Face website to upload these file or we can use git. Now we can use this dataset in our applications

import pandas as pd
from datasets import load_dataset

dataset = load_dataset("ernanhughes/ai-human")
print(dataset)
DatasetDict({
    train: Dataset({
        features: ['id', 'source', 'sub_source', 'lang', 'model', 'label', 'text'],
        num_rows: 15822
    })
    validation: Dataset({
        features: ['id', 'source', 'sub_source', 'lang', 'model', 'label', 'text'],
        num_rows: 6954
    })
})

Updating our Dataset card

Using the Hugging Face website you can add or update the card for your dataset. This consists of adding some special formatting to the README.md file in your repo.

References

1️⃣ Hugging Face Datasets Documentation 2️⃣ Pandas Documentation 3️⃣ Apache Arrow Documentation

Conclusion

In this post we learned how to

  • Load Datasets from Hugging Face
  • Process, filter, and refine the data obtained from these datasets
  • We looked at how we would analyze a dataset to determine whether it would fit our purposes of not
  • We learned how to save that dataset locally
  • We learned how to export files from a dataset
  • We learned how to create our own DataSets on Hugging Face