Skip to content

Data Management

Data is the fuel. How you manage it determines how fast you go.

Type: Build Language: Python Prerequisites: Phase 0, Lesson 01 Time: ~45 minutes

Learning Objectives

  • Load, stream, and cache datasets using the Hugging Face datasets library
  • Convert between CSV, JSON, Parquet, and Arrow formats and explain their tradeoffs
  • Create reproducible train/validation/test splits with fixed random seeds
  • Manage large model and dataset files using .gitignore, Git LFS, or DVC

The Problem

Every AI project starts with data. You need to find datasets, download them, convert between formats, split them for training and evaluation, and version them so experiments are reproducible. Doing this manually every time is slow and error-prone. You need a repeatable workflow.

The Concept

mermaid
graph TD
    A["Hugging Face Hub"] --> B["datasets library"]
    B --> C["Load / Stream"]
    C --> D["Local Cache<br/>~/.cache/huggingface/"]
    B --> E["Format Conversion<br/>CSV, JSON, Parquet, Arrow"]
    E --> F["Data Splits<br/>train / val / test"]
    F --> G["Your Training Pipeline"]

The Hugging Face datasets library is the standard way to load data for AI work. It handles downloading, caching, format conversion, and streaming out of the box.

Build It

Step 1: Install the datasets library

bash
pip install datasets huggingface_hub

Step 2: Load a dataset

python
from datasets import load_dataset

dataset = load_dataset("imdb")
print(dataset)
print(dataset["train"][0])

This downloads the IMDB movie review dataset. After the first download, it loads from cache at ~/.cache/huggingface/datasets/.

Step 3: Stream large datasets

Some datasets are too large to fit on disk. Streaming loads them row by row without downloading the full thing.

python
dataset = load_dataset("wikimedia/wikipedia", "20220301.en", split="train", streaming=True)

for i, example in enumerate(dataset):
    print(example["title"])
    if i >= 4:
        break

Streaming gives you an IterableDataset. You process rows as they arrive. Memory usage stays constant regardless of dataset size.

Step 4: Dataset formats

The datasets library uses Apache Arrow under the hood. You can convert to other formats depending on what your pipeline needs.

python
dataset = load_dataset("imdb", split="train")

dataset.to_csv("imdb_train.csv")
dataset.to_json("imdb_train.json")
dataset.to_parquet("imdb_train.parquet")

Format comparison:

FormatSizeRead SpeedBest For
CSVLargeSlowHuman readability, spreadsheets
JSONLargeSlowAPIs, nested data
ParquetSmallFastAnalytics, columnar queries
ArrowSmallFastestIn-memory processing (what datasets uses internally)

For AI work, Parquet is the best storage format. Arrow is what you work with in memory. CSV and JSON are for interchange.

Step 5: Data splits

Every ML project needs three splits:

  • Train: The model learns from this (typically 80%)
  • Validation: You check progress during training (typically 10%)
  • Test: Final evaluation after training is done (typically 10%)

Some datasets come pre-split. When they don't, split them yourself:

python
dataset = load_dataset("imdb", split="train")

split = dataset.train_test_split(test_size=0.2, seed=42)
train_val = split["train"].train_test_split(test_size=0.125, seed=42)

train_ds = train_val["train"]
val_ds = train_val["test"]
test_ds = split["test"]

print(f"Train: {len(train_ds)}, Val: {len(val_ds)}, Test: {len(test_ds)}")

Always set a seed for reproducibility. The same seed produces the same split every time.

Step 6: Download and cache models

Models are large files. The huggingface_hub library handles downloading and caching.

python
from huggingface_hub import hf_hub_download, snapshot_download

model_path = hf_hub_download(
    repo_id="sentence-transformers/all-MiniLM-L6-v2",
    filename="config.json"
)
print(f"Cached at: {model_path}")

model_dir = snapshot_download("sentence-transformers/all-MiniLM-L6-v2")
print(f"Full model at: {model_dir}")

Models cache to ~/.cache/huggingface/hub/. Once downloaded, they load instantly on subsequent runs.

Step 7: Handle large files

Model weights and large datasets should not go into git. Three options:

Option A: .gitignore (simplest)

*.bin
*.safetensors
*.pt
*.onnx
data/*.parquet
data/*.csv
models/

Option B: Git LFS (track large files in git)

bash
git lfs install
git lfs track "*.bin"
git lfs track "*.safetensors"
git add .gitattributes

Git LFS stores pointers in your repo and the actual files on a separate server. GitHub gives you 1 GB free.

Option C: DVC (data version control)

bash
pip install dvc
dvc init
dvc add data/training_set.parquet
git add data/training_set.parquet.dvc data/.gitignore
git commit -m "Track training data with DVC"

DVC creates small .dvc files that point to your data. The data itself lives in S3, GCS, or another remote storage backend.

ApproachComplexityBest For
.gitignoreLowPersonal projects, downloaded data you can re-fetch
Git LFSMediumTeams sharing model weights via git
DVCHighReproducible experiments, large datasets, teams

For this course, .gitignore is enough. Use DVC when you need to reproduce exact experiments across machines.

Step 8: Storage patterns

Local storage works for datasets under ~10 GB. The HF cache handles this automatically.

Cloud storage is for anything larger or shared across machines:

python
import os

local_path = os.path.expanduser("~/.cache/huggingface/datasets/")

# s3_path = "s3://my-bucket/datasets/"
# gcs_path = "gs://my-bucket/datasets/"

DVC integrates with S3 and GCS directly:

bash
dvc remote add -d myremote s3://my-bucket/dvc-store
dvc push

For this course, local storage is sufficient. Cloud storage becomes relevant when you fine-tune on remote GPU instances.

Datasets Used in This Course

DatasetLessonsSizeWhat It Teaches
IMDBTokenization, classification84 MBText classification basics
WikiTextLanguage modeling181 MBNext-token prediction
SQuADQA systems35 MBQuestion answering, spans
Common Crawl (subset)EmbeddingsVariesLarge-scale text processing
MNISTVision basics21 MBImage classification fundamentals
COCO (subset)MultimodalVariesImage-text pairs

You do not need to download all of these now. Each lesson specifies what it needs.

Use It

Run the utility script to verify everything works:

bash
python code/data_utils.py

This downloads a small dataset, converts it, splits it, and prints a summary.

Ship It

This lesson produces:

  • code/data_utils.py - reusable data loading and caching utility
  • outputs/prompt-data-helper.md - prompt for finding the right dataset for a task

Exercises

  1. Load the glue dataset with the mrpc config and inspect the first 5 examples
  2. Stream the c4 dataset and count how many examples you can process in 10 seconds
  3. Convert a dataset to Parquet and compare the file size to CSV
  4. Create a 70/15/15 train/val/test split with a fixed seed and verify the sizes

Key Terms

TermWhat people sayWhat it actually means
Dataset split"Training data"A named subset (train/val/test) used at different stages of the ML lifecycle
Streaming"Load it lazily"Processing data row by row from a remote source without downloading the full dataset
Parquet"Compressed CSV"A columnar file format optimized for analytical queries and storage efficiency
Arrow"Fast dataframe"An in-memory columnar format used internally by the datasets library for zero-copy reads
Git LFS"Git for big files"An extension that stores large files outside the git repo while keeping pointers in version control
DVC"Git for data"A version control system for datasets and models that integrates with cloud storage
Cache"Already downloaded"A local copy of previously fetched data, stored at ~/.cache/huggingface/ by default