Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dataset Specification

A Dataset resource is a metadata reference to a data source. The actual data lives on HuggingFace Hub or the local filesystem. On import, Studio extracts feature column names and stores them as a JSON string, enabling downstream prompt template construction without reloading the data.

Source: ft/datasets.py, ft/db/model.py

Supported Types

TypeSourceIdentifier FieldFeature Extraction Method
huggingfaceHuggingFace Hubhuggingface_nameload_dataset_builder() -> info.features.keys()
projectLocal HF-compatible directorylocationNot extracted
project_csvLocal CSV filelocationRead header row via csv.reader
project_jsonLocal JSON filelocationRead first object keys via json.load
project_jsonlLocal JSONL filelocationRead first line keys via json.loads

ORM Schema

class Dataset(Base, MappedProtobuf, MappedDict):
    __tablename__ = "datasets"
    id = Column(String, primary_key=True)    # UUID
    type = Column(String)                     # DatasetType enum value
    name = Column(String)                     # Display name
    description = Column(Text)                # Auto-populated for HF datasets
    huggingface_name = Column(String)         # HF Hub identifier (HF type only)
    location = Column(Text)                   # Filesystem path (project types only)
    features = Column(Text)                   # JSON-serialized list of column names

Import Validation

add_dataset() dispatches to type-specific validators before creating a record:

All types:

  • type field is required.
  • Duplicate detection by name (local types) or huggingface_name (HF type).

HuggingFace (_validate_huggingface_dataset_request):

  • huggingface_name field required and non-blank.
  • Validates dataset exists on Hub via load_dataset_builder().
  • Extracts dataset_info.features.keys() for feature list.
  • Stores dataset_info.description as the description.

CSV (_validate_local_csv_dataset_request):

  • location field required, must end with .csv.
  • name field required and non-blank.
  • Reads header row with csv.reader(file) / next(reader) for features.

JSON (_validate_local_json_dataset_request):

  • location field required, must end with .json.
  • Reads first object in the JSON array for feature keys.

JSONL (_validate_local_jsonl_dataset_request):

  • location field required, must end with .jsonl.
  • Reads first line, parses as JSON, extracts keys for features.

Feature Extraction Functions

extract_features_from_csv(location)   # csv.reader -> next(reader)
extract_features_from_json(location)  # json.load -> next(iter(data)).keys()
extract_features_from_jsonl(location) # json.loads(first_line).keys()

Features are stored as json.dumps(features) in the features column. Downstream consumers (prompt templates, training scripts) parse this back with json.loads().

Loading into Memory

load_dataset_into_memory(dataset: DatasetMetadata) normalizes all dataset types into a HuggingFace DatasetDict with at minimum a train key:

TypeLoad MethodWrapping
huggingfacedatasets.load_dataset(huggingface_name)Already a DatasetDict
project_csvdatasets.load_dataset('csv', data_files=location)Already a DatasetDict
project_jsondatasets.Dataset.from_json(location)Wrapped in DatasetDict({'train': ds})
project_jsonldatasets.Dataset.from_json(location)Wrapped in DatasetDict({'train': ds})

If the loaded object is a Dataset (not DatasetDict), it is wrapped: DatasetDict({'train': ds}).

Removal

remove_dataset() deletes the Dataset record. If request.remove_prompts is set, also deletes all Prompt records with matching dataset_id via cascading delete.

Protobuf Message

DatasetMetadata fields: id, type, name, description, huggingface_name, location, features.