Rethinking Data Storage and Preprocessing for ML

Machine learning (ML) — and in particular deep learning — applications have sparked the development of specialized software frameworks and hardware accelerators. Frameworks like PyTorch and TensorFlow offer a clean abstraction for developing and running models efficiently at scale. Custom accelerators, such as TPUs and NPUs, dramatically increase the throughput and performance-per-Watt of ML computations. Meanwhile, platforms like MLFlow and TFX streamline deploying models in production and monitoring ML experiments.

However, in today’s machine learning ecosystem, one important part of the system stack has received far less attention and specialization for ML: how we store and preprocess training data. Yet how we store and ingest data into ML frameworks can significantly impact the end-to-end training performance and resource costs. A recent study of millions of ML training workloads at Google over a one-month period revealed that jobs spend on average 30% of end-to-end training compute time on data ingestion and on-the-fly preprocessing.

In this blog post, we outline the data storage and preprocessing requirements of large-scale ML training applications. We start by discussing how training data is stored and consumed today. We then motivate opportunities to design new storage abstractions and data processing services with first-class support for the unique requirements of ML applications.

How is training data stored and consumed today?

Raw training data — such as images, audio, and text — is commonly stored in cloud object stores (e.g., Amazon S3) as they provide low cost, highly available, and highly scalable storage. Training datasets can be bounded (e.g., reference datasets such as ImageNet) or unbounded (e.g., logs from production services). Data undergoes both offline and online preprocessing before it is ingested for model training.

Offline data preprocessing typically involves:

Extracting features from raw data
Cleaning and validating input data
Converting input data to binary formats, such as Avro, Parquet, or TFRecord, to improve data ingestion throughput to ML frameworks.
Applying a limited set of data transformations, such as data normalization.

Online data preprocessing executes as part of the input pipeline of ML training jobs. The input pipeline is responsible for:

Extracting (reading) data from storage
Transforming data on-the-fly, such as shuffling, batching and augmenting data. Many data transformations — such as data augmentations — must be performed online as they multiply the size of the original dataset, making it prohibitive to store outputs in intermediate files.
Loading data to the device that executes training computations (typically an accelerator).

While offline preprocessing is performed with batch cluster frameworks (e.g., Apache Beam and Apache Spark), online preprocessing executes with frameworks such as tf.data, PyTorch DataLoader, or NVIDIA DALI, which provide a streaming data model. Online preprocessing typically executes on the CPU resources of model training nodes.

In addition to ingesting data into ML frameworks to train models, it is often desirable to run traditional analytics (i.e., SQL) queries on the same training data. Unfortunately, the storage systems (or “data lakes”) that commonly host training datasets today lack support for running analytics queries directly on raw training data. To run SQL queries, a separate ETL process first ingests the training data to a “data warehouse.”

Requirements for ML data storage and online preprocessing

Machine learning applications present several unique requirements for data storage and preprocessing systems. We outline some key requirements and challenges below.

Feeding accelerators efficiently with larger-than-memory datasets: It is increasingly common for input data in production workloads to exceed the size of main memory. The input pipeline must therefore ingest and preprocess data with high parallelism, software pipelining, and prefetching to hide the latency of I/O and on-the-fly data transformations. Feeding input data at sufficiently high throughput to saturate accelerator resources is increasingly challenging as the FLOPS capabilities of accelerators increase. The high cost of accelerators compared to their CPU hosts makes it particularly important to ensure that accelerators operate at high utilization.

Elastically scaling data preprocessing resources: The ratio of compute resources required for online data preprocessing and model training computations varies across ML jobs. However, CPU and accelerator resources are typically allocated in fixed ratios, as today’s ML frameworks run online data preprocessing on the local CPU resources of model training nodes. To avoid imbalanced resource utilization and idle accelerators, it is useful to disaggregate online data preprocessing from model training so that CPU and accelerator resources can be scaled independently. Furthermore, it can be beneficial to split input pipelines such that some online data transformations (e.g., filtering) can execute close to storage while others (e.g., data augmentations) execute close to accelerators. Ideally, resource assignment and scaling should be abstracted from users and performed automatically by the framework.

Data ordering and randomness guarantees: The convergence of ML models is sensitive to the order in which data is received during training. Hence the input pipeline should provide data ordering guarantees. Training algorithms typically make multiple passes (called epochs) over input examples. To help models converge quickly, each epoch should use all elements in the dataset exactly once, and elements should appear in a different random order across epochs. Data augmentations applied on-the-fly in the input pipeline should generate new random distortions of the data each epoch, rather than reusing data augmentations across epochs. Finally, to aid with debugging model training, the input pipeline should be able to produce random examples in a deterministic order, according to a given seed. To provide ordering guarantees in the presence of job preemptions, data processing should support checkpointing.

Data freshness: ML models that serve critical requests in dynamic environments should be continually re-trained with fresh data. New training data must first be cleaned and preprocessed offline. As offline and online preprocessing execute in separate frameworks, ensuring that fresh data is readily available for training can be challenging.

Data versioning and provenance: A storage system for ML training data should support data versioning as models may train with different snapshots of the training dataset. The storage layer can help track data provenance to aid with ML debugging and answer questions such as which data was used to train a particular model. Supporting deterministic model training is becoming increasingly important as models become more complex and expensive to train from scratch.

Research opportunities

Rather than simply combining disparate systems that exist today to meet the unique requirements of ML applications, there is an opportunity to rethink how we design the data management ecosystem to have first-class support for ML.

The challenges that ML users face today in implementing performant data ingestion pipelines motivate the design of cloud services which can automatically optimize input pipeline execution and scale resources according to job demands. As input pipelines are frequently re-executed in machine learning training workflows (e.g., during hyperparameter tuning and model search), an input data processing service can automatically cache preprocessed datasets to avoid redundant computation and improve training throughput. Identifying which parts of ML jobs’ input pipelines are optimal to cache/reuse vs. recompute within or across jobs in a multi-tenant service is non-trivial. A cloud service for input data processing can also make recommendations that affect model convergence, such as the choice of source datasets, data augmentations, and data ordering. Another open question is how to leverage hardware acceleration for data preprocessing operations and how to push the execution of data transformations that reduce the volume of data (e.g., filtering and aggregation) closer to storage.

One promising direction for ML storage system design is a unified storage layer for training data with different metadata and API layers on top, which enable diverse applications — ranging from SQL queries to deep learning — to run directly over raw data. This architectural pattern was recently proposed by Databricks and termed the Lakehouse. There are many open questions, including how to optimize the design of data formats and metadata layers. In addition to optimizing the data layout for high throughput access, metadata layers can track data provenance to aid with reproducible model training.

About the Author: Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Before joining ETH, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University. Her dissertation focused on the design and implementation of fast, elastic storage for cloud computing.

Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.

Computer Architecture Today

Contribute

Recent Blog Posts

Archives

Subscribe

Join Us