Retrieval Enhanced Machine Learning Highlights a Vacuum Deep Learners have been Ignoring for Years

Large scale language modelling has been all the rage in the last couple years, with the release of incremental updates of the GPT-X architecture, as well as a host of competing models published by (primarily) big-tech research departments. At a recent lunch with a colleague, I expressed my frustration on behalf of researchers who don’t have access to big-tech mega infrastructure and dollars, but still hope to do innovative work in language modelling and AI more generally, from 2020 onwards. Even running these giant NLP models for inference is prohibitively expensive, and comes with substantial infrastructural overhead. This is an alarming trend, which has compounded since 2018. By way of contrast, in 2018, at Zalando research (“medium tech”) we developed Flair embeddings, and trained all RNN models on a single client with 1 GPU, yielding highly useful embeddings, which were leveraged to achieve state-of-the-art on a number of core NLP tasks. I would personally like to see more work seeking to reestablish this type of agile experimental spirit, and to see researchers having to resort to something of the type “look what I got by querying the OpenAI GPT-X API”.

With a new emerging trend, namely “retrieval enhanced machine learning” (REML), we may see some modicum of power going back into the hands of the less-then big-tech innovators, and hopefully the field will be all the better for this.

REML caught my eye recently, ironically after reading another production - RETRO - of a well known big-tech company “Google”. RETRO is a semi-parameteric retrieval enhanced, transformer based language model. By incorporating a retrieval pass and including retrieved text chunks in making predictive textual continuations, the authors claim a 25x improved parameter efficiency over GPT3 on “the Pile”, a key language modelling benchmark. This is great - these types of parameter efficiencies, coupled with innovations in hardware, optimization and machine learning scientific work, may lead to researchers being able to produce results of similar quality with small and agile infrastructure setups. The RETRO architecture has been well described in other blog posts, so I won’t go into more detail here.

If this trend continues, we will begin to see that one’s machine learning model is no longer simply a deployed PyTorch or Tensorflow model, but rather, a store of features with, potentially, a fast similarity search index on top of these features, as well as the deployed PyTorch or Tensorflow model. In terms of parameter efficiency, we are streets ahead of where we were with GPT3 and similar models. In serving the model, we could potentially even opt for a 1 GPU machine to run forward passes of the model. The difficulty, however, lies in incorporating the training data, in the form of a similarity search index or feature store, and serving this with the model. Gone now (with these new models) are the days of cutting the training data into chunks and depositing these willy nilly in some ad-hoc file system, and essentially throwing these away, for all intents and purposes, when it comes to serving the model. Quite the opposite is true - we need at the very least:

A clear logging and tracking system of which data went into training the model
A system for converting training data efficiently into vectorial features
A system for converting vectoring features into a similarity search index
A connector, allowing the model to easily talk to the retrieved items in a memory efficient manner

Optionally we might also opt for these things:

A system for selecting a different set of retrieval data points for serving than was used for training (“zero-shot” style transfer learning)
A system which is able to easily switch between differently trained models, maintaining feature sets for all models
A system which is able to cope not just with storing the output of models outputting vectors, but also other types of output, for instance, generated text output

Many new possibilities raise themselves, once we move to the paradigm of separating retrieval data set and model. This is due to the ways we can choose each of these, and relate the one to the other. At the moment however, we don’t have production ready systems for serving and managing models in this way; moreover, matters are made more difficult in that there is no longer the traditional split between training and production - these distinct modes are beginning to merge. Current tools for deep learning tend to focus on one mode or the other - PyTorch, Tensorflow, and co. for the model building and training part, and proprietary services such as Amazon Sagemaker, Google Datalab as well as the full gamut of open source “MLops” tooling on the production side.

Already in 2018, in building Aleph Search (an e-Commerce enterprise search and navigation system) my wish from the ecosystem was to have a closer interplay between the data and the models. Our concerns at that time were more primitive than deploying REML. We already had the necessity of maintaining feature stores for production due to the semantic search aspect of the software. However, these feature stores also became useful in a multitude of other contexts, from which the software profited - transfer learning, caching of intermediate values for models sharing layers, deduplication and so forth. The feature stores were closely associated with the data - a one-many relationship was maintained between database entities and feature sets. In addition, subsets of data from the database (essentially the target of database queries), were associated in particular ways with certain models, due to, for instance, restricted or specialized scope of certain models. In this way, the best tooling for the job, if available, would have been more than simply detailed logging of what we had done to get a model, but more a close symbiosis of preparation, models, data and deployments; ideally I would have had this done in a way that didn’t use diverse sets of tooling from a variety of enterprises and organizations for each part. I.e. not Apache Spark + Apache Airflow + Amazon EMR + s3 + ElasticFS + PyTorch + SageMaker + … (you get the idea).

The point I’m making, is that we should stop thinking of data and models as two separate worlds, and stop reinventing the wheel with regards to marrying these two worlds in our projects.

Stay tuned as I intend to review potential solutions in this space, and ideas on how I would use these to achieve a sleek, agile and easy to manage interplay between data and models.

Written on October 21, 2022