Introduction

As a deep learning practitioner, it can be hard not to spend a great deal of time and energy tuning and re-tuning a model. What if I just change the learning rate a little? What if I alter the architecture to implement this cool new feature I read about today? There are near-infinite ways to tweak a model and squeeze out just a little more performance. However, this is often not the best or most efficient way to improve a model’s performance.

More often than not, the greatest obstacle to producing high-performance deep learning models is actually the data itself. In most real-world problems, vast amounts of high-quality data is hard to come by. Human annotation is costly to collect in large amounts (often prohibitively so). Public data sources can be large, but often suffer from issues with quality, coverage, or recency. Those that don’t suffer from these issues tend to address problems that have already been “solved” (think ImageNet). Because of this, many deep learning practitioners will spend their time on the problem they can address: the hyperparameters of their model. But developing new models (or meticulously tuning existing ones) is a siren’s call. Even the best model can only be as good as the data it consumes. One tool to remedy the data problem is a process we have dubbed Full-Loop deep learning™.

What is full-loop deep learning?

Full-loop deep learning is an iterative process involving four steps:

  1. Train an initial model with the best data currently available.
  2. Evaluate model performance against ground-truth data.
  3. Deploy the model to a live system that allows for customer feedback.
  4. Use feedback to expand and improve the training set.

Why full-loop deep learning?

Full-loop deep learning addresses the challenges with data acquisition and quality mentioned earlier, while also bringing substantial benefits to the improvement of the following important areas:

Domain knowledge transfer

At Arturo we model property features that often require a deep level of expertise to properly identify. Home inspectors and underwriters are highly-skilled and trained to identify problems with a roof, construction materials, and other features, but how can their expert knowledge best be used to train our models? By putting our model predictions directly into our customers’ hands and opening a feedback channel, we give these experts the ability to tell us directly.

Data Reliability

When a high degree of expert knowledge is required to collect a dataset, even the initial annotation task can be a challenge. Experts are costly or unavailable, and directly training non-expert human annotators can be challenging. Ensuring high quality data from such a large and variable source is difficult on its own. Without the direct input of domain experts, it may be difficult or impossible to get true ground-truth annotations. For example, imagine a task that involves verifying whether a map service is correctly geocoding an address. True ground-truth for such a task may involve direct familiarity with the property in question.

Balanced Datasets

Full-loop deep learning isn’t a complete solution to this particular problem. However, it can be a great way to gather additional data at little to no cost. Models that are lacking in performance on rare classes can benefit from large volumes of data because there will inevitably be more examples of the under-represented class.

Outliers

The majority of roofs in the United States are composed of similar, rectangular, geometric shapes. A high percentage of all roofs fall within this category. However, a non-trivial minority represent a vast range of unusual shapes and styles — especially outside the United States. Have you seen Naomi Campbell’s spaceship house?


Capital Hill Residence — Image by Zaha Hadid Architects | OKO Group

Many real-world problems follow a similar pattern: a majority of the data can be captured effectively by a simple model, but the dataset has a “long tail” of unique (or under-represented) outlier examples.

An incredibly useful consequence of full-loop deep learning is the gradual inclusion of outliers into our training set. Over time, user discovery of model errors causes these outliers to find their way into our training sets with little additional effort.

Representative of customer data

Being able to improve model performance on data that is particularly relevant to each customer is extremely useful. This means that each successive iteration of the model we perform better on data that the customer cares about. Without full loop deep learning our models will still improve in the aggregate. However, without it these improvements will not necessarily be tailored to the customer needs. While it is possible to collect a custom dataset for each customer (there are economic constraints that make this impractical), full-loop deep learning allows us to acquire representative data without having to incur these costs.

The Arturo ‘Full-Loop’


Photo by Charlotte Coneybeer on Unsplash

Training

Because all of our models are inherently geospatial in nature, a representative geographic distribution in our training data is essential. We have developed a geospatial sampling technique using freely-available census block and population data. This gives us an initial distribution that follows the distribution of homes in the United States without any prior knowledge. Once we have a sufficient set of images and annotations for a given task, we train an initial version of the model.

Evaluation

For every new model release, we maintain a separate, high-quality, representative dataset to evaluate its performance. This dataset undergoes multiple passes of auditing and quality control to ensure its accuracy. During model evaluation, we compare against an expected performance benchmark (for new models) or against existing model performance. In this step, we use the same or similar dataset to tune the model’s outputs using our internal confidence framework. This ensures that our confidence scores reflect real-world performance as closely as possible.

Deployment

Arturo’s deployment infrastructure leverages the state of the art in scalable microservice technology. All of our models are deployed as modular, dockerized containers in Kubernetes, and exposed via a customer-facing API. This infrastructure is the backbone of our entire product, and is used by the web-app described in the next section. Due to the fact that this process is highly iterative, Arturo maintains a state of the art continuous integration — continuous deployment environment. This allows us to make the latest changes available to our customers with minimal effort required.

Feedback

Feedback comes in multiple forms. Arturo’s web UI allows users to evaluate our model predictions directly. This customer-provided data is stored, aggregated, and assessed for quality before being fed back into our models for further training. Perhaps the most important part of our feedback loop, this gives us direct access to domain expertise provided directly by our customers.

  • Arturo also audits 2–3% of all customer requests internally. Human annotators manually validate thousands of inferences to assess model performance on each customer’s data independently. By adding these customer-specific datasets, we can ensure that each of our customers receives the same high quality predictions.
  • Beyond simple data collection, Arturo also analyzes regional patterns in our customers’ request activity to ensure relevant model performance. Aggregate model performance statistics are useful, but not to a customer who insures homes, say, only in the Pacific Northwest. Many property features express regionality, as do the relevant insurance risks. This is incredibly important when assessing, for example, risk due to wildfire.

Once a model goes through this rigorous process, often over a few months, it is ready to start another training and evaluation iteration. As we gather data from our customers and internal auditors, both our training data and model evaluation data can be improved. We continue this cycle of training, evaluation, deployment, and feedback as long as models continue to show noticeable improvement.

Conclusion

Full-loop deep learning™ is essential to creating high quality models that account for specific customer needs. While there are many complementary ways to improve data quality and efficiently capture data (such as bootstrapping) there is no alternative to full loop deep learning. While we recognize that the application of this approach may have diminishing returns over time (especially from an economic perspective for individual mature models), it still continues to serve as a useful measurement tool to validate our models are continuously performing at necessary accuracy levels for our customers — especially as new customers or geographies are added.

While the advances within the AI/deep learning research community that are publicly heralded may capture attention, and often our imagination, it is important to remember that the application of a scientific approach and consistent refinement methodology (no matter what underlying advances may be incorporated at the model level) will always be critical in delivering high-value and utility to the customers and organizations we are serving.