Data-driven business decisions have increasingly become  the norm across market verticals and small to large-scale businesses. In insurance specifically, according to a survey conducted by CoreLogic, 90% of surveyed executives in the property and casualty industry agree that data-driven decision-making is an important strategic factor in operating successfully.  

But not all information is created equal. Public records data, for instance, may have been accurate and reliable at the time they were collected, but they may be outdated today. Or user-contributed data, like that which occurs during a quoting questionnaire, may be current but potentially specious, as oftentimes homeowners don’t know how to answer the questions being asked of them and are vague and inconsistent in their answers. 

Today’s big data environment is facing both emerging and foundational challenges across the entire artificial intelligence (AI) modeling lifecycle, from properly curating and labeling input data, through model development and tuning, to tracking model confidence and extracting the most accurate insights for the customers. This entire lifecycle is predicated on quality ground truth data and valid labels. 

In short, the problem of when to trust your data is not new for any AI business. And while the data labeling and large-scale validation and verification (V&V) processes are usually the largest drivers of model accuracy and quality, they remain timely, expensive, and manual. 

Furthermore, the V&V process only gives you aggregate statistics on data quality. Since our customers typically make decisions one property at a time, the usefulness of the V&V data can only be fully realized in the broader context of model performance for one data point at a time. This begs the question: How can we know the quality of a model prediction for each new input, when all we have are aggregated model statistics over validation data sets?


This concept of when to trust your data is covered at length in many publications on machine learning in different fields. Data reliability is a major requirement in areas like healthcare, where diagnoses and treatments of patients based upon incorrect data could lead to serious injury or even loss of life. It could be argued that in healthcare, model interpretability and a robust measure of trust are more valuable than sheer model performance. Wouldn’t you like to know why your AI doctor ordered that open heart surgery for you?

In other industries, there is an acceptance of the fact that a measure of trust isn’t available for most data sources, yet the data is still considered relevant and valuable. However, the unreliability factor decreases the data’s predictive power drastically. At Arturo we have conquered this unreliability factor by developing a framework for generating a measure of trust for each metric we provide. We have branded this as “confidence”.

Arturo’s Confidence Score Framework

Arturo has recognized that for our product to be most impactful and useful to our customers, we need to provide a level of trust for each data point returned. We offer confidence in one of two ways depending on the type of data we are returning:

Probability (Discrete / Classification Data)

This is reported as a single number between 0 and 99 that represents the probability that the classification is correct. This is only available for models that produce discrete classification data (one of some fixed numbers of choices was selected). This isn’t a true probability as one would learn about in a stats 101 course. Instead, this is an estimate of that value.

For example, given an image of a roof that we analyze and predict the roof material (a discrete class from several classes) as…


we would also return our confidence on the prediction as…


95% Confidence Interval (Continuous / Regression Data)

This is reported as two numbers; one number represents an upper bound, and the other represents the lower bound for an interval likely to contain the ground truth. Just like the probability we provide, this confidence interval isn’t the familiar one that is taught in most undergraduate statistics courses. This simply is an interval that you can count on the ground truth falling within most of the time (approximately 95%).

It can provide context on the direction of the error and the magnitude of the error. For example, you could tell that if we are wrong about the roof area, we are much more likely to be overestimating the area as opposed to underestimating it.



Confidence can add a significant amount of value to any data used for decision making. For example, an insurance company wants to predict more accurately the risk of a home insurance policy, so it purchases data that informs them on whether or not their policyholders own pools. Let’s assume that this data was generated by aggregating local permits issued before the pool was constructed.

To get an idea of how much to trust the data, the insurance company could manually check individual properties to see how accurate the data is (there are alternatives, but the end result is the same). This would result in a single number that describes how much to trust this metric. However, there are two shortcomings to this approach. First, it is possible that this accuracy number is only correct in specific municipalities due to the different permitting laws. That is; the measure of accuracy might be incorrect when this data is applied outside those areas that were measured to assess accuracy. Second, this single estimate of accuracy does not provide the ability to know when to rely on this information. We can either trust the data all of the time, or none of the time. In most cases, relying upon the available data to determine accuracy is more advantageous than not having the data at all, however, a superior model could be developed if we could determine when to trust the data.

This problem is not specific to the insurance industry. By treating all data the same, opportunities are lost for creating new products and making better decisions. In addition to what is listed in the example above, there are other use cases where having data points that help customers understand when to trust the data are valuable.

These simple confidence indicators enable new capabilities for our customers. We have outlined a few potential advantages, but the following is not exhaustive.

Higher Variance Modeling
The trust level can directly be modeled into the decision making process. When data is unreliable, then both human decision making and data driven modeling must discount this data. Knowing in which direction the model is likely to err can help determine whether or not the data should be trusted.

Selective Product Quality
In some use cases, data that doesn’t meet a certain performance level must be discarded. Using a confidence value to discard data points that are unlikely to be correct allows for products that previously would have been infeasible.

In the insurance industry, an insurer can choose to insure properties that other insurers might refuse, but only if the data supporting their decision is highly reliable. For example, most insurers might ordinarily refuse to cover homes in a particular area deemed high risk due to regional wildfires. However, if an insurer had access to data that correlates highly with low fire risk for specific properties in the region, then it might be able to insure these particular properties.
In other words, highly reliable data can provide an insurer with the confidence to cherry pick homes in this high risk region. There is a minimum data quality required before these decisions could be made. Confidence makes this possible.

Minimizing Data Drift
Generally, precision, recall, and accuracy determine the reliability of a particular model in the aggregate. However, these metrics can be highly skewed depending on the data collection process (as in the pool permit example above). A random sample of addresses may or may not match the data that was used to create the performance metrics. This could result in either trusting the data too much or too little.
Confidence allows the end user to get an idea of the model performance on any arbitrary dataset without needing ground truth labels.

Calculating Confidence

How can it be possible to know how likely data is to be correct?

Have you ever looked at a picture from a long time ago and tried to identify all of the much younger faces? Usually you can identify a few people because they mostly look the same. Then using information external to the image, such as the date of the image, you reason about who you knew at that age. After using all of the available information at your disposal, a combination of the image and your memories, you can come up with a level of confidence for the identities of each person in the image. You might be absolutely certain that you recognize you and your sister, but you notice someone partially visible that you think with about 50% confidence is your best friend since she was always around.

This same reasoning is the premise behind how Arturo’s confidence modeling works. We take all the available data that we have access to — neural network parameters, image quality, prior knowledge about objects, and others — and reason about how likely we are to be correct.

For the more technical reader, we use a combination of machine learning and monte carlo simulation to estimate our confidence levels. We are in the midst of some heavy-duty research and hope to release a white paper on the inner workings of our approaches soon.

Measuring the Quality of Confidence

To know if confidence is working as intended (and to know if our research is improving the confidence framework) we came up with a few robust measures of performance. For classification models we didn’t have to be so creative because we found inspiration in these articles on calibration, measuring calibration and trust in uncertainty. These papers do a great job on measuring the quality of confidence for classification models. While we’ve drawn heavily from the principles proposed in the above research when appropriate, it is worth noting that we measure our confidence scores for classification and regression tasks differently.

Classification Confidence

This is measured using a metric called expected classification error (ECE) and negative log loss (NLL). ECE bins all of the predictions based on their confidence level and measures how far the average confidence is from the accuracy level of each bin. This is useful, because it is easy to inspect where the framework is failing, but it has some potential pitfalls. If every confidence prediction is equivalent to the average accuracy then the ECE will be 0. Conversely, a lower NLL will always mean that the confidence metric is better. For more information read the most recent link posted above.

Regression Confidence

To measure the effectiveness of confidence intervals we look at the average interval width and what percentage of our test data falls within the confidence interval. Our objective, as we work to improve this metric, is to  shrink the confidence interval and still have the ground truth inside of the interval 95% of the time.


Having awareness of the uncertainty in data can be a huge advantage when making business decisions. By knowing when to trust data, how much to trust data, and how to account for the error in data, decisions can and will be much more informed.

While the entire team at Arturo is deeply excited about the science of deep learning, we’re equally if not more passionate about assuring the outputs can have high value and utility for our customers.