Predicate Constraint - A Missing Data Contingency Analysis Framework

Fast and Reliable Missing Data Contingency Analysis with Predicate-Constraints

Read the full paper on arXiv.

In this research project, we propose a novel analysis framework called Predicate-Constraint Set that can be used for missing data contingency analysis. The most important feature that dinstinguishes our framework from other approaches is the ability to generate a meaningful hard-bound of the query result while other probablistic approaches can only generate a confidence interval that is not reliable.

Let’s use an IoT dataset as an motivating example, suppose we have the following dataset contains measurements returned by several IoT devices for the year of 2019.

DeviceId Date Time Light Humidity
1 20190313 07:55:12 70.9 40.1
2 20190313 07:56:15 71.9 42.3
3 20190313 07:57:41 72.1 41.5
4 20190313 07:58:02 70.3 45.8

An analyst is interested in the following simple query:

SELECT MIN(Humidity) FROM Data WHERE Time <= 12:00:00

This query can be easily answered, but let’s make things a bit challenging and consider: what if 1-month amount of data is missing due to file corruption?

Of course, we can still run the query on whatever is left and we will have a result. But because the business decision we are going to make based on the query result is so important that we cannot help wondering how the missing data would affect the query result.

Well, the analyst could simply make a guess based on her experience/intuition or she could extrapolate the missing data using the existing data. However, such a result generated from partial data and speculation cannot be relied on to make important decisions and that is one of the many of the scenarios where a reliable result with a hard-bound would be preferred.

Our framework is called Predicate-Constraint Set. A predicate-constraint (PC) contains two parts: a predicate and a constraint. Intuitively, a PC specifies that all the tuples in the dataset that satisfy the predicate are constrained by the constraints defined by the PC.

Informally, let’s use an example to understand the definition of PC.

Here, the predicate could be defined on one or more attributes, using our IoT dataset as an example, we can define a predicate on DeviceId and Date:

10 <= DeviceId <= 20 AND 20190401 <= Date <= 20190701

In the paper, we define three types of constraint:

  1. a frequency constraint that specifies a bound on the cardinality of tuples satisfying the aformentioned predicate, e.g., no more than 100 tuples;

  2. a MIN constraint that specifies a lower bound on the value that could be taken on an attribute, e.g., value of Light is larger than 20;

  3. a MAX constraint that specifies a upper bound on the value that could be taken on an attribute, e.g., value of Light is smaller than 70;

If we put above predicate and constraint together as a PC, it specifies that:

  • there are at most 100 tuples in the dataset with DeviceID attribute between 10 and 20 and Date attribute between 0401 and 0701;
  • among these tuples, their Light attribute is between 20 and 70;

Basically, a PC specifies or describes the constraints of certain tuples of the dataset and if we have a set of PCs that are associated with the dataset, when part of the dataset is corrupt or missing, we can use the set of PCs as an alternative source of data for reliable contingency analysis.

One of the major contribution and a key challenge addressed by our research is how to handle the situation when the set of PCs are overlapping. In this case, we could have PCs specifying different constraint for the same tuple and we need to resolve the potential conflicts caused by different PCs and generate the best result that complies with all the constraints. And such a result, by virtue, is a hard bound of the query result.

In the paper, we propose an algorithm to solve a PC problem to generate the best hard-bound that can be derived from a set of PCs to answer SUM, COUNT, MIN, MAX, AVG queries. We propose several optimization techniques to make our framework more scalable. We also extend our framework to a multi-table scenario and draw interesting connections to the AGM Bound1.

Finally, we compare with different baselines on several real-world dataset and shows that our framework is 100% reliable and competitively accurate.

Please check our paper for more details.

1: A. Atserias, M. Grohe and D. Marx, “Size bounds and query plans for relational joins”, FOCS 2008.


© 2021 xiliang6.com | Powered by Hydejack & Jekyll