algorithm - Quickly checking if set is superset of stored sets

Question

Welcome To Ask or Share your Answers For Others

algorithm - Quickly checking if set is superset of stored sets

posted Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

algorithm - Quickly checking if set is superset of stored sets

The problem

I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.

Or, in terms of sets instead of arrays:

Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.

Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.

Some context

I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:

N = 10000
C = 1000
The stored arrays are sparse
The looked up arrays are random (so not sparse)

What I've come up with so far

For O(NC) lookup: Just iterate all the arrays. This is just too slow though.
For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.

I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.

EDIT: A new idea I've come up with

For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.

My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.

I will try out both this new idea and the BDD method, and see which of the 2 work out best.

But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-23T18:43:54+0000

Just to add some background information to the prefix trie solution, recently I found the following paper:

I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.

The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.

For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.

Categories

algorithm - Quickly checking if set is superset of stored sets

algorithm - Quickly checking if set is superset of stored sets

The problem

Some context

What I've come up with so far

EDIT: A new idea I've come up with

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags