Rex Kerr
3 min readSep 27, 2024

--

This is fun, but I'm a little puzzled about the level for which you're hiring and/or the needed skillset for which you're hiring. Maybe this is a good filter for overall-bright people who don't think much about algorithms? Anyone who has algorithm design as their core competency should get the algorithm right really fast, and then you're just testing whether they know the language syntax as well as a LLM or not. (Historically, being able to actually produce the code in a nontrivial amount of time was important; these days at least for a common language like Python or Java it isn't, particularly.)

And you haven't even given the best easy solution! Surely someone came up with it? (Edit: in further perusal of the comments I see that Adam Wronski did come up with this solution.)

You don't need a set. You just need the PageID with a sentinel value. You have only two persistent data structures: a Map<CustomerID, PageID> and a Set<CustomerID>.

First you iterate through the small file, building the map. If the CustomerID does not exist, you add the PageID to the map. If it does exist, but the PageID is different, you switch the PageID to the sentinel value.

Then you iterate through the larger file. If the CustomerID you read is in the map and the PageID does not match--which will happen on every sentinel value and also if it's a second unique page read--you remove the CustomerID from the map and add it to the set of loyal customers.

You don't need to futz with a map of sets! If you're trying to do the operation in memory, not needing an extra data structure for every map entry is a significant win. (Doesn't generalize well to more complex criteria, of course, and how big of a win depends on the size of the PageID.)

Parallelization is also pretty easy if it all fits in memory, but then it depends whether I/O or map lookup is limiting. That would seem to me to be the more fun discussion to have if the candidates were supposed to be good at algorithms and build the logic themselves rather than stuff a bunch of workers into a framework with backpressure and let it figure it out.

There, the key to a simple performant solution is that your reading threads* can pre-sort the CustomerIDs by hashcode into K bins, and pass chunks of binned CustomerID/TaskID pairs to the correct of the M mapper threads (chunked to avoid too much synchronization/CAS latency), each of which has a local map. At this point, it's exactly the same problem, save for the final merge of the (disjoint) customer sets.

And if you can't fit everything in memory?

Wait, we already have the answer! The same hashcode-based binning scheme used for parallelization _also_ can be used to chunk the data into bins of a size that will fit into memory (all you need for that is a bunch of open file handles; it's O(1) in memory).

Need to parallelize *and* it doesn't fit in memory? No problem, just create bin * thread files and walk the lot of them instead of single files.

So, anyway, it's a fun little problem! But I do have to wonder about who you're trying to hire if this is a useful discriminating question.

(* How to get useful reading threads depends on your I/O system and where the bottleneck is. Often, the only thing you can do is devote a thread to pulling in the data as fast as possible and passing it on for additional work to be done on it. Tim Bray's WideFinder2 was of that sort: even with the relatively underpowered cores in that challenge, one core could max out the I/O that the filesystem could deliver.)

--

--

Rex Kerr
Rex Kerr

Written by Rex Kerr

One who rejoices when everything is made as simple as possible, but no simpler. Sayer of things that may be wrong, but not so bad that they're not even wrong.

No responses yet