lookup bounded memory support New Design

  1. Process every record that would compose to be a lookup table, read one record at a time, and add the record to the memory (may overflow here). Until all the records are processed, then save the lookup table in the memory into disk.
  2. Reads the lookup table file back to memory (may overflow here) to setup the hash bucket for each record (set next record offset) , then write to disk again.
  3. Load the entire lookup table file from the disk to memory to do lookup(may overflow here).

As we can see, there are 3 parts that will cause memory overflow, the main reason is that we load the whole lookup table into memory.

To solve this problem, the easy and straightforward way to do is to divide the lookup table into several parts (we call this as section), then process these sections accordingly.

New Design

Let’s see the diagram to illustrate the workflow:

Header: meta data about lookup table: address, offset…
Bucket vector: speed up lookup by chaining the records to different bucket using hash.


Here assume we have 5 records and 2 buckets, so when we lookup record 4, after hashing, we search from bucket2, so we will only walk through record 2, skip record 1,3 and 5. Of course, the pointer is stored in each record, actually it’s an offset value.

section file container: we fix the size of section file container in memory (can be configured by client or set by some logic according to the size of memory in client machine)

Current Design to minimize the memory usage

  1. Set a max memory cap (section file container size).
  2. Process each record as it got read in, add the size of the record, when the size will exceed or be the same as max memory cap, save the records that collected so far into a file (called section file, size of section file <= max memory cap size).
  3. Repeat the step above until all records have been saved into section file. Therefore, one big lookup table file now has been divided into several chunks of section files.
  4. Process the input record and use its key to get the hash bucket number, chain the records(namely add offect in placeholder left, then we need to wirte new content back to disk, need to swap section files with writing and reading operations).
  5. when do lookup operation, hash the key and get the entry to lookup it, walk through and compare each record until find it or not, may need to read several section file from disk (now write operation).

If the size of input data <= 2 * available memory size (ignore other small overhead), we will only have 2 section files, the performance will be only corroded slightl(assume the read/write file operation is good).