12
votes

I have a program in which I want to be able to store certain data (dynamically allocated blocks), on disk for reduced memory usage and persistence.

My first thought was to write my own custom allocator which managed the content of files on the disk, but I want to see what alternatives there are too.

I've looked into custom memory allocators and topics on object serialization but there are subtle differences, both good and bad, when adapting those principles to managing the address space of a file.

In this situation:

  1. Memory is accessed only through IO (read/write) functions rather than directly

  2. No objects(methods/pointers) are stored, only data.

  3. The size of a file is not static, so it should grow when needed rather than being large and static

  4. For my uses, it is acceptable to re-map existing pointers after defragmentation

Because the data is not of a fixed size, most database implementations seem not well suited.

I ask, what is the best approach for this problem? Should I implement a simple memory allocator which treats a file as the heap?

For reference, im using C++ on embedded devices.


Edit: I've implemented my own memory manager which uses buddy memory allocation and block sizes of powers of two. I'm satisfied that it is correct and does not leak, coalesces free blocks, and can do a 'stop the world' defragmentation.

The problem is, as expected, there is quite a bit of internal and external fragmentation. I'm not an expert in this field and although I find it fascinating (I'm still a student), I wonder if there are any other implementations that have done the same or similar thing? Surely I cant be the only one?


Some helpful but so far incompatible topics are:

mmap tbh I havent used mmap but, it addresses the file IO, but not the management of the file address space.

BOOST:serialization I have a (probably unjustified) reluctance to use boost libraries at the moment.

STXXL Interesting but doesnt address variable size memory allocation

Doug Lea Memory Allocator Has very good insights into issues with memory allocators, but I'm not in a position to try and make my own implementation

12

12 Answers

7
votes

Your two goals are to reduce memory usage and persist your data. That definitely sounds like a job for a database. But then you say

Because the data is not of a fixed size, most database implementations seem not well suited.

I think you'll be interested in this distinctive feature of SQLite (a very lightweight cross-platform database with public domain source code):

Variable-length records

...

SQLite, in contrast, use only the amount of disk space actually needed to store the information in a row. If you store a single character in a VARCHAR(100) column, then only a single byte of disk space is consumed. (Actually two bytes - there is some overhead at the beginning of each column to record its datatype and length.)

It also is a good choice for embedded development:

Embedded devices and applications

Because an SQLite database requires little or no administration, SQLite is a good choice for devices or services that must work unattended and without human support. SQLite is a good fit for use in cellphones, PDAs, set-top boxes, and/or appliances. It also works well as an embedded database in downloadable consumer applications.

8
votes

I've implemented my own memory manager which uses buddy memory allocation and block sizes of powers of two. I'm satisfied it is correct and has doesnt leak, coalesses free blocks and can do a 'stop the world' defragmentation.

That's a great first step. Once you have a working custom memory allocator you can of course do better!

The problem is, as expected there is quite a bit of internal(power of 2 blocks) and external fragmentation. I'm not an expert in this field and although I find it facinating (I'm still a student), I wonder if there is any other implementations that have done the same or similar thing? Surely I cant be the only one?

The power of two is a generic approach. However, note that this may not be the best simply because your allocation pattern may not follow the same geometric progression. In such a case it is best to test as much as you can and see what block sizes are getting allocated the most and optimize accordingly.

I would also like to suggest this a wonderful article by Andrei Alexandrescu and Emery Berger on the topic of memory allocation: Policy-Based Memory Allocation and the latter's work in particular: The Hoard Memory Allocator.

If possible go through the references mentioned at the end of that article. They may just as well provide additional insights.

3
votes

Your best option would be fast key-value store. The advantage over RDBMS is that you won't need all the overhead of database.

2
votes

I recently coded up a virtual heap class for a high memory usage problem that I had. The code is LGPL'ed and is hosted at code.google.com at:

http://code.google.com/p/kgui/source/browse/trunk/vheap.cpp

http://code.google.com/p/kgui/source/browse/trunk/vheap.h

Essentially it works as follows:

1) Define a block size and number of blocks to leave in memory, and a filename for caching to the filesystem. In my usage case I have 200 blocks of 1MB in memory at any time.

2) Then call Allocate to reserve a chunk of "virtual memory". You are returned a 8byte "handle" to the memory. You can allocate chunks larger than the block size if desired.

3) To write to the "virtual heap" there is a write function where you pass the "handle", pointer to the data and size of the data.

4) To read from the "virtual heap" there is a read function where you pass the "handle", pointer to the destination and size of the data to read.

The code automatically handles swapping between what is in memory and what is stored on disk. It's pretty simple actually.

1
votes

For embedded devices I would certainly do a simple implementation instead of using a database. Direct file IO avoids some overhead of databases. And resources are often limited in embedded environments.

Your idea writing a memory allocator is probably the best way. It should provide some kind of API layer that isolates the file based memory management as much as possible from the rest of your application. That way it should be easy to swap out (no pun intended) for a different implementation later on and therefore optimize if the need arises.

1
votes

I would definitely use mmap for the I/O. This would make it easy to directly access the data and flush to disk when needed. The only thing you would have to control is where the file is mapped in the address space, so you can move it around.

One possibility for memory management is to create a different file for each object, and use file-system level defragmentation rather than implementing it yourself. You never mentioned what OS/file-system you are using, but if it already has online defragmentation I would use that. If you are using Linux and can use XFS, you can use xfs_fsr. I would expect file-system defragmentation to be highly optimized and it would take a lot less effort than to implement on your own in one big file.

1
votes

From what I understand you need a file system and not a memory allocation system. At first, in embedded systems dynamic memory allocation in a disk is a contradictory term. A disk, either a hard disk or a flash device, used for persistent storage is much different than memory. It is not only the way you access it, but the fact that disk storage isn't 100% reliable. When writing to a disk you need to have an algorithm for avoiding bad sectors. Have you thought of this or can you consider your disk fault free?

A file system will deal both with space allocation and bad sectors issues. FAT is usually used in embedded devices. Although FAT's fragmentation performance is rather poor, this hasn't prevented it from being used in many embedded devices. Most flash based devices actually use FAT.

Anyway, I suggest to start with what you have now: your operation system (if you use any) and the driver for your disk. Investigate if a suitable solution is already supported from these. Also keep in mind that embedded devices are more difficult to debug - if you set to implement your own algorithms expect longer development times.

1
votes

Take a look at HDF5 http://www.hdfgroup.org/HDF5/whatishdf5.html

This should serve your purpose.

0
votes

I think you'd have less internal fragmentation with a simple heap allocator. You just allocate the amount of memory you actually use (plus the overhead for the header). If you're already resigned to doing stop-the-world compaction, you could combine this with new arena allocation, and allocate a new (larger) arena and copy all your live blocks to the new arena.

0
votes

I'm going to echo kgiannakakis - what you're describing is a file system, not a memory management system.

Since all your access is through I/O functions, it is not necessary for your object to be contiguous on disk. Instead of putting each object into a dynamic-size block, divide the object into multiple fixed-size blocks. The blocks can be located anywhere, all you need is a way to link them together. Your I/O functions will break up and coalesce the blocks as needed.

0
votes

Hmmh. This sounds like a very common use case for BDB (Berkeley DB). It's an efficient production-quality library that does key-value persistent "databases" (~= tables with other DBs), open source and all.

I don't think relational (SQL) DBs make much sense, but bdb et al (gnu db and I'm sure there are others) certainly does.

0
votes

You may want to look at the facilities provided by Boost.Interprocess, in particular look at the managed memory mapped files facilities.