.. $Id$ $URL$ .. _faq: Frequently Asked Questions ========================== This is a list of FAQs about MDBM. To suggest new entries, send mail to the mdbm-users group. .. _faq-general: General ------- **What is an MDBM?** MDBM is a fast in-memory hash-based key-value store. **What are the most common use cases for MDBM?** MDBM is commonly used for caching and storing static data for quick access. **How do I get started?** - Build the package 'make' and install it 'make install' **What are the different language bindings available for MDBM?** - Bindings are included for C/C++, and perl. .. _faq-general-problems: General Problems ---------------- **How do I report a problem with MDBM?** See :ref:`getting-help` **Why can't I create a 2G MDBM on a 32-bit system?** In order to use a 2G MDBM, you'll have to lower the data-size limit in your process (man limits) from the default (512MB) to something lower. **I have a corrupted mdbm, what can I do?** Often, MDBMs get corrupted due to an application not handling locking correctly. If you are doing reads and writes, you must open that MDBM with locking (default is exclusive locking). You must hold the lock for the duration that you are referencing a record (for all operations, including reads, writes, and deletes). For example, when doing a read, you would typically: #. ``mdbm_lock`` #. ``mdbm_fetch`` #. copy out the data pointed to by the returned fetched datum #. ``mdbm_unlock`` #. access your copied-out data Another common way for MDBMs to become corrupt is for an application to mistakenly write (via a bad pointer) into the MDBM's mapped space. There is an MDBM *protect* feature for debugging purposes to catch these problems. See :ref:`mdbm-data-store-protection` for more information. Try using ``mdbm_check`` to identify the extent of the damage. If there is major corruption, ``mdbm_check`` will abort. Once an MDBM is corrupt, most of the time it's not possible to tell how it got corrupted. You need to catch it in the act of being corrupted. .. _faq-file-system: File System ----------- **Will MDBM perform well over NFS?** No. MDBM over NFS **MUST BE AVOIDED** at all costs. It does not perform well at all. MDBM uses mmap(), and although the NFS driver does support the mmap() operation, it gets converted to regular block fetches and updates, so the performance sucks. **What's the worst-case-scenario if I never call mdbm_close nor mdbm_sync (I just exit)? Can the database become corrupted? Can individual records become corrupted?** If you never call ``mdbm_sync`` and don't pass either ``MDBM_O_ASYNC`` or ``MDBM_O_FSYNC`` to ``mdbm_open``, the data will never be sync'd to disk (unless the system runs low on physical memory and starts swapping). If you reboot the system and open the database, it will be empty. **Is the whole file mapped to memory at any given time, or is there some sort of page fault/swapping that goes on?** The whole thing is mapped when you open the database, but individual pages are faulted in when they're touched. **How often (if ever) is the backing file sync'd to disk?** Never, by default. If you want sync'ing, you either have to use ``mdbm_sync`` to manually sync or specify ``MDBM_O_ASYNC`` when opening to enable background sync'ing by the kernel syncing process. **There's a flag for a memory-only database without a backing file. Is that correct?** There is a private internal flag which is not part of the public API to signify a memory-only MDBM. To create a memory-only MDBM, specify ``NULL`` for the ``mdbm_open`` *file* argument. Memory only MDBMs must be initialized as a fixed-sized MDBM because there is no way to handle dynamic-growth size changes across processes. This means that you must specify a *presize* to ``mdbm_open``. You must also use ``mdbm_limit_size_v3`` with *max_page* equivalent to the *presize*. **Why is my MDBM file modified time not being updated after I store something?** Simply, mod time does not get updated on mmap'd writes. MDBM writes are simply writing to memory; they are not directly doing file-based operations which would affect a file's modified time. For a high-performance store (ex., when you want to do 50K-100K writes/sec), it wouldn't be reasonable to update the modified time for each write. **Under what circumstances can a process or system crash cause loss of data in an MDBM?** A system crash will lose data unless something has synced the MDBM. By default, nothing does that. **If mdbm_sync is called, am I guaranteed that the MDBM will be corruption-free if there is a subsequent crash, even if it's missing some updates subsequent to the sync?** That's tricky. If the MDBM page size doesn't match the OS page size, then it's possible that VM pressure might cause only part of of a database page to get synced to disk. That would corrupt that page. ``mdbm_sync`` itself isn't foolproof either because it doesn't lock the MDBM and does a background sync. ``mdbm_fsync`` is better for integrity in that it locks the database and uses a synchronous fsync. The downside, of course, is that the database is locked until all the dirty pages are flushed. **What kind of overhead would I expect from using MDBM_O_ASYNC?** I infer that the system sync process, which flushes data to disk every 30 seconds, would also write the mmap-ed changed pages to disk every 30 seconds, so the worst case performance would be the time it takes to write the data, amortized over the time. **But, does the sync process lock the pages (this could be important if we're doing very high data rates on the MDBM - for example 100Ks/sec)?** Yes, the sync locks the pages, so if you touch a page while it's being flushed, you'll block. I haven't looked at this closely in a while, but I also recall that a sync results in a page fault when you touch a page for the first time after it's been synced. It's a quick fault (no disk access), but it still hurts a bit. **Does mdbm_close do an implicit flush to disk?** ``mdbm_close`` only syncs if the MDBM was opened with ``MDBM_O_FSYNC``. ``mdbm_close`` itself won't cause any flushing. **Does the MDBM file on the disk have the latest updates to the key-value pairs?** In general, unless you use ``mdbm_sync``, or use ``mdbm_open`` with ``MDBM_O_FSYNC`` (or ``O_FSYNC`` in earlier MDBM versions), your data will probably not be written to disk by ``mdbm_close``. On FreeBSD, the mmap'd file that holds the MDBM is not sync'd to the physical disk unless ``mdbm_sync`` is used or during a normal system shutdown when all dirty file data gets sync'd to disk. On RHEL, modifications to the mmap'd file are background-sync'd to disk after 30 seconds for files that are on a normal file-system mount. However, MDBMs that are hosted on a tmpfs file-system are not sync'd (and are also not preserved across a system reboot). **Can I copy an MDBM file from one machine to another?** Normally this is a bad idea, it is recommended to use ``mdbm_export`` at the source machine to obtain a portable file and then perform an ``mdbm_import`` on this file to get the MDBM on the destination machine. The ``mdbm_copy`` command is also available. Neither ``mdbm_export`` nor ``mdbm_copy`` guarantee data consistency, since calls to mdbm_store that store related data can occur mid-copy. .. _faq-sizes-and-limitations: Sizes and limitations --------------------- **My MDBM says it is 4G in size, will I need more RAM?** MDBM is a sparse file when large object mode is enabled. Use ``mdbm_stat`` to view the actual allocated size of the database (and Large Object Store). **Why has my MDBM dynamically grown to be huge? I don't have nearly that much data.** You have a data-sensitive problem. If you are using duplicate keys, or you have a pathological dataset, some of your pages are filling up too soon. When there is no more room on a page, an MDBM will grow to a maximum limit. When your MDBM can no longer grow, an attempted store to an full page will return an error. There are only few knobs to turn, in priority order: #. Enable large objects (only settable at create time) if you have a small percentage (<5%) of objects that are significantly bigger than the others. - The v2 implementation will create a 4GB file size because large objects are stored at a 4GB file offset and below. It's a sparse file so only the necessary pages on disk are used for storing data. #. Increase your page size (only settable at create time) - This might decrease performance because many more keys might need to be compared on a page to determine whether your lookup key exists. If there are few keys/page this won't be significant. If there are many (100+) keys/page it might be noticeable. If you have a lot of lookups where the key doesn't exist, this hurts performance because it has to compare every key on the page. #. Try another hash function (only settable at create time) - If your key is a string, try the Jenkins hash function - If you key is binary, try CRC32, SHA1, MD5 (probably in that order, YMMV) #. Use ``mdbm_open`` and ``mdbm_limit_size`` to set the initial and maximum MDBM size when the file is created. - This will create a flatter internal btree which might help distribute your data more uniformly. This might help you reduce that number of nearly full pages where a store operation would fail due to lack of space (this is your real problem as opposed to a large sparse MDBM). A consequence is that the actual number of pages used on disk might be higher, but that's probably a good trade-off. #. If you really don't have a good idea of the final size of your MDBM (as needed in the previous option), use ``mdbm_open`` with an initialize size with your best guess. If option 1 does not work, then you might need a combination of the other. MDBM is a hashed key-value store, so changing your hash function or page size changes how your data is distributed between MDBM pages. If the hash function you chose happens to parcel out too many keys into a single page, that page will split and MDBM's file size will double. If you keep adding data that happens to hit the same page, the MDBM will keep splitting and file size will keep growing and growing. Use ``mdbm_stat`` to look at your histogram data. You want to avoid having many pages that are nearly full when your MDBM close to its maximum size. **How do I control the amount of memory available to mmap?** In FreeBSD this can be controlled by using the kernel variable vm.max_proc_mmap, though it's usually not necessary to tune this. 32-bit applications on FreeBSD trade-off space for malloc against space for mmap according to the data segment size limit. This is controlled at the kernel level using the kern.maxdsiz loader variable (FreeBSD 4) or compat.ia32.maxdsiz sysctl (FreeBSD 6/7). In addition, the process rlimit for data segment size can be used to lower the data segment size limit (and therefore make more room for mmaps). **How do I determine how my MDBM is mapped into memory?** MDBM v3 mmaps an entire MDBM file into memory. Simply mapping an MDBM does not make it memory resident. Although a file's size on disk might be quite large, the sparse file structure will only bring pages containing data into memory when they are referenced (ex., fetch, store, or delete operation). The ``mdbm_preload`` routine may be used to make an MDBM memory resident. On RHEL, you can review a process' mapped regions and associated files via ``cat /proc/``\ *pid*\ ``/smaps``. **I'm using the MDBM within PHP that runs within yapache, given that each yapache child runs as a process, will each process mmap the MDBM separately?** Usually, the individual processes will map the MDBM separately (because the MDBM is opened after the child has been forked from the parent), but they will all be sharing the same physical RAM mapping for that file. **Can I use MDBMs in two or more machines in a cluster mode by connecting them?** MDBM can't do this out of the box. Explore YDBM or DISC-GDS. .. _faq-iteration: Iteration --------- **How do I initialize an MDBM iterator?** The ``MDBM_ITER_INIT()`` initializes an iterator. **While iterating across an entire database, am I guaranteed to see all key-values present in the database when the iteration starts, if deletes occur during the iteration? What if inserts or overwrites occur during iteration?** Deleting items will not affect iteration, assuming you only delete items you've already iterated over. If you lock the database; and begin an iteration, you will see all key-values. Deletion of some key-values will not interfere with this, as long as you remove a key-value you've already iterated over. If you started an iteration; and removed a key you knew was in the database but hadn't iterated over yet; the iteration would *not* return the (now deleted) key-value pair, even though it was in the database when the iteration began. Overwriting depends on what you're doing. If you're just fetching the value pointer and rewriting in-place, that's safe. If you're replacing the value with a different size, that may cause garbage collection, which may cause your iterator to miss records. Inserting records may also trigger garbage collection, which may cause your iterator to miss records. .. _faq-locking: Locking ------- **Do I need to use locking if I'm only doing read access and using mdbm_replace?** If you have a read-only MDBM (there are no store/delete operations) in a single-threaded application, you do not need to lock. This is because the access operations are smart enough to check for replacement and to acquire an internal lock. However, if you use ``mdbm_replace`` in a multi-threaded application, you do need to lock around fetches. A future enhancement will remove this locking requirement for multi-threaded applications. **When should I use mdbm_lock?** When two or more processes are reading and writing to the same MDBM. ``mdbm_lock`` is used by a process reading or writing to obtain exclusive access. **There doesn't appear to be a distinction between read locks and write locks. Is that correct?** For exclusive locks, that's correct. **Is there any mechanism for allowing multiple readers and one writer (MROW) that doesn't have the readers block each other?** MDBM V3 has shared locks (sometimes called read-write locks). **Are the lock requests FIFO?** No, locks are scheduled according to process priority. **Why doesn't mdbm_fetch automatically lock?** ``mdbm_fetch`` doesn't lock so that an application can take greater control over locking, and the corresponding performance in a few ways: - ``mdbm_fetch`` doesn't copy-out the data. An application could lock, fetch, look at that pointer's data contents, and unlock. In some situations, this can be much faster than lock, fetch, copy-out, unlock, and look at returned contents. - If you are willing trade off latency for higher throughput: locking, doing multiple fetches (copy-out or not), and unlock, you could achieve higher throughput. This approach is application and data dependent. - If you have a master record and dependent records (specified in that master record), your app might require that accesses to the master record and the dependent records be done in a single locked context. Otherwise, dependent records could be deleted or be modified, which could be incompatible with the master record. ``mdbm_fetch_str`` locks, does a copy-out of the value, and unlocks. This, however, is only for string data. ``mdbm_fetch_buf`` also locks while copying into the provided buffer. **With MDBM V4, I'm getting the following error message: multiple different lockfiles exist** If you're seeing the following error message when opening an MDBM: mdbm_lock.cc:68 YourFile.mdbm: multiple different lockfiles exist! : No such file or directory ERROR (2 No such file or directory) in mdbm_open_inner() mdbm.c:3817 Then this is what is likely happening: someone has opened YourFile.mdbm using a 32 bit process and you are using 64-bit, or vice versa. Make sure any tools you are using match your executables (bin vs bin64). .. _faq-performance: Performance ----------- **If we have many little structures to store (possibly smaller than 64 bytes, keyed by registered user), how should we tune for that? (page-size?)** Many little structures work best. It's bigger structures that create problems. You should try different page sizes to see what performs better. 8K or 16K are probably good starting points. **Are there are guidelines for tuning MDBM, or is it more of trial and error?** It's mostly trial-and-error, but try to use the smallest page size that will fit the dataset without causing page overflows (a page overflow happens when a key to be inserted hashes to a page that's already full and the database can't be split because it's already too big). Use larger page sizes when key+value size is larger, smaller page sizes when key+value size is smaller. Larger page sizes are slower because effectively the hash buckets take longer to locate a specific key. In V3, however, this was significantly speeded up. Also, if you know you don't have duplicate keys (or don't care if they get inserted), you can avoid the lookup that occurs on insert by use the ``MDBM_INSERT_DUP`` flags. That'll speed things up even more. There is a new ``mdbm_config`` tool that will help you select MDBM configuration parameters for your dataset. **What should be the ratio between main memory size and MDBM size (in order to maintain its performance)?** MDBM expects all of its data pages to be in physical memory. MDBM databases grow in power's of 2, and not all the pages mapped necessarily have data on them. The ``mdbm_stat`` utility can analyze a database and show the various efficiencies (how full the pages are, how many non-empty pages there are). **Why is building an offline MDBM slow, and my resulting file is highly fragmented?** If you are building offline and you have a known maximum size of the MDBM: - Create the MDBM with the initial size set to the final size - Use ``mdbm_limit_size_v3`` to ensure that MDBM doesn't split in the future - Make sure that your physical memory is larger than the resulting MDBM Setting the initial size to the final size will avoid MDBM splits, which also avoids the latency incurred during the split. The resulting MDBM directory will also have fewer levels (enabling faster lookups). If the resulting MDBM is highly fragmented, you probably have highly a fragmented disk. Either use a non-fragmented disk, or use a ramdisk to build the MDBM, then copy the MDBM out of the ramdisk. **How can I speed up fetches in my read-only web application?** Frequently, web applications open an MDBM on a per-user-request basis. This is a very bad idea because each open can take several hundred (or more) microseconds. The best thing that you can do to improve your performance is to open the MDBM once at an application level, not at a per-request level. For a single (unlocked) memory-resident lookup, it should take ~4 microseconds on standard hardware. Not only is an open call comparatively slow to a fetch, but concurrent calls to ``mdbm_open`` are single-threaded when it creates the locks for the first time (very slow) and then it maps the shared MDBM into process address space. After initial lock creation, subsequent ``mdbm_open`` calls will also be single-threaded as it creates state in each new handle and maps the shared MDBM into process address space. If there are *no* writes taking place on the MDBM, and you are *not* using ``mdbm_replace``, you can disable all locking overhead by specifying ``MDBM_OPEN_NOLOCK`` in your ``mdbm_open`` flags. This avoids creating the mutexes used for locking during ``mdbm_open``. Fetching a non-existent key is slower than fetching a key that exists. Non-existent keys require checking all keys on a page before determining that a key does not exist. In this regard, MDBM V3 format files are faster than MDBM V2 due to some saved key hash data. The number of key comparisons will influence your fetch time. Large pages typically have more keys/data, thus have a slower lookup time. Using a smaller page size can get better performance. Don't use large objects, they are a little slower to reference. .. _faq-data-access-and-management: Data Access and Management -------------------------- **Does MDBM automatically do garbage collection?** No. MDBM doesn't do this for you. Data-specific garbage collection can be implemented using the "shake" function that is registered by using ``mdbm_limit_size_v3``. **What is the upper limit size for MDBM, before its performance degrades?** Even at maximum size, the design of MDBM requires only two database pages to be accessed for any single fetch. If your system's RAM available to MDBMs cannot contain your "working set" of MDBM data, your performance will degrade. **How much extra empty storage does an MDBM require?** It depends on the hash collision rate of the keys. The better the hashed key distribution is, the less likely a leaf page is to be filled and require a premature database split. **Is it possible to shrink an MDBM without rebuilding the entire database?** It's sometimes possible that ``mdbm_compress`` will be able to shrink the database because it rebalances the tree. **How does mdbm_store with flag MDBM_MODIFY work?** When using ``mdbm_store`` with flag ``MDBM_MODIFY`` to change an existing record, what happens if the new record is of the same size as the original record? What happens if the new record is of a different size than the original? If the new record is of the same size as the original, the data is replaced but the location in memory stays the same. If the new record is of a different size (larger or smaller) than the original, the original record is deleted and a new one is inserted, so the location in memory may change. **When do I use mdbm_popen vs. mdbm_open in PHP?** Usage might depend on the frequency of access. For example a PHP script running in yapache context that accesses the MDBM during all/most of the requests, then it should do a ``mdbm_popen``. **Can I use mdbm_fetch without taking out a lock, to just check for key existence?** The short answer is that you must always take out a lock for a fetch operation if there are concurrent writes (store or delete operations). Here is the reasoning on why you must lock around ``mdbm_fetch`` if there are concurrent writes. - An unlocked ``mdbm_fetch`` cannot guarantee a coherent use of a key's fields for offset and size. An intervening write might produce a mismatched offset and/or size for a key. - Depending on your hardware architecture for memory byte-ordering of read/write access and atomicity of access, reading a field (ex., offset) as it's being written could generate an undefined result. The operations that write key meta-data information are not atomic operations. For x86 architectures, this might not be an issue under some situations (ex., 2 or 4-byte writes with aligned access). This issue requires additional investigation. A further consideration is that the underlying access for ``mdbm_fetch`` is doing a 3-byte read. If you have concurrent writes, and you do not lock an ``mdbm_fetch``, the following consequences are possible, but (highly) unlikely: - False-positives -- ``mdbm_fetch`` could indicate that a key is present when it is not - False-negatives -- ``mdbm_fetch`` could indicate that a key is not present when it is - SEGV -- if a key resolves to an off-page address, and that page has not been mapped, it is an access violation. Some related issues: - If you do get a crash, it will be very difficult to determine the overall reason for the crash. An invalid address access will be obvious, but there will be insufficient information to develop a scenario to explain the crash. The bad access generated by ``mdbm_fetch`` is transient, and not deterministically reproducible. - If a crash happens while an ``mdbm_store`` operation is taking place, the meta-data on the page could become corrupted. An ``mdbm_store`` can result in shuffling the meta-data on the page (to make room for the store) which could result in a partial update. .. _faq-miscellaneous: Miscellaneous ------------- **Is MDBM thread-safe?** MDBM V2 is not thread-safe in any context. There are known problems with ``mdbm_replace``, as well as other unqualified issues. MDBM V3 is thread-safe if used in a specific way: - Only 1 thread may use an MDBM handle at a time. - If an application needs concurrent MDBM access, then additional MDBM handles are required It is up to the app to decide how to ensure that a handle may be used by only 1 thread at a time. Handles contains state, and you cannot use a handle concurrently across threads even if you are only doing fetches (reads). An mdbm handle is a context object, and as is frequently done in reentrant programming, you pass a context object to a reentrant routine for that routine to read/write thread-specific state information so that it does not to use external (global) state. If you have a multi-threaded app, you will probably want to use dup handles to avoid remapping the same MDBM multiple times (use mdbm_dup_handle to create a new handle instead of mdbm_open). There are various approaches for using MDBM handles in a threaded context. For high-performance applications, it's recommended to use thread-local storage (TLS) containing an MDBM handle. Alternatively, it would be possible to create a pool of MDBM handles, but that would require a lock to acquire a handle, which might have an unwanted impact on latency. **Can I open an MDBM, then use it in forked child processes?** No, you cannot expect MDBM handles (MDBM \*) to be valid across fork() calls. You will have to call mdbm_open() in the child process after forking that child process. The MDBM handle contains data items such as lock counts that cannot be expected to remain consistent when copied to another process's memory space. **How do I know which processes are using a particular MDBM?** lsof | grep *yourMdbmFile* .. End of documentation emacsen buffer-local ispell variables -- Do not delete. === content === LocalWords: CRC DUP FSYNC MROW NOLOCK SHA TLS VM basedir LocalWords: YourFile btree buf cxx dev dup emacsen faq fsync grep init iter LocalWords: kern lockfile lockfiles lsof malloc maxdsiz mdbm mdbm's mdbms LocalWords: mdbmv mmap mmap'd mmaps mmutex nfs perl pid pmxmap pmxtools popen LocalWords: presize proc procs ramdisk rlimit sec smaps speeded stat str LocalWords: sync'd sync'ing sysctl tmp tmpfs vm yapache ydbm yjava LocalWords: yourMdbmFile yphp yroot ysys Local Variables: mode: text fill-column: 80 indent-tabs-mode: nil tab-width: 4 End: