Investigating man-db
internals
From time-to-time I get nerd-snipped into understanding the internals of some system, library or any technology I use. This time, man
was the target. I’ve discovered that man-db
uses a DB (maybe that’s why there is DB in the name?) to index command names to some metadata, allowing quicker lookups on whatis
and apropos
commands.
Technically, apropos
is just a symlink to whatis
and they work very similarly to each other.
Here’s some quick information that will cover in more depth in this post:
- All cache and storage from
man-db
is in/var/cache/man
folder - The DB is stored in a file
index.{db|bt}
in the cache folder. - This is a key-value database and the key is the command name. For example
whatis python
would lookup the entry in the DB with the keypython
.
There’s a database in man-db?
Yes. There’s an index DB that’s used for lookups on command name and description, used by whatis
and apropos
commands. The three storage options are: Berkeley DB, GNU DB and UNIX BTree. I’m using Debian for these tests, which in turn uses the gdbm
storage layer. So that’s what I’ll be focusing in this post.
DB and cache folder
The contents under /var/cache/man
are only created after you run mandb
for the first time with a privileged user. Running mandb
for the first time would give you a message similar to:
74 man subdirectories contained newer manual pages.
1344 manual pages were added.
0 stray cats were added.
0 old database entries were purged.
Exploring the index file
One of the files created under the cache folder is the DB file contents: /var/cache/man
. man-db
package offers a convenient tool to dump the contents of the DB in ASCII (accessdb
):
$ accessdb
$version$ -> "2.5.0"
/etc/adduser.conf -> "- 5 5 1537038759 0 C adduser.conf - gz "
/etc/deluser.conf -> "- 5 5 1537038759 0 C deluser.conf - gz "
PAM~7 -> "- 7 7 1630005083 0 A - - gz Pluggable Authentication Modules for Linux"
[ -> "- 1 1 1600936569 0 B - - gz check file types and compare values"
access.conf -> "- 5 5 1630005083 0 A - - gz the login access control table file"
accessdb -> "- 8 8 1613729663 0 A - - gz dumps the content of a man-db database in a human readable format"
add-shell -> "- 8 8 1601227547 0 A - - gz add shells to the list of valid login shells"
addgroup -> "- 8 8 1537038759 0 B - - gz add a user or group to the system"
addpart -> "- 8 8 1642709435 0 A - - gz tell the kernel about the existence of a partition"
adduser -> "- 8 8 1537038759 0 A - - gz add a user or group to the system"
adduser.conf -> "- 5 5 1537038759 0 A - - gz configuration file for adduser(8) and addgroup(8) ."
adjtime -> "- 5 5 1642709435 0 C adjtime_config - gz "
adjtime_config -> "- 5 5 1642709435 0 A - - gz information about hardware clock setting and drift factor"
agetty -> "- 8 8 1642709435 0 A - - gz alternative Linux getty"
...
Interesting. There’s a lot to unpack here. We can clearly see some familiar information here. There is a name, section and a description that maybe we can correlate with the text shown in whatis
. Let’s use adduser
as an example.
$ whatis adduser
adduser (8) - add a user or group to the system
Nice! But what’s the rest? Time to read some C code and see what can we understand of this feature.
Cache index contents
Reading through accessdb.c
(the file that compiles the accessdb
binary), we can quickly find the code that iterates on DB entries and spits them to the terminal:
while (MYDBM_DPTR (key) != NULL) {
datum content, nextkey;
char *t, *nicekey;
content = MYDBM_FETCH (dbf, key);
if (!MYDBM_DPTR (content)) {
debug ("key %s has no content!\n", MYDBM_DPTR (key));
ret = FATAL;
goto next;
}
nicekey = xstrdup (MYDBM_DPTR (key));
while ( (t = strchr (nicekey, '\t')) )
*t = '~';
while ( (t = strchr (MYDBM_DPTR (content), '\t')) )
*t = ' ';
printf ("%s -> \"%s\"\n", nicekey, MYDBM_DPTR (content));
free (nicekey);
MYDBM_FREE_DPTR (content);
next:
nextkey = MYDBM_NEXTKEY (dbf, key);
MYDBM_FREE_DPTR (key);
key = nextkey;
}
What are we looking at here? There are two important parts:
-
MYDBM_FETCH
, which uses the dbfile and key to get the contentscontent = MYDBM_FETCH (dbf, key);
-
The
printf
statement which spits the format betweenkey -> content
printf ("%s -> \"%s\"\n", nicekey, MYDBM_DPTR (content));
We know for a fact now that the part before the ->
is the key in DB. MYDBM_FETCH
is simply a macro that maps to whatever function the DB you’re using support. In our case, the function is mapped to gdbm_fetch((db)->file, key)
.
Here we’re simply dumping the contents of a character array into the output buffer. So there’s no use to understand this part. After some investigation on other usages of fetching/storing information from the DB, I could find the following data structure, which represents a man entry:
struct mandata {
char *addr; /* ptr to memory containing the fields */
char *name; /* Name of page, if != key */
/* The following are all const because they should be pointers to
* parts of strings allocated elsewhere (often the addr field above)
* and should not be written through or freed themselves.
*/
const char *ext; /* Filename ext w/o comp ext */
const char *sec; /* Section name/number */
char id; /* id for this entry */
const char *pointer; /* id related file pointer */
const char *comp; /* Compression extension */
const char *filter; /* filters needed for the page */
const char *whatis; /* whatis description for page */
struct timespec mtime; /* mod time for file */
};
Notice that char *addr
holds the pointers to all the information. Where the other fields represent only the addresses for where you can find each information. This is more clear if we read the split_content
procedure, which parses a raw DB read into a mandata
struct :
/* Parse the db-returned data and put it into a mandata format */
void split_content (MYDBM_FILE dbf, char *cont_ptr, struct mandata *pinfo)
{
char *start[FIELDS];
char **data;
data = split_data (dbf, cont_ptr, start);
pinfo->name = copy_if_set (*(data++));
pinfo->ext = *(data++);
pinfo->sec = *(data++);
pinfo->mtime.tv_sec = (time_t) atol (*(data++));
pinfo->mtime.tv_nsec = atol (*(data++));
pinfo->id = **(data++); /* single char id */
pinfo->pointer = *(data++);
pinfo->filter = *(data++);
pinfo->comp = *(data++);
pinfo->whatis = *(data);
pinfo->addr = cont_ptr;
}
As you can see, the data
array is read sequentially and each property in the mandata
is filled according to the position in memory. Now it’s important to highlight that before, we were simply dumping data
contents into the output buffer, so what we see in accessdb
is exactly how data (or pointers, to be more precise) are ordered in memory.
With both pieces of information, we can put it together to figure out what’s being dumped. Here are some examples from the result above:
$ accessdb
adduser -> "- 8 8 1537038759 0 A - - gz add a user or group to the system"
addgroup -> "- 8 8 1537038759 0 B - - gz add a user or group to the system"
adjtime -> "- 5 5 1642709435 0 C adjtime_config - gz "
adjtime_config -> "- 5 5 1642709435 0 A - - gz information about hardware clock setting and drift factor"
...
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|---|
name |
ext |
sec |
time |
id |
pointer |
filter |
comp |
whatis |
|
adduser | - | 8 | 8 | 1537038759 0 | A | - | - | gz | “add a user or group to the system” |
addgroup | - | 8 | 8 | 1537038759 0 | B | - | - | gz | “add a user or group to the system” |
adjtime | - | 5 | 5 | 1642709435 0 | C | adjtime_config | - | gz | "" |
adjtime_config | - | 5 | 5 | 1642709435 0 | A | - | - | gz | “information about hardware clock setting and drift factor” |
This is good! But now even more questions are popping up.
What is the id
and why is it a letter?
Answered by a comment in libdb/db_storage.h
/* These definitions give an inherent precedence to each particular type
of manual page:
ULT_MAN: ultimate manual page, the full source nroff file.
SO_MAN: source nroff file containing .so request to an ULT_MAN.
WHATIS_MAN: virtual `whatis referenced' page pointing to an ULT_MAN.
STRAY_CAT: pre-formatted manual page with no source.
WHATIS_CAT: virtual `whatis referenced' page pointing to a STRAY_CAT. */
/* WHATIS_MAN and WHATIS_CAT are deprecated. */
#define ULT_MAN 'A'
#define SO_MAN 'B'
#define WHATIS_MAN 'C'
#define STRAY_CAT 'D'
#define WHATIS_CAT 'E'
What about this pointer to another entry?
This indicates that the adjtime
entry points to adjtime_config
on whatis
. This can be observed by:
$ whatis adjtime_config
adjtime_config (5) - information about hardware clock setting and drift factor
$ whatis adjtime
adjtime_config (5) - information about hardware clock setting and drift factor
We’ve learned that there is a database that holds indexes for each man page based on their index. There’s some metadata to each man page that allows for quickly identifying the man page type, extension and section it belongs to.
There are still some things I don’t understand about man-db
and why something work they way they do.
catman
pages don’t seem to be cached if terminal window size isn’t exactly 80.- There seems to be some hierarchy between pages? The
pointer
property gives us a hint that pages are linked to each other. Still need to investigate this further. - OSX uses
man
(notman-db
) by default, which has a different set of features.man-db
is a fork of an earlier version ofman
, so there are many similarities. You can still installman-db
usingbrew
. - Different OSes do things slightly different. For example
nix-os
adds some extraMANPATH_MAP
andMANDB_MAP
entries . Debian doesn’t specify a DB in configure but clearly usesgdbm
, while pacman’s PKGBUILD explicitly setsgdbm
. Homebrew also doesn’t specify a DB in configure but gets defaulted to BTree. I don’t get why some things work the way they do.
⇦ Back Home | ⇧ Top |
If you hated this post, and can't keep it to yourself, consider sending me an e-mail at fred.rbittencourt@gmail.com or complain with me at Twitter X @derfrb. I'm also occasionally responsive to positive comments.