Investigating `man-db` internals

From time-to-time I get nerd-snipped into understanding the internals of some system, library or any technology I use. This time, man was the target. I’ve discovered that man-db uses a DB (maybe that’s why there is DB in the name?) to index command names to some metadata, allowing quicker lookups on whatis and apropos commands.

Technically, apropos is just a symlink to whatis and they work very similarly to each other.

Here’s some quick information that will cover in more depth in this post:

All cache and storage from man-db is in /var/cache/man folder
The DB is stored in a file index.{db|bt} in the cache folder.
This is a key-value database and the key is the command name. For example whatis python would lookup the entry in the DB with the key python.

There’s a database in man-db?

Yes. There’s an index DB that’s used for lookups on command name and description, used by whatis and apropos commands. The three storage options are: Berkeley DB, GNU DB and UNIX BTree. I’m using Debian for these tests, which in turn uses the gdbm storage layer. So that’s what I’ll be focusing in this post.

DB and cache folder

The contents under /var/cache/man are only created after you run mandb for the first time with a privileged user. Running mandb for the first time would give you a message similar to:

74 man subdirectories contained newer manual pages.
1344 manual pages were added.
0 stray cats were added.
0 old database entries were purged.

Exploring the index file

One of the files created under the cache folder is the DB file contents: /var/cache/man. man-db package offers a convenient tool to dump the contents of the DB in ASCII (accessdb):

$ accessdb
$version$ -> "2.5.0"
/etc/adduser.conf -> "- 5 5 1537038759 0 C adduser.conf - gz "
/etc/deluser.conf -> "- 5 5 1537038759 0 C deluser.conf - gz "
PAM~7 -> "- 7 7 1630005083 0 A - - gz Pluggable Authentication Modules for Linux"
[ -> "- 1 1 1600936569 0 B - - gz check file types and compare values"
access.conf -> "- 5 5 1630005083 0 A - - gz the login access control table file"
accessdb -> "- 8 8 1613729663 0 A - - gz dumps the content of a man-db database in a human readable format"
add-shell -> "- 8 8 1601227547 0 A - - gz add shells to the list of valid login shells"
addgroup -> "- 8 8 1537038759 0 B - - gz add a user or group to the system"
addpart -> "- 8 8 1642709435 0 A - - gz tell the kernel about the existence of a partition"
adduser -> "- 8 8 1537038759 0 A - - gz add a user or group to the system"
adduser.conf -> "- 5 5 1537038759 0 A - - gz configuration file for adduser(8) and addgroup(8) ."
adjtime -> "- 5 5 1642709435 0 C adjtime_config - gz "
adjtime_config -> "- 5 5 1642709435 0 A - - gz information about hardware clock setting and drift factor"
agetty -> "- 8 8 1642709435 0 A - - gz alternative Linux getty"
...

Interesting. There’s a lot to unpack here. We can clearly see some familiar information here. There is a name, section and a description that maybe we can correlate with the text shown in whatis. Let’s use adduser as an example.

$ whatis adduser
adduser (8)          - add a user or group to the system

Nice! But what’s the rest? Time to read some C code and see what can we understand of this feature.

Cache index contents

Reading through accessdb.c (the file that compiles the accessdb binary), we can quickly find the code that iterates on DB entries and spits them to the terminal:

	while (MYDBM_DPTR (key) != NULL) {
		datum content, nextkey;
		char *t, *nicekey;

		content = MYDBM_FETCH (dbf, key);
		if (!MYDBM_DPTR (content)) {
			debug ("key %s has no content!\n", MYDBM_DPTR (key));
			ret = FATAL;
			goto next;
		}
		nicekey = xstrdup (MYDBM_DPTR (key));
		while ( (t = strchr (nicekey, '\t')) )
			*t = '~';
		while ( (t = strchr (MYDBM_DPTR (content), '\t')) )
			*t = ' ';
		printf ("%s -> \"%s\"\n", nicekey, MYDBM_DPTR (content));
		free (nicekey);
		MYDBM_FREE_DPTR (content);
next:
		nextkey = MYDBM_NEXTKEY (dbf, key);
		MYDBM_FREE_DPTR (key);
		key = nextkey;
	}

What are we looking at here? There are two important parts:

MYDBM_FETCH, which uses the dbfile and key to get the contents
```
content = MYDBM_FETCH (dbf, key);
```

The printf statement which spits the format between key -> content

printf ("%s -> \"%s\"\n", nicekey, MYDBM_DPTR (content));

We know for a fact now that the part before the -> is the key in DB. MYDBM_FETCH is simply a macro that maps to whatever function the DB you’re using support. In our case, the function is mapped to gdbm_fetch((db)->file, key).

Here we’re simply dumping the contents of a character array into the output buffer. So there’s no use to understand this part. After some investigation on other usages of fetching/storing information from the DB, I could find the following data structure, which represents a man entry:

struct mandata {
	char *addr;			/* ptr to memory containing the fields */

	char *name;			/* Name of page, if != key */

	/* The following are all const because they should be pointers to
	 * parts of strings allocated elsewhere (often the addr field above)
	 * and should not be written through or freed themselves.
	 */
	const char *ext;		/* Filename ext w/o comp ext */
	const char *sec;		/* Section name/number */
	char id;			/* id for this entry */
	const char *pointer;		/* id related file pointer */
	const char *comp;		/* Compression extension */
	const char *filter;		/* filters needed for the page */
	const char *whatis;		/* whatis description for page */
	struct timespec mtime;		/* mod time for file */
};

Notice that char *addr holds the pointers to all the information. Where the other fields represent only the addresses for where you can find each information. This is more clear if we read the split_content procedure, which parses a raw DB read into a mandata struct :

/* Parse the db-returned data and put it into a mandata format */
void split_content (MYDBM_FILE dbf, char *cont_ptr, struct mandata *pinfo)
{
	char *start[FIELDS];
	char **data;

	data = split_data (dbf, cont_ptr, start);

	pinfo->name = copy_if_set (*(data++));
	pinfo->ext = *(data++);
	pinfo->sec = *(data++);
	pinfo->mtime.tv_sec = (time_t) atol (*(data++));
	pinfo->mtime.tv_nsec = atol (*(data++));
	pinfo->id = **(data++);				/* single char id */
	pinfo->pointer = *(data++);
	pinfo->filter = *(data++);
	pinfo->comp = *(data++);
	pinfo->whatis = *(data);

	pinfo->addr = cont_ptr;
}

As you can see, the data array is read sequentially and each property in the mandata is filled according to the position in memory. Now it’s important to highlight that before, we were simply dumping data contents into the output buffer, so what we see in accessdb is exactly how data (or pointers, to be more precise) are ordered in memory.

With both pieces of information, we can put it together to figure out what’s being dumped. Here are some examples from the result above:

$ accessdb
adduser -> "- 8 8 1537038759 0 A - - gz add a user or group to the system"
addgroup -> "- 8 8 1537038759 0 B - - gz add a user or group to the system"
adjtime -> "- 5 5 1642709435 0 C adjtime_config - gz "
adjtime_config -> "- 5 5 1642709435 0 A - - gz information about hardware clock setting and drift factor"
...

	0	1	2	3	4	5	6	7	8
	`name`	`ext`	`sec`	`time`	`id`	`pointer`	`filter`	`comp`	`whatis`
adduser	-	8	8	1537038759 0	A	-	-	gz	“add a user or group to the system”
addgroup	-	8	8	1537038759 0	B	-	-	gz	“add a user or group to the system”
adjtime	-	5	5	1642709435 0	C	adjtime_config	-	gz	""
adjtime_config	-	5	5	1642709435 0	A	-	-	gz	“information about hardware clock setting and drift factor”

This is good! But now even more questions are popping up.

What is the id and why is it a letter?

Answered by a comment in libdb/db_storage.h

/* These definitions give an inherent precedence to each particular type
   of manual page:

   ULT_MAN:	ultimate manual page, the full source nroff file.
   SO_MAN:	source nroff file containing .so request to an ULT_MAN.
   WHATIS_MAN:	virtual `whatis referenced' page pointing to an ULT_MAN.
   STRAY_CAT:	pre-formatted manual page with no source.
   WHATIS_CAT:  virtual `whatis referenced' page pointing to a STRAY_CAT. */

/* WHATIS_MAN and WHATIS_CAT are deprecated. */

#define ULT_MAN		'A'
#define SO_MAN		'B'
#define WHATIS_MAN	'C'
#define STRAY_CAT	'D'
#define WHATIS_CAT	'E'

What about this pointer to another entry?

This indicates that the adjtime entry points to adjtime_config on whatis. This can be observed by:

$ whatis adjtime_config
adjtime_config (5)   - information about hardware clock setting and drift factor
$ whatis adjtime
adjtime_config (5)   - information about hardware clock setting and drift factor

We’ve learned that there is a database that holds indexes for each man page based on their index. There’s some metadata to each man page that allows for quickly identifying the man page type, extension and section it belongs to.

There are still some things I don’t understand about man-db and why something work they way they do.

catman pages don’t seem to be cached if terminal window size isn’t exactly 80.
There seems to be some hierarchy between pages? The pointer property gives us a hint that pages are linked to each other. Still need to investigate this further.
OSX uses man (not man-db) by default, which has a different set of features. man-db is a fork of an earlier version of man, so there are many similarities. You can still install man-db using brew.
Different OSes do things slightly different. For example nix-os adds some extra MANPATH_MAP and MANDB_MAP entries . Debian doesn’t specify a DB in configure but clearly uses gdbm, while pacman’s PKGBUILD explicitly sets gdbm. Homebrew also doesn’t specify a DB in configure but gets defaulted to BTree. I don’t get why some things work the way they do.

#Linux

⇦ Back Home | ⇧ Top |

If you hated this post, and can't keep it to yourself, consider sending me an e-mail at fred.rbittencourt@gmail.com. I'm more responsive to positive comments though.

Investigating man-db internals

There’s a database in man-db?

Investigating `man-db` internals