5.3. ITSF internal file formats

Please let us know if you find any other internal files, figure out formats of any internal files or find out what unknown parts of the files below do. Any and all contributions will be fully attributed and, if appropriate, co-copyright given.

In this section, where the description of a file says that an item is an offset into another file, that file may be located in the same CHM, or it may be located in an accompanying CHI file.

The different types of ITSF files contain different internal files. The list below indicates which file types contain which internal files:

CHI
/#ITBITS, /#SYSTEM, /#IDXHDR, /#STRINGS, /#TOCIDX, /#TOPICS, /#URLSTR, /#URLTBL, /#WINDOWS, /$OBJINST, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWKeywordLinks/Property
CHM
/#ITBITS, /#SYSTEM, /#IDXHDR, /#STRINGS, /#TOCIDX, /#TOPICS, /#URLSTR, /#URLTBL, /#IVB, /#SUBSETS, /#WINDOWS, /$FIftiMain, /$OBJINST, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWKeywordLinks/Property
CHQ
/$FIftiMain, /$OBJINST, /$TitleMap
CHW
/$OBJINST, /$HHTitleMap, /$WWAssociativeLinks/BTree, /$WWAssociativeLinks/Data, /$WWAssociativeLinks/Map, /$WWAssociativeLinks/Property, /$WWKeywordLinks/BTree, /$WWKeywordLinks/Data, /$WWKeywordLinks/Map, /$WWKeywordLinks/Property
hh.dat
windowtype, AdvSearchUI/Keywords, AdvSearchUI/Properties, Bookmarks/v1/Count, Bookmarks/v1/n/Topic, Bookmarks/v1/n/Url
KPD
/#KEY_DATA, /#KEY_DELETED
Seen in HHA.dll or on the internet, but not seen in any ITSF files

#GRPINF (see helpdeco docs by Manfred Winterhoff for a possible function), #INFOTYPES (probably will be output when MS implements information types), #URLS (probably a previous incarnation of the #URLTBL + #URLSTR combination), #BSSC (8 bytes, based on something I saw in KeyTools.exe it might contain the version of RoboHelp used to create the CHM. If you have RoboHelp please check this out & be sure to send in your chm.)

The #SYSTEM file begins with a DWORD, which is a version number. It is 2 in files compiled with "Compatibility=1.0" or 3 in files compiled with "Compatibility=1.1 or later". Other values have not been found. It is followed #SYSTEM entries to the EOF, which have the following format:

Table 5.1. The format of #SYSTEM entries.

OffsetTypeComment/Value
0WORDcode - see below for values & meanings
2WORDlength of data
4BYTEsdata

In the below list of the different codes the order of the codes in the #SYSTEM file is 10, 9, 4, 2, 3, 16, 6, (5,0,1 or 0,1,5 - haven't been able to make files with all three), 7, 11, 12, 13, 14, 8 and lastly 15.

Table 5.2. An explanation for each of the #SYSTEM codes. *Not present in files with "Compatibility=1.0".

CodeExplanation
0Value of Contents file in the [OPTIONS] section of the HHP file. NT
1Value of Index file in the [OPTIONS] section of the HHP file. NT
2Value of Default topic in the [OPTIONS] section of the HHP file. NT
3Value of Title in the [OPTIONS] section of the HHP file. NT
428 (HHA Version 4.72.7294 and earlier) or 36 (HHA Version 4.72.8086 and later) byte structure:

Table 5.3. The format of the code 4 #SYSTEM entry.

OffsetTypeComment/Value
0DWORDLCID from the HHP file.
4DWORDOne if DBCS is in use.
8DWORDOne if full-text search is on.
0xCDWORDNon-zero if the file has KLinks.
0x10DWORDNon-zero if the file has ALinks.
0x14QWORDtimestamp - Definately not a straightforward Win32 FILETIME structure. On odd hours it seems to be reduced by a factor of 15, compared to even hours.
0x1CDWORD0/1 (unknown) Only dsmsdn.chi from the MSDN has 1 here. Perhaps 1 means it is the root chm of a collection?
0x20DWORD0 (unknown)
5Value of Default Window in the [OPTIONS] section of the HHP file. NT
6Value of Compiled file in the [OPTIONS] section of the HHP file. This is the lowercase of the stem of the CHM file name. If the name of the CHM is ..\bar\foo\ FOO-Bar . chm jimmy is a poo-bum then this will be " foo-bar ". NT
7*DWORD present in files with "Binary Index=Yes". The entry in the #URLTBL file that points to the sitemap index had the same first DWORD.
8Rare. VOICESDK.CHM & CHI and WOSA.CHI from the MSDN have one. The abbreviations and explanations seem to be the same in WOSA.CHI & VOICESDK.CHM, except for 2 mistakes (one in VOICESDK.CHM & one in WOSA.CHI) that seem to be created by bugs in the compiler. Both were compiled by the same version of HHA (4.72.8086), so perhaps this version has some weird bug. Each entry is 16 BYTEs:

Table 5.4. The format of the code 8 #SYSTEM entry.

OffsetTypeComment/Value
0DWORD0, 4 in some (unknown)
4DWORDOffset in #STRINGS file. An abbreviation.
8DWORD3 where 1st DWORD is 0, 5 where it is 4 (unknown)
0xCDWORDOffset in #STRINGS file. An explanation of the abbreviation.
9The version/program that the CHM was compiled by - shown in the version dialog as "Compiled with %s" where %s is what is in this entry of the #SYSTEM file. If compiled with the MS HTML Help Author dll then it will be something like "HHA Version 4.74.8702". It comes directly from the resource strings of HHA.dll (I saw it there in Unicode and successfully altered it). Beware that the text control in the version dialog that displays it is only so big and in some cases the string won't be displayed, & in other cases only part, depending apon the effect of wrapping, so if you write a compiler, be sure to test it and use a short name and version. Usually NT, but HH won't crash if it isn't.
10time_t timestamp (DWORD). Not sure of the start year yet.
11*DWORD present in files with "Binary TOC=Yes". The entry in the #URLTBL file that points to the sitemap contents has the same first DWORD.
12*Number of information types (DWORD).
13*The #IDXHDR file contains exactly the same bytes. See below for more info
14Rare. The ones I saw were from MS Word 2000. My guess is that it is an MSOffice extension (or maybe not) that overrides the names & window types of the navigation tabs. DWORD number of windows to override, 2 ANSI/UTF-8 NT strings for each window. The first is the text for the tab & the second is probably the name of the window type to use. (eg 2, "&Answer Wizard\0MsoHelpAWDlg\0&Index\0MsoHelpKeyDlg\0") These are from the Custom tab variables of the [OPTIONS] section of the HHP file. The resources from MSOHELP.EXE have a weird .reg file that gives the CLSIDs involved in the provision of these dialogs.
15*Information type checksum (DWORD). Unknown algorithm & data source.
16Value of Default Font in [OPTIONS] section of the HHP file. NT
17-65535Not yet seen. Please let us know if you see these.

This has exactly the same bytes as the code 13 entry in the #SYSTEM file and is 4096 bytes long.

Table 5.5. The format of the #IDXHDR file.

OffsetTypeComment/Value
0char[4]T#SM
4DWORDUnknown timestamp/checksum
8DWORD1 (unknown)
0xCDWORDNumber of topic nodes including the contents & index files
0x10DWORD0 (unknown)
0x14DWORDOffset in the #STRINGS file of the ImageList param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x18DWORD0 (unknown)
0x1CDWORD1 if the value of the ImageType param of the "text/site properties" object of the sitemap contents is "Folder". 0 otherwise.
0x20DWORDThe value of the Background param of the "text/site properties" object of the sitemap contents
0x24DWORDThe value of the Foreground param of the "text/site properties" object of the sitemap contents
0x28DWORDOffset in the #STRINGS file of the Font param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x2CDWORDThe value of the Window Styles param of the "text/site properties" object of the sitemap contents
0x30DWORDThe value of the ExWindow Styles param of the "text/site properties" object of the sitemap contents
0x34DWORDUnknown. Often -1. Sometimes 0.
0x38DWORDOffset in the #STRINGS file of the FrameName param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x3CDWORDOffset in the #STRINGS file of the WindowName param of the "text/site properties" object of the sitemap contents (0/-1 = none)
0x40DWORDNumber of information types.
0x44DWORDUnknown. Often 1. Also 0, 3.
0x48DWORDNumber of files in the [MERGE FILES] list.
0x4CDWORDUnknown. Often 0. Non-zero mostly in files with some files in the merge files list.
0x50DWORD[1004]List of offsets in the #STRINGS file that are the [MERGE FILES] list. Zero terminated, but don't count on it.

This file contains information on the window types in the CHM. It has the following format:

Table 5.6. The format of the #WINDOWS header.

OffsetTypeComment/Value
0DWORDNumber of entries in the file
4DWORDSize of each of the entries in the file (188 or 196)
8#WINDOWS entries to the EOF

#WINDOWS entries are basically HH_WINTYPE structures as specified in htmlhelp.h. Note the first DWORD can be used to specify different versions of this structure. Also note that the HHW docs show a different structure to htmlhelp.h. Therefore many CHM files need to be surveyed to find structures with sizes other than 188 or 196. In the description of #WINDOWS entries below, Arg n means that that item is argument n of the window definition in the HHP file, either converted to a DWORD or to an offset in the indicated file:

Table 5.7. The format of each #WINDOWS entry.

OffsetTypeComment/Value
0DWORDSize of the entry (188 in CHMs compiled with "Compatibility=1.0", 196 in CHMs compiled with "Compatibility=1.1 or later")
4DWORD0 (unknown) - but htmlhelp.h indicates that this is "BOOL fUniCodeStrings; // IN/OUT: TRUE if all strings are in UNICODE"
8DWORDArg 0. Offset in #STRINGS file.
0xCDWORDWhich window properties are valid & are to be used for this window. See the table below.
0x10DWORDArg 10.
0x14DWORDArg 1. Offset in #STRINGS file.
0x18DWORDArg 14.
0x1CDWORDArg 15.
0x20RECTArg 13. Order left, top, right & bottom.
0x30DWORDArg 16.
0x34DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHelp; // OUT: window handle"
0x38DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndCaller; // OUT: who called this window"
0x3CDWORD0 (unknown) - but htmlhelp.h indicates that this is "HH_INFOTYPE* paInfoTypes; // IN: Pointer to an array of Information Types"
0x40DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndToolBar; // OUT: toolbar window in tri-pane window"
0x44DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndNavigation; // OUT: navigation window in tri-pane window"
0x48DWORD0 (unknown) - but htmlhelp.h indicates that this is "HWND hwndHTML; // OUT: window displaying HTML in tri-pane window"
0x4CDWORDArg 11.
0x50BYTE[16]0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcHTML; // OUT: HTML window coordinates" & the HHW docs say "Specifies the coordinates of the Topic pane."
0x60DWORDArg 2. Offset in #STRINGS file.
0x64DWORDArg 3. Offset in #STRINGS file.
0x68DWORDArg 4. Offset in #STRINGS file.
0x6CDWORDArg 5. Offset in #STRINGS file.
0x70DWORDArg 12.
0x74DWORDArg 17.
0x78DWORDArg 18.
0x7CDWORDArg 19.
0x80DWORDArg 20.
0x84BYTE[20]0 (unknown) - but htmlhelp.h indicates that this is "BYTE tabOrder[HH_MAX_TABS + 1]; // IN/OUT: tab order: Contents, Index, Search, History, Favorites, Reserved 1-5, Custom tabs"
0x98DWORD0 (unknown) - but htmlhelp.h indicates that this is "int cHistory; // IN/OUT: number of history items to keep (default is 30)"
0x9CDWORDArg 7. Offset in #STRINGS file.
0xA0DWORDArg 9. Offset in #STRINGS file.
0xA4DWORDArg 6. Offset in #STRINGS file.
0xA8DWORDArg 8. Offset in #STRINGS file.
0xACBYTE[16]0 (unknown) - but htmlhelp.h indicates that this is a RECT that is "RECT rcMinSize; // Minimum size for window (ignored in version 1)"
Everything after here is only present in CHMs compiled with "Compatibility=1.1 or later".
0xBCDWORD0 (unknown) - but htmlhelp.h indicates that this is "int cbInfoTypes; // size of paInfoTypes;"
0xC0DWORD0 (unknown) - but htmlhelp.h indicates that this is "LPCTSTR pszCustomTabs; // multiple zero-terminated strings"

Present in files with a non-empty contents file, "Binary TOC=Yes" and "Compatibility=1.1 or later".

This file is made up of 0x1000 byte blocks, but this is only apparent because of extra bytes interrupting what would otherwise be a stream of 20/28 byte structs. If the other parts (DWORDS & 16 byte structs) didn't fit into these blocks then presumably this would show up in the other parts too.

The first block is the header:

Table 5.9. The format of the #TOCIDX header.

OffsetTypeComment/Value
0DWORD4096/header length/offset of no. 1 below
4DWORDoffset of no. 3 below
8DWORDnumber of no. 3 below
0xCDWORDoffset of no. 2 below
0x10BYTE[4080]0 (unknown)

The header is followed by the following different types of structs in the specified order:

  1. 20/28 byte structs (pages/books)

  2. list of dwords into #TOPICS file

  3. 16 byte structs - links above stuff

First all the top level books/pages, then the next level, then the next & so on

An index into this file can be converted to an offset in the #URLTBL file, without reading this file using the following formula: offset = (index%341)*12 + index/341*4096

This file contains information on the topics present.

Each entry has the following format.

Table 5.12. The format of the #TOPICS file entries.

OffsetTypeComment/Value
0DWORDOffset into the tree in the #TOCIDX file.
4DWORDOffset in #STRINGS file of the contents of the title tag or the Name param of the file in question. -1 = no title.
8DWORDOffset in #URLTBL of entry containing offset to #URLSTR entry containing the URL.
0xCWORD2 indicates not in contents, 6 indicates that it is in the contents, 0/4 something else (unknown)
0xEWORD0, 2, 4, 8, 10, 12, 16, 32 (unknown)

This file is made up of 0x4000 byte blocks. If the last block is not filled then it will be smaller than 0x4000 bytes. The free space at the end of the blocks is filled with NUL bytes. The blocks contain the following elements one after another:

An unknown BYTE. So far this has been 0, 0x42 and in spechsdk.chi it was 0x49. Does not indicate presence/absence of URL/FrameName strings.

This is followed by pairs of URL, FrameName strings (both NT) from the HHC.

Then come all the Local strings from the HHC:

Table 5.13. The format of the #URLSTR entries.

OffsetTypeComment/Value
0DWORDOffset of the URL for this topic.
4DWORDOffset of the FrameName for this topic.
8ANSI/UTF-8 NT string that is the Local for this topic. 

There is one way to tell where the end of the URL/FrameName pairs occurs: Repeat the following: read 2 DWORDs and if both are less than the current offset then this is the start of the Local strings else skip two NT strings.

An offset in this file can be converted to an index into the #TOPICS file, without reading this file using the following formula: index = ((offset%4096)+((offset/4096)*4096-4))/12

Each 0x1000 byte block has the following format.

Table 5.14. The format of the #URLTBL blocks.

OffsetTypeComment/Value
0DWORD[3][341]341 entries. 12 bytes each.
0xFFCDWORD4096 (unknown) possibly the length of the block? That MS would pull this kind of shit is really annoying; they should have just put all the entries one after another, not stuffed in an arbitrary DWORD after every 4092 bytes. From this and other blockness I guess they are optimizing for the Wintel platform.

Each entry has the following format.

Table 5.15. The format of the #URLTBL block entries.

OffsetTypeComment/Value
0DWORDUnknown. I suspect that this is either some kind of unique ID or two WORDs.
4DWORDIndex of entry in #TOPICS file.
8DWORDOffset in #URLSTR file of entry containing filename.

This is basically the [ALIAS] section of the HHP file.

Table 5.16. The format of the #IVB file.

OffsetTypeComment/Value
0DWORDSize of the file minus 4 (num entries = (filelen-4)/8)
4#IVB entries to the EOF

#IVB entries have the following format.

Table 5.17. The format of the #IVB entries.

OffsetTypeComment/Value
0DWORDThe value of the alias
4DWORDOffset in #STRINGS file of the file to show

This file is present when the [SUBSETS] section is present in the HHP file.

Table 5.18. The format of the #SUBSETS header.

OffsetTypeComment/Value
0WORD0 (unknown)
2WORDNumber of bytes taken up by the subset entries.

The subset entries currently seem to be garbage left over from previous usage of the same memory locations. Based on the number of bytes per non-whitespace line in the [SUBSETS] section each subset entry is 12 BYTEs in length.

The majority of this description was contributed and or corrected by Jed Wing.

Empty when "Full-text search=No" or when no HTML files have been indexed. Holds the full-text search information. If you have a word longer than 99 characters in a HTML file then it seems the indexing routines will die during indexing of that file and then skip on to the next one. All word sorting, processing and storage is done case-insensitively and is not case-preserving. Note that files without ".h" in their names will not contribute keywords to this fast-search index. The function of this file seems to be to store the locations of the words found in the HTML files, so the search code can quickly find where those words occur.

This file is yet another tree, more similar to the ITSP directory than the BTree file.

This file makes use of 2 ways of encoding integers in variable length fields: the so called scale and root method and a variant of the ENCINT method used in the PMGL/PMGI directory chunks. For the ENCINTs in this file the bytes are stored least significant first (little endian), whereas in the PMGL/PMGI chunks they are stored most significant first (big endian).

The scale and root method needs two parameters, which I'll call s (scale) and r (root size). In the context of $FIftiMain files, s always appears to be '2', but any other power of 2 could also work (and might be used in some rare cases). The encoding is as follows:

The integer is encoded as two parts, p (prefix) and q (actual bits). p determines how many bits are stored, as well as implicitly determining the high-order bit of the integer. To encode an integer, p starts out as a single 0. If the integer fits in r bits, you're done. If the integer fits in r+1 bits (i.e. r-th bit is set, counting from 0), prepend a 1 to the p and store the low r bits of the integer in q. Otherwise, while the integer does not fit in the allotted space, prepend a bit to p, and increase the size of q by one bit. It's hard to see from the description, but an example will make it more clear. Using s=2, r=3:

value   p  q
0:      0 000
1:      0 001
2:      0 010
..
7:      0 111
8:     10 000
9:     10 001
10:    10 010
..
15:    10 111
16:   110 0000
17:   110 0001
18:   110 0010
..
30:   110 1110
31:   110 1111
32:  1110 00000
33:  1110 00001
34:  1110 00010
..
62:  1110 11110
63:  1110 11111
64: 11110 000000
and so on.

A scale other than 2 has never been seen, so it is hard to say how s/r encoding works when s=4, etc. The following is how it might work using s=4, r=2:

value  p (base 2) q (base 4)
0:             0 00
1:             0 01
..
14:            0 32
15:            0 33
16:           10 00
17:           10 01
..
30:           10 32
31:           10 33
32:          110 000
33:          110 001
..

and so on. (i.e. a base-4 digit is added each time, meaning two bits added each time. In binary that looks like:

value         p   q
0:             0 0000
1:             0 0001
..
14:            0 1110
15:            0 1111
16:           10 0000
17:           10 0001
..
30:           10 1110
31:           10 1111
32:          110 000000
33:          110 000001
..

Of course, this is all wild speculation, since examples with s other than 2 haven't been seen... But the codes do work this way (i.e. prepending a 1 to the prefix multiplies the additive value 'b' by s and adds another log2(s) bits.)

The file begins with a header that is 0x400 bytes in length.

Table 5.19. The format of the $FIftiMain header.

OffsetTypeComment/Value
0BYTE[4]0x00 0x00 0x28 0x00 (unknown)
4DWORDNumber of HTML files indexed after any automatic splitting.
8DWORDOffset to the last word tree block (4096 less than the file length)
0xCDWORD0 (unknown)
0x10DWORDThe number of "leaf node" pages in the file.
0x14DWORDOffset to the last word tree block (4096 less than the file length)
0x18WORDDepth of the tree of blocks (i.e. 1 if only leaf nodes, 2 if there is a non-leaf node page to index among the leaf nodes, 3 if there are 2 levels of index node chunks, but could theoretically be even deeper.
0x1ADWORD7 (unknown)
0x1EBYTEScale for encoding of "document index" in Word Location Code (WLC) entries
0x1FBYTERoot size for encoding of "document index" in WLC entries
0x20BYTEScale for encoding of "code count" in WLC entries
0x21BYTERoot size for encoding of "code count" in WLC entries
0x22BYTEScale for encoding of "location codes" in WLC entries
0x23BYTERoot size for encoding of "location codes" in WLC entries
0x24BYTE[10]0 (unknown)
0x2EDWORDLength of the word tree blocks (4096).
0x32DWORD0/1 (unknown)
0x36DWORDWord index of the last duplicate.
0x3ADWORDCharacter index of the last duplicate. From the first character of the first word. The whitespace after tags is not included. & type things are counted as one character. Line endings are not counted in this.
0x3EDWORDLength of the longest word in the list not including NT (maximum of 99).
0x42DWORDNumber of words including duplicates.
0x46DWORDNumber of words not including duplicates.
0x4ADWORDThe total length of all the words including duplicates is this DWORD plus the next one. It is unknown how the split is performed.
0x4EDWORDThis one is usually smaller than the previous one.
0x52DWORDTotal length of all the words not including duplicates.
0x56DWORDLength of unused/null bytes at the end of the word block (if only 1 block, more than total if > 1 block - possibly some free space in WLC blocks).
0x5ADWORD0 (unknown)
0x5EDWORDOne less than the number of HTML files indexed (not entirely sure)
0x62BYTE[24]0 (unknown)
0x7ADWORDWindows code page identifier (usually 1252 - Windows 3.1 US (ANSI))
0x7EDWORDLCID from the HHP file.
0x82BYTE[894]0 (unknown)

The header is followed by pairs of variable size WLC (scale and root encoded) blocks and leaf node chunks (in that order).

Each WLC entry is made up of bit fields packed as tightly as possible. Each entry, however, is right-padded with 0s to a full byte. The fields are encoded as the scale and root variable-length integer format described above, with the parameters taken from the initial header. "Delta coding" is also used in a couple of places to reduce the size of the codes -- that is, the first value is stored verbatim, and subsequent values are stored as a delta or difference from the previous value.

The leaf and index node chunks are 4096 bytes in length. They begin with a header followed by entries.

This is followed by leaf node entries:

This is all fairly complex, so an example will be extremely useful here. This example is taken from a copy of windows.chm, the system documentation apparently distributed with some version of Windows 98:

Hex dump of two leaf node entries:

000223d:                                        02 00 31  ...0...........1
0002240: 00 0a 03 04 00 00 00 00 1d 01 01 01 01 20 04 00  ............. ..
0002250: 00 00 00 03

The fields of these two entries are as follows:

Scanning over to offset 0x403 in the file, we see:

0000403:          f9 f4 60 86 b8 ea 6a 00 ed 78 00 2d c0
0000410: f8 d7 28 2c f0 f6 dc c8 ce 66 61 80 87 02 00 00
0000420: f9 f4 40

Broken out, these WLC entries are:

1          <10, 1027, 29>:  f9 f4 60 86 b8 ea 6a 00 ed 78 00 2d c0 f8 d7 28
                            2c f0 f6 dc c8 ce 66 61 80 87 02 00 00
1 (TITLE)  <1, 1056, 3>:    f9 f4 40

Now, the parameters for the WLC in this file are 2/2, 2/1, 2/5. Here is a quick reference table for the codes:

p       value       q (bits)
2/1:
0:      0-1         1
10:     2-3         1
110:    4-7         2
1110:   8-15        3
11110:  16-31       4
111110: 32-63       5
2/2:
0:      0-3         2
10:     4-7         2
110:    8-15        3
1110:   16-31       4
11110:  32-63       5
111110: 64-127      6
2/5:
0:      0-31        5
10:     32-63       5
110:    64-127      6
1110:   128-255     7
11110:  256-511     8
111110: 512-1023    9

Let's start with the short one, since it's very simple:

f9 f4 40 => 1111 1001 1111 0100 0100 0000
2/2 Document index: 111110 011111 => 64 + 31 => Document no. 95
2/1 Code count:     0 1           => 1
2/5 Location codes: 0 00100       => 4       => Word no. 4
    padding:        0000

So, in document no. 95, word no. 4 is a '1' which is in the title. Now, the ordering of the documents is provided by the #URLTBL and #URLSTR files. Looking up document no. 95 in there (0-based indexing!), we see the file is internet_account.htm, in which, the first non-markup text is:

<title>Dial-Up Networking: Step 1</title>
0: dial
1: up
2: networking
3: step
4: 1

Now, the next one is a little more complicated. I won't go over it in as much detail, but I'll just break it out quickly. It contains 10 entries:

(111110 011111) ( 0 1) (   0 00110  )             0000
(    10 00    ) ( 0 1) (  10 10111  )             000
(  1110 1010  ) ( 0 1) (  10 10100  )             0000000
(  1110 1101  ) ( 0 1) (1110 0000000)             000
(     0 01    ) ( 0 1) (  10 11100  )             0000
(111110 001101) ( 0 1) ( 110 010100 )             0
(     0 01    ) ( 0 1) (  10 01111  )             0000
( 11110 11011 ) ( 0 1) ( 110 011001 )             000
(   110 011   ) (10 0) ( 110 011001 ) ( 10 00011) 0000000
(    10 00    ) ( 0 1) ( 110 000001 )             0
                                                  00000000 00000000

Parsing those entries, we get:

Picking one at random, say, Document no. 303 with 2 hits, we open up windows_netsetup_netwin.htm, from which I've generated a wordlist containing all of the words in order:

  0: To(TITLE)
  1: set(TITLE)
  2: up(TITLE)
  ..
  ..
 86: client
 87: follow
 88: steps
 89: 1
 90: 3
 ..
122: follow
123: steps
124: 1
125: 3
126: and
 ..

And we can see the word '1' shows up in precisely the 89th and 124th spots.

After the WLC blocks and the leaf node chunks comes the index node chunk (for a depth of 2). For higher tree depths the index node blocks are interspersed with the listing node blocks, similarly to how the PMGL/PMGI chunks are laid out in the directory of the ITSF format. The method of splitting used is likely the same space filling method used in the directory. The index node header is just a WORD indicating the length of free space at the end of the current index node chunk.

Words in the node chunks are made up of the following characters stored as is: 0x01 (buggy), 0-9, a-z, _, 0xDE, 0xFE. Bytes are converted and stored as pre the table below. Character entity references of the form &#9660; are truncated to BYTEs and translated as per the table below. Character entity references of the form &amp; are treated as whitespace, except for the the latin characters, which are converted as per the table below.

These conversons may depend on the system codepage, character set, font and language set in the HHP file (I'm just guessing here).

There are a few bugs:

An 0x01 in a word causes the first whitespace character at the end of the word to be included in the word and if the next character is non-whitespace the word is joined to the next word. If the word begins with 0-9 then the word is terminated before the 0x01 and a new word begins at the 0x01. This bug affects the fields in the initial header. For example: "abcd0x1efghi-foobar" is converted to "abcd0x1efghi-foobar". "abcd0x1efghi- foobar" is converted to "abcd0x1efghi-" and "foobar". "0bcd0x1efghi-foobar" is converted to "0bcd" and "0x1efghi-foobar". "0bcd0x1efghi- foobar" is converted to "0bcd", "0x1efghi-" and "foobar".

Weird bug where if the word is 16 characters in length then the word is doubled plus the first 7 chars in length.

There is a weird feature that if a word starts with 0-9 then it may contain multiple periods (0x2E = '.') or commas (0x2C = ',') embedded in the word before the non-period, non-comma word terminating character. I think this feature is so that the user can search for version numbers or numbers with a decimal point or thousands separator in them. Note that commas are removed from the word, while periods are not. For example "v1.1.234.5......,6" will become "v1" and "1.234.5......6".

Weird bug involving words ending in single quote (') being forgotten when the same word is also normal and also ending in a period (.).

There are probably many more hidden bugs and features in the word converter (I think its the the ITIR.StdWordBreaker class in ITIRCL.DLL).

From the name and the number of GUIDs present I guess it has something to do with ActiveX objects. Seems like it can be deleted without major consequence.

Table 5.27. The format of the $OBJINST header.

OffsetTypeComment/Value
0DWORD0x04000000 (unknown)
4DWORDNumber of entries

This is followed by an listing, and each listing entry is as follows

Table 5.28. The format of the $OBJINST listing entries.

OffsetTypeComment/Value
0DWORDOffset of the entry in this file
4DWORDLength of the entry

The listing is followed by the entries one after another at offsets specified in the listing.

There are 2 known types of entries. The first seems to be made up of up to 3 different sub entries. The second is a 36 BYTE structure.

I haven't been able to find any files without the data for bits 0 & 1 so I can't really say exactly how big the header is and which bytes are part of the bit 0 block and which are part of the bit 1 block. Together, though, bits 0 & 1 account for a large bulk of repeatedly increasing byte blocks of 10 bytes each, plus something else at the end. I suspect that the repeats are for bit 0 and the stuff at the end is bit 1. As to the function of these two bits blocks, well there are no GUIDs and no other clues, so who knows.

Table 5.30. bit 2. Only present when "Full text search stop list file" has been specified in the HHP.

OffsetTypeComment/Value
0char[4]""(\0
4DWORDLength in bytes of the entries not including the last zero word.
8BYTE[32]0 (unknown)
0x28Entries. The last entry has a zero length word.

The files in the $WWAssociativeLinks and $WWKeywordLinks directories have the same formats. The maximum total length (including parents) of an entry in one of these files is 488 characters (including NT). HHW complains about and refuses to output any that are greater than this length.

The $WWKeywordLinks dir specifies the contents of the Index navigation pane & the $WWAssociativeLinks dir specifies the Alinks.

In CHW files this is named BTREE and in CHI/CHM files it is named BTree.

This file has a 76 byte header, then 2048 byte blocks. First come all the listing blocks, then all the index blocks. This file is similar to the directory entries in the ITSF format, except that the index blocks are at the end instead of interspersed with the listing blocks. All block indices below are zero based. This file forms a tree, with the last (index mostly) block being the root of the tree. If there is more than one level of index blocks then the root block will have two children; the first in the block header and the second in the entry. WARNING: just as in the ITSF directory there can be garbage in the free space, so respect that first WORD and use it. I'm not yet sure how the listing blocks are split up, though it is probably the same as the ITSF directory (space filling).

The file begins with a WORD indicating the number of entries.

Each entry has the following format:

Table 5.39. The format of the $HHTitleMap entries.

OffsetTypeComment/Value
0WORDLength of the file stem.
2BYTEsFile stem. ANSI/UTF-8 string. Not NT.
+0DWORDUnknown.
+4DWORDUnknown. Same value as previous DWORD.
+8DWORDLCID of the specified file.

The file begins with a WORD indicating the number of entries.

Each entry is 68 BYTEs in length and has the following format:

Table 5.40. The format of the $TitleMap entries.

OffsetTypeComment/Value
0BYTE[25]File stem. ANSI/UTF-8 NT fixed length string.
0x19BYTE[25]Unknown. Seems to be RAM litter, but contains paths, file names, zero bytes, DWORDs and mixtures.
0x32WORDAn index number that begins at 1 and is incremented by 1 for each entry.
0x34DWORDUnknown.
0x38DWORDUnknown. Same value as previous DWORD.
0x3CDWORDLCID of the specified file.
0x40DWORDNumber of topic nodes including the contents & index files in the specified file.