With each plugin/master file that contains an lstring (lookup string) datatype, there is an accompanying set of string tables in Data\Strings. The naming convention appears to be the plugin/master filename then an underscore then the language. For example, english Skyrim.esm has 'Skyrim_English' as the base filename. There are 3 files with different extensions (DLSTRINGS, ILSTRINGS, STRINGS), the significance of which appears to be that DLSTRINGS contains Journal/Book entries, ILSTRINGS has subtitled conversations and STRINGS contains general strings like item names. With the exception of STRINGS having a slightly different string data format, they share the same format.
The string files are simple uncompressed data with a layout that consists of an 8-byte header that contains the count of strings and the total size of the string data at the end of the file. This is followed by a series of 8-byte structs that consist of the string ID for reference and a relative offset to the string from the beginning of the string data.
The string data itself has 2 formats that are only slightly different, the .STRINGS file has simple null-terminated (C-style) strings, while the .ILSTRINGS and .DLSTRINGS also have null-terminated strings but additionally have a uint32 preceding the string that declares the length.
|count||uint32||Number of entries in the string table.|
|dataSize||uint32||Size of string data that follows after header and directory.|
|directory||Directory Entry[count]||See specification of Directory Entry below.|
Directory entries are simple 8-byte structs that consist of two uint32, the first being the ID used by mod files to refer to it and the second is the offset from the beginning of the string data to the string itself. These entries are not required to be sequential, and additionally while the ID is unique the offset is not (eg 2 different IDs can point to the same string).
|offset||uint32||Offset (relative to beginning of data) to the string. These entries are not required to be sequential. See String Data below.|
There are 2 slightly different types of string data, depending on the file extension.
Null-terminated C-style string.
|data||zstring||Null-terminated string data.|
Also null-terminated C-style string but has an additional uint32 that specifies length preceding the string data. The length includes the null terminator.
|length||uint32||Length of following string, including null-terminator.|
|data||char[length]||Null-terminated string data.|
The string encodings supported by Skyrim are decided by the "fonts_en.swf" file in the "Skyrim - Interface.bsa", which varies between languages. The following table gives the known supported localisations of the "fonts_en.swf" file (which all have the same filename - the "_en" substring is confusingly not indicative of target language) and corresponding encodings. Blank boxes are unknown.
|Localisation||Primary Encoding||Secondary Encoding|
The official translations all use the secondary encoding given in the table above, apart from Japanese. Polish and Czech use a custom Windows-1250-based encoding with the following character set (note that original ů and ý characters are not used):
Skyrim first attempts to interpret a string as encoded in its primary encoding, but if it contains invalid byte sequences then the secondary encoding is used to interpret it. It is unknown what happens if the string also contains invalid bytes when interpreted using its secondary encoding (eg. by including unused bytes).
Note that interpretation is done after alias lookup and substitution, so if the string used for an alias is in a different encoding to the string containing the alias, the combined string will not be displayed correctly. Note also that each localisation's fonts include incomplete character support, eg. the English localisation's font cannot display Cyrillic characters even when strings are encoded in UTF-8, nor can it display some of the lesser-used characters available in Windows-1252.
There also appears to be a lack of UTF-8 support in certain circumstances, thus far reported for text in scripts. In these circumstances it appears that the secondary encoding is used, but this issue has not yet been investigated.