A friend of mine, who'd rather remain unknown, helped me quite a lot on insight regarding those 2 files.
He said, and i quote: I think I've figured out a somewhat-clumsy-yet-efficient solution to the "square glyph" problem, which is responsible for the letter repetition when using half-width characters in the dialogue lines. I don't know what you've figured out and what you didn't, so I'll explain things from the beginning. Please be patient with me even if you know most or all of what follows. I only mean to be of assistance.
I've read in the first post of the "Disgaea 4 PS3 Translation / Fix Project WIP" thread at PS3 News that the square glyph issue is an encoding related problem, resulting of the fact that SJIS has 2 bytes per character for most of the characters. I think that's not quite it -- it is more of a rendering problem. In the dialogues, all glyphs are rendered as squares. And since half-width characters have smaller textures, they just "warp around". On the other hand, encoding is respected since all of the characters of an ASCII string are printed. One would expect one letter skipped out of two if the game where not "encoding aware enough".
I think it would be very difficult to solve this problem through disassembling and modifying the "EBOOT.BIN" file, because the buffer used to store the dialogue text is likely too small. The "bugs" one can observe by using the US "talk.dat" with the Japanese version are probably due to buffer overruns. So even if one managed to halve the character width, the maximum number of displayed characters would stay the same. Unless of course the size of the buffer is increased. All in all, this seems very difficult to do to me.
So I propose another solution: overwriting the glyphs found in the "font.lzs" file so as to substitute some of the characters (e.g., the kanjis) with letter pairs. For instance, the first kanji of the font file is "亜", which could be substituted by a square glyph representing the letter pair "Th". The next kanji would be replaced by "e ", then "fr", "es", "h ", and so on, in order to display on the screen "The fresh, dripping blood...", which is Valvatorez's first line. Of course, those same pairs could be reused later, should they appear in other dialogues. According to my estimations, the font file is big enough to contain all the pairs needed. Were it not, it could still be resized (but adding extra glyphs would also require to modify "font.ffm", which would make things more complicated).
I'll move to the technical aspects. Such a hack would require to understand the formats of the ".lzs", ".txf" and "talk.dat" files. Here is the information I could gather:
+ LZS is an already well understood format. See this page: www.cetramod.it/forum/viewtopic.php?f=52&t=266. I don't understand Italian, so I don't really know who to credit for the information (the author of the post?), but in summary it is a 254-bytes sliding-window compression scheme. To be specific:
- bytes 0x0 to 0x2: the expected extension of the decompressed file; should be "txf" for "font.lzs"
- byte 0x3: should be '\0'
- bytes 0x4 to 0x7 (little endian uint32): decompressed file size
- bytes 0x8 to 0xb (little endian uint32): (compressed file size) - 4
- bytes 0xc to 0xf (little endian uint32): flag value; must be less than 255, which means that bytes 0xd to 0xf are always "\0\0\0"
- the rest is compressed data; to decompress it, all the characters are to be copied as is except the flag byte which indicates a match, and is followed by the distance/length pair (unless it is followed by another flag byte, in which case a single byte with the flag value should be added to the uncompressed data); the distance and length are one uint8 each; the distance is given from the current position, and should be substracted 1 if it is greater than the flag (to account for the two-consecutive-flags special case); see for instance LZ77 on Wikipedia if you're not familiar with such compression methods
+ TXF is a very simple bitmap format:
- bytes 0x0 to 0xf are the header; the big-endian-encoded uint16 at offsets 0x4 and 0x6 are the image width and height, respectively; it is 1024×2272 for "font.txf"
- the rest is pixel data in the usual scanline order; when it comes to the "font.txf" file, there are two channels: the first is alpha and the other is value, I guess (it is worth 0xff for all the pixels)
+ "talk.dat" has the following structure:
- bytes 0x0 to 0x3 (big endian): the number of conversations (?)
- bytes 0x4 to 0x7 (big endian): should be the same as bytes 0x0 to 0x3
- following 56×(number of conversations) bytes: an array of conversation (?) structures; the first 4 bytes of each element (big endian) are the offset of the conversation start in the conversation data, in bytes (so if it is n, the conversation starts at n+8+56×(number of conversations) from the beginning of the file)
- the rest is conversation data; in particular, spoken lines begin with '\1' and end with '\0' (but of course, not all the '\1' indicate the beginning of a spoken line)
That's basically it. I've also got some very basic understanding of "font.ffm", but normally one should not need to mess with it. It is necessary to associate the glyph position in the "font.txf" bitmap with its SJIS byte sequence. The kanji table given below should be enough. Ask me I you really need to know.
Some kanjis of "font.lzs", in order (starting from glyph 486):
With some more dedicated programming, we could actually make the whole game text translated. I'll look into it after i'm done dumping/translating the eboot, but if anyone is interested/able to do something with this info, by all means, go for it.