GD is an open source code library for the dynamic creation of images by programmers. GD is written in C, and “wrappers” are available for Perl, PHP and other languages. GD creates PNG, JPEG and GIF images, among other formats. GD is commonly used to generate charts, graphics, thumbnails, and most anything else, on the fly. While not restricted to use on the web, the most common applications of GD involve web site development.
See the GD website for more informations.
FS#42 — GD can't display characters with unicode references over 7 digits
Opened by Christopher Key (cjk32) - Thursday, 08 February 2007, 10:02 GMT+2
Last edited by Pierre Joye (Pierre) - Saturday, 10 February 2007, 17:12 GMT+2
|
DetailsGD doesn’t appear to be able to display characters with unicode references over 7 digits. 6 digit references (or 5 digits in hex) display correctly, but 7 digits references (or 6 digits in hex) simply show the literal string: $image->stringFT(0x0000ff, $fontpath, 24, 0, 0, 32, "➠"); # Works $image->stringFT(0x0000ff, $fontpath, 24, 0, 0, 64, "𘚟"); # Works $image->stringFT(0x0000ff, $fontpath, 24, 0, 0, 96, "𝅗𝅥"); # Shows "𝅗𝅥" $image->stringFT(0x0000ff, $fontpath, 24, 0, 0, 128, "𘚠"); # Shows "𘚠" I’ve had a look at the code, and the solution does seem to be as simple as changing the loops in gdft.c to run to 9 characters rather than 7 (Which will run up to 10FFFF). The affected lines are 245 and 261 (in 2.0.33) which change from: 245: for (i = 3; i < 8; i++) 261: for (i = 2; i < 8; i++) to 245: for (i = 3; i < 10; i++) 261: for (i = 2; i < 10; i++) |
For the record (to put the details of a previous discussion here):
It is actually not a bug. See:
http://www.unicode.org/faq/utf_bom.html#10
Maximum 21bits are needed, corresponding to 6 digits in hexadecimal form.
The fix is wrong, on line 245 it processes the hexadecimal form which can have a maximum length of 6 digits. On line 261, it should be:
to match the maximum length of a decimal representation (2^21='2097152'). But not 10, as said in the specs and the HTML entities definition. Can you try it please?
Also, if you still use the same font, can you please attach it again with the full perl script? I think it may be better to rely on the symbols encoding for such tasks (which is fixed in 2.0.34).
Thanks for your report,
Yes, I agree, 7 decicmal digits is sufficient.
I've attached a script (and supporting font file) that demonstrates the problem, along with the output produced on my machine. Unfortunately, as I'm using a win32 platform, I'm somewhat restricted as to the versions of libgd that I can use. I'm using the ActiveState perl distribution with GD 2.35, which I believe uses libgd 2.0.33, although I haven't been able to confirm this.
Thanks for the font and the script. Do you have an image of what you expect (using a screenshot for ex.)? and tell me how you create it (which apps), if possible something OS :).
About windows, we provide DLLs, you can find them in our downloads page. I do not know if activestate supports extern DLL though. I will ask them :)
I'm afriad I can't really provide a screenshot of how it should look, but it is pretty obvious what it should do when working; the script should produce an image with a single character on each line.
The character I'm particularly interested in are those on the third and ninth lines, which should render as a musical note symbol. I use this character extensively within my tagged music, and would like it to render correctly on my squeezebox www.slimdevices.com which GD is used to render text for.
I download the windows binaries, but couldn't get it to work with the activestate module. I replaced the existing GD.dll with bdg.dll from the win32 binaries (renamed to GD.dll), but started getting the error "Can't find 'boot_GD' symbol in C:/Programs/Perl/site/lib/auto/GD/GD.dll". There are various other files within the folder, GD.lib, GD.exp, GD.bs and autosplit.ix, which I presume are all something to do with autosplit and hence need updating for a new binary file, although I don't know how the autosplit system works.
“I’m afriad I can’t really provide a screenshot of how it should look, but it is pretty obvious what it should do when working; the script should produce an image with a single character on each line.”
Yes, one character per line is obvious. Which character and how it looks a bit less :) But you gave me a clear description now.
A little note about the “&#...” notation. It is only for the HTML entities which do not support such large values. It is expected to fail and display the complete string.
Which perl version do you use? 5.8 (as far as I remember) supports UTF-8, can you try to pass a native UTF-8 string to stringFT? The code for the note symbol is u+1d160 (119136). I do not know how to convert this value into a well formed UTF-8 string. But maybe asking in a perl forum will give you the answer (if you can paste it here, it will help to tests). If you can provide a UTF-8 encoded text file with these values, we will then have a base to fix the issue, if any.
Please find as attachment a font (based on code2001.ttf) but with a lower value for the music notes., x101 to x104.
I analyzed the font using fontforge and I may have found the problem. A fix is hardly possible using this notation but to rely on true UTF-8 strings (two sequences of three bytes). It would be nice if we can figure out a solution for 2.1.0 (maybe using gd-pango).
Can you provide a reference for HTML entities only supporting up to four digit characters? Firefox certainly supports up to 6 digit references, is there any specific reason not to provide support?
The perl function, pack produces correct UTF8. pack("U", 0x1d15e) = 0xf09d859e, which from http://en.wikipedia.org/wiki/UTF-8 is correct. It is modifed UTF-8 that 6 bytes.
Yes, that's what I tried to say. The goal is to support two sequences of three bytes. I have to check that this is the real problem but for what I see, we only support single sequence (0xffff max). If I allow larger values does not fix the problem, I get the wrong glyph (that means we generate a wrong UTF-8 value for the sequence). I also tried to pass the result of pack using perl (with gd-2.0.33 or the patched version), same result. So no, it is not only about allowing more digits (as I suspected). Have you ever succeeded to render something using this font? And if yes, with which tools? That will *really* help to get the cleanest possible solution.
And please try using the font I provided earlier and confirm me that we are talking about the same glyphs (x101-104 are the music symbols).
Firstly, yes, the symbols displayed in out.png are the correct symbols.
I have successfully rendered those symbols using the original font file use Firefox: Put CODE2001.ttf in your windows fonts directory then load the attached html file in Firefox. It renders perfectly on my machine.
If I get a chance over the weekend, I'll try and take a look at the rest of the GD code and see if I can see what's going on exactly.
I got it work using gd-pango. That's a good news as 2.1.0 will support it. You can try using gd-pango in CVS (C only, unix for now...). A little side not about these codes, they should really not be used for visible glyphs, they are not defined at all in any encoding (hence the usage of symbols font instead, safer and forward compatibility is granted). I will stop my tests for now. Let see what you will find :)
I'm afraid I don't have a unix platform available to test on.
I've attached a modified version of gdft.c that should support 4 byte utf sequences, as well as longer character references, but I've not way of checking that it even compiles. Do you think you could compile GD with this version of gdft.c and show me the results of either the original perl script or simply trying to render the 4 byte utf8 sequences in the file english.txt.
By the way, these codes are perfectly valid characters, and *are* defined as part of the unicode standard http://www.unicode.org/charts/PDF/U1D100.pdf.
"By the way, these codes are perfectly valid characters, and *are* defined as part of the unicode standard "
Right, I must have read an old specs.
The attached image is generated using your script "gfx.pl" with gd 2.0.34 and your version of gdft.c (patched_gdft.c). I did not check your changes but it has the same errors I had earlier.
I would like two things, keep the html entities like now and the required functions to correctly deal with unicode. The basic UTF-8 support can or should be present in gd itself. The advanced unicode rendering will be done using Pango (a gd renderer for pango is coming).
I'm keen to get some sort or working solution sooner rather that later. Can you tell me what your build environment is, so that I can have a go at some local debugging and see where things are going wrong. It'd be nice to know whether FT_Get_Char_Index returns the correctly value and so on.
Chris
FT_Get_Char_Index returns the correct index if we give it the correct code but we do not. If FT_Get_Char_Index was the problem, pangoft2 will fail miserably (see my previous comment, it works). The problem is really in the (old) conversion routines. I have to clean once and for all and that's in my todo for 2.1.0. I don't think it is worth the effort to fix to much bug in this implementation as we know that it has other troubles, which are impossible to solve using this approach (full unicode rendering requires much more than that).
About the build environment is simple, a standard linux system, its development tools, etc. Feel free to post on the mailing list if you have any trouble to setup your build environment. I will be happy to help you :)
If you have a small patch to fix this specifix issue, I will be happy to apply it. However. please note that it will not make it in 2.0.x, this tree is only for critical bug fixes (security related mostly).
I'm away over the weekend, but I'll drop by the mailing list next week and see if I can get a win32 build and test environment working. It strikes me that it might also be nice to have the option of turning off html entities decoding, and just have the routine render pure utf-8 text.
Chris
To make it clearer, it is not only about the HTML entities. We already fixed it by increasing the maximum length of an entity. The problem is a bit more complex than that. If you pass pure UTF-8 string, it will s1till fail (for these values). More later.
Problems found, finally.
The most important problem is the charmap selection. We select the wrong one. We loop through all charmaps and try to find the requested one. The problem is that we consider non unicode charmap as valid and select it. If the fonts contain other charmap (FT_ENCODING_MS_SYMBOL, FT_ENCODING_ADOBE_CUSTOM or FT_ENCODING_ADOBE_STANDARD) and if they are before FT_ENCODING_UNICODE, we will use them instead. High values symbol require Unicode, that’s why FT_Get_Char_Index fails to find the correct glyph index and return 0.
The second problem (the one I suspected) is in the unicode parsing functions (to get the unicode codes and pass them to FT2). They are too limited and do not support large entities. I have a working version here, I will clean it a bit, add test cases and commit it to HEAD as soon as possible.
Here is a first patch to solve this problem. What this patch does:
FT_Set_Charmap should not have been used in the first place but Select_Charmap(ENCODING). It will also save us the manual detection of the available encoding in the current face. This job will be done by FT_Select_Charmap.
The little sample application shows a working example using the musical symbol in Unicode and the code2001.ttf
There is still a couple of things to check like JIS and Big5 supports but I do not have yet a single clue about these encoding, any help is welcome :)
The next step will be to add a strict encoding mode instead of considering anything as Unicode by default. It will help us to choose the correct encoding and get an error if it does not exist.
About the previous, it also requires freetype 2.1.7 or later. We will certainly have this dependency in 2.1.0.
Looks like the issue is well sorted. It will be nice to be able to control the encoding used by stringFT too.
I will keep StringFT as it is forever. StringFtEx has the great advantage to be flexible. I can add this fix or new features without breaking backward compatibility.
Expect this fix to be in CVS by the end of next week.
I Will like to discribe me experience in use Gdlib to draw traditional chinese characters.
1:Process string must be char. by char., in case of the chr. bigger
2:All ttf/ttc... files are ordered by UnCode. So all multibyte chr.
3:I will like to suggest that use ICONV library to do this task.
Pedro P. Wong
Example
Pedro P. Wong
It's been roughly three years since the last post on this bug, but in the mean time I still see FreeType2 failing to generate legible text via PHP+GD2, with characters like U+1000F and U+211A2 still being chopped up incorrect. Instead, it will generate a two character string for both of them.
The status of this bug, too, is still "undecided". Given that this is a massive issue (namely, FreeType2 currently does not actually support even unicode standard 3.1, which was committed all the way back in 2001), can someone make this bug slightly more pressing to the project?
I've attached a php file that runs into the problem still, which requires the Code2000, Code2001 and 'Han Nom B' fonts (all three freely available, with links to their download pages in the source)
- Pomax
Correction to fix the bug as follows:
diff correction set format.
d257 1 a257 2
———————————————-
My personal correction set format.
The correction was base on version 2.0.36 and test under butterfly windows environment.
Attachment was the test result. Is it correct?
can you generate that png with some fonts that support the high characters? (like Han Nom A and B) right now the characters are "missing glyph" indicators (so I assume it would work with those corrections, but unless there's an actual character to see, it's hard to say)
Han Nom A and B font do not provide any characters code bigger than 0xFFFF. So I don't have a font to test it.
HAN NOM A doesn't, but HAN NOM B implements the CJK-Extension B block, which runs from 0x20000 to 0x2A6DF), which is quite high. For 0x1???? points, Code2001 (a higher plane companion to Code2000) probably does the trick, as it for instance has implementations of the Linear B blocks, which run from 0x10000 to 0x100FF.
I had check the font by font editor(Typetool3) and found it is Chinese font file. The max. font is 0xFFFF.
Then it would appear TypeTool's lying. Although I'm not sure I could call it that, since Fontlab made the executive decision to not let ANY of their programs actually let you inspect properly big fonts. The only product they make that supports a full opentype font is AsiaFont Studio, which sets you back $999 (for no clear reason. There is nothing in Opentype that says you can only have so many characters in an editor for a font, it supports 65535 glyphs and fontlab just thought it was a nice idea to wring more money out of people by going "oh you want a FULL font editor? that'll be a thousand dollar, please").
But buying AsiaFont studio would be silly when we have FontForge, which is free. However, what is even easier is to simply install the font in windows and then use babelmap (http://www.babelstone.co.uk/Software/BabelMap.html).
Han Nom A only goes up to 0xFFFF, and Han Nom B exclusively covers CJK Unified Ideographs Extension B block (which is humongous with 42 thousand glyphs, and starts at 0x20000), similar to for instance the MingLiU, MingLiU-ExtB or SimSun and SimSun-ExtB pairs.
But we're getting side tracked, can you do another test image with a font that has some glyphs for the high codepoints you're rendering? (ideally, one from the 0x1234 range, one from the 0x1xxxx range and one from the 0x2xxxx range).
Han Nom B font has 0x20000~ chars. inside but the font does not provided index for those characters. There are the same as cdoe2001.ttf. Since I only can download demo program, so can not save it while I change the contents. Do some one can change code2001.ttf just assign a index for the chars (those chars. have a name but do not have index, GDlib fetch chars. by index, not by name.) So I can test it. (Right now gdlib can show any char. less then 0x10000. We can not get a font provided chars index bigger then 0x10000, so we can not test it.)
Not quite sure what you mean with "the font does not provided index for those characters"?
HAN NOM B is a normal TTF font with a CMAP subtables. For UCS-2 characters (everything 0x00 to 0xFFFF) uses a format 4 subtable, and UCS-4 characters use a format 12 subtable.
I don't use FreeType directly (I'm using a custom opentype/truetype parse code), but FT_Get_Char_Index *should* return indexes for high codepoint glyphs just fine. For instance, the glyph index for unicode character 0x20000 in the GLYF data table for HAN NOM B is 1775, and the glyph index for unicode character 0x10000 in the GLYF data table for Code2001 is 683.
Are you saying GDLib, by way of FreeType, cannot access unicode characters over 0xFFFF because of limitations in FreeType? Because as far as I know FreeType is perfectly capable of dealing with UCS-4 codepoint glyphs...
Gdlib converts HTML HEX to unicode binary value or direct convert unicode ASCII to unicode binary value as font glyph index. Use this index to fetch the character image. From font editor to view the fonts. You will find out that any char have table index. This value just for the location in the table. And it also have it's name and glyph index. The problem is the characters his glyph index bigger than 0x10000 in font code2001 or HAN NOM B, there glyph index are all zero. Gdlib uses glyph index to fetch the image. So it can not get the image for those characters.
Ahh, I see. That's a problem with the editor you're using, not the font. They have glyph indices greater than 0, but if you're using TypeTool, it will not show you these, because TypeTool only supports up 0xFFFF, which is why I told you to use FontForge instead (a free font editor that runs in Linux, and does not come with the restrictions that TypeTool comes with).
1: Demo TypeTool3 can supports over 0x10000, I have test it, change the glyph index, but I can not save it. Not only TypeTool3 , a lot of font editor show the same result. 2: I don't have cygwin environment, so I can not use FontForge. Do anybody can use code2001 and change one character's glyph index. So I can test it. Anyway, maybe Gdlib shall enhance the freetype search routine(not only search by glyph index but also search by name).
I'll check the freetype list to see if this is a known problem.
Enhance the freetype search function in gdft.c and upgrade freetype library to freetype-2.3.12. You will see the result as attachments.
The correction just for a alternative solution. In case of fail to fetch by index.
Delete first correct set file. I can not delete it.
Some font files like CODE2001.TTF, HAN NOM B.ttf provides multiple CharMap. gdft.c just uses first available CharMap. But the characters informations bigger than 0x10000 are not in first available CharMap. That is why gdlib can not service for the characters bigger than 0x10000.
Hmm...
There might be a relatively simple way around that, if GD assumes TTF fonts: the UCS-2 cmap subtable is an impossible candidate for characters with code 0x10000 and up, as these are wider than 2 bytes (which is what the '2' in UCS-2 stands for). Effectively, if a character is more than 2 bytes, GD could skip straight to the UCS-4 cmap subtable instead, and resolve the character there.
I use this shortcutting in my own OTF/TTF parser, too.
(Of course, this does not work for OTF fonts with CFF content, but then I seem to recall GD doesn't actually support those)
- Pomax
I prefer step one test first available charmap. This charmap support up to 0xFFFF. So, even chinese characters first available charmap is enough. In case of test fail, try any one or more if available.
Attachments are my correction set and test result.
That would be cleaner. The characters in the test PNG look fine to me!
would now be a good time to also add in a "supportsTTFtext($string)" function? (true is all characters in the string, as unicode, are supported, false the moment a single character is not supported)