Fonts and encodings¶
This chapter covers fonts, encodings and Asian language capabilities. If you are purely concerned with generating PDFs for Western European languages, you can just read the "Unicode is the default" section below and skip the rest on a first reading. We expect this section to grow considerably over time. We hope that Open Source will enable us to give better support for more of the world's languages than other tools, and we welcome feedback and help in this area.
3.1 Unicode and UTF8 are the default input encodings¶
Starting with reportlab Version 2.0 (May 2006), all text input you provide to our APIs should be in UTF8 or as Python Unicode objects. This applies to arguments to canvas.drawString and related APIs, table cell content, drawing object parameters, and paragraph source text.
We considered making the input encoding configurable or even locale-dependent, but decided that "explicit is better than implicit".
This simplifies many things we used to do previously regarding greek letters, symbols and so on. To display any character, find out its unicode code point, and make sure the font you are using is able to display it.
If you are adapting a ReportLab 1.x application, or reading data from another source which contains single-byte data (e.g. latin-1 or WinAnsi), you need to do a conversion into Unicode. The Python codecs package now includes converters for all the common encodings, including Asian ones.
If your data is not encoded as UTF8, you will get a UnicodeDecodeError as soon as you feed in a non-ASCII character. For example, this snippet below is attempting to read in and print a series of names, including one with a French accent: Marc-André Lemburg. The standard error is quite helpful and tells you what character it doesn't like:
>>> from reportlab.pdfgen.canvas import Canvas
>>> c = Canvas('temp.pdf')
>>> y = 700
>>> for line in file('latin_python_gurus.txt','r'):
... c.drawString(100, y, line.strip())
...
Traceback (most recent call last):
...
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 9-11: invalid data
-->é L<--emburg
>>>
The simplest fix is just to convert your data to unicode, saying which encoding it comes from, like this:
>>> for line in file('latin_input.txt','r'):
... uniLine = unicode(line, 'latin-1')
... c.drawString(100, y, uniLine.strip())
>>>
>>> c.save()
3.2 Automatic output font substitution¶
There are still a number of places in the code, including the rl_config defaultEncoding parameter, and arguments passed to various Font constructors, which refer to encodings. These were useful in the past when people needed to use glyphs in the Symbol and ZapfDingbats fonts which are supported by PDF viewing devices.
By default the standard fonts (Helvetica, Courier, Times Roman) will offer the glyphs available in Latin-1. However, if our engine detects a character not in the font, it will attempt to switch to Symbol or ZapfDingbats to display these. For example, if you include the Unicode character for a pair of right-facing scissors, \u2702, in a call to drawString, you should see them (there is an example in test_pdfgen_general.py/pdf). It is not necessary to switch fonts in your code.
3.3 Using non-standard Type 1 fonts¶
As discussed in the previous chapter, every copy of Acrobat Reader comes with 14 standard fonts built in. Therefore, the ReportLab PDF Library only needs to refer to these by name. If you want to use other fonts, they must be available to your code and will be embedded in the PDF document.
You can use the mechanism described below to include arbitrary
fonts in your documents. We have an open source
font named DarkGardenMK which we may
use for testing and/or documenting purposes (and which you may
use as well). It comes bundled with the ReportLab distribution in the
directory reportlab/fonts.
Right now font-embedding relies on font description files in the Adobe
AFM ('Adobe Font Metrics') and PFB ('Printer Font Binary') format. The
former is an ASCII file and contains information about the characters
('glyphs') in the font such as height, width, bounding box info and
other 'metrics', while the latter is a binary file that describes the
shapes of the font. The reportlab/fonts directory contains the files
'DarkGardenMK.afm' and 'DarkGardenMK.pfb' that are used as an example
font.
In the following example locate the folder containing the test font and
register it for future use with the pdfmetrics module,
after which we can use it like any other standard font.
import os
import reportlab
from reportlab.pdfgen import canvas
folder = os.path.dirname(reportlab.__file__) + os.sep + 'fonts'
afmFile = os.path.join(folder, 'DarkGardenMK.afm')
pfbFile = os.path.join(folder, 'DarkGardenMK.pfb')
from reportlab.pdfbase import pdfmetrics
justFace = pdfmetrics.EmbeddedType1Face(afmFile, pfbFile)
faceName = 'DarkGardenMK' # pulled from AFM file
pdfmetrics.registerTypeFace(justFace)
justFont = pdfmetrics.Font('DarkGardenMK', faceName, 'WinAnsiEncoding')
pdfmetrics.registerFont(justFont)
# Create a canvas to draw on
output_file = "output.pdf"
c = canvas.Canvas(output_file)
# Use the font and write text
c.setFont('DarkGardenMK', 32)
c.drawString(10, 150, 'This should be in')
c.drawString(10, 100, 'DarkGardenMK')
# Save the canvas
c.save()
Note that the argument "WinAnsiEncoding" has nothing to do with the input; it's to say which set of characters within the font file will be active and available.
The font's facename comes from the AFM file's FontName field.
In the example above we knew the name in advance, but quite
often the names of font description files are pretty cryptic
and then you might want to retrieve the name from an AFM file
automatically.
When lacking a more sophisticated method you can use some
code as simple as this:
class FontNameNotFoundError(Exception):
pass
def findFontName(path):
"Extract a font name from an AFM file."
f = open(path)
found = 0
while not found:
line = f.readline()[:-1]
if not found and line[:16] == 'StartCharMetrics':
raise FontNameNotFoundError, path
if line[:8] == 'FontName':
fontName = line[9:]
found = 1
return fontName
In the DarkGardenMK example we explicitely specified the place of the font description files to be loaded. In general, you'll prefer to store your fonts in some canonic locations and make the embedding mechanism aware of them. Using the same configuration mechanism we've already seen at the beginning of this section we can indicate a default search path for Type-1 fonts.
Unfortunately, there is no reliable standard yet for such
locations (not even on the same platform) and, hence, you might
have to edit one of the files reportlab_settings.py or ~/.reportlab_settings to modify the
value of the T1SearchPath identifier to contain additional
directories. Our own recommendation is to use the reportlab/fonts
folder in development; and to have any needed fonts as packaged parts of
your application in any kind of controlled server deployment. This insulates
you from fonts being installed and uninstalled by other software or system
administrator.
Warnings about missing glyphs¶
If you specify an encoding, it is generally assumed that the font designer has provided all the needed glyphs. However, this is not always true. In the case of our example font, the letters of the alphabet are present, but many symbols and accents are missing. The default behaviour is for the font to print a 'notdef' character - typically a blob, dot or space - when passed a character it cannot draw. However, you can ask the library to warn you instead; the code below (executed before loading a font) will cause warnings to be generated for any glyphs not in the font when you register it.
import reportlab.rl_config
reportlab.rl_config.warnOnMissingFontGlyphs = 0
3.4 Standard Single-Byte Font Encodings¶
This section shows you the glyphs available in the common encodings.
The code chart below shows the characters in the WinAnsiEncoding.
This is the standard encoding on Windows and many Unix systems in America
and Western Europe. It is also knows as Code Page 1252, and is practically
identical to ISO-Latin-1 (it contains one or two extra characters). This
is the default encoding used by the Reportlab PDF Library. It was generated from
a standard routine in reportlab/lib, codecharts.py,
which can be used to display the contents of fonts. The index numbers
along the edges are in hex.
The code chart below shows the characters in the MacRomanEncoding.
as it sounds, this is the standard encoding on Macintosh computers in
America and Western Europe. As usual with non-unicode encodings, the first
128 code points (top 4 rows in this case) are the ASCII standard and agree
with the WinAnsi code chart above; but the bottom 4 rows differ.
These two encodings are available for the standard fonts (Helvetica, Times-Roman and Courier and their variants) and will be available for most commercial fonts including those from Adobe. However, some fonts contain non- text glyphs and the concept does not really apply. For example, ZapfDingbats and Symbol can each be treated as having their own encoding.
3.5 TrueType Font Support¶
Marius Gedminas ([email protected]) with the help of Viktorija Zaksiene ([email protected])
have contributed support for embedded TrueType fonts. TrueType fonts work in Unicode/UTF8
and are not limited to 256 characters.
We use reportlab.pdfbase.ttfonts.TTFont to create a true type
font object and register using reportlab.pdfbase.pdfmetrics.registerFont.
In pdfgen drawing directly to the canvas we can do
import os
import reportlab
from reportlab.pdfgen import canvas
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
pdfmetrics.registerFont(TTFont('Vera', 'Vera.ttf'))
pdfmetrics.registerFont(TTFont('VeraBd', 'VeraBd.ttf'))
pdfmetrics.registerFont(TTFont('VeraIt', 'VeraIt.ttf'))
pdfmetrics.registerFont(TTFont('VeraBI', 'VeraBI.ttf'))
# we know some glyphs are missing, suppress warnings
import reportlab.rl_config
reportlab.rl_config.warnOnMissingFontGlyphs = 0
# Create a canvas to draw on
output_file = "output.pdf"
c = canvas.Canvas(output_file)
c.setFont('Vera', 32)
c.drawString(10, 150, "Some text encoded in UTF-8")
c.drawString(10, 100, "In the Vera TT Font!")
# Save the canvas
c.save()
In the above example the true type font object is created using
TTFont(name,filename)
so that the ReportLab internal name is given by the first argument and the second argument
is a string(or file like object) denoting the font's TTF file. In Marius' original patch the filename
was supposed to be exactly correct, but we have modified things so that if the filename is relative
then a search for the corresponding file is done in the current directory and then in directories
specified by reportlab.rl_config.TTFSearchpath!
Before using the TT Fonts in Platypus we should add a mapping from the family name to the
individual font names that describe the behaviour under the <b> and <i> attributes.
from reportlab.pdfbase.pdfmetrics import registerFontFamily
registerFontFamily('Vera',normal='Vera',bold='VeraBd',italic='VeraIt',boldItalic='VeraBI')
If we only have a Vera regular font, no bold or italic then we must map all to the same internal fontname. and tags may now be used safely, but have no effect. After registering and mapping the Vera font as above we can use paragraph text like
<font name="Times-Roman" size="14">This is in Times-Roman</font>
<font name="Vera" color="magenta" size="14">and this is in magenta <b>Vera!</b></font>
TrueType Font Fallback (Experimental)¶
This is an experimental feature. When a TrueType font does not contain a glyph for a character, ReportLab can automatically fall back to a substitute font. This is useful for mixed-script documents (e.g. Latin + CJK).
The feature is disabled by default. Set the environment variable
REPORTLAB_FONT_FALLBACK=1 to enable it.
REPORTLAB_FONT_FALLBACK=1 python your_script.py
Configure fallback fonts by setting the substitutionFonts property on a TTFont:
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
font = TTFont('NotoSans', 'NotoSans-Regular.ttf')
fallback = TTFont('NotoSansCJK', 'NotoSansCJK-Regular.ttf')
pdfmetrics.registerFont(font)
pdfmetrics.registerFont(fallback)
font.substitutionFonts = [fallback]
# Now mixed-script text will automatically use the fallback font
c.setFont('NotoSans', 12)
c.drawString(100, 700, 'Hello 你好 World')
A convenience function registerFontWithFallback is also available:
font = pdfmetrics.registerFontWithFallback(
'NotoSans', 'NotoSans-Regular.ttf',
fallbackFonts=[TTFont('NotoSansCJK', 'NotoSansCJK-Regular.ttf')]
)
You can check if a font contains a specific glyph with hasGlyph():
font.hasGlyph('A') # True
font.hasGlyph(0x4F60) # False if font lacks CJK glyphs
3.6 Asian Font Support¶
The Reportlab PDF Library aims to expose full support for Asian fonts. PDF is the first really portable solution for Asian text handling. There are two main approaches for this: Adobe's Asian Language Packs, or TrueType fonts.
Asian Language Packs¶
This approach offers the best performance since nothing needs embedding in the PDF file; as with the standard fonts, everything is on the reader.
Adobe makes available add-ons for each main language. In Adobe Reader 6.0 and 7.0, you will be prompted to download and install these as soon as you try to open a document using them. In earlier versions, you would see an error message on opening an Asian document and had to know what to do.
Japanese, Traditional Chinese (Taiwan/Hong Kong), Simplified Chinese (mainland China) and Korean are all supported and our software knows about the following fonts:
-
chs= Chinese Simplified (mainland): 'STSong-Light' -
cht= Chinese Traditional (Taiwan): 'MSung-Light', 'MHei-Medium' -
kor= Korean: 'HYSMyeongJoStd-Medium','HYGothic-Medium' -
jpn= Japanese: 'HeiseiMin-W3', 'HeiseiKakuGo-W5'
Since many users will not have the font packs installed, we have included a rather grainy bitmap of some Japanese characters. We will discuss below what is needed to generate them.
An image should have appeared here.
Prior to Version 2.0, you had to specify one of many native encodings when registering a CID Font. In version 2.0 you should a new UnicodeCIDFont class.
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfgen import canvas
from reportlab.pdfbase.cidfonts import UnicodeCIDFont
pdfmetrics.registerFont(UnicodeCIDFont('HeiseiMin-W3'))
# Create a canvas to draw on
output_file = "output.pdf"
c = canvas.Canvas(output_file)
c.setFont('HeiseiMin-W3', 16)
# the two unicode characters below are "Tokyo"
msg = u'\u6771\u4EAC : Unicode font, unicode input'
c.drawString(100, 675, msg)
The old coding style with explicit encodings should still work, but is now only relevant if you need to construct vertical text. We aim to add more readable options for horizontal and vertical text to the UnicodeCIDFont constructor in future. The following four test scripts generate samples in the corresponding languages:
tests/test_multibyte_jpn.py
tests/test_multibyte_kor.py
tests/test_multibyte_chs.py
tests/test_multibyte_cht.py
In previous versions of the ReportLab PDF Library, we had to make use of Adobe's CMap files (located near Acrobat Reader if the Asian Language packs were installed). Now that we only have one encoding to deal with, the character width data is embedded in the package, and CMap files are not needed for generation. The CMap search path in rl_config.py is now deprecated and has no effect if you restrict yourself to UnicodeCIDFont.
TrueType fonts with Asian characters¶
This is the easy way to do it. No special handling at all is needed to work with Asian TrueType fonts. Windows users who have installed, for example, Japanese as an option in Control Panel, will have a font "msmincho.ttf" which can be used. However, be aware that it takes time to parse the fonts, and that quite large subsets may need to be embedded in your PDFs. We can also now parse files ending in .ttc, which are a slight variation of .ttf.
To Do¶
We expect to be developing this area of the package for some time.accept2dyear Here is an outline of the main priorities. We welcome help!
-
Ensure that we have accurate character metrics for all encodings in horizontal and vertical writing.
-
Add options to UnicodeCIDFont to allow vertical and proportional variants where the font permits it.
-
Improve the word wrapping code in paragraphs and allow vertical writing.
3.7 RenderPM tests¶
This may also be the best place to mention the test function of reportlab/graphics/renderPM.py,
which can be considered the cannonical place for tests which exercise renderPM (the "PixMap Renderer",
as opposed to renderPDF, renderPS or renderSVG).
If you run this from the command line, you should see lots of output like the following.
C:\code\reportlab\graphics>renderPM.py
wrote pmout\renderPM0.gif
wrote pmout\renderPM0.tif
wrote pmout\renderPM0.png
wrote pmout\renderPM0.jpg
wrote pmout\renderPM0.pct
...
wrote pmout\renderPM12.gif
wrote pmout\renderPM12.tif
wrote pmout\renderPM12.png
wrote pmout\renderPM12.jpg
wrote pmout\renderPM12.pct
wrote pmout\index.html
This runs a number of tests progressing from a "Hello World" test, through various tests of Lines; text strings in a number of sizes, fonts, colours and alignments; the basic shapes; translated and rotated groups; scaled coordinates; rotated strings; nested groups; anchoring and non-standard fonts.
It creates a subdirectory called pmout, writes the image files into it, and writes an
index.html page which makes it easy to refer to all the results.
The font-related tests which you may wish to look at are test #11 ('Text strings in a non-standard font') and test #12 ('Test Various Fonts').