Improve your Chinese: Use Java and iText 7 to Create Study Sheet PDF’s

5 min readMay 30, 2020

Use your programming skills to help improve your Chinese language skills.

In this article, I will show you how to write a simple, standalone Java application that converts your vocabulary lists into several PDF’s that you can use in your Mandarin studies.

Although the application is for studying Traditional Mandarin Chinese, it’s simple enough to modify it for Simplified Mandarin Chinese, Korean, or any other language.

My GitHub repository: https://github.com/kevinleeham/learn-chinese. For this article, see the studySheetPdf directory.

Sample Study Sheets

Here are samples of the four types of study sheets the application will produce.

iText 7 PDF Library for Java

These days, many applications can easily produce PDF’s. On a Mac, you can even “print” your document to a PDF. Being a programmer, however, it’s very cool to have the ability to produce a PDF right from your Java application. We will be using the iText 7 library to create our PDF’s. If your language of choice is C#, there’s also a C# version of the library.

Traditional Chinese Font

One of my favorite traditional Chinese fonts is Arphic’s KaitiM Big5. It’s both free and beautiful (I’m a font geek). I wasn’t able to find a download page on Arphic’s web site. However, I did find the font file and the associated license agreement in repositories of several major open-source projects. For simplicity, you can download the font and license files from my repository: https://github.com/kevinleeham/learn-chinese/tree/master/studySheetPdf/src/main/resources/fonts/chinese/arphic

Maven POM file with iText 7 Dependencies

See below for the Project Object Model (POM) XML file. You can see that the file indicates dependencies on four iText artifacts. As of May 2020, the latest version of iText is 7.1.11. So, in the pom.xml file, I am setting the version to 7.1.11. To always use the latest version, you can specify the version as RELEASE.

Let’s Create our First PDF

Our goal is to generate a simple PDF with two lines of text that say “Hello everybody!” (one in English and the other in Chinese).

See below for the complete Java source code. It’s also available from my GitHub repository. The code has a decent amount of comments. I will highlight a few of the interesting bits here.

To ensure that the PDF renders the text using our preferred fonts, we tell iText to embed them. To do this, we simply pass a value of true when calling PdfFontFactory.createFont (lines 30 and 31).

In addition to embedding the Chinese font, I am also embedding an English font named TidyHand. Although PDF’s have support for a number of built-in (standard) fonts, I feel TidyHand adds some casualness to my rigid study routines.

Lines 34 and 35 create PdfDocument and Document objects. In this sample application, we’re telling the Document object that our page size is US Letter with a landscape orientation. The PageSize class defines over twenty ANSI and ISO standard paper sizes. With the Document object, you can also optionally specify page margins. If you don’t specify any, Document uses default values.

Now that we have a Document object, we can begin adding content. On lines 38 and 39, we add paragraphs to the Document and specify the exact font values (font filename, point size, and color) to use for the text.

Lastly, we close the Document (line 42). Pretty straightforward, right?

After running the application, you should see a file named helloEverybody.pdf and it should look exactly like the image below.

Now that we’ve proven our application can successfully produce a PDF, let’s start building our Study Sheet application. We’ll start with our vocabulary list files.

Vocabulary Lists

Our application uses two vocabulary list files as input: one that contains single characters and one that contains multiple-character words and/or phrases.

Methods for Reading the Vocabulary List Files

Class: VocabEntry

When reading each file, the data is stored in a collection of VocabEntry objects.

Method: loadSingleCharacterVocabFile

Chinese has many characters that have multiple pronunciations and meanings. For example, the character 會 has two very different pronunciations along with very different meanings.

The version that is pronounced as ㄏㄨㄟˋ (or “hui4” in Pinyin) is extremely common and translates to “will”, “can”, and “able to”. The other version is pronounced as ㄎㄨㄞˋ (or “kuai4” in Pinyin) and means “accounting” or “to balance an account”.

Because of these multiple pronunciations and meanings, I chose a Map collection where the key is the Chinese character and the value is a List of VocabEntry objects. Each VocabEntry object represents one pronunciation/definition. So, when the application looks up the character 會, the value in the Map is a List containing two VocabEntry objects.

Method: loadMultiCharacterVocabFile

Since multi-character words and phrases only have one pronunciation, storing these is much simpler than what we did for single characters. We simply store these multi-character entries as a List of VocabEntry objects.

Study Sheet #1: Single Characters (minimalist)

This study sheet is my favorite because after learning hundreds of Chinese characters, I need to be able to quickly review them. Being able to see them on just a page or two is great.

As you can see from the screenshot, each character is separated from the next with just enough whitespace. The amount of code to produce this sheet is surprisingly small.

Study Sheet #2: Multiple Characters (minimalist)

This study sheet is a minimalist version for multi-character words and phrases and groups them by the number of characters so that they’re easier to see.

Study Sheet #3: Single Characters (full/detailed)

This study sheet is especially helpful when learning new characters or when you want to periodically review them. Creating the table is straightforward and by setting the table header, it automatically shows if the table spans multiple pages.

Study Sheet #4: Multiple Characters (full/detailed)

This is our final study sheet and it’s the multi-character version showing the full/detailed view.