(Article Version 2 – 20th February 2014)
In this paper I discuss some simple textual analysis of a blazon corpus. In 1894 James Parker published a book called “a glossary of terms used in heraldry”. Between 2000 and 2004 the complete text of this book was converted into HTML by “Saitou” and Jim Trigg. In 2011, I updated the HTML code to current standards. Part of this process included locating and modifying the examples of heraldic language that were in the text – during these activities an extract was made of the complete set.
This set comprised 4,611 entries and was then examined by hand to locate just the example blazons. Removing crests, badges, duplicate entries and other non-blazon items left 4,184 blazons in English and 332 in French. The complete set is available for download in various formats by following the links at the end of this article. From here on I will refer to this set of blazons as the “Parker Corpus”.
Before we subject this corpus to analysis we must consider whether it constitutes a representative set from the population of all blazons. Parker chose these blazons individually in order to illustrate a particular feature of the blazonry language – almost every dictionary entry has at least one blazon example. The dictionary contains over 2000 entries including some quite rare terms, hence we must acknowledge that of course some of the blazons will also contain those rare terms. A randomly chosen subset taken from “all blazons” would not be expected to cover such a broad range of terms as is found in the Parker corpus. However I believe that the sheer number of blazons is sufficient to provide source material for textual analysis.
The first such analysis that we will carry out is a very simple word frequency investigation. The process we will use is as follows:
For each blazon in the corpus:
- Remove all text surrounded by square brackets (Parker used this convention to note particular features, his text in these brackets does not form part of the blazon itself)
- Remove all punctuation marks except hyphens and apostrophes
- Split the blazon into individual words, (splitting at whitespace) and convert all words to lowercase
- Create a frequency table for each word
Note that no attempt has been made here to group singular words with their plurals and thus “lion” and “lions” have been counted separately.
This analysis shows that the Parker corpus contains 3,503 different words, of which 1,702 appear only once. It is this second number that I would expect to be much lower in a truly representative sample – this reflects the need of Parker to give example blazons containing obscure terms.
The resulting frequency data can also be downloaded from the links at the bottom of the article, but the 100 most frequent items are shown in the table below, the number of appearances shown in brackets. I have also reinstated some punctuation marks to aid understanding.
1 – 10
11 – 20
21 – 30
31 – 40
41 – 50
51 – 60
61 – 70
71 – 80
81 – 90
91 – 100
There are some observations we can draw from this. If there is more than one charge, it is most likely that there will be three of them! The much higher appearance of the word second than first is, I think, explained by the use of the term “of the field”, rather than “of the first”, and adding the frequency of field to that of first does indeed bring it almost equal to second.
We can also manually inspect this list and extract words from within particular categories. For example, the most frequently occurring tinctures are as follows:
- argent (3005)
- or (2114)
- gules (1936)
- sable (1467)
- azure (1405)
- proper (671)
- vert (460)
- ermine (291)
- counterchanged (110)
- gold (57)
A further refinement of this investigation would be to move the French blazons to a separate corpus as some of those terms in the list above are from the French, however I don’t believe this will significantly change the rankings.
We can carry out a similar exercise for ordinaries, shown here:
- chief (705)
- chevron (578)
- fesse (529)
- cross (378)
- bend (319)
- pale (298)
- base (289)
- saltire (197)
- bars (126)
- bordure (106)
- canton (94)
- barry (89)
Unfortunately with this rather crude analysis we can’t do the same for charges as these are frequently multi word terms, although we can note the relatively high frequencies of lions, assorted heads, trees and mullets. Similarly, most divisions and treatments consist of multiple words and this analysis does not group them appropriately.
Another of the author’s projects, the DrawShield Suite is currently being rewritten, and the first stage of this, the blazon “parser” is almost complete. This parser is able to read the complete blazon and break it down into a hierarchy of component parts. This tool will be able to carry out a much more sophisticated and detailed analysis of all blazon features, and a future paper will report on this.
I hope this paper has proved interesting and I look forward to presenting results of my further investigations.
Please feel free to download these resources and use them as required in your own work.