Finding words in context

The corpus I mainly work on, the Oxford Corpus of Old Japanese, is tagged in XML, following the conventions of the Text Encoding Initiative (TEI). In the first stage, texts were romanized including information about whether something was logographically or phonographically rendered, and then tagged for morphological and syntactic information. This allows us to, for example, search for a lexical particular lexical item in any syntactic  environment, and include only those items which were recorded phonographically.

In an earlier post, I noted that the verb mi– ‘see’ was the most frequently attested word; it’s attested 1358 times in a smallish corpus of only 111,000 words. But what does that really tell us?

In addition to marking up texts, one of my projects was to create a Lexicon which contained various information about each lexical item. The online version of the lexicon has links to various stored searches. Here’s the Lexicon entry for mi-:


The link for statistics gives the following information:


The most common inflection, ‘stem’, tells us that this verb is usually followed by either an auxiliary or another verb. We can get a sense of that by clicking the attestations link from the Lexicon entry. (I’m not going to present that here, as it’s rather long.)

The collocations link shows nouns that head noun phrases which are marked as subjects or objects of the verb or nouns which are modified by the verb in a noun modifying construction. I’ve shortened it to just show nouns attested at least 10 times.


Looking at a list like this, the first thing I think is that I should have had also automated the definitions for the nouns, but I’ll do that another day.

Except for ime ‘dream’, the first 7 nouns refer to people (including me ‘eye’, which is used metaphorically with mi- to refer to the person you [want to] see). So this verb occurs more often in the OCOJ with humans than with inanimate objects.

This gives us more of an idea of how mi- is used than looking just at its frequency.

Not that collocations show the whole picture either. The verb abur- ‘broil’ occurs only twice in the OCOJ, both times with pito ‘person’, referring to the person doing the broiling, not a person being broiled.

My day so far

The purpose (as I see it) of the Day of DH, is to document what it is DH people do during their day.

My day so far has been spent making sure I have enough coffee in my system to function, taking care of my pet degus, then coming in to work and having a meeting to discuss “exam stuff”, i.e., making sure the exams we give at the end of the year are set up fairly and accurately. I’ve answered the urgent e-mails in my inbox, and need to spend time on the less urgent-but-can’t-be-ignored e-mails after this post. I also have a huge stack of marking waiting for my attention.

All this to say, that while I am a Digital Humanist, much of my day is spent doing completely unrelated things.

Does every word cloud have a silver lining?

Here’s a word cloud of the Top 25 Verbs in the Oxford Corpus of Old Japanese (OCOJ). I’m really just posting this because there’s an awful lot of white space on my blog, and that needs to change.

I’m sure that before starting this project, I never would have guessed that the verb mi- ‘see’ was the most frequent verb used in Old Japanese (at least in our data).

And while I find word clouds a fun and easy way to view data and to quickly see relative frequencies between words, I don’t think they really tell us much about a language or culture without also investigating the contexts where these words appear.

More on that later.


Top 25 Verbs in the OCOJ
