OCRing Spanish American texts

(by Ulrike Henny)

Like my colleague José Calvo, I have spent much time with the collection of digital texts during the last year which I want to use in my PhD project. I am establishing a corpus of Spanish American novels of the 19th century. The texts date from 1830 to 1910 and I am including novels from Argentina, Mexico and Cuba.

The texts I found in HTML or clean plain text or as good ebooks were not so many as I had wished for. Because of that, I am now in a phase where I have to prepare the full text myself with the help of OCR. Fortunately, there are many PDF and image files online that I can use. I want to avoid doing the scanning on my own as much as possible.

Last chapter of the novel “El casamiento original” by the Cuban-Spanish author Felicia Auber Noya from 1844.

I feel that I have to improve my bibliographic search skills because I have the impression that it is not too easy to access lesser known Spanish American novels from Europe (even the physical copies through the library system).

If there is anybody out there who knows about collections of Spanish and Spanish American novels available in digital format and not easily findable which we could use or reuse, we would definitely be very happy.

And if you yourself are looking for this kind of texts, the CLiGS group has already published the first part of the novels which have been prepared in TEI as a “textbox” on Github and you are free to use them.

Collecting metadata for Spanish Texts

(by José Calvo)

In the last post I wrote about increasing one text to my collections, but I didn’t say anything about metadata. Because what is it good for to have text, if we don’t know its author, title, year of publication… And many other information.

The standard TEIHeader of the CLiGS collections looks like this on our textbox:

Screenshot from 2016-04-07 10:59:18

Information about the author, in this case Blasco Ibáñez, the title (La barraca), copyright situation, source of the text… And then I also added an abstract of the text.

After that, the terms of the keywords come. This is our way to encode some information about the text that might be important for the genre, our main interest. In the last pictures we see some information about  the genre, how the history of literature has said about it, the narrative perspective, the gender of the protagonist,  the kind of place where the action takes place…

In my dissertation I decided that metadata is going to have a central role. This is why I decided to model many other categories of metadata. This are the beta ones in keywords.csv, files which controls this categories. I am encoding information like the name of the protagonist, its age, its profession, its social level; the name of the city, region, country and continent, how long the action takes places, when, kind of end…

Screenshot from 2016-04-07 11:11:05

To keep this information, I need to know about the text. This is why it is for me so important to have an abstract. epublibre offers us some information about it, but actually it says not much about the information that I need. This is why I decide to go both to Wikipedia to copy the abstract and to the Manual de literatura española VII. Época del Realismo from Pedraza Jimánez and Rodríguez Cáceres to know more about the content of the novel:


With this two source of information I can say now a lot about the text and fill a many of the literary metadata. And because I haven’t read the text, I put in the attribute cert of the keywords element the value medium:

Screenshot from 2016-04-07 11:29:03

Now, of course, for this text I have different source of information that more or less tell the same information. But this is not always the case, some times we don’t have any source of information, or sometimes the different sources give us very contradictory information or we are not sure if they are reliable.

This is why a couple of weeks ago I started a script called assist_metadata in Python. The idea is to try to help the editor of the metadata (normally me) to get some easy and useful information that is spread along the text. At the moment there are 4 functions available:

  • get_names()
  • get_full_names()
  • get_time()
  • get_places()

And all of them can be used at once using get_all(). So what, do they do?

get_places() gives you a list of the most frequent “places” of the text. And how does it get that if we don’t have this information marked? He only searches for proper names (words that are not at the beginning of a sentence but its first letter is uppercase) following the Spanish preposition en. Quick and dirty, I know, this is why this script is called “assist_metadata” and not “modeling metadata” and that is why I said that he searches for “places”. Result for our text:

Screenshot from 2016-04-07 11:37:49

Hey! It gives us Ursaria as the most frequent “place” and that is what Wikipedia and our Manual also say!

get_time() gives us a list of 4 numbers found in the text, like 1610, 1985 or 2015. Of course it would give us also 0000, what we are very sure that is not a year:

Screenshot from 2016-04-07 11:43:08

Our text does not have any year in the body element, which is not a surprise in a text with daemons and witches. If we use the same function with another text, like La nave de los locos, from Baroja, the result would be different:

Screenshot from 2016-04-07 11:53:20

So apparently the action takes place in the decade of 1830 and preceding years are mentioned. Anyway, it seems that it takes place at the 19th Century, which is also truth if you know the text!

The two functions that are left are called get_names() and get_full_names(). Both search for proper names; the first one makes a list of each of them;  full_names tries to get the relationship between first and last name or similar things.

The result for get_names():

Screenshot from 2016-04-07 11:57:29

As we can see, it says that Alejandro is the most frequent name, so we would think that he might be the protagonist. This is a problem of the form of this theatrical novel: the names in the speaker tag are all uppercase, so ATENAIDA is not counted:

Screenshot from 2016-04-07 11:59:26

But, anyway, Atenaida is the second person on the list, so it is still useful and now we know that Alejandro is a central person of the text.

In stead, the result for get_full_names() is:

Screenshot from 2016-04-07 11:58:27

The idea with this two functions is to try to get more information from the protagonist. In the future I would like to try to get if the same person is mentioned through firstname, lastname, title…

This is all very work in progress. I would like to add another function trying to get the narrative perspective of the text trying to get how many times appears something like ” -dije ” and ” -dijo ” or try to see if words like muerte, morir, dolor, enfermedad, muerto and so on appear in the last div of the body to see the type of end that the text has or see the gender of the possessive pronouns to see the gender of the protagonist. Anything to get easier the access to the metadata of the text.

And with this posts I have explained how I get my texts and metadata for my corpus.

Increasing one title in my corpus of Spanish Novels

(by José Calvo)

It took me about 4 seconds to think what I want to write about on the Day of DH 2016 trying to answer “what do digital humanists really do?”. In the last year I have spent a lot of time as part of the Computational Literary Genre Stylistics (CLiGS) modeling and sampling my collection of Spanish Novel between 1880 and 1939, which a part is already published. The master format of the group for textual data is XML-TEI. When I started I found exactly zero of the texts that I need in XML-TEI published. The main reason is because a lot of projects like Cervantes Virtual work and have the texts in XML-TEI, but they neither publish it nor let other researchers work with it. I started thinking that I would need around 150-200 texts. Now I think I would like to get around 300 texts.

What you can find in the web are Spanish texts in different formats like HTML, PDF or ePUB (the standard format for eBooks). Actually this is how the amount of import formats for my corpus looks like right now:
Screenshot from 2016-04-06 14:42:23

A lot of non-xml HTML,  PDFs and eBooks! At CLiGS We created a way to deal very flexible con formats using regular expressions. Everything is published in our GitHub account, exactly in the html2tei folder, where you can find one script for each frequent source of text that we have found.

So, first we need a text that we don’t have in XML-TEI and whose version we do have in other format. Here is the chosen one for today:

Screenshot from 2016-04-06 14:53:59

The best version of the text La razón de la sinrazón from Galdós that I have found in Internet is in epub libre. Epub libre is an incredible project that publish eBooks in ePUB format in Spanish. They have published more the 20 000 eBooks! You can download the eBooks via p2p.

Once I have it, I give it an id: this one is going to be the 250th text of my corpus:
Screenshot from 2016-04-06 14:58:45

Now, we have the eBook, but in an eBook there are a lot of things that are not texts that we don’t want. And each chapter might be in a separate XHTML file. For that I convert the eBook with Calibre into HTMLZ; this option makes a simple zip file where all the text of the eBook is in a single XHTML file, which I renamed as “ne0250_Galdos_Razon.html”:

Screenshot from 2016-04-06 15:02:40

The XHTML of epublibre-Calibre looks like this:

Screenshot from 2016-04-06 15:41:54

Now that we have a file with the text that we want, we move it to toolbox/legacy/html2tei/input and we place it there. One step above, we find the file html_epublibre2tei_jc.py, which is the python script that converts the XHTML of epublibre-Calibre into TEI and we open it for example in Spyder and we let it run:

Screenshot from 2016-04-06 15:06:28

The python script have created in the output folder an XML file with the text. It is important to understand that this scripts does not work perfect always (some times it does! ). Why not? Because every editor of epub libre uses small a slight different version of HTML: some use for poems div class=”poem”, other div class=”poema1″, some div class=”versos”, or, like in this case: div class=”tablacentro”…

We convert it with the python script. Let’s see if the xml file is well formed:

Screenshot from 2016-04-06 15:10:16

Yes it is! But we recognize see that the markup is not TEI… Let’s check if it valid:

Screenshot from 2016-04-06 15:11:32

Lot of errors: 2996. But don’t worry, this is why we have the python script! We see what markup we have and what we actually want and we try to convert it with regular expressions. After some time, I added the following lines:

Screenshot from 2016-04-06 15:25:35

After that. Oxygen is not completely happy, but we have come from 2996 to 14 errors with only 7 lines of code:

Screenshot from 2016-04-06 15:26:47

I decided to fix the other steps by hand:

Screenshot from 2016-04-06 15:29:03

The fact that a file is valid doesn’t mean that is what we want. If you see the structure on the left, the scenes and acts are not correctly nested. I decided to do that by hand since it is too risky to automatize it if you don’t know very well how the editors structured it. After a few minutes the structure look much better:
Screenshot from 2016-04-06 15:34:47

I also put the names of the characters in a castList and convert the p elements in castItems with regexps:

Screenshot from 2016-04-06 15:38:06

And of course I indent the text with the Notepad++ > XML Tools > Pretty Print and voilá!
Screenshot from 2016-04-06 15:40:01

I move the file to the corpus folder where the new text can join its colleagues:

Screenshot from 2016-04-06 15:39:22

And I upload to GitHub the new file of the corpus and the changes of the python script in the toolbox, so we all can use the new version:

Screenshot from 2016-04-06 15:47:51

And that is pretty much what this Digital Humanist has done in the last year: collecting and converting the text in the right format, trying to authomatize the majority of it, but also doing by hand some tricky parts.

But: what about metadata? We talk about that in the next post!