Dicteator

From Interglider
Revision as of 16:46, 27 October 2015 by Misha (Talk | contribs) (Introduction)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Putting dictionaries into Wiktionaries

Introduction

When I started to work as a programmer on the project “Wiktionary Meets Matica Srpska” I knew the task will be challenging as parsing dictionaries that contain thousands and thousands of entries and uploading them to Wiktionary bring some difficulties. Partly these difficulties come from working with material that was made by humans that are prone to making errors and, because this materials are something that needs to be dealt with computationally, this puts us in the territory where computer and human world collide. Second difficulty is the sheer scale of the project and dictionaries we are working with. And the last one is making the software universal so that anybody with a little knowledge of programming can use software to parse and upload dictionaries of his/her choice.

What I am presenting here is the first working version of the software which is providing a solid base to build upon. Currently, the software is divided into three separate modules that are not directly connected. First module is used for parsing and tokenizing the dictionary data, second is for creating Wiktionary pages out of collected data, and third is for working with Wiktionary directly via Pywikibot (uploading data, querying what is already on Wiktionary and so on). I will talk briefly about every module and guide you through their methods and functions but bear in mind this is a work in progress so everything here is a subject to change. Also, the modules themselves will become more universal and interconnected with increased reusability to make automating the process of parsing and uploading dictionaries to Wiktionary faster, more easy and less painful for us humans.

In general, the process of working with single dictionary looks like this: starting from doc file containing the content of the dictionary, entries and their elements have to be abstracted, manipulated and formatted to Wiktionary standard. The first dictionary and the only one that is finished at this time is “Rečnik sinonima” by Pavle Ćosić which is a synonym dictionary of Serbian language containing 13569 entries. This was to be our pilot dictionary to simply try things out before we embark upon working with larger dictionaries with more complex entry structures.

I haven’t decided in advance on which tools to use for this beside Python. Why Python? It’s very logical to use Python because of its rich set of libraries to efficiently deal with any kind of text processing. Especially because Python 3 has good unicode support, as we need to work with different scripts, like Serbian Cyrillic and Latin, which makes it an ideal candidate for a task ahead. On the top of that, there is something called Pywikibot, the Python library written specifically to help automating tasks on Wikipedias which is a very handy tool. With Python I used Anaconda distribution and Spyder, IDE that came as a part of the package. Sublime Text was the editor of choice which I used to check the output of my scripts, to fix and prettify HTML forms, as it is fast and able to search and replace text using regular expressions which saves a lot of time. And finally there is AbiWord to convert the original data of the dictionary which came as doc file to HTML.

Software package I made is called Dictator, and before I continue, bellow is the table with list of all files included in Dictator software. I will frequently refer to the files listed there and you can use the list to check the code in each of the files.

Parsing dictionary is a three step process during which may be returning to the previous phase for error corrections, i.e, fixing the input. Dictator is currently organized into three separate groups of scripts and there is no direct connection between them, which may be changed in the future. For illustration see the graph bellow and while reading the rest of the text, check in parallel the code which is linked from the GitHub.

Overall schema of the entire process


Components of Dictator software
Parsing tools Entry making tools Additional tools
parser.py sinonimi_reader.py igl_count_pages.py
load_pickle.py get_all_serbian_pages.py
igl_addition.py sr_lat2cyr2lat.py

Getting the data

The process of getting the data and tokenizing it has several steps. The first step was to have data in format from which it can be easily extracted for further processing. Sure, it’s not hard to extract data from a doc file using Python, but what is harder is to make program recognize separate entries and their elements. What if there is a software which can recognize entries without demanding too much human effort to guide it? Turns out AbiWord can do such task, and it could accurately recognize separate entries and, less accurately, separate elements within them. What we get with this is a single HTML file, which can be easily searched because everything is separated with tags. There are Python modules made for that purpose, like BeautifulSoup, which I used to read and extract the data. Downside was that elements within entries got mixed up to some extent and only the title and the word type elements could be extracted reliably. For the rest of the entry we had to think of some methods to get the data accurately. But, with title and word type already available, it is a good start and task is already less complicated to the some extent.

For illustration, here’s how the entry in Rečnik sinonima looks like after converting it to HTML:

 
    <p class="body_text" dir="ltr" style="text-align:left">
       <span style="font-size:11pt;font-family:'Arial Black'" lang="en-US">
       aktivirati
       </span>
       <span style="font-style:italic;font-size:9pt;font-family:'Courier New'" lang="en-US">
       svrš. prel. 
       </span>
       <span style="font-size:10pt;font-family:'Arial Unicode MS'" lang="en-US">
       ❶ 
       </span>
       <span style="font-size:10pt;font-family:'Garamond'" lang="en-US">
       (~ uređaj) upaliti, aktivirati, pokrenuti, staviti u pogon/pokret
       </span>
       <span style="font-size:10pt;font-family:'Arial Unicode MS'" lang="en-US">
       ❷ 
       </span>
       <span style="font-style:italic;font-size:10pt;font-family:'Garamond'"  lang="en-US">
       v.
       </span>
       <span style="font-size:10pt;font-family:'Garamond'" lang="en-US">
       angažovati 1.
       </span>
    </p>
  

Seen from browser it looks like this:

aktivirati svrš. prel. (~ uređaj) upaliti, aktivirati, pokrenuti, staviti u pogon/pokret v. angažovati 1.

Note that the entire entry is delimited with HTML paragraph tags and everything within is separated with span tags. In the series of span tags, between the first pair is the name of the entry, between the second is the type of the word. Next one, with “❶”, and another one with “❷”, mark the beginning of separate meanings of the entry. Everything after them is the part of that separate meaning including words (synonyms), and tags (like “v.”), or other words which have similar meaning (“angažovati 1.” - where number marks which submeaning of the other entry is similar to the current entry). Also there are some extra symbols that we need to filter out like “ ”.

What we want is to convert it to a format where every element is separated and can be easily manipulated. Processed data will be stored in JSON file that will for this entry look like this:

       "aktivirati": {
           "['svrš.', 'prel.']": [
               {
                   "1": {
                       "0": {
                           "form": "upaliti",
                           "description": "(~ uređaj)"
                       },
                       "1": {
                           "form": "aktivirati"
                       },
                       "2": {
                           "form": "pokrenuti"
                       },
                       "3": {
                           "form": "staviti u pogon/pokret"
                       }
                   },
                   "2": {
                       "0": {
                           "categories": [
                               "v."
                           ],
                           "form": "1. angažovati"
                       }
                   }
               },
               "{'reference_type': , 'page': 0}"
           ]
       }

As you can see the entry is in Python dictionary structure which has the title as the key, and the type of the word is the key of its subdictionary. The subdictionary contains the list which has two elements. The first is another dictionary which contains the meanings of the entry and the second has the reference to a dictionary that it is taken from and the number of page. In this example it is left blank, but in the future it will contain accurate data about reference. Let’s go back to the meanings. The meanings dictionary has numbers as the keys which are counted from 1. Each key has another dictionary as the value which is a particular synonym data from the entry. These dictionaries have numerical keys starting from 0. Their value is, guess what, another dictionary with one to three possible key-value pairs. These keys are:

  1. form - the synonym itself, be it a word or part of the sentence
  2. description - meaning of the entry described in words
  3. categories - tags that go with synonym like ‘v.’ for “see”, or “sl.” for “similar”

How can we tokenize the rest of the entry? That’s where Python comes in. We already have the title and type of the word readily available, but the rest of the entry is a bit garbled and needs cleaning. Now, everything that is after the first two elements of the entry is concatenated into a single string. The rest of the entry is divided into separate meanings and meanings themselves are separated by delimiter. In this case the meanings were delimited by numbers like “❶” and “❷”, but I changed that into a single arbitrary word (this can be anything, a symbol for example, that you know that it is not used anywhere else in the original text or for anything else inside dictionary, but only for separating meanings) so we can more easily find them. I used BeautifulSoup for reading and parsing HTML and it was really an easy job as the module offers a bunch of methods and you can access elements of any type and of any number with just a few lines of code.

Most of the work here is done by Entry class in Parse.py and it’s methods and helper functions. Entry class stores all data about entry and is suited for this particular dictionary. If we were to parse some other dictionary we would have to rewrite most of it. Same can be said of some helper functions but as the work will continue on other dictionaries it will enable us to see what can be used or not. But idea is to make program modular so that anybody who wishes to parse some dictionary can simply add class and methods which are dictionary specific to it.

So, upon receiving initial parameters, the Entry class, using it’s own methods further separates elements of the entry. First we get the meaning as explained above, then we go on to get all the words of particular meaning and everything that follows them, like tags, explanations, descriptions. When dealing with words as the elements of the meaning and their tags, we must read the tags to determine if the word is a synonym or just related or similar word as denoted by the tag. Only the true synonyms have number as key in the list of meanings, while other words have the type of the tag as the key. This goes for tags like “similar words”, “compare”, etc.

Each word we deal with has three elements that we separate. These are the word itself, its tags/categories and written description. Not all of the words have these elements. Most of them have only word form without any tags and descriptions.

There are also other tags which need to be handled differently, for example, the number right of the word denotes that we need to look up for the synonyms in the other entry from the dictionary and into a particular meaning as denoted by the number, lat. means Latin terms are following and are not be transcribed to Cyrillic, brackets mean we are dealing with description of the word or it’s usage, for example.

Examples of usage of different tags:

These are regular tags, one per word: drob reg., tiba deč.

pometnuće reg., pomet reg. arh.

larija žarg., pisarna arh., pisarnica arh., referada (mesto gde radi referent) ret.

This is an example of word with v. ('see') tag which we need to substitute with synonyms of the entry with that name: v. poremećenost

v. avanzovati 1 i 2.


Similarly to previous case but we only substitute the word with one of the meanings of the other entry: podesiti 1.

naoružanje 1. i 2.

važeći 2.

Example of lat. (for Latin) tag. '#' marks that following letter is not to be transcribed in any case: (vrsta šarenog dnevnog leptira, lat. #V#a#n#e#s#a #A#t#a#l#a#n#t#a) *

(lat. #C#a#r#c#h#a#r#i#a#s #g#l#a#u#c#u#s) morski pas

(gmizavac srodan krokodilu iz reda krokodilije, lat. #A#l#l#i#g#a#t#o#r) krokodil

Single tag used for multiple words: -up: uneti, spremiti, menjati

-suž: momaštvo, mladićstvo, devojaštvo

This process was not straightforward and at first I tried to parse only a small test sample of dictionary till I get the code right. Once the code was working I had to constantly go back at HTML searching and looking for misplaced symbols, result of human error, to correct them so that program can correctly parse it. As most of these mistakes were typical I could use Sublime Text and search them via regular expressions and change every instance by replace option. This was a tedious but necessary step and more than any other step involved testing and using odd cases to find all possible errors and fix them.

Finally, with the program passing through all of the data successfully and with all possible errors resolved, the processed data is ready to be exported. It is exported in JSON format but there is also an option to change the script. We can export data as it is, and in this case it is in Latin script originally, or transcribe it to Cyrillic (bear in mind that we are dealing with dictionaries in Serbian language here which uses both Latin and Cyrillic scripts). Only the content is being transcribed here, the keys, which are in English, remain in their original form.

Preparing data for Wiki

With the data parsed, tokenized and organized into Python’s dictionary of dictionaries, it is now ready to be put into a format suitable for Wiktionary. This is what sinonimi_reader.py accomplishes by reading a given JSON file, inserting the data into its own Entry class which features methods to rearrange and output data in Wiktionary markup.

But it’s not a time to hurry and upload all entries at once! First we must use Pywikibot and check what entries are already on Wiktionary, in which case we need to edit entries from Wiktionary to append new data to it, and generate new entries if they are not on Wiktionary already. igl_count_pages.py uses Pywikibot features to get the names of all pages that we didn’t work on yet, or that are not created yet. We use it to get the two lists. One would contain entries which do not exist on Wiktionary yet, and the other will contain existing entries which we have to edit.

With lists in our hands we can do whatever we want first, either create new entries or append to existing ones. As there were far more non-existing entries and as they are more easy to create, I decided to work on them first. So, back to sinonimi_reader.py, we use our list and look for the entries from JSON that we need to output. As the dictionary (the book) contains multiple entries with the same name, in cases when word form can be more than of one type, we store each of those types in separate dictionary and all these dictionaries we store in a list. Main difference between outputting these two is that we export the entry as a whole with simple entries, including header with title and footer with references, and with list of entries we need to add header and title to the first one, take only the body if second one is not the last, and finally add the footer to the last one, and combine all of them in order to have a complete entry of a word. There are methods in Entry class to add what we need with our data (Entry.to_wiki, ), and additional helper functions to help with inserting references (concat_entry), formatting markup (format_syn_asc, format_type, and Entry.to_wiki), methods to combine data (process_description, process_form, process_categories), and constants used for abbreviations and word types.

Of course, and this was also the trial and error process, with error checking, switching back forth between fixing and modifying software to hunting anomalies in JSON file. And once this is done to satisfactory level (it’s never actually finished in this sense, as we can’t check everything by hand, so data is never 100% error free), I could proceed to upload first batch of entries to Wiktionary.

Let me pause here for a minute and elaborate more about what sinonimi_reader.py does. It deals with each element of the Wiktionary entry separately. Methods used to process synonyms are quite different from methods used to process description of the meaning, for example. As I said before, from Rečnik Sinonima we only have three relevant fields to get: description of the meaning, synonyms and associations. But more can and will be added as I embark upon working with other dictionaries with different fields. Also, every word in strings will be transformed into variable to enable working with Wiktionaries on languages other than Serbian. One more interesting thing to note is that each word that is accompanied with tag that points or refers to another entry (tag “see”) is recursively substituted with synonyms of the entry which it points to. This was one of the hardest things to accomplish and it took much time and effort. Also, notice that we are dealing with two kinds of tags. One group of tags are regular Wiktionary tags from which you can navigate to the entry containing the description of the tag. The second group are also word tags but pointing to abbreviation description taken from Rečnik Sinonima’s dictionary glossary. At least it should be like that, as currently they are not entered on Wiktionary, but I hope by the time you read this you can also find them there.

Uploading data

With entries ready to be uploaded I once again turned to Pywikibot to help with the job. I started with smaller samples first to test if everything is ok and later I increased the size of samples when I was more sure that (almost) everything is correct. Pagefromfile.py from Pywikibot was really useful here as it takes almost no effort to get the entries uploaded and pages created on Wiktionary. This took some time, because there were around 13000 entries and as I wasn't uploading non-stop, it took four days, although I calculated that everything can be uploaded in 36 hours if there are no pauses. Of course, pagefromwiki takes a break between uploading every entry and wait time can be decreased but I have decided to stick with default value of 10 seconds to be on the safe side.

With the larger batch already uploaded I proceeded to deal with entries which are already on Wiktionary. Processing them wasn’t as straightforward and required writing entire new set of functions in addition to the existing ones. These functions can be found igl_addition.py which heavily realies on sinonimi_reader.py. Pywikibot offers methods to add text to the top or to the bottom of the entry but inserting in the middle of the entry can only be accomplished by getting entire existing entry, searching for position to add text, inserting text in one or many fields and overwriting existing entry with modified one. Fortunately, most of the Serbian entries followed the same structure which made finding appropriate place to insert fields easier but in the rest of the entries a very different syntax could be seen, and we couldn’t do this with the same reliability. There were only about 500 such entries so checking them manually wasn’t a problem. But if I was dealing with, say, 30000 entries that had to be edited, it may have been a very different story.

Example of Wiktionary page made by Dictator

Screenshot from 2015-10-12 16-16-22.png


Graph with all sinonimi_reader functions and connections between them

Conclusion

Using this software and its methods I was able to make entries on Wiktionary from Rečnik Sinonima which increased the number of pages from 18000 to 31500 after the pages in cyrillic were made, which is approximately increase of 75%, and later after addition of pages in latin it increased to around 45000, which is increase of 150%. With more dictionaries to come and with other potential users around the world we can only imagine that the number of pages in Wiktionary shall increase very quickly.

Of course everything here in terms of software is a subject to change and methods will be ironed out, made more universal and applicable to the other dictionaries, as this is only the beginning. And also, there is an expectation and hope that this will evolve as the universal tool for automating tasks performed to parse and upload dictionaries into Wiktionaries in the future and make it far richer with much less human effort.