Géo − Geonames alignment
This blog post describes the main points of the alignment process between the French National Library's Géo repository of data, and the data extracted from Geonames.
Alignment is the process of finding similar entities in different repositories. The Géo repository of data contains a lot of locations and the goal is to find those locations in the Geonames repository, and to be able to say that location in Géo is the same than this one in Geonames. For that purpose, Logilab developed a library, called Nazca, to build those links.
To process the alignment between Géo and Geonames, we divided the Géo repository into two groups:
- A group gathering the Géo data having information about longitude and latitude.
- An other, gathering the data having no information about longitude and latitude.
Group 1 - Data having geographical information
The alignment process is made in five steps (see figure below):
1. Data gathering
We gather the information needed to align, that is to say, the unique identifier, the name, the longitude and the latitude. The same applies to the Geonames data.
2. Standardization
This step aims to make the data the as standard as possible. ie, set to lower case, remove the stop words, remove the punctuation and so on.
3. Geographical neighbours search
During this step, all the locations are mixed into a big set. A Kdtree is used to find the nearest neighbours of a given location. This step allows us to gather the location geographically.
4. Alignment
Thanks to the Kdtree, we can quickly find the geographical nearest neighbours. During this fourth step, we loop over the nearest neighbours and assign to each a grade according to the similarity of its name and the name of the location we're looking for, using the Levenshtein distance. The alignment will be made with the best graded one.
5. Saving the results
Finally, we save all the results into a file.
Group 2 - Data having no geographical information
Let's have a look to the data having no information on the longitude and the latitude. The steps are more or less the same than before, except that we cannot find neighbours using a Kdtree. So, we use an other method to find location having a quite high level of similarity in their names. This method is called the Minhashing which has been shown to be quite relevant for this purpose.
To minimise the amount of mistakes, we try to gather locations according to their country, knowing the country in often written in the location's preferred_label. This pre-treatment helps us to filter out the cities having the same name but located in different countries. For instance, there is Paris in France, there is Paris in the United States, and there is Paris in Canada. So the alignment is made country by country.
The fourth and the fifth steps remain the sames.
Results obtained
The results we got are the followings :
Amount of locations | Aligned | Non-aligned | |
---|---|---|---|
Group 1 | 97572 | (89.3%) | (10.7%) |
Group 2 | 150528 | (72.9%) | (27.1%) |
Total | 248100 | (79.3%) | (20.7%) |
One problem we met is the language used to describe the location. Indeed, the similarity grade is given according the distance between the names, and one can notice that Londres and London, for instance, do not having the same spelling.despite they represent the same location.
Results improvement
In order to improve a little bit the results, we had a closer look to the 10.7% non-aligned of the first group. The problem of the language mentioned before was pretty clear. So we decided to use the following definition : two locations are identical, if they are geographically very close. Using this definition, we get rid of the name, and focus on the longitude and the latitude only.
To estimate the exactness of the results, we pick 50 randomly chosen location and process to a manual checking. And the results are pretty good ! 98% are correct (49/50). That's how, based on a purely geographical approach, we can increase the results covering rate (from 89.3% to 99.6%).
In the end, we get those results :
Amount of locations | Aligned | Non-aligned | |
---|---|---|---|
Group 1 | 97572 | (99.6%) | (0.4%) |
Group 2 | 150528 | (72.9%) | (27.1%) |
Total | 248100 | (83.4%) | (16.4%) |