The process of creating a geocoder based on ElasticSearch, which searches for coordinates by synonyms and names of places, looking for crossroads and addresses in a certain radius, and knows how to reverse geocoding and automatically update with new data from drivers. The repository is available by link:
When we decided to design a geocoder for the needs of Namba-Taxi, we encountered face with a lack of data.
What we don't have:
- The full map of Yandex, Google or 2Gis
- Confidence in GPS data
What we have:
- Very mixed input;
- We use OpenStreetMap somewhere;
- Our accumulated address database with coordinates.
What can operators enter?
Operators can enter addresses in different formats:
- Street house
- Name of institution
- Point name
- Housing estate
- Microdistrict street house
And there are a lot of such options, for example:
- Kiev Street 28
- Kiev Street/Soviet Street
- 5 microdistrict Soviet 42
- CSM (Wallmart)
- Cafe Ashot’s
The following algorithm was laid down:
- First, we obtain the geometry of large settlements (cities, capitals, villages, residential areas);
- We unload all possible addresses and correlate them to the necessary residential array, city, and other settlements, setting the desired value;
- We unload all roads;
- Looking for the intersection of roads;
- Put everything in the index;
We have OSM as the main data source, so the filters, in order to get data from us, are looking like this:
- Place = city, place = village, place = suburb, place = town, place = neighbourhood — get all the neighborhoods.
- addr: street + addr: housenumber, amenity, shop, addr: housenumber — get addresses and names of institutions.
- highway — get all the roads.
There were difficulties with the search for English-language names in Russian. As I tried to solve it:
- Simple automatic transliteration into Russian. As a result, it turned out to be absurd and incorrect. Example of data conversion looked like this: City House -> Цити Хоусе.
- Get the transcription of the word and after that make its transliteration. It turned out something like Adrenaline rush -> Эдреналин Рэш. Possibly, but you need a Russian accent, such as адреналин раш.
- Automatically transliterate all data using the replacement dictionary. It is the solution. Simple transliteration works tolerably. The dictionary was filled in principle quickly through several runs on the data.
We sorted out with this, to this point we are already getting data that:
- Normalized and brought to the Russian language;
- Addresses are given to the format — country, city, village or village, neighborhood or residential area, street, house.
The next part of the quest is to find the intersections of the roads. I made it on a fast and got a very slow implementation, the complexity of O(n²). As a temporary output, I used Postgres+postgis to find the intersections, until I found a good algorithm for finding intersections.
As a result, a good data parser with osm has been created, which puts the data into ElasticSearch, which got a simple name “importer”.
Considering that we should constantly pump out and create indexes in the ElasticSearch soon became fed up, the updater component appeared. There was also an automatic configuration in the JSON format.
The process of downloading the file and importing it into ElasticSearch was automated. Additionally, there was an opportunity to update the data in the ElasticSearch without downtime, thanks to the aliases.
How it works:
- Updater downloads the file;
- Recognizes the current version of the index from the config;
- Increments the version and creates a new index;
- Fills it with data;
- Changes aliases;
- Removes the old index.
I received such benefits from this:
- Write a config;
- Run the ./ariadna update;
- Go to drink coffee;
- Get the readily customized index.
Also, for convenience, a simple web interface with a map and search capability was attached.
Automatic replenishment of data
In addition to the OSM, we still have many drivers and operators, who are hammering orders. Accordingly, we have a name and coordinates. So, the following scheme was made:
- Tracks of drivers are stored in the drivers_data index;
- Data from the OSM is stored in the osm_data index;
- They are combined through the alias addresses on which the address is searched.
Data from drivers are recorded if we have an error in certain coordinates more than 200 meters.
What can the Ariadna geocoder do?
- Search for coordinates by synonyms. For example, CVK — ChampagneVinKombinat;
- Search for addresses in a certain radius (for example, for themselves, with a search for addresses 30 km from the city center);
- Search by name of establishments (cafe Ashot’s for example);
- Search crossroads;
- Search for addresses in neighborhoods and lived arrays;
- Reverse geocoding;
- Automatically replenished with new data from drivers.
What components does the geocoder consist of:
- Data Importer;
- Data ancestor;
- Web interface.
- Tested only for Kyrgyzstan;
- No demo (Although you can see it in the Namba Taxi application when it determines your address by location);
- No support for all addressing schemes.
Therefore, I hope, someone will help him finish and for a good search for other countries and cities.
If someone has found the project interesting, then I’m not against any criticism, a pool of questioners, issues on the GitHub, and feedback in general.