Built with 
HomeBrave Tech WorldAbout SiteMarcelo CalbucciMy Videos

Brave Tech World

October 28, 2006


SAT
28
OCT
2006

Geo Database: Probably the hardest thing I've done for Sampa

By Marcelo

 

    I probably bit more than I could chew, translation, building the Geo-Database for Sampa was probably the hardest feature to date on Sampa. There are many aspects that make it harder, but the number is probably the lack of consistency on terms of geographical names and cataloging them.

 

    I want a Geo-database to do two things:

    1. Given a geographic name (city, state, square, river, etc.) return an unique identifier for that location and its coordinates.
    2. Given a coordinate, find the most likely geographic entity that this point belongs to.

    For #1 I could have used your traditional geo-coder (from Yahoo, Google, ServiceObjects, etc.), but the problem is that they don't return a unique identifier. Using the city name is not a unique identifier because many cities have the same name, using a combination of the neighborhood + city + state + country would be unique, but it would be too long to store on each picture (for Geo-Tagging).

 

    The problem with implementing #1 is the amount of duplicate names. For example, there are more than 40 places in the world named "São Paulo". In 99.999% of the cases, when somebody types São Paulo they are probably refering to my birth city and not the other tiny places around the world. Same thing for places like Paris, Tokyo, London. All of those names identify many places, but they are likely to be refering to "Paris, France", "Tokyo, Japan" and "London, England".

 

    So, the only way to return the correct value in the correct order is to use some kind of Importance factor. In my view, and on the context of pictures/blogs, the things that make a place more likely to be what the user is looking for are:

  • The population of a location: Higher the population, more likely to be a match;
  • The size of location (as in how many square kilometers of area) -- Think of a national park or river (it has no population)
  • The number of tourists per year -- this is interesting because a lot of tourists hot spots don't have a lot of population, like Seychelles, Cannes, Venice, Angra dos Reis, etc.

    To make the ranking more reliable, I manage to get the population of about 30,000 cities. But I'm still not happy, because my database has about 3 million locations.

 

    How to get the size of locations? That is probably the hardest thing to get in terms of Geographic databases.

 

    The number of tourists could be extrapolated based on the number of hotel's and/or the number of hotel rooms on that place. But where to get this data from?

 

    I just got the database to work, the code in place and searching for "São Paulo" returned the big city in result #1, which is what I needed. I feel a bit relieved that the system works the way I designed it. I spent way too much time on this, while I should have re-dimensioned the feature to make it simpler. Two weeks are gone and just 1 feature gets implemented. Very frustrating.

 

 

 

 

 



Comments for "Geo Database: Probably the h...

No comments posted.
Similar Content
Powered by Google