OpenStreetMap (http://www.openstreetmap.org) foundation builds a free and editable map of the world, enabling the development of freely-reusable geospatial data. The data from OpenStreetMap is being used by many applications such as GoogleMaps, Foursquare, and Craigslist.
OpenStreetMap data structure has four core elements: Nodes, Ways, Tags, and Relations.
Nodes are points with a geographic position stored as lon (longitude) and lan (latitude).
Ways are ordered list of nodes, representing a polyline (can be a closed loop). They depict entities like streets, rivers, parks, and lakes.
Relations are ordered list of nodes, ways and relations (called 'members') to show the relation between nodes and ways such as restrictions on roads.
Tags are pairs of Keys and Values. They are used to store metadata of the map such as address, type of building, or any other physical property. Tags are always attached to a node, way or relation and are not stand-alone elements.
Users can add points on the map, create relations of the nodes and ways, and assign properties such as address or type to them. The data can be stored in OSM format and can be accessed in different formats. For this project, OSM XML format is used.
In this project, I have worked with the raw data from an area. Since the data is put by different users, I suppose it can be quite messy; therefore, cleaning and auditing functions to make the data look clean for analysis. I then export the data into CSV format and use this format to create tables in my SQLite database. Then I run queries on the database to extract information such as the number of nodes and ways, most contributing users, top points of interest, etc. The report is concluded by discussing benefits as well as some anticipated problems in implementing the improvement.
The reason I chose San Francisco as the area of choice for this study is that it is where the Silicon Valley is; a place I want to work in someday.
The raw data was locally downloaded from – https://mapzen.com/data/metro-extracts/metro/san-francisco_california/. The 'metro extracts' provides the map of the metropolitan area (i.e., contains more elements to work with). The original file is about 1.42 GB in size. Hence, a sample file of about 20 MB is used to perform the initial analysis. Post this, after a satisfactory code is written, it is run on the original files to create the CSV files for my database.
The data analyzed in this Jupyter notebook is from the sample file to be able to show shorter results. I have included all the functions here, as well as the function to create CSV files, in separate .py files in the accompanying directory/repository, which can be run on the original dataset.
The detailed case study includes information about all snippets of codes, as well as outputs from the query. You can find the complete report from this repository from CaseStudy for OpenStreetMap.
This notebook (Report.ipynb
) gives a complete explanation regarding the project, the code, and queries.
audit.py
: This script contains the auditing and cleaning functions. This file is a dependency of shaping_csv.py
.shaping_csv.py
: The main script which created the CSV files from the OSM data.creating_database.py
: This script creates databases and tables, then insert the relevant data.query_on_database.py
: This script contains all the SQL commands.sample.osm
: This file contains a sample data from the original San Francisco database.sampler.py
: This scripts sampels the original OSM file by taking every 'k' observations.Report.html
: Export of this report from Jupyter notebook to HTML.This section shows the data exploration i.e. to check how the data is structured, what are problems and how they can be cleaned. This is done by first looking into the dataset and parse through using ElementTree
and extract information such as different types of elements (nodes, ways, etc.) in the OSM XML file.
Since the original file is too large for processing, the file can be parsed using ET.iterparse
(i.e. iterative parsing) as it will parse the file as it builds it – http://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy).
In the following code,
import xml.etree.cElementTree as ET
import pprint
OSMFILE = 'sample.osm'
def count_tags(filename):
tags= {}
for event, elem in ET.iterparse(filename):
if elem.tag not in tags.keys():
tags[elem.tag] = 1
else:
tags[elem.tag] += 1
pprint.pprint(tags)
count_tags(OSMFILE)
In the OSM XML file, the 'tag' element has key-value pairs which contain information about different points (nodes) or ways in the map. I parse through this element using the following regular expressions:
Lower - ^([a-z]|_)*$
: Matches strings that contain only lower case characters. It also covers the underscore '_' character.
Lower Colon - ^([a-z]|_)*:([a-z]|_)*$
: Matches strings which contain lower case characters but also have the colon ':' character.
Problem Char - [=\+/&<>;\'"\?%#$@\,\. \t\r\n]
: Matches tags with problematic characters specified in the regex pattern.
Take this section of the map as an example:
<node changeset="30175357" id="358830340"
lat="37.6504905" lon="-122.4896963"
timestamp="2015-04-12T22:43:37Z"
uid="35667" user="encleadus" version="4">
<tag k="name" v="Ocean Shore School" />
<tag k="phone" v="+1 650 738 6650" />
<tag k="amenity" v="school" />
<tag k="website" v="http://www.oceanshoreschool.org/" />
<tag k="addr:city" v="Pacifica" />
<tag k="addr:state" v="CA" />
<tag k="addr:street" v="Oceana Boulevard" />
<tag k="gnis:created" v="04/06/1998" />
<tag k="addr:postcode" v="94044" />
<tag k="gnis:state_id" v="06" />
<tag k="gnis:county_id" v="081" />
<tag k="gnis:feature_id" v="1785657" />
<tag k="addr:housenumber" v="411" />
</node>
This node tag has 13 tag elements inside it. There are multiple keys that have the ':' character in them, so they fall under the 'Lower_Colon' regular expression. Keys like name, phone, and amenity will fall under the 'lower' regular expression. There are no problematic characters in this specific node.
import xml.etree.cElementTree as ET
import pprint
import re
OSMFILE = 'sample.osm'
lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
def key_type(element, keys):
if element.tag == "tag":
for tag in element.iter('tag'): # Iterating through the tag elements.
k = element.attrib['k'] # Looking for the tag attribute 'k' which contains the keys.
if re.search(lower, k):
keys['lower'] += 1
elif re.search(lower_colon, k):
keys['lower_colon'] += 1
elif re.search(problem_chars, k):
keys['problem_chars'] += 1
else:
keys['other'] += 1
return keys
def process_map(filename):
keys = {"lower": 0, "lower_colon": 0, "problem_chars": 0, "other": 0}
for _, element in ET.iterparse(filename):
keys = key_type(element, keys)
pprint.pprint(keys)
process_map(OSMFILE)
Now, the number of unique users that have contributed to the OpenStreetMap data for San Francisco have to be calculated. To find the users, we need to look through the attributes of the node, way and relation tags. The 'uid' attribute is what we need to count.
`<node changeset="30175357"
id="358830340" lat="37.6504905" lon="-122.4896963"
timestamp="2015-04-12T22:43:37Z" uid="35667" user="encleadus" version="4">`
OSMFILE = 'sample.osm'
def process_map(filename):
users = set()
for _, element in ET.iterparse(filename):
if element.tag == 'node' or element.tag == 'way' or element.tag == 'relation':
userid = element.attrib['uid']
users.add(userid)
print(len(users))
process_map(OSMFILE)
Since, many users are entering data in OpenStreetMap, the way they represent the formatting of streets can vary. For example, the street type 'Avenue' can be written in formats such as:
To be able to process the data, we need to make these street types uniform. In case we are later searching for specific Avenue names, we can do a quick search on all street types that have the word 'Avenue' in them, and we can make sure that we are not missing anything with abbreviations of Avenue.
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint
OSMFILE = "sample.osm"
"""This regular expression matches letters without any white space with zero to one '.'
Extract a string which might or might not have the '.' character in it."""
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)
# List of expected street types.
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road",
"Trail", "Parkway", "Commons"]
def audit_street_type(street_types, street_name):
"""This function will get the list of street types and using the regular expression,
compare them to the expected list. If they do not match the names in the expected list,
it adds it to the street_types dictionary.
"""
m = street_type_re.search(street_name)
if m:
street_type = m.group()
if street_type not in expected:
street_types[street_type].add(street_name)
def is_street_name(elem):
"""This unction will get the elements in the file (i.e. the tag element) and
return the attributes in that element for which their key is equal to 'addr:street'.
"""
return (elem.attrib['k'] == "addr:street")
def audit_street(osmfile):
"""This function uses iterative parsing to go through the XML file,
parse node and way elements, and iterate through their tag element.
It will then call the 'audit_street_type' function to add the value attribute
of the tag (i.e. the street name) to it.
"""
osm_file = open(osmfile, "r")
street_types = defaultdict(set)
# Parses the XML file.
for event, elem in ET.iterparse(osm_file, events=("start",)):
# Iterate through the 'tag' element of node and way elements.
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if is_street_name(tag):
audit_street_type(street_types, tag.attrib['v'])
osm_file.close()
return street_types
street_types = audit_street(OSMFILE)
pprint.pprint(dict(street_types))
Going through the street name list, I use it to update the 'mapping' list. In this list, I mention the format of the street type that was found in the file (left) and specify to what format it needs to be changed (right). The dictionary containing the abbreviated street types do not cover all the different street types but covers a large number of them. I go through this list and see which ones can be changed. For instance:
{'Highway': Great Highway'}
needs no changes. {'Blvd': 'Newark Blvd'}
needs to be changed from Blvd to 'Boulevard.# The list of dictionaries, containing street types that need to be changed to match the expected list.
mapping = { "St": "Street",
"St.": "Street",
"street": "Street",
"Ave": "Avenue",
"Ave.": "Avenue",
"AVE": "Avenue,",
"avenue": "Avenue",
"Rd.": "Road",
"Rd": "Road",
"road": "Road",
"Blvd": "Boulevard",
"Blvd.": "Boulevard",
"Blvd,": "Boulevard",
"boulevard": "Boulevard",
"broadway": "Broadway",
"square": "Square",
"way": "Way",
"Dr.": "Drive",
"Dr": "Drive",
"ct": "Court",
"Ct": "Court",
"court": "Court",
"Sq": "Square",
"square": "Square",
"cres": "Crescent",
"Cres": "Crescent",
"Ctr": "Center",
"Hwy": "Highway",
"hwy": "Highway",
"Ln": "Lane",
"Ln.": "Lane",
"parkway": "Parkway"
}
def update_name(name, mapping):
"""This function takes the street name and split it at the space character.
In case, it finds a string that matches any key in the mapping, it replaces it with
the format that has been specified for it.
e.g. When the function finds 'Blvd' in "Newark Blvd", it goes through mapping and maps
it to 'Boulevard', and the final street name will come out as 'Newark Boulevard'.
"""
output = list()
parts = name.split(" ")
for part in parts:
if part in mapping:
output.append(mapping[part])
else:
output.append(part)
return " ".join(output)
# Printing the changes made in street names.
for st_type, ways in street_types.items():
for name in ways:
better_name = update_name(name, mapping)
print(name, "→", better_name)
Postcodes are another inconsistent type of data that is entered into the map. The inconsistency is either in how they are represented (with the city abbreviation or without) or how long they are. The theory behind auditing Postcodes is the same as auditing street names.
OSMFILE = 'sample.osm'
def dicti(data, item):
"""This function creates a dictionary postcodes can be held.
The dictionary key will be the postcode itself and the dictionary value
will be the number of times that postcode was repeated throughout the map."""
data[item] += 1
def get_postcode(elem):
"""This function takes the 'tag' element as an input and
return the elements for which the keys are equal to 'addr:postcode'"""
return (elem.attrib['k'] == "addr:postcode")
def audit(osmfile):
"""This function parses the XML file and iterates through node and
way elements. It extracts the value attribute (i.e. the postcode) and
add it to the 'dicti' dictionary."""
osm_file = open(osmfile, "r")
data = defaultdict(int)
# Parsing the XML file
for event, elem in ET.iterparse(osm_file, events=("start",)):
# Iterating through node and way elements.
if elem.tag == "node" or elem.tag == "way":
for tag in elem.iter("tag"):
if get_postcode(tag):
dicti(data, tag.attrib['v'])
return data
postcodes = audit(OSMFILE)
pprint.pprint(dict(postcodes))
The output shows that the postcode is in a 5-digit format.
After running the code on the original file, I also found other postcode formats. To deal with these other formats observed, I divide them into different categories:
The first category includes the ones:
The second category includes the ones:
The third category includes the ones:
For the first category, I use a regular expression to extract only 5 digits from the pattern. This regex asserts position at the start of the string (^
) and matches any character that is NOT a digit (\D*
). The (\d{5}
) matches a digit exactly 5 times. In case the postcode starts with letters (e.g., CA 12345), it gives two groups of output: 'CA' and '12345'. Depending on which one is needed, the preferred group can be chosen – ^\D*(\d{5}).*
.
For the second category, I use another regular expression to extract the first 5-digits. This regex matches digits 5 times, is followed by a '-,' and then matching digits exactly 4 times – ^(\d{5})-\d{4}$
.
For the third category, having postcodes which have shorter or longer than 5-digit length, means that they are not valid. To clean them up, I replace those postcodes with '00000'. I use regular expressions to be able to find the ones that are exactly 6-digit long – ^\d{6}$
.
def update_postcode(digit):
"""Makes use of different conditions in the function to match the
postcodes in the 3 categories that can be found for postal codes."""
output = list()
first_category = re.compile(r'^\D*(\d{5}$)', re.IGNORECASE)
second_category = re.compile('^(\d{5})-\d{4}$')
third_category = re.compile('^\d{6}$')
if re.search(first_category, digit):
new_digit = re.search(first_category, digit).group(1)
output.append(new_digit)
elif re.search(second_category, digit):
new_digit = re.search(second_category, digit).group(1)
output.append(new_digit)
elif re.search(third_category, digit):
third_output = third_category.search(digit)
new_digit = '00000'
output.append('00000')
# This condition matches the third category for any other types.
elif digit == 'CA' or len(digit) < 5:
new_digit = '00000'
output.append(new_digit)
return ', '.join(str(x) for x in output)
for postcode, nums in postcodes.items():
better_code = update_postcode(postcode)
print(postcode, "→", better_code)
To load the data to the SQLite database, they need to be converted from the XML file to CSV files. I create multiple CSV files, and later create the corresponding tables in my database based on them.
The CSV files I want to have are:
node.csv
node_tags.csv
way.csv
way_tags.csv
way_nodes.csv
Each of these CSV files contains different columns and stores data based on those columns. The columns used in the CSV files will be the table columns in the database. This is the schema:
To create these files, I will parse the 'node' and 'way' tags and extract the tags inside them. The shape_element
function takes an iterparse element object and returns a dictionary. Depending on whether the element is 'node' or 'way', the dictionary looks different.
Here's an example from the 'node' element in the XML file:
<node changeset="27772228" id="358830414"
lat="37.6668652" lon="-122.4895243" timestamp="2014-12-29T09:43:14Z"
uid="14293" user="KindredCoda" version="2">
<tag k="ele" v="118" />
<tag k="name" v="Longview Park" />
<tag k="leisure" v="park" />
<tag k="gnis:created" v="04/06/1998" />
<tag k="gnis:state_id" v="06" />
<tag k="gnis:county_id" v="081" />
<tag k="gnis:feature_id" v="1785701" />
</node>
The dictionary returns the format {"node": .., "node_tags": ...}
The node
field holds a dictionary of the following top level node attributes: id, user, uid, version, lat, lon, timestamp, changeset
. All the other attributes are ignored.
The node_tags
field holds a list of dictionaries, one per secondary tag (child tags of node, which have the tag name/type: "tag"). Each dictionary has the following fields and properties:
id:
the top level node_id attribute value. For example, node['id']='358830414'
in the above sample.
key:
the full tag "k" attribute value if no colon is present or the characters after the colon if there is one. For example, k="name"
.
value:
the tag "v" attribute value. For example, v="Longview Park"
is the value for the key k="name"
.
type:
either the characters before the colon in the tag "k" value or "regular" if a colon is not present. For k="name"
, the type would be 'regular'.
If the tag "k" value contains problematic characters, the tag should be ignored.
If the tag "k" value contains a ":", the characters before the ":" should be set as the tag type and characters after the ":" should be set as the tag key. For example, in <tag k="gnis:county_id">
, the tag['type']='gnis' and tag['key']='county_id'
.
If there are additional ":" in the "k" value, they should be ignored and kept as part of the tag key. For example, {'id':12345, 'key':'street:name', 'value':'Lincoln', 'type':'addr'}.
If a node has no secondary tags, then node_tags
field should just contain an empty list.
The dictionary has the format {"way": ..., "way_tags": ..., "way_nodes": ...}
The way
field should hold a dictionary of the following top level way attributes id, user, uid, version, timestamp, changeset
.
The way_tags
field again holds a list of dictionaries, following the exact same rules as for node_tags
.
Additionally, the dictionary has a field way_nodes
. way_nodes
holds a list of dictionaries. Each dictionary has the fields:
id:
the top level element (way) id.node_id:
the ref attribute value of the nd tag.position:
the index starting at 0 of the nd tag i.e. what order the nd tag appears within the way element.To write the data into CSV files, shape_element
function which makes use of update_name
and update_postcode
functions to clean the street names and postcodes before they are inserted into the CSV files. Each of the function is called twice, once for nodes and the other for ways.
One issue I noticed while I was validating my CSV file against the expected schema, I found out that some uid
values are missing from the data. Although according to the best practice of OpenStreetMap, all user's information should be written while submitting data, they are not; thus, causing validation to throw errors. To fix this issue, dummy values have been entered.
The same problem with some empty fields are relatable to the 'k' attribute as well; meaning users did not add information for all the attributes in node or way. To overcome this problem, a conditional statement to set the ['type']
and ['key']
to 'regular' in case the field is empty was implemented. Else, it uses the regular expressions to find corresponding patterns.
After the CSV files are created (by running the shaping_csv.py file), the next step is to create the database and insert the information from those CSV files to their corresponding tables.
I created the database called openstreetmap_sf_db,
and I created tables with columns based on the columns from the CSV files, and inserted the data from the CSV files to the corresponding tables in the database. The creating_database.py
file takes care of creating the tables and inserting data in them.
After the tables are created, I can now start investigating them and getting queries on them.
The sample code below shows how I have created the tables in the database; where tables_name
, column_name
and filename.csv
are replaced according to the table that is to be inserted. First, I connect to the SQLite file and make sure to check that the table I want to create is not already created. Using cur.execute
, I execute commands to the database in Python. After creating the table, I inserted the data from the CSV file into it. I did this process for every table I wanted to create in the database.
import sqlite3
import csv
from pprint import pprint
# Put the path to your sqlite database. If no database is available, a new one will be created.
sqlite_file = 'openstreetmap_sf_db.sqlite'
# Connecting to the database.
conn = sqlite3.connect(sqlite_file)
cur = conn.cursor()
Now that I have audited and cleaned the data and transfered everything into table in my database, I can start running queries on it.
The queries help find the following.
def number_of_nodes():
output = cur.execute('SELECT COUNT(*) FROM nodes')
return output.fetchone()[0]
print('Number of nodes: \n' , number_of_nodes())
def number_of_ways():
output = cur.execute('SELECT COUNT(*) FROM ways')
return output.fetchone()[0]
print('Number of ways: \n', number_of_ways())
def number_of_unique_users():
output = cur.execute('SELECT COUNT(DISTINCT e.uid) FROM \
(SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e')
return output.fetchone()[0]
print('Number of unique users: \n' , number_of_unique_users())
def most_contributing_users():
output = cur.execute('SELECT e.user, COUNT(*) as num FROM \
(SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
GROUP BY e.user \
ORDER BY num DESC \
LIMIT 10 ')
print("Most contributing users: \n")
pprint(output.fetchall())
return None
most_contributing_users()
def number_of_users_contributed_once():
output = cur.execute('SELECT COUNT(*) FROM \
(SELECT e.user, COUNT(*) as num FROM \
(SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
GROUP BY e.user \
HAVING num = 1) u')
return output.fetchone()[0]
print('Number of users who have contributed once: \n', number_of_users_contributed_once())
query = "SELECT value, COUNT(*) as num FROM nodes_tags \
WHERE key=\"b'amenity'\" \
GROUP BY value \
ORDER BY num DESC \
LIMIT 20"
def top_ten_amenities_in_sf():
output = cur.execute(query)
pprint(output.fetchall())
return None
print('Top 10 Amenities:\n')
top_ten_amenities_in_sf()
query = "SELECT value, COUNT(*) as num FROM ways_tags \
WHERE key=\"b'cuisine'\" \
GROUP BY value \
ORDER BY num DESC \
LIMIT 10"
def cuisines_in_sf():
output = cur.execute(query)
pprint(output.fetchall())
return None
print('Top 10 Cuisines in San Francisco:\n')
cuisines_in_sf()
San Francisco is known for its diverse culture and this reflects in the popularity of international cuisines.
query = "SELECT value, COUNT(*) as num FROM nodes_tags \
WHERE key=\"b'shop'\" \
GROUP BY value \
ORDER BY num DESC \
LIMIT 10"
def shops_in_sf():
output = cur.execute(query)
pprint(output.fetchall())
return None
print('Different types of shops:\n')
shops_in_sf()
query = "SELECT nodes_tags.value, COUNT(*) as num \
FROM nodes_tags \
JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value=\"b'coffee_shop'\") AS cafes \
ON nodes_tags.id = cafes.id \
WHERE nodes_tags.key=\"b'name'\"\
GROUP BY nodes_tags.value \
ORDER BY num DESC \
LIMIT 10"
def most_popular_cafes():
output = cur.execute(query)
pprint(output.fetchall())
return None
print('Most popular cafes in San Francisco: \n')
most_popular_cafes()
It is not surprising that Starbucks in the most popular brand in the US. There are, however, some mistakes in the names. For example 'Starbuck' and 'Starbucks Coffee' are the same thing with different names. This is another potential data wrangling task (left out in this project).
Data wrangling for this project has been a time-consuming and complicated due to many inconsistencies in the data. I could spot only a few of those problems and clean them up, but I am sure there are many issues that went unnoticed. The primary reason behind these inconsistencies is human error.
A lot of the issues related to the OpenStreetMap data can be prevented or corrected. For this, the following summary highlighting the issue and a proposed solution with its pros & cons are displayed.
This issue was noticed while creating the CSV files and trying to validate them against the correct schema. I had received an error citing that the uid
field is empty. This was surprising as for a data made by multiple users; a uid
is a highly important attribute.
uid
mandatory for contributors. uid
can be then used a primary key.While auditing the postcodes, there were a lot of invalid entries in the dataset. The standard postcode format was in the form of a 5-digit code, with no letters or other characters; however, there were formats with the state abbreviation before the digit (still acceptable). There were a high number of entries in the format where the code was 6-digit long or shorter than 5-digits long or only 'CA.'
Gamification refers to the application of typical elements of game-playing such as competition, scores, etc. The suggestion is to incorporate gamification into the usage of methods such as 'top contributors,' 'reward points' for users contributing to the OpenMapStreet Data.
The inspiration for the idea: https://www.facebook.com/editor/ - the competitive scoring system has often motivated me to add more to the data to stand higher rated than other people on my friend-list.
The idea is to have scripts that can pre-clean the most frequent types of errors that are made by the users. e.g., some users might enter the pin code in an address field and not separate as a postal code. This data can be extracted to some extent using these scripts.
This can also be done using cross-validating/cross-referencing to other databases such as Google API.
As noted in the query which finds 'Starbucks' to be the most popular restaurant in San Francisco, there were many spelling errors, and multiple entries were made for the same place. With techniques like matching two strings, users can contribute by answering 'yes/no' queries related to validation.
The OSM homepage is raw and functional but looks dated in comparison to most sites of the same scale. This might cause a lot of new users not to trust a site that looks unmaintained or hackish. Hence, some changes should be maintained on the website to make it look better (similar to https://www.facebook.com/editor/).