OpenStreetMap: Data Wrangling

OpenStreetMap (http://www.openstreetmap.org) foundation builds a free and editable map of the world, enabling the development of freely-reusable geospatial data. The data from OpenStreetMap is being used by many applications such as GoogleMaps, Foursquare, and Craigslist.

OpenStreetMap data structure has four core elements: Nodes, Ways, Tags, and Relations.

  • Nodes are points with a geographic position stored as lon (longitude) and lan (latitude).

  • Ways are ordered list of nodes, representing a polyline (can be a closed loop). They depict entities like streets, rivers, parks, and lakes.

  • Relations are ordered list of nodes, ways and relations (called 'members') to show the relation between nodes and ways such as restrictions on roads.

  • Tags are pairs of Keys and Values. They are used to store metadata of the map such as address, type of building, or any other physical property. Tags are always attached to a node, way or relation and are not stand-alone elements.

Users can add points on the map, create relations of the nodes and ways, and assign properties such as address or type to them. The data can be stored in OSM format and can be accessed in different formats. For this project, OSM XML format is used.

In this project, I have worked with the raw data from an area. Since the data is put by different users, I suppose it can be quite messy; therefore, cleaning and auditing functions to make the data look clean for analysis. I then export the data into CSV format and use this format to create tables in my SQLite database. Then I run queries on the database to extract information such as the number of nodes and ways, most contributing users, top points of interest, etc. The report is concluded by discussing benefits as well as some anticipated problems in implementing the improvement.

Area Chosen for Analysis: San Francisco, California

The reason I chose San Francisco as the area of choice for this study is that it is where the Silicon Valley is; a place I want to work in someday.

The raw data was locally downloaded from – https://mapzen.com/data/metro-extracts/metro/san-francisco_california/. The 'metro extracts' provides the map of the metropolitan area (i.e., contains more elements to work with). The original file is about 1.42 GB in size. Hence, a sample file of about 20 MB is used to perform the initial analysis. Post this, after a satisfactory code is written, it is run on the original files to create the CSV files for my database.

The data analyzed in this Jupyter notebook is from the sample file to be able to show shorter results. I have included all the functions here, as well as the function to create CSV files, in separate .py files in the accompanying directory/repository, which can be run on the original dataset.

How the Project Proceeds

  1. Exploring the Data
  2. Auditing Street Names
  3. Auditing Postcodes
  4. Preparing the Data for the Database
  5. Creating Tables in the Database
  6. Running Queries on the Data
  7. Conclusions

The detailed case study includes information about all snippets of codes, as well as outputs from the query. You can find the complete report from this repository from CaseStudy for OpenStreetMap.

The Files in this Directory

This notebook (Report.ipynb) gives a complete explanation regarding the project, the code, and queries.

  1. audit.py: This script contains the auditing and cleaning functions. This file is a dependency of shaping_csv.py.
  2. shaping_csv.py: The main script which created the CSV files from the OSM data.
  3. creating_database.py: This script creates databases and tables, then insert the relevant data.
  4. query_on_database.py: This script contains all the SQL commands.
  5. sample.osm: This file contains a sample data from the original San Francisco database.
  6. sampler.py: This scripts sampels the original OSM file by taking every 'k' observations.
  7. Report.html: Export of this report from Jupyter notebook to HTML.

Exploring the Data

This section shows the data exploration i.e. to check how the data is structured, what are problems and how they can be cleaned. This is done by first looking into the dataset and parse through using ElementTree and extract information such as different types of elements (nodes, ways, etc.) in the OSM XML file.

Since the original file is too large for processing, the file can be parsed using ET.iterparse (i.e. iterative parsing) as it will parse the file as it builds it – http://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy).

In the following code,

  • I will iterate through different tags in the XML.
  • Count them and put them in a dictionary with the key being the tag type.
In [67]:
import xml.etree.cElementTree as ET
import pprint

OSMFILE = 'sample.osm'

def count_tags(filename):
    tags= {}
    for event, elem in ET.iterparse(filename):
        if elem.tag not in tags.keys():
            tags[elem.tag] = 1
        else:
            tags[elem.tag] += 1
    
    pprint.pprint(tags)
    
count_tags(OSMFILE)
{'member': 910,
 'nd': 105916,
 'node': 89220,
 'osm': 1,
 'relation': 110,
 'tag': 27721,
 'way': 11113}

In the OSM XML file, the 'tag' element has key-value pairs which contain information about different points (nodes) or ways in the map. I parse through this element using the following regular expressions:

  • Lower - ^([a-z]|_)*$: Matches strings that contain only lower case characters. It also covers the underscore '_' character.

  • Lower Colon - ^([a-z]|_)*:([a-z]|_)*$: Matches strings which contain lower case characters but also have the colon ':' character.

  • Problem Char - [=\+/&<>;\'"\?%#$@\,\. \t\r\n]: Matches tags with problematic characters specified in the regex pattern.

Take this section of the map as an example:

    <node changeset="30175357" id="358830340" 
        lat="37.6504905" lon="-122.4896963" 
        timestamp="2015-04-12T22:43:37Z" 
        uid="35667" user="encleadus" version="4">

        <tag k="name" v="Ocean Shore School" />
        <tag k="phone" v="+1 650 738 6650" />
        <tag k="amenity" v="school" />
        <tag k="website" v="http://www.oceanshoreschool.org/" />
        <tag k="addr:city" v="Pacifica" />
        <tag k="addr:state" v="CA" />
        <tag k="addr:street" v="Oceana Boulevard" />
        <tag k="gnis:created" v="04/06/1998" />
        <tag k="addr:postcode" v="94044" />
        <tag k="gnis:state_id" v="06" />
        <tag k="gnis:county_id" v="081" />
        <tag k="gnis:feature_id" v="1785657" />
        <tag k="addr:housenumber" v="411" />
    </node>

This node tag has 13 tag elements inside it. There are multiple keys that have the ':' character in them, so they fall under the 'Lower_Colon' regular expression. Keys like name, phone, and amenity will fall under the 'lower' regular expression. There are no problematic characters in this specific node.

In [5]:
import xml.etree.cElementTree as ET
import pprint
import re

OSMFILE = 'sample.osm'

lower = re.compile(r'^([a-z]|_)*$')
lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
problem_chars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')


def key_type(element, keys):
    if element.tag == "tag":
        for tag in element.iter('tag'): # Iterating through the tag elements.
            k = element.attrib['k'] # Looking for the tag attribute 'k' which contains the keys.
            if re.search(lower, k):
                keys['lower'] += 1
            elif re.search(lower_colon, k):
                keys['lower_colon'] += 1
            elif re.search(problem_chars, k):
                keys['problem_chars'] += 1
            else:
                keys['other'] += 1                
    return keys

def process_map(filename):
    keys = {"lower": 0, "lower_colon": 0, "problem_chars": 0, "other": 0}
    for _, element in ET.iterparse(filename):
        keys = key_type(element, keys)

    pprint.pprint(keys)
    
process_map(OSMFILE)
{'lower': 19501, 'lower_colon': 7857, 'other': 361, 'problem_chars': 2}

Now, the number of unique users that have contributed to the OpenStreetMap data for San Francisco have to be calculated. To find the users, we need to look through the attributes of the node, way and relation tags. The 'uid' attribute is what we need to count.

`<node changeset="30175357" 
id="358830340" lat="37.6504905" lon="-122.4896963" 
timestamp="2015-04-12T22:43:37Z" uid="35667" user="encleadus" version="4">`
In [6]:
OSMFILE = 'sample.osm'

def process_map(filename):
    users = set()
    for _, element in ET.iterparse(filename):
        if element.tag == 'node' or element.tag == 'way' or element.tag == 'relation':
                userid = element.attrib['uid']
                users.add(userid)
    print(len(users))
    
process_map(OSMFILE)
928

Auditing the Data

Auditing Street Names

Since, many users are entering data in OpenStreetMap, the way they represent the formatting of streets can vary. For example, the street type 'Avenue' can be written in formats such as:

  • Avenue (starting with a capital letter)
  • avenue (starting with a small letter)
  • Ave. (with a .)
  • Ave (without a .)

To be able to process the data, we need to make these street types uniform. In case we are later searching for specific Avenue names, we can do a quick search on all street types that have the word 'Avenue' in them, and we can make sure that we are not missing anything with abbreviations of Avenue.

In [19]:
import xml.etree.cElementTree as ET
from collections import defaultdict
import re
import pprint

OSMFILE = "sample.osm"

"""This regular expression matches letters without any white space with zero to one '.'
Extract a string which might or might not have the '.' character in it."""
street_type_re = re.compile(r'\b\S+\.?$', re.IGNORECASE)

# List of expected street types.
expected = ["Street", "Avenue", "Boulevard", "Drive", "Court", "Place", "Square", "Lane", "Road", 
            "Trail", "Parkway", "Commons"]

def audit_street_type(street_types, street_name):
    """This function will get the list of street types and using the regular expression, 
    compare them to the expected list. If they do not match the names in the expected list, 
    it adds it to the street_types dictionary.
    """
    m = street_type_re.search(street_name)
    if m:
        street_type = m.group()
        if street_type not in expected:
            street_types[street_type].add(street_name)
            
def is_street_name(elem):
    """This unction will get the elements in the file (i.e. the tag element) and 
    return the attributes in that element for which their key is equal to 'addr:street'. 
    """
    return (elem.attrib['k'] == "addr:street")


def audit_street(osmfile):
    """This function uses iterative parsing to go through the XML file, 
    parse node and way elements, and iterate through their tag element. 
    It will then call the 'audit_street_type' function to add the value attribute 
    of the tag (i.e. the street name) to it. 
    """
    osm_file = open(osmfile, "r")
    street_types = defaultdict(set)
    
    # Parses the XML file.
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        # Iterate through the 'tag' element of node and way elements.
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if is_street_name(tag):
                    audit_street_type(street_types, tag.attrib['v'])
    osm_file.close()
    return street_types

street_types = audit_street(OSMFILE)
pprint.pprint(dict(street_types))
{'2': {'San Francisco Bicycle Route 2'},
 'Alameda': {'Alameda'},
 'Blvd': {'Newark Blvd'},
 'Broadway': {'Broadway'},
 'Center': {'Westlake Center'},
 'Circle': {'Columbia Circle', 'Blossom Circle', 'Inner Circle'},
 'Clemente': {'San Clemente'},
 'Cres': {'Wellesley Cres'},
 'Gardens': {'Wildwood Gardens'},
 'Highway': {'Great Highway'},
 'Ora': {'Avenue Del Ora'},
 'Real': {'North el Camino Real', 'El Camino Real'},
 'Terrace': {'Hawthorne Terrace'},
 'Way': {'Camberly Way',
         'Cambridge Way',
         'Chelsea Way',
         'Facebook Way',
         'Glenn Way',
         'Hansen Way',
         'Industrial Way',
         'Lenox Way',
         'Liberty Ship Way',
         'Lincoln Way',
         'Marina Way',
         'Martin Luther King Jr Way',
         'Olympia Way',
         'Park Way',
         'Ranleigh Way',
         'San Benito Way',
         'Sanchez Way',
         'Sussex Way',
         'Sylvan Way',
         'Tum Suden Way',
         'Wessex Way'}}

Going through the street name list, I use it to update the 'mapping' list. In this list, I mention the format of the street type that was found in the file (left) and specify to what format it needs to be changed (right). The dictionary containing the abbreviated street types do not cover all the different street types but covers a large number of them. I go through this list and see which ones can be changed. For instance:

  • {'Highway': Great Highway'} needs no changes.
  • {'Blvd': 'Newark Blvd'} needs to be changed from Blvd to 'Boulevard.
In [20]:
# The list of dictionaries, containing street types that need to be changed to match the expected list.
mapping = { "St": "Street",
            "St.": "Street",
            "street": "Street",
            "Ave": "Avenue",
            "Ave.": "Avenue",
            "AVE": "Avenue,",
            "avenue": "Avenue",
            "Rd.": "Road",
            "Rd": "Road",
            "road": "Road",
            "Blvd": "Boulevard",
            "Blvd.": "Boulevard",
            "Blvd,": "Boulevard",
            "boulevard": "Boulevard",
            "broadway": "Broadway",
            "square": "Square",
            "way": "Way",
            "Dr.": "Drive",
            "Dr": "Drive",
            "ct": "Court",
            "Ct": "Court",
            "court": "Court",
            "Sq": "Square",
            "square": "Square",
            "cres": "Crescent",
            "Cres": "Crescent",
            "Ctr": "Center",
            "Hwy": "Highway",
            "hwy": "Highway",
            "Ln": "Lane",
            "Ln.": "Lane",
            "parkway": "Parkway"
            }

def update_name(name, mapping):
    """This function takes the street name and split it at the space character. 
    In case, it finds a string that matches any key in the mapping, it replaces it with 
    the format that has been specified for it. 
    
    e.g. When the function finds 'Blvd' in "Newark Blvd", it goes through mapping and maps 
    it to 'Boulevard', and the final street name will come out as 'Newark Boulevard'.
    """
    output = list()
    parts = name.split(" ")
    for part in parts:
        if part in mapping:
            output.append(mapping[part])
        else:
            output.append(part)
    return " ".join(output)

# Printing the changes made in street names.
for st_type, ways in street_types.items():
        for name in ways:
            better_name = update_name(name, mapping)
            print(name, "→", better_name)
Columbia Circle → Columbia Circle
Blossom Circle → Blossom Circle
Inner Circle → Inner Circle
Hawthorne Terrace → Hawthorne Terrace
Sanchez Way → Sanchez Way
Park Way → Park Way
Glenn Way → Glenn Way
Chelsea Way → Chelsea Way
Wessex Way → Wessex Way
Ranleigh Way → Ranleigh Way
Sussex Way → Sussex Way
Liberty Ship Way → Liberty Ship Way
Facebook Way → Facebook Way
Camberly Way → Camberly Way
Lenox Way → Lenox Way
San Benito Way → San Benito Way
Lincoln Way → Lincoln Way
Martin Luther King Jr Way → Martin Luther King Jr Way
Cambridge Way → Cambridge Way
Hansen Way → Hansen Way
Sylvan Way → Sylvan Way
Tum Suden Way → Tum Suden Way
Marina Way → Marina Way
Olympia Way → Olympia Way
Industrial Way → Industrial Way
Alameda → Alameda
Westlake Center → Westlake Center
Broadway → Broadway
Newark Blvd → Newark Boulevard
San Francisco Bicycle Route 2 → San Francisco Bicycle Route 2
San Clemente → San Clemente
North el Camino Real → North el Camino Real
El Camino Real → El Camino Real
Wellesley Cres → Wellesley Crescent
Avenue Del Ora → Avenue Del Ora
Wildwood Gardens → Wildwood Gardens
Great Highway → Great Highway

Auditing Postcodes

Postcodes are another inconsistent type of data that is entered into the map. The inconsistency is either in how they are represented (with the city abbreviation or without) or how long they are. The theory behind auditing Postcodes is the same as auditing street names.

In [28]:
OSMFILE = 'sample.osm'

def dicti(data, item):
    """This function creates a dictionary postcodes can be held. 
    The dictionary key will be the postcode itself and the dictionary value 
    will be the number of times that postcode was repeated throughout the map."""
    data[item] += 1
    
def get_postcode(elem):
    """This function takes the 'tag' element as an input and 
    return the elements for which the keys are equal to 'addr:postcode'"""
    return (elem.attrib['k'] == "addr:postcode")

def audit(osmfile):
    """This function parses the XML file and iterates through node and 
    way elements. It extracts the value attribute (i.e. the postcode) and 
    add it to the 'dicti' dictionary."""
    osm_file = open(osmfile, "r")
    data = defaultdict(int)
    # Parsing the XML file
    for event, elem in ET.iterparse(osm_file, events=("start",)):
        # Iterating through node and way elements.
        if elem.tag == "node" or elem.tag == "way":
            for tag in elem.iter("tag"):
                if get_postcode(tag):
                    dicti(data, tag.attrib['v'])
    return data

postcodes = audit(OSMFILE)
pprint.pprint(dict(postcodes))
{'94005': 1,
 '94010': 2,
 '94014': 1,
 '94015': 2,
 '94025': 1,
 '94061': 3,
 '94062': 1,
 '94063': 2,
 '94065': 1,
 '94102': 1,
 '94103': 18,
 '94104': 1,
 '94105': 2,
 '94107': 2,
 '94108': 1,
 '94109': 10,
 '94110': 2,
 '94111': 1,
 '94113': 1,
 '94114': 5,
 '94115': 1,
 '94116': 29,
 '94117': 22,
 '94118': 14,
 '94121': 6,
 '94122': 69,
 '94123': 1,
 '94124': 1,
 '94127': 10,
 '94131': 2,
 '94132': 1,
 '94133': 15,
 '94401': 1,
 '94501': 2,
 '94530': 2,
 '94541': 3,
 '94560': 1,
 '94578': 1,
 '94580': 1,
 '94587': 6,
 '94598': 1,
 '94602': 1,
 '94605': 1,
 '94606': 2,
 '94610': 14,
 '94611': 46,
 '94612': 4,
 '94618': 2,
 '94702': 1,
 '94703': 1,
 '94704': 2,
 '94706': 4,
 '94707': 2,
 '94709': 1,
 '94804': 1,
 '94805': 9,
 '94904': 1,
 '94941': 1,
 '94965': 1,
 '95498': 1}

The output shows that the postcode is in a 5-digit format.

After running the code on the original file, I also found other postcode formats. To deal with these other formats observed, I divide them into different categories:

  • The first category includes the ones:

    • Where the length equals to 5 (e.g., 12345).
    • Where the length is longer than 5, and they contain characters (e.g., CA 12345).

  • The second category includes the ones:

    • Where the length is longer than 5, and they are followed by a hyphen (e.g., 12345-6789).

  • The third category includes the ones:

    • Where the length is longer than 5, but are not followed by any hyphen (e.g., 123456).
    • Where the length is shorter than 5 (e.g., 1234, 515).
    • Where the postcode equals to 'CA.'

For the first category, I use a regular expression to extract only 5 digits from the pattern. This regex asserts position at the start of the string (^) and matches any character that is NOT a digit (\D*). The (\d{5}) matches a digit exactly 5 times. In case the postcode starts with letters (e.g., CA 12345), it gives two groups of output: 'CA' and '12345'. Depending on which one is needed, the preferred group can be chosen – ^\D*(\d{5}).*.

For the second category, I use another regular expression to extract the first 5-digits. This regex matches digits 5 times, is followed by a '-,' and then matching digits exactly 4 times – ^(\d{5})-\d{4}$.

For the third category, having postcodes which have shorter or longer than 5-digit length, means that they are not valid. To clean them up, I replace those postcodes with '00000'. I use regular expressions to be able to find the ones that are exactly 6-digit long – ^\d{6}$.

In [30]:
def update_postcode(digit):
    """Makes use of different conditions in the function to match the 
    postcodes in the 3 categories that can be found for postal codes."""
    output = list()
    
    first_category = re.compile(r'^\D*(\d{5}$)', re.IGNORECASE)
    second_category = re.compile('^(\d{5})-\d{4}$')
    third_category = re.compile('^\d{6}$')
    
    if re.search(first_category, digit):
        new_digit = re.search(first_category, digit).group(1)
        output.append(new_digit)
        
    elif re.search(second_category, digit):
        new_digit = re.search(second_category, digit).group(1)
        output.append(new_digit)
    
    elif re.search(third_category, digit):
        third_output = third_category.search(digit)
        new_digit = '00000'
        output.append('00000')
    
    # This condition matches the third category for any other types.
    elif digit == 'CA' or len(digit) < 5:
        new_digit = '00000'
        output.append(new_digit)

    return ', '.join(str(x) for x in output)

for postcode, nums in postcodes.items():
    better_code = update_postcode(postcode)
    print(postcode, "→", better_code)
94541 → 94541
94110 → 94110
94116 → 94116
94587 → 94587
94612 → 94612
94065 → 94065
94704 → 94704
94103 → 94103
94606 → 94606
94061 → 94061
94107 → 94107
94010 → 94010
94115 → 94115
94132 → 94132
94114 → 94114
94702 → 94702
94618 → 94618
94109 → 94109
94122 → 94122
94127 → 94127
94105 → 94105
94117 → 94117
94118 → 94118
94121 → 94121
94108 → 94108
94706 → 94706
94904 → 94904
94015 → 94015
94941 → 94941
94804 → 94804
94560 → 94560
94965 → 94965
94062 → 94062
94104 → 94104
94113 → 94113
94063 → 94063
94133 → 94133
94111 → 94111
94123 → 94123
94131 → 94131
95498 → 95498
94501 → 94501
94602 → 94602
94610 → 94610
94611 → 94611
94598 → 94598
94124 → 94124
94014 → 94014
94102 → 94102
94580 → 94580
94025 → 94025
94401 → 94401
94005 → 94005
94703 → 94703
94605 → 94605
94578 → 94578
94805 → 94805
94530 → 94530
94707 → 94707
94709 → 94709

Preparing the Data for the Database

To load the data to the SQLite database, they need to be converted from the XML file to CSV files. I create multiple CSV files, and later create the corresponding tables in my database based on them.

The CSV files I want to have are:

  • node.csv
  • node_tags.csv
  • way.csv
  • way_tags.csv
  • way_nodes.csv

Each of these CSV files contains different columns and stores data based on those columns. The columns used in the CSV files will be the table columns in the database. This is the schema:

  • NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']
  • NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']
  • WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']
  • WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']
  • WAY_NODES_FIELDS = ['id', 'node_id', 'position']

To create these files, I will parse the 'node' and 'way' tags and extract the tags inside them. The shape_element function takes an iterparse element object and returns a dictionary. Depending on whether the element is 'node' or 'way', the dictionary looks different.

Here's an example from the 'node' element in the XML file:

<node changeset="27772228" id="358830414" 
lat="37.6668652" lon="-122.4895243" timestamp="2014-12-29T09:43:14Z" 
uid="14293" user="KindredCoda" version="2">
    <tag k="ele" v="118" />
    <tag k="name" v="Longview Park" />
    <tag k="leisure" v="park" />
    <tag k="gnis:created" v="04/06/1998" />
    <tag k="gnis:state_id" v="06" />
    <tag k="gnis:county_id" v="081" />
    <tag k="gnis:feature_id" v="1785701" />
</node>

For 'Node':

The dictionary returns the format {"node": .., "node_tags": ...}

The node field holds a dictionary of the following top level node attributes: id, user, uid, version, lat, lon, timestamp, changeset. All the other attributes are ignored.

The node_tags field holds a list of dictionaries, one per secondary tag (child tags of node, which have the tag name/type: "tag"). Each dictionary has the following fields and properties:

  • id: the top level node_id attribute value. For example, node['id']='358830414' in the above sample.

  • key: the full tag "k" attribute value if no colon is present or the characters after the colon if there is one. For example, k="name".

  • value: the tag "v" attribute value. For example, v="Longview Park" is the value for the key k="name".

  • type: either the characters before the colon in the tag "k" value or "regular" if a colon is not present. For k="name", the type would be 'regular'.

  • If the tag "k" value contains problematic characters, the tag should be ignored.

  • If the tag "k" value contains a ":", the characters before the ":" should be set as the tag type and characters after the ":" should be set as the tag key. For example, in <tag k="gnis:county_id">, the tag['type']='gnis' and tag['key']='county_id'.

  • If there are additional ":" in the "k" value, they should be ignored and kept as part of the tag key. For example, {'id':12345, 'key':'street:name', 'value':'Lincoln', 'type':'addr'}.

  • If a node has no secondary tags, then node_tags field should just contain an empty list.

For 'Way':

The dictionary has the format {"way": ..., "way_tags": ..., "way_nodes": ...}

The way field should hold a dictionary of the following top level way attributes id, user, uid, version, timestamp, changeset.

The way_tags field again holds a list of dictionaries, following the exact same rules as for node_tags.

Additionally, the dictionary has a field way_nodes. way_nodes holds a list of dictionaries. Each dictionary has the fields:

  • id: the top level element (way) id.
  • node_id: the ref attribute value of the nd tag.
  • position: the index starting at 0 of the nd tag i.e. what order the nd tag appears within the way element.

Writing to CSV

To write the data into CSV files, shape_element function which makes use of update_name and update_postcode functions to clean the street names and postcodes before they are inserted into the CSV files. Each of the function is called twice, once for nodes and the other for ways.

Issue Faced: Incompletely Filled Attributes

One issue I noticed while I was validating my CSV file against the expected schema, I found out that some uid values are missing from the data. Although according to the best practice of OpenStreetMap, all user's information should be written while submitting data, they are not; thus, causing validation to throw errors. To fix this issue, dummy values have been entered.

The same problem with some empty fields are relatable to the 'k' attribute as well; meaning users did not add information for all the attributes in node or way. To overcome this problem, a conditional statement to set the ['type'] and ['key'] to 'regular' in case the field is empty was implemented. Else, it uses the regular expressions to find corresponding patterns.

Preparing the Database

After the CSV files are created (by running the shaping_csv.py file), the next step is to create the database and insert the information from those CSV files to their corresponding tables.

I created the database called openstreetmap_sf_db, and I created tables with columns based on the columns from the CSV files, and inserted the data from the CSV files to the corresponding tables in the database. The creating_database.py file takes care of creating the tables and inserting data in them.

After the tables are created, I can now start investigating them and getting queries on them.

The sample code below shows how I have created the tables in the database; where tables_name, column_name and filename.csv are replaced according to the table that is to be inserted. First, I connect to the SQLite file and make sure to check that the table I want to create is not already created. Using cur.execute, I execute commands to the database in Python. After creating the table, I inserted the data from the CSV file into it. I did this process for every table I wanted to create in the database.

In [68]:
import sqlite3
import csv
from pprint import pprint

# Put the path to your sqlite database. If no database is available, a new one will be created.
sqlite_file = 'openstreetmap_sf_db.sqlite'

# Connecting to the database.
conn = sqlite3.connect(sqlite_file)
cur = conn.cursor()

Data Overview

Now that I have audited and cleaned the data and transfered everything into table in my database, I can start running queries on it.

The queries help find the following.

  1. Number of nodes, ways.
  2. Number of unique users.
  3. Most contributing users.
  4. No. of users who contributed only once.
  5. Top 10 amneties in San Francisco.
  6. Cuisines in San Francisco.
  7. Shopes in San Francisco.
  8. Users who added amneties.

Number of Nodes & Ways

In [69]:
def number_of_nodes():
    output = cur.execute('SELECT COUNT(*) FROM nodes')
    return output.fetchone()[0]
print('Number of nodes: \n' , number_of_nodes())

def number_of_ways():
    output = cur.execute('SELECT COUNT(*) FROM ways')
    return output.fetchone()[0]
print('Number of ways: \n', number_of_ways())
Number of nodes: 
 6691432
Number of ways: 
 833509

Number of Unique Users

In [70]:
def number_of_unique_users():
    output = cur.execute('SELECT COUNT(DISTINCT e.uid) FROM \
                         (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e')
    return output.fetchone()[0]
print('Number of unique users: \n' , number_of_unique_users())
Number of unique users: 
 2996

Most Contributing Users

In [78]:
def most_contributing_users():
    output = cur.execute('SELECT e.user, COUNT(*) as num FROM \
                         (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
                         GROUP BY e.user \
                         ORDER BY num DESC \
                         LIMIT 10 ')
    print("Most contributing users: \n")
    pprint(output.fetchall())
    return None

most_contributing_users()
Most contributing users: 

[("b'andygol'", 1495885),
 ("b'ediyes'", 886372),
 ("b'Luis36995'", 678994),
 ("b'dannykath'", 545833),
 ("b'RichRico'", 414739),
 ("b'Rub21'", 380990),
 ("b'calfarome'", 190771),
 ("b'oldtopos'", 165326),
 ("b'KindredCoda'", 149721),
 ("b'karitotp'", 139477)]

Number of Users Who Contributed Once

In [79]:
def number_of_users_contributed_once():
    output = cur.execute('SELECT COUNT(*) FROM \
                             (SELECT e.user, COUNT(*) as num FROM \
                                 (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \
                                  GROUP BY e.user \
                                  HAVING num = 1) u')
    return output.fetchone()[0]
                         
print('Number of users who have contributed once: \n', number_of_users_contributed_once())
Number of users who have contributed once: 
 747

Top 10 Amenities in San Francisco

In [106]:
query = "SELECT value, COUNT(*) as num FROM nodes_tags \
            WHERE key=\"b'amenity'\" \
            GROUP BY value \
            ORDER BY num DESC \
            LIMIT 20"

def top_ten_amenities_in_sf():
    output = cur.execute(query)
    pprint(output.fetchall())
    return None

print('Top 10 Amenities:\n')
top_ten_amenities_in_sf()
Top 10 Amenities:

[("b'restaurant'", 3063),
 ("b'bench'", 1357),
 ("b'cafe'", 1039),
 ("b'post_box'", 698),
 ("b'place_of_worship'", 689),
 ("b'fast_food'", 599),
 ("b'bicycle_parking'", 585),
 ("b'school'", 572),
 ("b'drinking_water'", 543),
 ("b'toilets'", 424),
 ("b'bank'", 372),
 ("b'bar'", 343),
 ("b'fuel'", 275),
 ("b'parking'", 274),
 ("b'waste_basket'", 239),
 ("b'car_sharing'", 223),
 ("b'atm'", 211),
 ("b'pub'", 207),
 ("b'post_office'", 164),
 ("b'pharmacy'", 151)]

Top 10 Cuisines in San Francisco

In [107]:
query = "SELECT value, COUNT(*) as num FROM ways_tags \
            WHERE key=\"b'cuisine'\" \
            GROUP BY value \
            ORDER BY num DESC \
            LIMIT 10"

def cuisines_in_sf():
    output = cur.execute(query)
    pprint(output.fetchall())
    return None

print('Top 10 Cuisines in San Francisco:\n')
cuisines_in_sf()
Top 10 Cuisines in San Francisco:

[("b'burger'", 77),
 ("b'mexican'", 51),
 ("b'chinese'", 30),
 ("b'pizza'", 29),
 ("b'american'", 23),
 ("b'coffee_shop'", 22),
 ("b'japanese'", 20),
 ("b'italian'", 19),
 ("b'seafood'", 13),
 ("b'indian'", 11)]

San Francisco is known for its diverse culture and this reflects in the popularity of international cuisines.

Different Types of Shops

In [115]:
query = "SELECT value, COUNT(*) as num FROM nodes_tags \
            WHERE key=\"b'shop'\" \
            GROUP BY value \
            ORDER BY num DESC \
            LIMIT 10"

def shops_in_sf():
    output = cur.execute(query)
    pprint(output.fetchall())
    return None

print('Different types of shops:\n')
shops_in_sf()
Different types of shops:

[("b'convenience'", 393),
 ("b'clothes'", 378),
 ("b'hairdresser'", 349),
 ("b'beauty'", 284),
 ("b'yes'", 216),
 ("b'supermarket'", 207),
 ("b'car_repair'", 175),
 ("b'bakery'", 172),
 ("b'laundry'", 171),
 ("b'dry_cleaning'", 144)]
In [118]:
query = "SELECT nodes_tags.value, COUNT(*) as num \
                          FROM nodes_tags \
                            JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value=\"b'coffee_shop'\") AS cafes \
                            ON nodes_tags.id = cafes.id \
                            WHERE nodes_tags.key=\"b'name'\"\
                            GROUP BY nodes_tags.value \
                            ORDER BY num DESC \
                            LIMIT 10"

def most_popular_cafes():
    output = cur.execute(query)
    pprint(output.fetchall())
    return None

print('Most popular cafes in San Francisco: \n')
most_popular_cafes()
Most popular cafes in San Francisco: 

[("b'Starbucks'", 53),
 ('b"Peet\'s Coffee & Tea"', 16),
 ("b'Starbucks Coffee'", 15),
 ('b"Peet\'s Coffee and Tea"', 7),
 ('b"Peet\'s Coffee"', 5),
 ("b'Blue Bottle Coffee'", 5),
 ("b'Philz Coffee'", 5),
 ("b'Beanery'", 3),
 ("b'Highwire Coffee Roasters'", 3),
 ('b"Yali\'s Cafe"', 2)]

It is not surprising that Starbucks in the most popular brand in the US. There are, however, some mistakes in the names. For example 'Starbuck' and 'Starbucks Coffee' are the same thing with different names. This is another potential data wrangling task (left out in this project).

Discussions about the Data

Data wrangling for this project has been a time-consuming and complicated due to many inconsistencies in the data. I could spot only a few of those problems and clean them up, but I am sure there are many issues that went unnoticed. The primary reason behind these inconsistencies is human error.

A lot of the issues related to the OpenStreetMap data can be prevented or corrected. For this, the following summary highlighting the issue and a proposed solution with its pros & cons are displayed.

1. Empty User ID Fields

This issue was noticed while creating the CSV files and trying to validate them against the correct schema. I had received an error citing that the uid field is empty. This was surprising as for a data made by multiple users; a uid is a highly important attribute.

  • Solution: It should be made a practice to make uid mandatory for contributors.
    • Benefits: uid can be then used a primary key.
    • Anticipated Issue: This might decrease the number of contributions.

2. Invalid Format for Postcodes

While auditing the postcodes, there were a lot of invalid entries in the dataset. The standard postcode format was in the form of a 5-digit code, with no letters or other characters; however, there were formats with the state abbreviation before the digit (still acceptable). There were a high number of entries in the format where the code was 6-digit long or shorter than 5-digits long or only 'CA.'

  • Solution: There should be some data validation before a user enters the data.
    • Benefits: Introduces standardization and reduces data auditing/cleaning process.
    • Drawback: Users might take time to accustom to the standards that are being forced on them which can affect the rate of data-entry.
      • The effect of this drawback can be reduced by a small-readable 'Read Me.'

Suggestions for Improving Overall Data Quality

3. Gamification

Gamification refers to the application of typical elements of game-playing such as competition, scores, etc. The suggestion is to incorporate gamification into the usage of methods such as 'top contributors,' 'reward points' for users contributing to the OpenMapStreet Data.

  • Benefits: Increases the level of motivation for submitting more data. This motivation can also push users to adhere to standardization.
  • Drawback: Non-rewarding points/statuses are weak. Rewards would require funds.

The inspiration for the idea: https://www.facebook.com/editor/ - the competitive scoring system has often motivated me to add more to the data to stand higher rated than other people on my friend-list.

4. Pre-Cleaned Data on OSM

The idea is to have scripts that can pre-clean the most frequent types of errors that are made by the users. e.g., some users might enter the pin code in an address field and not separate as a postal code. This data can be extracted to some extent using these scripts.

This can also be done using cross-validating/cross-referencing to other databases such as Google API.

  • Benefits: Reduces error in the data and reduces human validation.
  • Drawback: Adds to the computational requirements (for running the scripts).
    • It might also need the additional validation of data which can be done by confirming the change from the user who had initially entered the data. This, however, has to execute the script right after the data is entered as a user might not respond later.

5. Additional Validation of Existing Entries

As noted in the query which finds 'Starbucks' to be the most popular restaurant in San Francisco, there were many spelling errors, and multiple entries were made for the same place. With techniques like matching two strings, users can contribute by answering 'yes/no' queries related to validation.

  • Benefits: Reduces redundancy in data.

6. Improve the User Interface

The OSM homepage is raw and functional but looks dated in comparison to most sites of the same scale. This might cause a lot of new users not to trust a site that looks unmaintained or hackish. Hence, some changes should be maintained on the website to make it look better (similar to https://www.facebook.com/editor/).

  • Benefits: Improves site's usability.
  • Drawback: Implementation is a tedious task. Also, severe changes might confuse existing contributors.