{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# OpenStreetMap: Data Wrangling\n", "\n", "OpenStreetMap (http://www.openstreetmap.org) foundation builds a free and editable map of the world, enabling the development of freely-reusable geospatial data. The data from OpenStreetMap is being used by many applications such as GoogleMaps, Foursquare, and Craigslist. \n", "\n", "OpenStreetMap data structure has four core elements: Nodes, Ways, Tags, and Relations.\n", "\n", "- **Nodes** are points with a geographic position stored as lon (longitude) and lan (latitude).\n", "\n", "- **Ways** are ordered list of nodes, representing a polyline (can be a closed loop). They depict entities like streets, rivers, parks, and lakes.\n", "\n", "- **Relations** are ordered list of nodes, ways and relations (called 'members') to show the relation between nodes and ways such as restrictions on roads.\n", "\n", "- **Tags** are pairs of Keys and Values. They are used to store metadata of the map such as address, type of building, or any other physical property. Tags are always attached to a node, way or relation and are not stand-alone elements.\n", "\n", "Users can add points on the map, create relations of the nodes and ways, and assign properties such as address or type to them. The data can be stored in OSM format and can be accessed in different formats. For this project, [OSM XML](http://wiki.openstreetmap.org/wiki/OSM_XML) format is used.\n", "\n", "In this project, I have worked with the raw data from an area. Since the data is put by different users, I suppose it can be quite messy; therefore, cleaning and auditing functions to make the data look clean for analysis. I then export the data into CSV format and use this format to create tables in my SQLite database. Then I run queries on the database to extract information such as the number of nodes and ways, most contributing users, top points of interest, etc. The report is concluded by discussing benefits as well as some anticipated problems in implementing the improvement." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Area Chosen for Analysis: San Francisco, California\n", "\n", "The reason I chose San Francisco as the area of choice for this study is that it is where the Silicon Valley is; a place I want to work in someday. \n", "\n", "The raw data was locally downloaded from – https://mapzen.com/data/metro-extracts/metro/san-francisco_california/. The 'metro extracts' provides the map of the metropolitan area (i.e., contains more elements to work with). The original file is about 1.42 GB in size. Hence, a sample file of about 20 MB is used to perform the initial analysis. Post this, after a satisfactory code is written, it is run on the original files to create the CSV files for my database. \n", "\n", "The data analyzed in this Jupyter notebook is from the sample file to be able to show shorter results. I have included all the functions here, as well as the function to create CSV files, in separate .py files in the accompanying directory/repository, which can be run on the original dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## How the Project Proceeds\n", "\n", "1. **Exploring the Data**\n", "2. **Auditing Street Names**\n", "3. **Auditing Postcodes**\n", "4. **Preparing the Data for the Database**\n", "5. **Creating Tables in the Database**\n", "6. **Running Queries on the Data**\n", "7. **Conclusions**\n", "\n", "The detailed case study includes information about all snippets of codes, as well as outputs from the query. You can find the complete report from this repository from CaseStudy for OpenStreetMap.\n", "\n", "## The Files in this Directory\n", "This notebook (`Report.ipynb`) gives a complete explanation regarding the project, the code, and queries. \n", "\n", "1. **`audit.py`**: This script contains the auditing and cleaning functions. This file is a dependency of `shaping_csv.py`.\n", "2. **`shaping_csv.py`**: The main script which created the CSV files from the OSM data.\n", "3. **`creating_database.py`**: This script creates databases and tables, then insert the relevant data.\n", "4. **`query_on_database.py`**: This script contains all the SQL commands.\n", "5. **`sample.osm`**: This file contains a sample data from the original San Francisco database.\n", "6. **`sampler.py`**: This scripts sampels the original OSM file by taking every 'k' observations.\n", "6. **`Report.html`**: Export of this report from Jupyter notebook to HTML." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Data\n", "\n", "This section shows the data exploration i.e. to check how the data is structured, what are problems and how they can be cleaned. This is done by first looking into the dataset and parse through using `ElementTree` and extract information such as different types of elements (nodes, ways, etc.) in the OSM XML file. \n", "\n", "Since the original file is too large for processing, the file can be parsed using `ET.iterparse` (i.e. iterative parsing) as it will parse the file as it builds it – http://stackoverflow.com/questions/12792998/elementtree-iterparse-strategy)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the following code, \n", "- I will iterate through different tags in the XML.\n", "- Count them and put them in a dictionary with the key being the tag type." ] }, { "cell_type": "code", "execution_count": 67, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'member': 910,\n", " 'nd': 105916,\n", " 'node': 89220,\n", " 'osm': 1,\n", " 'relation': 110,\n", " 'tag': 27721,\n", " 'way': 11113}\n" ] } ], "source": [ "import xml.etree.cElementTree as ET\n", "import pprint\n", "\n", "OSMFILE = 'sample.osm'\n", "\n", "def count_tags(filename):\n", " tags= {}\n", " for event, elem in ET.iterparse(filename):\n", " if elem.tag not in tags.keys():\n", " tags[elem.tag] = 1\n", " else:\n", " tags[elem.tag] += 1\n", " \n", " pprint.pprint(tags)\n", " \n", "count_tags(OSMFILE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the OSM XML file, the 'tag' element has key-value pairs which contain information about different points (nodes) or ways in the map. I parse through this element using the following regular expressions:\n", "\n", "- Lower - **`^([a-z]|_)*$`**: Matches strings that contain only lower case characters. It also covers the underscore '_' character.\n", "\n", "- Lower Colon - **`^([a-z]|_)*:([a-z]|_)*$`**: Matches strings which contain lower case characters but also have the colon ':' character.\n", "\n", "- Problem Char - **`[=\\+/&<>;\\'\"\\?%#$@\\,\\. \\t\\r\\n]`**: Matches tags with problematic characters specified in the regex pattern.\n", "\n", "Take this section of the map as an example:\n", "\n", "```\n", " \n", " \n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\n", " \n", "```\n", " \n", "This node tag has 13 tag elements inside it. There are multiple keys that have the ':' character in them, so they fall under the 'Lower_Colon' regular expression. Keys like name, phone, and amenity will fall under the 'lower' regular expression. There are no problematic characters in this specific node." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'lower': 19501, 'lower_colon': 7857, 'other': 361, 'problem_chars': 2}\n" ] } ], "source": [ "import xml.etree.cElementTree as ET\n", "import pprint\n", "import re\n", "\n", "OSMFILE = 'sample.osm'\n", "\n", "lower = re.compile(r'^([a-z]|_)*$')\n", "lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')\n", "problem_chars = re.compile(r'[=\\+/&<>;\\'\"\\?%#$@\\,\\. \\t\\r\\n]')\n", "\n", "\n", "def key_type(element, keys):\n", " if element.tag == \"tag\":\n", " for tag in element.iter('tag'): # Iterating through the tag elements.\n", " k = element.attrib['k'] # Looking for the tag attribute 'k' which contains the keys.\n", " if re.search(lower, k):\n", " keys['lower'] += 1\n", " elif re.search(lower_colon, k):\n", " keys['lower_colon'] += 1\n", " elif re.search(problem_chars, k):\n", " keys['problem_chars'] += 1\n", " else:\n", " keys['other'] += 1 \n", " return keys\n", "\n", "def process_map(filename):\n", " keys = {\"lower\": 0, \"lower_colon\": 0, \"problem_chars\": 0, \"other\": 0}\n", " for _, element in ET.iterparse(filename):\n", " keys = key_type(element, keys)\n", "\n", " pprint.pprint(keys)\n", " \n", "process_map(OSMFILE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the number of unique users that have contributed to the OpenStreetMap data for San Francisco have to be calculated. To find the users, we need to look through the attributes of the node, way and relation tags. The 'uid' attribute is what we need to count.\n", "\n", " ``\n", " " ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "928\n" ] } ], "source": [ "OSMFILE = 'sample.osm'\n", "\n", "def process_map(filename):\n", " users = set()\n", " for _, element in ET.iterparse(filename):\n", " if element.tag == 'node' or element.tag == 'way' or element.tag == 'relation':\n", " userid = element.attrib['uid']\n", " users.add(userid)\n", " print(len(users))\n", " \n", "process_map(OSMFILE)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Auditing the Data\n", "\n", "### Auditing Street Names\n", "\n", "Since, many users are entering data in OpenStreetMap, the way they represent the formatting of streets can vary. For example, the street type 'Avenue' can be written in formats such as:\n", "\n", "- Avenue (starting with a capital letter)\n", "- avenue (starting with a small letter)\n", "- Ave. (with a .)\n", "- Ave (without a .)\n", "\n", "To be able to process the data, we need to make these street types uniform. In case we are later searching for specific Avenue names, we can do a quick search on all street types that have the word 'Avenue' in them, and we can make sure that we are not missing anything with abbreviations of Avenue." ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'2': {'San Francisco Bicycle Route 2'},\n", " 'Alameda': {'Alameda'},\n", " 'Blvd': {'Newark Blvd'},\n", " 'Broadway': {'Broadway'},\n", " 'Center': {'Westlake Center'},\n", " 'Circle': {'Columbia Circle', 'Blossom Circle', 'Inner Circle'},\n", " 'Clemente': {'San Clemente'},\n", " 'Cres': {'Wellesley Cres'},\n", " 'Gardens': {'Wildwood Gardens'},\n", " 'Highway': {'Great Highway'},\n", " 'Ora': {'Avenue Del Ora'},\n", " 'Real': {'North el Camino Real', 'El Camino Real'},\n", " 'Terrace': {'Hawthorne Terrace'},\n", " 'Way': {'Camberly Way',\n", " 'Cambridge Way',\n", " 'Chelsea Way',\n", " 'Facebook Way',\n", " 'Glenn Way',\n", " 'Hansen Way',\n", " 'Industrial Way',\n", " 'Lenox Way',\n", " 'Liberty Ship Way',\n", " 'Lincoln Way',\n", " 'Marina Way',\n", " 'Martin Luther King Jr Way',\n", " 'Olympia Way',\n", " 'Park Way',\n", " 'Ranleigh Way',\n", " 'San Benito Way',\n", " 'Sanchez Way',\n", " 'Sussex Way',\n", " 'Sylvan Way',\n", " 'Tum Suden Way',\n", " 'Wessex Way'}}\n" ] } ], "source": [ "import xml.etree.cElementTree as ET\n", "from collections import defaultdict\n", "import re\n", "import pprint\n", "\n", "OSMFILE = \"sample.osm\"\n", "\n", "\"\"\"This regular expression matches letters without any white space with zero to one '.'\n", "Extract a string which might or might not have the '.' character in it.\"\"\"\n", "street_type_re = re.compile(r'\\b\\S+\\.?$', re.IGNORECASE)\n", "\n", "# List of expected street types.\n", "expected = [\"Street\", \"Avenue\", \"Boulevard\", \"Drive\", \"Court\", \"Place\", \"Square\", \"Lane\", \"Road\", \n", " \"Trail\", \"Parkway\", \"Commons\"]\n", "\n", "def audit_street_type(street_types, street_name):\n", " \"\"\"This function will get the list of street types and using the regular expression, \n", " compare them to the expected list. If they do not match the names in the expected list, \n", " it adds it to the street_types dictionary.\n", " \"\"\"\n", " m = street_type_re.search(street_name)\n", " if m:\n", " street_type = m.group()\n", " if street_type not in expected:\n", " street_types[street_type].add(street_name)\n", " \n", "def is_street_name(elem):\n", " \"\"\"This unction will get the elements in the file (i.e. the tag element) and \n", " return the attributes in that element for which their key is equal to 'addr:street'. \n", " \"\"\"\n", " return (elem.attrib['k'] == \"addr:street\")\n", "\n", "\n", "def audit_street(osmfile):\n", " \"\"\"This function uses iterative parsing to go through the XML file, \n", " parse node and way elements, and iterate through their tag element. \n", " It will then call the 'audit_street_type' function to add the value attribute \n", " of the tag (i.e. the street name) to it. \n", " \"\"\"\n", " osm_file = open(osmfile, \"r\")\n", " street_types = defaultdict(set)\n", " \n", " # Parses the XML file.\n", " for event, elem in ET.iterparse(osm_file, events=(\"start\",)):\n", " # Iterate through the 'tag' element of node and way elements.\n", " if elem.tag == \"node\" or elem.tag == \"way\":\n", " for tag in elem.iter(\"tag\"):\n", " if is_street_name(tag):\n", " audit_street_type(street_types, tag.attrib['v'])\n", " osm_file.close()\n", " return street_types\n", "\n", "street_types = audit_street(OSMFILE)\n", "pprint.pprint(dict(street_types))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Going through the street name list, I use it to update the 'mapping' list. In this list, I mention the format of the street type that was found in the file (left) and specify to what format it needs to be changed (right). The dictionary containing the abbreviated street types do not cover all the different street types but covers a large number of them. I go through this list and see which ones can be changed. For instance:\n", "\n", "- `{'Highway': Great Highway'}` needs no changes. \n", "- `{'Blvd': 'Newark Blvd'}` needs to be changed from Blvd to 'Boulevard." ] }, { "cell_type": "code", "execution_count": 20, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Columbia Circle → Columbia Circle\n", "Blossom Circle → Blossom Circle\n", "Inner Circle → Inner Circle\n", "Hawthorne Terrace → Hawthorne Terrace\n", "Sanchez Way → Sanchez Way\n", "Park Way → Park Way\n", "Glenn Way → Glenn Way\n", "Chelsea Way → Chelsea Way\n", "Wessex Way → Wessex Way\n", "Ranleigh Way → Ranleigh Way\n", "Sussex Way → Sussex Way\n", "Liberty Ship Way → Liberty Ship Way\n", "Facebook Way → Facebook Way\n", "Camberly Way → Camberly Way\n", "Lenox Way → Lenox Way\n", "San Benito Way → San Benito Way\n", "Lincoln Way → Lincoln Way\n", "Martin Luther King Jr Way → Martin Luther King Jr Way\n", "Cambridge Way → Cambridge Way\n", "Hansen Way → Hansen Way\n", "Sylvan Way → Sylvan Way\n", "Tum Suden Way → Tum Suden Way\n", "Marina Way → Marina Way\n", "Olympia Way → Olympia Way\n", "Industrial Way → Industrial Way\n", "Alameda → Alameda\n", "Westlake Center → Westlake Center\n", "Broadway → Broadway\n", "Newark Blvd → Newark Boulevard\n", "San Francisco Bicycle Route 2 → San Francisco Bicycle Route 2\n", "San Clemente → San Clemente\n", "North el Camino Real → North el Camino Real\n", "El Camino Real → El Camino Real\n", "Wellesley Cres → Wellesley Crescent\n", "Avenue Del Ora → Avenue Del Ora\n", "Wildwood Gardens → Wildwood Gardens\n", "Great Highway → Great Highway\n" ] } ], "source": [ "# The list of dictionaries, containing street types that need to be changed to match the expected list.\n", "mapping = { \"St\": \"Street\",\n", " \"St.\": \"Street\",\n", " \"street\": \"Street\",\n", " \"Ave\": \"Avenue\",\n", " \"Ave.\": \"Avenue\",\n", " \"AVE\": \"Avenue,\",\n", " \"avenue\": \"Avenue\",\n", " \"Rd.\": \"Road\",\n", " \"Rd\": \"Road\",\n", " \"road\": \"Road\",\n", " \"Blvd\": \"Boulevard\",\n", " \"Blvd.\": \"Boulevard\",\n", " \"Blvd,\": \"Boulevard\",\n", " \"boulevard\": \"Boulevard\",\n", " \"broadway\": \"Broadway\",\n", " \"square\": \"Square\",\n", " \"way\": \"Way\",\n", " \"Dr.\": \"Drive\",\n", " \"Dr\": \"Drive\",\n", " \"ct\": \"Court\",\n", " \"Ct\": \"Court\",\n", " \"court\": \"Court\",\n", " \"Sq\": \"Square\",\n", " \"square\": \"Square\",\n", " \"cres\": \"Crescent\",\n", " \"Cres\": \"Crescent\",\n", " \"Ctr\": \"Center\",\n", " \"Hwy\": \"Highway\",\n", " \"hwy\": \"Highway\",\n", " \"Ln\": \"Lane\",\n", " \"Ln.\": \"Lane\",\n", " \"parkway\": \"Parkway\"\n", " }\n", "\n", "def update_name(name, mapping):\n", " \"\"\"This function takes the street name and split it at the space character. \n", " In case, it finds a string that matches any key in the mapping, it replaces it with \n", " the format that has been specified for it. \n", " \n", " e.g. When the function finds 'Blvd' in \"Newark Blvd\", it goes through mapping and maps \n", " it to 'Boulevard', and the final street name will come out as 'Newark Boulevard'.\n", " \"\"\"\n", " output = list()\n", " parts = name.split(\" \")\n", " for part in parts:\n", " if part in mapping:\n", " output.append(mapping[part])\n", " else:\n", " output.append(part)\n", " return \" \".join(output)\n", "\n", "# Printing the changes made in street names.\n", "for st_type, ways in street_types.items():\n", " for name in ways:\n", " better_name = update_name(name, mapping)\n", " print(name, \"→\", better_name)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Auditing Postcodes\n", "\n", "Postcodes are another inconsistent type of data that is entered into the map. The inconsistency is either in how they are represented (with the city abbreviation or without) or how long they are. The theory behind auditing Postcodes is the same as auditing street names." ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'94005': 1,\n", " '94010': 2,\n", " '94014': 1,\n", " '94015': 2,\n", " '94025': 1,\n", " '94061': 3,\n", " '94062': 1,\n", " '94063': 2,\n", " '94065': 1,\n", " '94102': 1,\n", " '94103': 18,\n", " '94104': 1,\n", " '94105': 2,\n", " '94107': 2,\n", " '94108': 1,\n", " '94109': 10,\n", " '94110': 2,\n", " '94111': 1,\n", " '94113': 1,\n", " '94114': 5,\n", " '94115': 1,\n", " '94116': 29,\n", " '94117': 22,\n", " '94118': 14,\n", " '94121': 6,\n", " '94122': 69,\n", " '94123': 1,\n", " '94124': 1,\n", " '94127': 10,\n", " '94131': 2,\n", " '94132': 1,\n", " '94133': 15,\n", " '94401': 1,\n", " '94501': 2,\n", " '94530': 2,\n", " '94541': 3,\n", " '94560': 1,\n", " '94578': 1,\n", " '94580': 1,\n", " '94587': 6,\n", " '94598': 1,\n", " '94602': 1,\n", " '94605': 1,\n", " '94606': 2,\n", " '94610': 14,\n", " '94611': 46,\n", " '94612': 4,\n", " '94618': 2,\n", " '94702': 1,\n", " '94703': 1,\n", " '94704': 2,\n", " '94706': 4,\n", " '94707': 2,\n", " '94709': 1,\n", " '94804': 1,\n", " '94805': 9,\n", " '94904': 1,\n", " '94941': 1,\n", " '94965': 1,\n", " '95498': 1}\n" ] } ], "source": [ "OSMFILE = 'sample.osm'\n", "\n", "def dicti(data, item):\n", " \"\"\"This function creates a dictionary postcodes can be held. \n", " The dictionary key will be the postcode itself and the dictionary value \n", " will be the number of times that postcode was repeated throughout the map.\"\"\"\n", " data[item] += 1\n", " \n", "def get_postcode(elem):\n", " \"\"\"This function takes the 'tag' element as an input and \n", " return the elements for which the keys are equal to 'addr:postcode'\"\"\"\n", " return (elem.attrib['k'] == \"addr:postcode\")\n", "\n", "def audit(osmfile):\n", " \"\"\"This function parses the XML file and iterates through node and \n", " way elements. It extracts the value attribute (i.e. the postcode) and \n", " add it to the 'dicti' dictionary.\"\"\"\n", " osm_file = open(osmfile, \"r\")\n", " data = defaultdict(int)\n", " # Parsing the XML file\n", " for event, elem in ET.iterparse(osm_file, events=(\"start\",)):\n", " # Iterating through node and way elements.\n", " if elem.tag == \"node\" or elem.tag == \"way\":\n", " for tag in elem.iter(\"tag\"):\n", " if get_postcode(tag):\n", " dicti(data, tag.attrib['v'])\n", " return data\n", "\n", "postcodes = audit(OSMFILE)\n", "pprint.pprint(dict(postcodes))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The output shows that the postcode is in a *5-digit format*.\n", "\n", "After running the code on the original file, I also found other postcode formats. To deal with these other formats observed, I divide them into different categories:
\n", "\n", "- The first category includes the ones:\n", " - Where the length equals to 5 (e.g., 12345).\n", " - Where the length is longer than 5, and they contain characters (e.g., CA 12345).

\n", " \n", "- The second category includes the ones:\n", " - Where the length is longer than 5, and they are followed by a hyphen (e.g., 12345-6789).

\n", " \n", "- The third category includes the ones:\n", " - Where the length is longer than 5, but are not followed by any hyphen (e.g., 123456).\n", " - Where the length is shorter than 5 (e.g., 1234, 515).\n", " - Where the postcode equals to 'CA.'
\n", " \n", "For the first category, I use a regular expression to extract only 5 digits from the pattern. This regex asserts position at the start of the string (`^`) and matches any character that is NOT a digit (`\\D*`). The (`\\d{5}`) matches a digit exactly 5 times. In case the postcode starts with letters (e.g., CA 12345), it gives two groups of output: 'CA' and '12345'. Depending on which one is needed, the preferred group can be chosen – **`^\\D*(\\d{5}).*`**.\n", "\n", "For the second category, I use another regular expression to extract the first 5-digits. This regex matches digits 5 times, is followed by a '-,' and then matching digits exactly 4 times – **`^(\\d{5})-\\d{4}$`**.\n", " \n", "For the third category, having postcodes which have shorter or longer than 5-digit length, means that they are not valid. To clean them up, I replace those postcodes with '00000'. I use regular expressions to be able to find the ones that are exactly 6-digit long – **`^\\d{6}$`**." ] }, { "cell_type": "code", "execution_count": 30, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "94541 → 94541\n", "94110 → 94110\n", "94116 → 94116\n", "94587 → 94587\n", "94612 → 94612\n", "94065 → 94065\n", "94704 → 94704\n", "94103 → 94103\n", "94606 → 94606\n", "94061 → 94061\n", "94107 → 94107\n", "94010 → 94010\n", "94115 → 94115\n", "94132 → 94132\n", "94114 → 94114\n", "94702 → 94702\n", "94618 → 94618\n", "94109 → 94109\n", "94122 → 94122\n", "94127 → 94127\n", "94105 → 94105\n", "94117 → 94117\n", "94118 → 94118\n", "94121 → 94121\n", "94108 → 94108\n", "94706 → 94706\n", "94904 → 94904\n", "94015 → 94015\n", "94941 → 94941\n", "94804 → 94804\n", "94560 → 94560\n", "94965 → 94965\n", "94062 → 94062\n", "94104 → 94104\n", "94113 → 94113\n", "94063 → 94063\n", "94133 → 94133\n", "94111 → 94111\n", "94123 → 94123\n", "94131 → 94131\n", "95498 → 95498\n", "94501 → 94501\n", "94602 → 94602\n", "94610 → 94610\n", "94611 → 94611\n", "94598 → 94598\n", "94124 → 94124\n", "94014 → 94014\n", "94102 → 94102\n", "94580 → 94580\n", "94025 → 94025\n", "94401 → 94401\n", "94005 → 94005\n", "94703 → 94703\n", "94605 → 94605\n", "94578 → 94578\n", "94805 → 94805\n", "94530 → 94530\n", "94707 → 94707\n", "94709 → 94709\n" ] } ], "source": [ "def update_postcode(digit):\n", " \"\"\"Makes use of different conditions in the function to match the \n", " postcodes in the 3 categories that can be found for postal codes.\"\"\"\n", " output = list()\n", " \n", " first_category = re.compile(r'^\\D*(\\d{5}$)', re.IGNORECASE)\n", " second_category = re.compile('^(\\d{5})-\\d{4}$')\n", " third_category = re.compile('^\\d{6}$')\n", " \n", " if re.search(first_category, digit):\n", " new_digit = re.search(first_category, digit).group(1)\n", " output.append(new_digit)\n", " \n", " elif re.search(second_category, digit):\n", " new_digit = re.search(second_category, digit).group(1)\n", " output.append(new_digit)\n", " \n", " elif re.search(third_category, digit):\n", " third_output = third_category.search(digit)\n", " new_digit = '00000'\n", " output.append('00000')\n", " \n", " # This condition matches the third category for any other types.\n", " elif digit == 'CA' or len(digit) < 5:\n", " new_digit = '00000'\n", " output.append(new_digit)\n", "\n", " return ', '.join(str(x) for x in output)\n", "\n", "for postcode, nums in postcodes.items():\n", " better_code = update_postcode(postcode)\n", " print(postcode, \"→\", better_code)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preparing the Data for the Database\n", "\n", "To load the data to the SQLite database, they need to be converted from the XML file to CSV files. I create multiple CSV files, and later create the corresponding tables in my database based on them.\n", "\n", "The CSV files I want to have are:\n", "\n", "- `node.csv`\n", "- `node_tags.csv`\n", "- `way.csv`\n", "- `way_tags.csv`\n", "- `way_nodes.csv`\n", "\n", "Each of these CSV files contains different columns and stores data based on those columns. The columns used in the CSV files will be the table columns in the database. This is the **schema**:\n", "\n", "- NODE_FIELDS = ['id', 'lat', 'lon', 'user', 'uid', 'version', 'changeset', 'timestamp']\n", "- NODE_TAGS_FIELDS = ['id', 'key', 'value', 'type']\n", "- WAY_FIELDS = ['id', 'user', 'uid', 'version', 'changeset', 'timestamp']\n", "- WAY_TAGS_FIELDS = ['id', 'key', 'value', 'type']\n", "- WAY_NODES_FIELDS = ['id', 'node_id', 'position']\n", "\n", "To create these files, I will parse the 'node' and 'way' tags and extract the tags inside them. The `shape_element` function takes an iterparse element object and returns a dictionary. Depending on whether the element is 'node' or 'way', the dictionary looks different." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here's an example from the 'node' element in the XML file:\n", "\n", " \n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\t\n", "\t\n", "\n", "### For 'Node':\n", "**The dictionary returns the format `{\"node\": .., \"node_tags\": ...}`**\n", "\n", "The `node` field holds a dictionary of the following top level node attributes: `id, user, uid, version, lat, lon, timestamp, changeset`. All the other attributes are ignored.\n", "\n", "The `node_tags` field holds a list of dictionaries, one per secondary tag (child tags of node, which have the tag name/type: \"tag\"). Each dictionary has the following fields and properties:\n", "\n", "- `id:` the top level node_id attribute value. For example, `node['id']='358830414'` in the above sample.

\n", "\n", "- `key:` the full tag \"k\" attribute value if no colon is present or the characters after the colon if there is one. For example, `k=\"name\"`.

\n", "\n", "- `value:` the tag \"v\" attribute value. For example, `v=\"Longview Park\"` is the value for the key `k=\"name\"`.

\n", "\n", "- `type:` either the characters before the colon in the tag \"k\" value or \"regular\" if a colon is not present. For `k=\"name\"`, the type would be 'regular'.

\n", "\n", "- If the tag \"k\" value contains problematic characters, the tag should be ignored.

\n", "\n", "- If the tag \"k\" value contains a \":\", the characters before the \":\" should be set as the tag type and characters after the \":\" should be set as the tag key. For example, in ``, the `tag['type']='gnis' and tag['key']='county_id'`.

\n", "\n", "- If there are additional \":\" in the \"k\" value, they should be ignored and kept as part of the tag key. For example, `{'id':12345, 'key':'street:name', 'value':'Lincoln', 'type':'addr'}.`

\n", "\n", "- If a node has no secondary tags, then `node_tags` field should just contain an empty list.\n", "\n", "### For 'Way':\n", "**The dictionary has the format `{\"way\": ..., \"way_tags\": ..., \"way_nodes\": ...}`**\n", "\n", "The `way` field should hold a dictionary of the following top level way attributes `id, user, uid, version, timestamp, changeset`.\n", "\n", "The `way_tags` field again holds a list of dictionaries, following the exact same rules as for `node_tags`.\n", "\n", "Additionally, the dictionary has a field `way_nodes`. `way_nodes` holds a list of dictionaries. Each dictionary has the fields:\n", "- `id:` the top level element (way) id.\n", "- `node_id:` the ref attribute value of the nd tag.\n", "- `position:` the index starting at 0 of the nd tag i.e. what order the nd tag appears within the way element." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Writing to CSV\n", "To write the data into CSV files, `shape_element` function which makes use of `update_name` and `update_postcode` functions to clean the street names and postcodes before they are inserted into the CSV files. Each of the function is called twice, once for nodes and the other for ways." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Issue Faced: Incompletely Filled Attributes\n", "One issue I noticed while I was validating my CSV file against the expected schema, I found out that some `uid` values are missing from the data. Although according to the best practice of OpenStreetMap, all user's information should be written while submitting data, they are not; thus, causing validation to throw errors. To fix this issue, dummy values have been entered. \n", "\n", "The same problem with some empty fields are relatable to the 'k' attribute as well; meaning users did not add information for all the attributes in node or way. To overcome this problem, a conditional statement to set the `['type']` and `['key']` to 'regular' in case the field is empty was implemented. Else, it uses the regular expressions to find corresponding patterns." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Preparing the Database\n", "\n", "After the CSV files are created (by running the shaping_csv.py file), the next step is to create the database and insert the information from those CSV files to their corresponding tables.\n", "\n", "I created the database called `openstreetmap_sf_db,` and I created tables with columns based on the columns from the CSV files, and inserted the data from the CSV files to the corresponding tables in the database. The `creating_database.py` file takes care of creating the tables and inserting data in them.\n", "\n", "After the tables are created, I can now start investigating them and getting queries on them.\n", "\n", "The sample code below shows how I have created the tables in the database; where `tables_name`, `column_name` and `filename.csv` are replaced according to the table that is to be inserted. First, I connect to the SQLite file and make sure to check that the table I want to create is not already created. Using `cur.execute`, I execute commands to the database in Python. After creating the table, I inserted the data from the CSV file into it. I did this process for every table I wanted to create in the database." ] }, { "cell_type": "code", "execution_count": 68, "metadata": { "collapsed": true }, "outputs": [], "source": [ "import sqlite3\n", "import csv\n", "from pprint import pprint\n", "\n", "# Put the path to your sqlite database. If no database is available, a new one will be created.\n", "sqlite_file = 'openstreetmap_sf_db.sqlite'\n", "\n", "# Connecting to the database.\n", "conn = sqlite3.connect(sqlite_file)\n", "cur = conn.cursor()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data Overview" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that I have audited and cleaned the data and transfered everything into table in my database, I can start running queries on it. \n", "\n", "The queries help find the following. \n", "\n", "1. Number of nodes, ways.\n", "2. Number of unique users.\n", "3. Most contributing users.\n", "4. No. of users who contributed only once.\n", "5. Top 10 amneties in San Francisco.\n", "6. Cuisines in San Francisco.\n", "7. Shopes in San Francisco.\n", "8. Users who added amneties." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Number of Nodes & Ways" ] }, { "cell_type": "code", "execution_count": 69, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of nodes: \n", " 6691432\n", "Number of ways: \n", " 833509\n" ] } ], "source": [ "def number_of_nodes():\n", " output = cur.execute('SELECT COUNT(*) FROM nodes')\n", " return output.fetchone()[0]\n", "print('Number of nodes: \\n' , number_of_nodes())\n", "\n", "def number_of_ways():\n", " output = cur.execute('SELECT COUNT(*) FROM ways')\n", " return output.fetchone()[0]\n", "print('Number of ways: \\n', number_of_ways())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Number of Unique Users" ] }, { "cell_type": "code", "execution_count": 70, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of unique users: \n", " 2996\n" ] } ], "source": [ "def number_of_unique_users():\n", " output = cur.execute('SELECT COUNT(DISTINCT e.uid) FROM \\\n", " (SELECT uid FROM nodes UNION ALL SELECT uid FROM ways) e')\n", " return output.fetchone()[0]\n", "print('Number of unique users: \\n' , number_of_unique_users())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Most Contributing Users" ] }, { "cell_type": "code", "execution_count": 78, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most contributing users: \n", "\n", "[(\"b'andygol'\", 1495885),\n", " (\"b'ediyes'\", 886372),\n", " (\"b'Luis36995'\", 678994),\n", " (\"b'dannykath'\", 545833),\n", " (\"b'RichRico'\", 414739),\n", " (\"b'Rub21'\", 380990),\n", " (\"b'calfarome'\", 190771),\n", " (\"b'oldtopos'\", 165326),\n", " (\"b'KindredCoda'\", 149721),\n", " (\"b'karitotp'\", 139477)]\n" ] } ], "source": [ "def most_contributing_users():\n", " output = cur.execute('SELECT e.user, COUNT(*) as num FROM \\\n", " (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \\\n", " GROUP BY e.user \\\n", " ORDER BY num DESC \\\n", " LIMIT 10 ')\n", " print(\"Most contributing users: \\n\")\n", " pprint(output.fetchall())\n", " return None\n", "\n", "most_contributing_users()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Number of Users Who Contributed Once" ] }, { "cell_type": "code", "execution_count": 79, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of users who have contributed once: \n", " 747\n" ] } ], "source": [ "def number_of_users_contributed_once():\n", " output = cur.execute('SELECT COUNT(*) FROM \\\n", " (SELECT e.user, COUNT(*) as num FROM \\\n", " (SELECT user FROM nodes UNION ALL SELECT user FROM ways) e \\\n", " GROUP BY e.user \\\n", " HAVING num = 1) u')\n", " return output.fetchone()[0]\n", " \n", "print('Number of users who have contributed once: \\n', number_of_users_contributed_once())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Top 10 Amenities in San Francisco" ] }, { "cell_type": "code", "execution_count": 106, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 10 Amenities:\n", "\n", "[(\"b'restaurant'\", 3063),\n", " (\"b'bench'\", 1357),\n", " (\"b'cafe'\", 1039),\n", " (\"b'post_box'\", 698),\n", " (\"b'place_of_worship'\", 689),\n", " (\"b'fast_food'\", 599),\n", " (\"b'bicycle_parking'\", 585),\n", " (\"b'school'\", 572),\n", " (\"b'drinking_water'\", 543),\n", " (\"b'toilets'\", 424),\n", " (\"b'bank'\", 372),\n", " (\"b'bar'\", 343),\n", " (\"b'fuel'\", 275),\n", " (\"b'parking'\", 274),\n", " (\"b'waste_basket'\", 239),\n", " (\"b'car_sharing'\", 223),\n", " (\"b'atm'\", 211),\n", " (\"b'pub'\", 207),\n", " (\"b'post_office'\", 164),\n", " (\"b'pharmacy'\", 151)]\n" ] } ], "source": [ "query = \"SELECT value, COUNT(*) as num FROM nodes_tags \\\n", " WHERE key=\\\"b'amenity'\\\" \\\n", " GROUP BY value \\\n", " ORDER BY num DESC \\\n", " LIMIT 20\"\n", "\n", "def top_ten_amenities_in_sf():\n", " output = cur.execute(query)\n", " pprint(output.fetchall())\n", " return None\n", "\n", "print('Top 10 Amenities:\\n')\n", "top_ten_amenities_in_sf()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Top 10 Cuisines in San Francisco" ] }, { "cell_type": "code", "execution_count": 107, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Top 10 Cuisines in San Francisco:\n", "\n", "[(\"b'burger'\", 77),\n", " (\"b'mexican'\", 51),\n", " (\"b'chinese'\", 30),\n", " (\"b'pizza'\", 29),\n", " (\"b'american'\", 23),\n", " (\"b'coffee_shop'\", 22),\n", " (\"b'japanese'\", 20),\n", " (\"b'italian'\", 19),\n", " (\"b'seafood'\", 13),\n", " (\"b'indian'\", 11)]\n" ] } ], "source": [ "query = \"SELECT value, COUNT(*) as num FROM ways_tags \\\n", " WHERE key=\\\"b'cuisine'\\\" \\\n", " GROUP BY value \\\n", " ORDER BY num DESC \\\n", " LIMIT 10\"\n", "\n", "def cuisines_in_sf():\n", " output = cur.execute(query)\n", " pprint(output.fetchall())\n", " return None\n", "\n", "print('Top 10 Cuisines in San Francisco:\\n')\n", "cuisines_in_sf()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "San Francisco is known for its diverse culture and this reflects in the popularity of international cuisines." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Different Types of Shops" ] }, { "cell_type": "code", "execution_count": 115, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Different types of shops:\n", "\n", "[(\"b'convenience'\", 393),\n", " (\"b'clothes'\", 378),\n", " (\"b'hairdresser'\", 349),\n", " (\"b'beauty'\", 284),\n", " (\"b'yes'\", 216),\n", " (\"b'supermarket'\", 207),\n", " (\"b'car_repair'\", 175),\n", " (\"b'bakery'\", 172),\n", " (\"b'laundry'\", 171),\n", " (\"b'dry_cleaning'\", 144)]\n" ] } ], "source": [ "query = \"SELECT value, COUNT(*) as num FROM nodes_tags \\\n", " WHERE key=\\\"b'shop'\\\" \\\n", " GROUP BY value \\\n", " ORDER BY num DESC \\\n", " LIMIT 10\"\n", "\n", "def shops_in_sf():\n", " output = cur.execute(query)\n", " pprint(output.fetchall())\n", " return None\n", "\n", "print('Different types of shops:\\n')\n", "shops_in_sf()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Popular Cafes in San Francisco" ] }, { "cell_type": "code", "execution_count": 118, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Most popular cafes in San Francisco: \n", "\n", "[(\"b'Starbucks'\", 53),\n", " ('b\"Peet\\'s Coffee & Tea\"', 16),\n", " (\"b'Starbucks Coffee'\", 15),\n", " ('b\"Peet\\'s Coffee and Tea\"', 7),\n", " ('b\"Peet\\'s Coffee\"', 5),\n", " (\"b'Blue Bottle Coffee'\", 5),\n", " (\"b'Philz Coffee'\", 5),\n", " (\"b'Beanery'\", 3),\n", " (\"b'Highwire Coffee Roasters'\", 3),\n", " ('b\"Yali\\'s Cafe\"', 2)]\n" ] } ], "source": [ "query = \"SELECT nodes_tags.value, COUNT(*) as num \\\n", " FROM nodes_tags \\\n", " JOIN (SELECT DISTINCT(id) FROM nodes_tags WHERE value=\\\"b'coffee_shop'\\\") AS cafes \\\n", " ON nodes_tags.id = cafes.id \\\n", " WHERE nodes_tags.key=\\\"b'name'\\\"\\\n", " GROUP BY nodes_tags.value \\\n", " ORDER BY num DESC \\\n", " LIMIT 10\"\n", "\n", "def most_popular_cafes():\n", " output = cur.execute(query)\n", " pprint(output.fetchall())\n", " return None\n", "\n", "print('Most popular cafes in San Francisco: \\n')\n", "most_popular_cafes()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It is not surprising that Starbucks in the most popular brand in the US. There are, however, some mistakes in the names. For example 'Starbuck' and 'Starbucks Coffee' are the same thing with different names. This is another potential data wrangling task (left out in this project)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Discussions about the Data\n", "\n", "Data wrangling for this project has been a time-consuming and complicated due to many inconsistencies in the data. I could spot only a few of those problems and clean them up, but I am sure there are many issues that went unnoticed. The primary reason behind these inconsistencies is human error.\n", "\n", "A lot of the issues related to the OpenStreetMap data can be prevented or corrected. For this, the following summary highlighting the issue and a proposed solution with its pros & cons are displayed.\n", "\n", "#### 1. Empty User ID Fields\n", "This issue was noticed while creating the CSV files and trying to validate them against the correct schema. I had received an error citing that the `uid` field is empty. This was surprising as for a data made by multiple users; a `uid` is a highly important attribute.\n", "\n", " - **Solution**: It should be made a practice to make `uid` mandatory for contributors. \n", " - **Benefits:** `uid` can be then used a primary key.\n", " - **Anticipated Issue:** This might decrease the number of contributions.\n", " \n", "#### 2. Invalid Format for Postcodes\n", "While auditing the postcodes, there were a lot of invalid entries in the dataset. The standard postcode format was in the form of a 5-digit code, with no letters or other characters; however, there were formats with the state abbreviation before the digit (still acceptable). There were a high number of entries in the format where the code was 6-digit long or shorter than 5-digits long or only 'CA.'\n", "\n", "- **Solution:** There should be some data validation before a user enters the data. \n", " - **Benefits:** Introduces standardization and reduces data auditing/cleaning process.\n", " - **Drawback:** Users might take time to accustom to the standards that are being forced on them which can affect the rate of data-entry.\n", " - The effect of this drawback can be reduced by a small-readable 'Read Me.'\n", "\n", "### Suggestions for Improving Overall Data Quality\n", "\n", "#### 3. Gamification\n", "Gamification refers to the application of typical elements of game-playing such as competition, scores, etc. The suggestion is to incorporate gamification into the usage of methods such as 'top contributors,' 'reward points' for users contributing to the OpenMapStreet Data.\n", "\n", " - **Benefits:** Increases the level of motivation for submitting more data. This motivation can also push users to adhere to standardization.\n", " - **Drawback:** Non-rewarding points/statuses are weak. Rewards would require funds.\n", " \n", "The inspiration for the idea: https://www.facebook.com/editor/ - the competitive scoring system has often motivated me to add more to the data to stand higher rated than other people on my friend-list.\n", " \n", "#### 4. Pre-Cleaned Data on OSM\n", "The idea is to have scripts that can pre-clean the most frequent types of errors that are made by the users. e.g., some users might enter the pin code in an address field and not separate as a postal code. This data can be extracted to some extent using these scripts.\n", "\n", "This can also be done using cross-validating/cross-referencing to other databases such as Google API.\n", "\n", " - **Benefits:** Reduces error in the data and reduces human validation.\n", " - **Drawback:** Adds to the *computational requirements* (for running the scripts). \n", " - It might also need the additional validation of data which can be done by confirming the change from the user who had initially entered the data. This, however, has to execute the script right after the data is entered as a user might not respond later.\n", " \n", "#### 5. Additional Validation of Existing Entries\n", "As noted in the query which finds 'Starbucks' to be the most popular restaurant in San Francisco, there were many spelling errors, and multiple entries were made for the same place. With techniques like matching two strings, users can contribute by answering 'yes/no' queries related to validation.\n", "\n", " - **Benefits:** Reduces redundancy in data.\n", " \n", "#### 6. Improve the User Interface\n", "The OSM homepage is raw and functional but looks dated in comparison to most sites of the same scale. This might cause a lot of new users not to trust a site that looks unmaintained or hackish. Hence, some changes should be maintained on the website to make it look better (similar to https://www.facebook.com/editor/).\n", "\n", " - **Benefits:** Improves site's usability.\n", " - **Drawback:** Implementation is a tedious task. Also, severe changes might confuse existing contributors." ] } ], "metadata": { "kernelspec": { "display_name": "Python [conda root]", "language": "python", "name": "conda-root-py" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.0" } }, "nbformat": 4, "nbformat_minor": 2 }