{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Creating a Dataset" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "from SPARQLWrapper import SPARQLWrapper, JSON\n", "import pandas as pd\n", "import hashlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To install the use the dependencies for this notebook:\n", "\n", "```bash\n", "conda env create -f environment.yml\n", "source activate fornax_tutorial\n", "```\n", "\n", "To run this notebook from the project root:\n", "\n", "```bash\n", "cd docs/tutorial\n", "jupyter-notebook\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Download\n", "\n", "For the duration of this tutorial we will be using the social network of Marvel Comicbook characters.\n", "\n", "Nodes will represent \n", "\n", "* characters \n", "* aliases \n", "* groups of characters\n", "\n", "edges will represent relationships between the nodes.\n", "\n", "For example Wolverine, Logan and X-Men are all nodes. \n", "There is an edge between Wolverine and Logan because Logan is an alternative name for Wolverine.\n", "There is an edge between Wolverine and X-Men because Wolverine is a member of X-Men.\n", "There is no direct relationship between Logan and X-Men so there is no edge between them.\n", "\n", "### SPARQL\n", "\n", "Below is a SPARQL query which will return data in the following format (using 'Wolverine' as an example):\n", "\n", "```json\n", "{\n", " \"group\": {\n", " \"type\": \"uri\",\n", " \"value\": \"http://www.wikidata.org/entity/Q2690825\"\n", " },\n", " \"character\": {\n", " \"type\": \"uri\",\n", " \"value\": \"http://www.wikidata.org/entity/Q186422\"\n", " },\n", " \"birthName\": {\n", " \"xml:lang\": \"en\",\n", " \"type\": \"literal\",\n", " \"value\": \"James Howlett\"\n", " },\n", " \"characterLabel\": {\n", " \"xml:lang\": \"en\",\n", " \"type\": \"literal\",\n", " \"value\": \"Wolverine\"\n", " },\n", " \"groupLabel\": {\n", " \"xml:lang\": \"en\",\n", " \"type\": \"literal\",\n", " \"value\": \"Horsemen of Apocalypse\"\n", " },\n", " \"characterAltLabel\": {\n", " \"xml:lang\": \"en\",\n", " \"type\": \"literal\",\n", " \"value\": \"Logan, Weapon X, Jim Logan, Patch, James Howlett, Agent Ten, Experiment X, Weapon Ten\"\n", " }\n", "}\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "sparql = SPARQLWrapper(\"https://query.wikidata.org/sparql\")\n", "sparql.setQuery(\"\"\"\n", " SELECT ?character ?characterLabel ?group ?groupLabel ?birthName ?characterAltLabel \n", " WHERE {\n", " ?group wdt:P31 wd:Q14514600 ; # group of fictional characters\n", " wdt:P1080 wd:Q931597. # from Marvel universe\n", " ?character wdt:P463 ?group. # member of group\n", " optional{ ?character wdt:P1477 ?birthName. }\n", " SERVICE wikibase:label { bd:serviceParam wikibase:language \"[AUTO_LANGUAGE],en\".}\n", " }\n", "\"\"\")\n", "sparql.setReturnFormat(JSON)\n", "results = sparql.query().convert()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Loading with Pandas\n", "\n", "We'll be using pandas to do some data manipulation so lets put the result inside a pandas dataframe." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# load the results into a pandas DataFrame\n", "records = []\n", "for result in results[\"results\"][\"bindings\"]:\n", " character_id = result['character']['value']\n", " group_id = result['group']['value']\n", " name = result['characterLabel']['value']\n", " group = result['groupLabel']['value']\n", " alt_names = None\n", " if 'characterAltLabel' in result:\n", " alt_names = result['characterAltLabel']['value']\n", " birth_name = None\n", " if 'birthName' in result:\n", " birth_name = result['birthName']['value']\n", " records.append((character_id, group_id, name, group, birth_name, alt_names))\n", "\n", "frame = pd.DataFrame.from_records(records, columns=['character_id', 'group_id', 'name', 'group', 'birth_name', 'alt_names'])" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
character_idgroup_idnamegroupbirth_namealt_names
0http://www.wikidata.org/entity/Q60002http://www.wikidata.org/entity/Q2603976ColossusExcaliburПётр Николаевич РаспутинPeter Rasputin, Piotr Nikolayevich Rasputin, P...
1http://www.wikidata.org/entity/Q258015http://www.wikidata.org/entity/Q2603976Rachel SummersExcaliburRachel Anne SummersPhoenix, Prestige, Marvel Girl, Mother Askani,...
2http://www.wikidata.org/entity/Q369197http://www.wikidata.org/entity/Q2527918Black WidowThunderboltsНаталья Алиановна РомановаNatalie Rushman, Natasha Romanoff, asesina rus...
3http://www.wikidata.org/entity/Q388316http://www.wikidata.org/entity/Q2527918BullseyeThunderboltsNoneLester, Hawkeye, Benjamin Poindexter
4http://www.wikidata.org/entity/Q432272http://www.wikidata.org/entity/Q2457162MedusaFrightful FourNoneNone
\n", "
" ], "text/plain": [ " character_id \\\n", "0 http://www.wikidata.org/entity/Q60002 \n", "1 http://www.wikidata.org/entity/Q258015 \n", "2 http://www.wikidata.org/entity/Q369197 \n", "3 http://www.wikidata.org/entity/Q388316 \n", "4 http://www.wikidata.org/entity/Q432272 \n", "\n", " group_id name group \\\n", "0 http://www.wikidata.org/entity/Q2603976 Colossus Excalibur \n", "1 http://www.wikidata.org/entity/Q2603976 Rachel Summers Excalibur \n", "2 http://www.wikidata.org/entity/Q2527918 Black Widow Thunderbolts \n", "3 http://www.wikidata.org/entity/Q2527918 Bullseye Thunderbolts \n", "4 http://www.wikidata.org/entity/Q2457162 Medusa Frightful Four \n", "\n", " birth_name \\\n", "0 Пётр Николаевич Распутин \n", "1 Rachel Anne Summers \n", "2 Наталья Алиановна Романова \n", "3 None \n", "4 None \n", "\n", " alt_names \n", "0 Peter Rasputin, Piotr Nikolayevich Rasputin, P... \n", "1 Phoenix, Prestige, Marvel Girl, Mother Askani,... \n", "2 Natalie Rushman, Natasha Romanoff, asesina rus... \n", "3 Lester, Hawkeye, Benjamin Poindexter \n", "4 None " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "frame.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Splitting into Tables\n", "\n", "The dataframe above is unwieldy since it contains a list of values in the alt_names column.\n", "Most values also appear in many rows.\n", "Below we seperate the frame into a set of tables with unique rows much like we would for a relational database." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "names = frame[['character_id', 'name']].drop_duplicates()\n", "groups = frame[['group_id', 'group']].drop_duplicates()\n", "character_group = frame[['character_id', 'group_id']].drop_duplicates()\n", "birth_names = frame[\n", " frame['birth_name'].notna() # do not include a row for characters without a birthname\n", "][['character_id', 'birth_name']].drop_duplicates()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "records = []\n", "for uid, alt_names in zip(frame['character_id'], frame['alt_names']):\n", " if alt_names is None:\n", " continue\n", " for name in alt_names.split(','):\n", " records.append({'character_id': uid, 'alt_name': name})\n", "alt_names = pd.DataFrame.from_records(records).drop_duplicates()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Analysis\n", "\n", "Lets quickly examine the data to check it still makes sense.\n", "\n", "There are 399 unique characters in the dataset" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "400" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the number of unique character ids\n", "len(names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The characters belong to 107 different groups" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "107" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# count the number of unique group ids\n", "len(groups)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's find Wolverine..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
character_idname
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [character_id, name]\n", "Index: []" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "names[names['name'] == 'Wolverine']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What groups has he been in?" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 X-Men\n", "1 Alpha Flight\n", "2 Avengers\n", "3 Horsemen of Apocalypse\n", "4 Team X\n", "Name: group, dtype: object" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "character_group[\n", " character_group['character_id'] == 'http://www.wikidata.org/entity/Q186422'\n", "].merge(groups, on='group_id')['group']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What are his alternative names?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
alt_namecharacter_id
187Loganhttp://www.wikidata.org/entity/Q186422
188Weapon Xhttp://www.wikidata.org/entity/Q186422
189Jim Loganhttp://www.wikidata.org/entity/Q186422
190Patchhttp://www.wikidata.org/entity/Q186422
191James Howletthttp://www.wikidata.org/entity/Q186422
192Agent Tenhttp://www.wikidata.org/entity/Q186422
193Experiment Xhttp://www.wikidata.org/entity/Q186422
194Weapon Tenhttp://www.wikidata.org/entity/Q186422
\n", "
" ], "text/plain": [ " alt_name character_id\n", "187 Logan http://www.wikidata.org/entity/Q186422\n", "188 Weapon X http://www.wikidata.org/entity/Q186422\n", "189 Jim Logan http://www.wikidata.org/entity/Q186422\n", "190 Patch http://www.wikidata.org/entity/Q186422\n", "191 James Howlett http://www.wikidata.org/entity/Q186422\n", "192 Agent Ten http://www.wikidata.org/entity/Q186422\n", "193 Experiment X http://www.wikidata.org/entity/Q186422\n", "194 Weapon Ten http://www.wikidata.org/entity/Q186422" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alt_names[alt_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is his birthname?" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
character_idbirth_name
103http://www.wikidata.org/entity/Q186422James Howlett
\n", "
" ], "text/plain": [ " character_id birth_name\n", "103 http://www.wikidata.org/entity/Q186422 James Howlett" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "birth_names[birth_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What is the biggest team?" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(116, group_id group\n", " 125 http://www.wikidata.org/entity/Q128452 X-Men)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(\n", " [\n", " (len(group), group[['group_id', 'group']].drop_duplicates()) \n", " for uid, group \n", " in character_group.merge(groups, on='group_id').groupby('group_id')\n", " ], \n", " key=lambda x: x[0], \n", " reverse=True\n", ")[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Who has been in the most groups?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(8, 102 Cannonball\n", " Name: name, dtype: object)" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "sorted(\n", " [\n", " (len(group), group['name'].drop_duplicates()) \n", " for uid, group \n", " in character_group.merge(names, on='character_id').groupby('character_id')\n", " ],\n", " key=lambda x: x[0], \n", " reverse=True\n", ")[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Export to CSV\n", "\n", "Let's write each node to a csv file, we need to record\n", "\n", "* a unique ID for each node (we use a hash of the Wikidata URL since later we will need an integer ID)\n", "* a label (such as the name of the character or the group)\n", "* a type (0, 1, 2 for character, group or birthname)\n", "\n", "Node that birth names don't have a Wikidata URL so we just use a hash of the name." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "def get_id(url):\n", " \"\"\" A function to map the python hash function onto 32-bit integers\"\"\"\n", " return int(hashlib.sha256(url.encode('utf-8')).hexdigest(), 16) % 2147483647 " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "nodes = pd.concat(\n", " [\n", " pd.DataFrame({'uid': [get_id(item) for item in names['character_id']], 'type': 0, 'label': names['name']}),\n", " pd.DataFrame({'uid': [get_id(item) for item in groups['group_id']], 'type': 1, 'label': groups['group']}),\n", " pd.DataFrame({'uid': [get_id(item) for item in birth_names['birth_name']], 'type': 2, 'label': birth_names['birth_name']}),\n", " pd.DataFrame({'uid': [get_id(item) for item in alt_names['alt_name']], 'type': 2, 'label': alt_names['alt_name']})\n", " ], \n", " sort=True\n", ").drop_duplicates()\n", "nodes.to_csv('./nodes.csv', index=False)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
labeltypeuid
0Colossus02105314676
1Rachel Summers0298635603
2Black Widow01897346471
3Bullseye02027281781
4Medusa0347320780
\n", "
" ], "text/plain": [ " label type uid\n", "0 Colossus 0 2105314676\n", "1 Rachel Summers 0 298635603\n", "2 Black Widow 0 1897346471\n", "3 Bullseye 0 2027281781\n", "4 Medusa 0 347320780" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "nodes.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Edges connect characters to their birth names and their groups." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "# character_group_edges = frame[['character_id', 'group_id']].drop_duplicates()\n", "# character_birth_name_edges = frame[['character_id', 'birth_name']].drop_duplicates()\n", "\n", "edges = pd.concat([\n", " # character to group\n", " pd.DataFrame([\n", " {'start': get_id(start), 'end': get_id(end)}\n", " for start, end in zip(character_group['character_id'], character_group['group_id'])\n", " ]),\n", "\n", " # character to alt name\n", " pd.DataFrame([\n", " {'start': get_id(start), 'end': get_id(end)}\n", " for start, end in zip(alt_names['character_id'], alt_names['alt_name'])\n", " ])\n", "])\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "edges.to_csv('./edges.csv', index=False)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }