Creating a Dataset

[1]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import hashlib

To install the use the dependencies for this notebook:

conda env create -f environment.yml
source activate fornax_tutorial

To run this notebook from the project root:

cd docs/tutorial
jupyter-notebook

Download

For the duration of this tutorial we will be using the social network of Marvel Comicbook characters.

Nodes will represent

  • characters
  • aliases
  • groups of characters

edges will represent relationships between the nodes.

For example Wolverine, Logan and X-Men are all nodes. There is an edge between Wolverine and Logan because Logan is an alternative name for Wolverine. There is an edge between Wolverine and X-Men because Wolverine is a member of X-Men. There is no direct relationship between Logan and X-Men so there is no edge between them.

SPARQL

Below is a SPARQL query which will return data in the following format (using ‘Wolverine’ as an example):

{
    "group": {
        "type": "uri",
        "value": "http://www.wikidata.org/entity/Q2690825"
    },
    "character": {
        "type": "uri",
        "value": "http://www.wikidata.org/entity/Q186422"
    },
    "birthName": {
        "xml:lang": "en",
        "type": "literal",
        "value": "James Howlett"
    },
    "characterLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Wolverine"
    },
    "groupLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Horsemen of Apocalypse"
    },
    "characterAltLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Logan, Weapon X, Jim Logan, Patch, James Howlett, Agent Ten, Experiment X, Weapon Ten"
    }
}
[2]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("""
    SELECT ?character ?characterLabel ?group ?groupLabel ?birthName ?characterAltLabel
    WHERE {
        ?group wdt:P31 wd:Q14514600 ;  # group of fictional characters
              wdt:P1080 wd:Q931597.    # from Marvel universe
        ?character wdt:P463 ?group.    # member of group
        optional{ ?character wdt:P1477 ?birthName. }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".}
    }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

Loading with Pandas

We’ll be using pandas to do some data manipulation so lets put the result inside a pandas dataframe.

[3]:
# load the results into a pandas DataFrame
records = []
for result in results["results"]["bindings"]:
    character_id = result['character']['value']
    group_id = result['group']['value']
    name = result['characterLabel']['value']
    group = result['groupLabel']['value']
    alt_names = None
    if 'characterAltLabel' in result:
        alt_names = result['characterAltLabel']['value']
    birth_name = None
    if 'birthName' in result:
        birth_name = result['birthName']['value']
    records.append((character_id, group_id, name, group, birth_name, alt_names))

frame = pd.DataFrame.from_records(records, columns=['character_id', 'group_id', 'name', 'group', 'birth_name', 'alt_names'])
[4]:
frame.head()
[4]:
character_id group_id name group birth_name alt_names
0 http://www.wikidata.org/entity/Q60002 http://www.wikidata.org/entity/Q2603976 Colossus Excalibur Пётр Николаевич Распутин Peter Rasputin, Piotr Nikolayevich Rasputin, P...
1 http://www.wikidata.org/entity/Q258015 http://www.wikidata.org/entity/Q2603976 Rachel Summers Excalibur Rachel Anne Summers Phoenix, Prestige, Marvel Girl, Mother Askani,...
2 http://www.wikidata.org/entity/Q369197 http://www.wikidata.org/entity/Q2527918 Black Widow Thunderbolts Наталья Алиановна Романова Natalie Rushman, Natasha Romanoff, asesina rus...
3 http://www.wikidata.org/entity/Q388316 http://www.wikidata.org/entity/Q2527918 Bullseye Thunderbolts None Lester, Hawkeye, Benjamin Poindexter
4 http://www.wikidata.org/entity/Q432272 http://www.wikidata.org/entity/Q2457162 Medusa Frightful Four None None

Splitting into Tables

The dataframe above is unwieldy since it contains a list of values in the alt_names column. Most values also appear in many rows. Below we seperate the frame into a set of tables with unique rows much like we would for a relational database.

[5]:
names = frame[['character_id', 'name']].drop_duplicates()
groups = frame[['group_id', 'group']].drop_duplicates()
character_group = frame[['character_id', 'group_id']].drop_duplicates()
birth_names = frame[
    frame['birth_name'].notna() # do not include a row for characters without a birthname
][['character_id', 'birth_name']].drop_duplicates()
[6]:
records = []
for uid, alt_names in zip(frame['character_id'], frame['alt_names']):
    if alt_names is None:
        continue
    for name in alt_names.split(','):
        records.append({'character_id': uid, 'alt_name': name})
alt_names = pd.DataFrame.from_records(records).drop_duplicates()

Analysis

Lets quickly examine the data to check it still makes sense.

There are 399 unique characters in the dataset

[7]:
# count the number of unique character ids
len(names)
[7]:
400

The characters belong to 107 different groups

[8]:
# count the number of unique group ids
len(groups)
[8]:
107

Let’s find Wolverine…

[9]:
names[names['name'] == 'Wolverine']
[9]:
character_id name

What groups has he been in?

[10]:
character_group[
    character_group['character_id'] == 'http://www.wikidata.org/entity/Q186422'
].merge(groups, on='group_id')['group']
[10]:
0                     X-Men
1              Alpha Flight
2                  Avengers
3    Horsemen of Apocalypse
4                    Team X
Name: group, dtype: object

What are his alternative names?

[11]:
alt_names[alt_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']
[11]:
alt_name character_id
187 Logan http://www.wikidata.org/entity/Q186422
188 Weapon X http://www.wikidata.org/entity/Q186422
189 Jim Logan http://www.wikidata.org/entity/Q186422
190 Patch http://www.wikidata.org/entity/Q186422
191 James Howlett http://www.wikidata.org/entity/Q186422
192 Agent Ten http://www.wikidata.org/entity/Q186422
193 Experiment X http://www.wikidata.org/entity/Q186422
194 Weapon Ten http://www.wikidata.org/entity/Q186422

What is his birthname?

[12]:
birth_names[birth_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']
[12]:
character_id birth_name
103 http://www.wikidata.org/entity/Q186422 James Howlett

What is the biggest team?

[13]:
sorted(
    [
        (len(group), group[['group_id', 'group']].drop_duplicates())
        for uid, group
        in character_group.merge(groups, on='group_id').groupby('group_id')
    ],
    key=lambda x: x[0],
    reverse=True
)[0]
[13]:
(116,                                    group_id  group
 125  http://www.wikidata.org/entity/Q128452  X-Men)

Who has been in the most groups?

[14]:
sorted(
    [
        (len(group), group['name'].drop_duplicates())
        for uid, group
        in character_group.merge(names, on='character_id').groupby('character_id')
    ],
    key=lambda x: x[0],
    reverse=True
)[0]
[14]:
(8, 102    Cannonball
 Name: name, dtype: object)

Export to CSV

Let’s write each node to a csv file, we need to record

  • a unique ID for each node (we use a hash of the Wikidata URL since later we will need an integer ID)
  • a label (such as the name of the character or the group)
  • a type (0, 1, 2 for character, group or birthname)

Node that birth names don’t have a Wikidata URL so we just use a hash of the name.

[15]:
def get_id(url):
    """ A function to map the python hash function onto 32-bit integers"""
    return int(hashlib.sha256(url.encode('utf-8')).hexdigest(), 16) % 2147483647
[16]:
nodes = pd.concat(
    [
        pd.DataFrame({'uid': [get_id(item) for item in names['character_id']], 'type': 0, 'label': names['name']}),
        pd.DataFrame({'uid': [get_id(item) for item in groups['group_id']], 'type': 1, 'label': groups['group']}),
        pd.DataFrame({'uid': [get_id(item) for item in birth_names['birth_name']], 'type': 2, 'label': birth_names['birth_name']}),
        pd.DataFrame({'uid': [get_id(item) for item in alt_names['alt_name']], 'type': 2, 'label': alt_names['alt_name']})
    ],
    sort=True
).drop_duplicates()
nodes.to_csv('./nodes.csv', index=False)
[17]:
nodes.head()
[17]:
label type uid
0 Colossus 0 2105314676
1 Rachel Summers 0 298635603
2 Black Widow 0 1897346471
3 Bullseye 0 2027281781
4 Medusa 0 347320780

Edges connect characters to their birth names and their groups.

[18]:
# character_group_edges = frame[['character_id', 'group_id']].drop_duplicates()
# character_birth_name_edges = frame[['character_id', 'birth_name']].drop_duplicates()

edges = pd.concat([
    # character to group
    pd.DataFrame([
        {'start': get_id(start), 'end': get_id(end)}
        for start, end in zip(character_group['character_id'], character_group['group_id'])
    ]),

    # character to alt name
    pd.DataFrame([
        {'start': get_id(start), 'end': get_id(end)}
        for start, end in zip(alt_names['character_id'], alt_names['alt_name'])
    ])
])

[19]:
edges.to_csv('./edges.csv', index=False)