Creating a Dataset

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import hashlib

To install the use the dependencies for this notebook:

conda env create -f environment.yml
source activate fornax_tutorial

To run this notebook from the project root:

cd docs/tutorial


For the duration of this tutorial we will be using the social network of Marvel Comicbook characters.

Nodes will represent

  • characters
  • aliases
  • groups of characters

edges will represent relationships between the nodes.

For example Wolverine, Logan and X-Men are all nodes. There is an edge between Wolverine and Logan because Logan is an alternative name for Wolverine. There is an edge between Wolverine and X-Men because Wolverine is a member of X-Men. There is no direct relationship between Logan and X-Men so there is no edge between them.


Below is a SPARQL query which will return data in the following format (using ‘Wolverine’ as an example):

    "group": {
        "type": "uri",
        "value": ""
    "character": {
        "type": "uri",
        "value": ""
    "birthName": {
        "xml:lang": "en",
        "type": "literal",
        "value": "James Howlett"
    "characterLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Wolverine"
    "groupLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Horsemen of Apocalypse"
    "characterAltLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Logan, Weapon X, Jim Logan, Patch, James Howlett, Agent Ten, Experiment X, Weapon Ten"
sparql = SPARQLWrapper("")
    SELECT ?character ?characterLabel ?group ?groupLabel ?birthName ?characterAltLabel
    WHERE {
        ?group wdt:P31 wd:Q14514600 ;  # group of fictional characters
              wdt:P1080 wd:Q931597.    # from Marvel universe
        ?character wdt:P463 ?group.    # member of group
        optional{ ?character wdt:P1477 ?birthName. }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".}
results = sparql.query().convert()

Loading with Pandas

We’ll be using pandas to do some data manipulation so lets put the result inside a pandas dataframe.

# load the results into a pandas DataFrame
records = []
for result in results["results"]["bindings"]:
    character_id = result['character']['value']
    group_id = result['group']['value']
    name = result['characterLabel']['value']
    group = result['groupLabel']['value']
    alt_names = None
    if 'characterAltLabel' in result:
        alt_names = result['characterAltLabel']['value']
    birth_name = None
    if 'birthName' in result:
        birth_name = result['birthName']['value']
    records.append((character_id, group_id, name, group, birth_name, alt_names))

frame = pd.DataFrame.from_records(records, columns=['character_id', 'group_id', 'name', 'group', 'birth_name', 'alt_names'])
character_id group_id name group birth_name alt_names
0 Colossus Excalibur Пётр Николаевич Распутин Peter Rasputin, Piotr Nikolayevich Rasputin, P...
1 Rachel Summers Excalibur Rachel Anne Summers Phoenix, Prestige, Marvel Girl, Mother Askani,...
2 Black Widow Thunderbolts Наталья Алиановна Романова Natalie Rushman, Natasha Romanoff, asesina rus...
3 Bullseye Thunderbolts None Lester, Hawkeye, Benjamin Poindexter
4 Medusa Frightful Four None None

Splitting into Tables

The dataframe above is unwieldy since it contains a list of values in the alt_names column. Most values also appear in many rows. Below we seperate the frame into a set of tables with unique rows much like we would for a relational database.

names = frame[['character_id', 'name']].drop_duplicates()
groups = frame[['group_id', 'group']].drop_duplicates()
character_group = frame[['character_id', 'group_id']].drop_duplicates()
birth_names = frame[
    frame['birth_name'].notna() # do not include a row for characters without a birthname
][['character_id', 'birth_name']].drop_duplicates()
records = []
for uid, alt_names in zip(frame['character_id'], frame['alt_names']):
    if alt_names is None:
    for name in alt_names.split(','):
        records.append({'character_id': uid, 'alt_name': name})
alt_names = pd.DataFrame.from_records(records).drop_duplicates()


Lets quickly examine the data to check it still makes sense.

There are 399 unique characters in the dataset

# count the number of unique character ids

The characters belong to 107 different groups

# count the number of unique group ids

Let’s find Wolverine…

names[names['name'] == 'Wolverine']
character_id name

What groups has he been in?

    character_group['character_id'] == ''
].merge(groups, on='group_id')['group']
0                     X-Men
1              Alpha Flight
2                  Avengers
3    Horsemen of Apocalypse
4                    Team X
Name: group, dtype: object

What are his alternative names?

alt_names[alt_names['character_id'] == '']
alt_name character_id
187 Logan
188 Weapon X
189 Jim Logan
190 Patch
191 James Howlett
192 Agent Ten
193 Experiment X
194 Weapon Ten

What is his birthname?

birth_names[birth_names['character_id'] == '']
character_id birth_name
103 James Howlett

What is the biggest team?

        (len(group), group[['group_id', 'group']].drop_duplicates())
        for uid, group
        in character_group.merge(groups, on='group_id').groupby('group_id')
    key=lambda x: x[0],
(116,                                    group_id  group
 125  X-Men)

Who has been in the most groups?

        (len(group), group['name'].drop_duplicates())
        for uid, group
        in character_group.merge(names, on='character_id').groupby('character_id')
    key=lambda x: x[0],
(8, 102    Cannonball
 Name: name, dtype: object)

Export to CSV

Let’s write each node to a csv file, we need to record

  • a unique ID for each node (we use a hash of the Wikidata URL since later we will need an integer ID)
  • a label (such as the name of the character or the group)
  • a type (0, 1, 2 for character, group or birthname)

Node that birth names don’t have a Wikidata URL so we just use a hash of the name.

def get_id(url):
    """ A function to map the python hash function onto 32-bit integers"""
    return int(hashlib.sha256(url.encode('utf-8')).hexdigest(), 16) % 2147483647
nodes = pd.concat(
        pd.DataFrame({'uid': [get_id(item) for item in names['character_id']], 'type': 0, 'label': names['name']}),
        pd.DataFrame({'uid': [get_id(item) for item in groups['group_id']], 'type': 1, 'label': groups['group']}),
        pd.DataFrame({'uid': [get_id(item) for item in birth_names['birth_name']], 'type': 2, 'label': birth_names['birth_name']}),
        pd.DataFrame({'uid': [get_id(item) for item in alt_names['alt_name']], 'type': 2, 'label': alt_names['alt_name']})
nodes.to_csv('./nodes.csv', index=False)
label type uid
0 Colossus 0 2105314676
1 Rachel Summers 0 298635603
2 Black Widow 0 1897346471
3 Bullseye 0 2027281781
4 Medusa 0 347320780

Edges connect characters to their birth names and their groups.

# character_group_edges = frame[['character_id', 'group_id']].drop_duplicates()
# character_birth_name_edges = frame[['character_id', 'birth_name']].drop_duplicates()

edges = pd.concat([
    # character to group
        {'start': get_id(start), 'end': get_id(end)}
        for start, end in zip(character_group['character_id'], character_group['group_id'])

    # character to alt name
        {'start': get_id(start), 'end': get_id(end)}
        for start, end in zip(alt_names['character_id'], alt_names['alt_name'])

edges.to_csv('./edges.csv', index=False)