Creating a Dataset¶

[1]:

from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import hashlib

To install the use the dependencies for this notebook:

conda env create -f environment.yml
source activate fornax_tutorial

To run this notebook from the project root:

cd docs/tutorial
jupyter-notebook

Download¶

For the duration of this tutorial we will be using the social network of Marvel Comicbook characters.

Nodes will represent

characters
aliases
groups of characters

edges will represent relationships between the nodes.

For example Wolverine, Logan and X-Men are all nodes. There is an edge between Wolverine and Logan because Logan is an alternative name for Wolverine. There is an edge between Wolverine and X-Men because Wolverine is a member of X-Men. There is no direct relationship between Logan and X-Men so there is no edge between them.

SPARQL¶

Below is a SPARQL query which will return data in the following format (using ‘Wolverine’ as an example):

{
    "group": {
        "type": "uri",
        "value": "http://www.wikidata.org/entity/Q2690825"
    },
    "character": {
        "type": "uri",
        "value": "http://www.wikidata.org/entity/Q186422"
    },
    "birthName": {
        "xml:lang": "en",
        "type": "literal",
        "value": "James Howlett"
    },
    "characterLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Wolverine"
    },
    "groupLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Horsemen of Apocalypse"
    },
    "characterAltLabel": {
        "xml:lang": "en",
        "type": "literal",
        "value": "Logan, Weapon X, Jim Logan, Patch, James Howlett, Agent Ten, Experiment X, Weapon Ten"
    }
}

[2]:

sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("""
    SELECT ?character ?characterLabel ?group ?groupLabel ?birthName ?characterAltLabel
    WHERE {
        ?group wdt:P31 wd:Q14514600 ;  # group of fictional characters
              wdt:P1080 wd:Q931597.    # from Marvel universe
        ?character wdt:P463 ?group.    # member of group
        optional{ ?character wdt:P1477 ?birthName. }
      SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".}
    }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()

Loading with Pandas¶

We’ll be using pandas to do some data manipulation so lets put the result inside a pandas dataframe.

[3]:

# load the results into a pandas DataFrame
records = []
for result in results["results"]["bindings"]:
    character_id = result['character']['value']
    group_id = result['group']['value']
    name = result['characterLabel']['value']
    group = result['groupLabel']['value']
    alt_names = None
    if 'characterAltLabel' in result:
        alt_names = result['characterAltLabel']['value']
    birth_name = None
    if 'birthName' in result:
        birth_name = result['birthName']['value']
    records.append((character_id, group_id, name, group, birth_name, alt_names))

frame = pd.DataFrame.from_records(records, columns=['character_id', 'group_id', 'name', 'group', 'birth_name', 'alt_names'])

[4]:

frame.head()

[4]:

	character_id	group_id	name	group	birth_name	alt_names
0	http://www.wikidata.org/entity/Q60002	http://www.wikidata.org/entity/Q2603976	Colossus	Excalibur	Пётр Николаевич Распутин	Peter Rasputin, Piotr Nikolayevich Rasputin, P...
1	http://www.wikidata.org/entity/Q258015	http://www.wikidata.org/entity/Q2603976	Rachel Summers	Excalibur	Rachel Anne Summers	Phoenix, Prestige, Marvel Girl, Mother Askani,...
2	http://www.wikidata.org/entity/Q369197	http://www.wikidata.org/entity/Q2527918	Black Widow	Thunderbolts	Наталья Алиановна Романова	Natalie Rushman, Natasha Romanoff, asesina rus...
3	http://www.wikidata.org/entity/Q388316	http://www.wikidata.org/entity/Q2527918	Bullseye	Thunderbolts	None	Lester, Hawkeye, Benjamin Poindexter
4	http://www.wikidata.org/entity/Q432272	http://www.wikidata.org/entity/Q2457162	Medusa	Frightful Four	None	None

Splitting into Tables¶

The dataframe above is unwieldy since it contains a list of values in the alt_names column. Most values also appear in many rows. Below we seperate the frame into a set of tables with unique rows much like we would for a relational database.

[5]:

names = frame[['character_id', 'name']].drop_duplicates()
groups = frame[['group_id', 'group']].drop_duplicates()
character_group = frame[['character_id', 'group_id']].drop_duplicates()
birth_names = frame[
    frame['birth_name'].notna() # do not include a row for characters without a birthname
][['character_id', 'birth_name']].drop_duplicates()

[6]:

records = []
for uid, alt_names in zip(frame['character_id'], frame['alt_names']):
    if alt_names is None:
        continue
    for name in alt_names.split(','):
        records.append({'character_id': uid, 'alt_name': name})
alt_names = pd.DataFrame.from_records(records).drop_duplicates()

Analysis¶

Lets quickly examine the data to check it still makes sense.

There are 399 unique characters in the dataset

[7]:

# count the number of unique character ids
len(names)

[7]:

The characters belong to 107 different groups

[8]:

# count the number of unique group ids
len(groups)

[8]:

Let’s find Wolverine…

[9]:

names[names['name'] == 'Wolverine']

[9]:

	character_id	name

What groups has he been in?

[10]:

character_group[
    character_group['character_id'] == 'http://www.wikidata.org/entity/Q186422'
].merge(groups, on='group_id')['group']

[10]:

0                     X-Men
1              Alpha Flight
2                  Avengers
3    Horsemen of Apocalypse
4                    Team X
Name: group, dtype: object

What are his alternative names?

[11]:

alt_names[alt_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']

[11]:

	alt_name	character_id
187	Logan	http://www.wikidata.org/entity/Q186422
188	Weapon X	http://www.wikidata.org/entity/Q186422
189	Jim Logan	http://www.wikidata.org/entity/Q186422
190	Patch	http://www.wikidata.org/entity/Q186422
191	James Howlett	http://www.wikidata.org/entity/Q186422
192	Agent Ten	http://www.wikidata.org/entity/Q186422
193	Experiment X	http://www.wikidata.org/entity/Q186422
194	Weapon Ten	http://www.wikidata.org/entity/Q186422

What is his birthname?

[12]:

birth_names[birth_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']

[12]:

	character_id	birth_name
103	http://www.wikidata.org/entity/Q186422	James Howlett

What is the biggest team?

[13]:

sorted(
    [
        (len(group), group[['group_id', 'group']].drop_duplicates())
        for uid, group
        in character_group.merge(groups, on='group_id').groupby('group_id')
    ],
    key=lambda x: x[0],
    reverse=True
)[0]

[13]:

(116,                                    group_id  group
 125  http://www.wikidata.org/entity/Q128452  X-Men)

Who has been in the most groups?

[14]:

sorted(
    [
        (len(group), group['name'].drop_duplicates())
        for uid, group
        in character_group.merge(names, on='character_id').groupby('character_id')
    ],
    key=lambda x: x[0],
    reverse=True
)[0]

[14]:

(8, 102    Cannonball
 Name: name, dtype: object)

Export to CSV¶

Let’s write each node to a csv file, we need to record

a unique ID for each node (we use a hash of the Wikidata URL since later we will need an integer ID)
a label (such as the name of the character or the group)
a type (0, 1, 2 for character, group or birthname)

Node that birth names don’t have a Wikidata URL so we just use a hash of the name.

[15]:

def get_id(url):
    """ A function to map the python hash function onto 32-bit integers"""
    return int(hashlib.sha256(url.encode('utf-8')).hexdigest(), 16) % 2147483647

[16]:

nodes = pd.concat(
    [
        pd.DataFrame({'uid': [get_id(item) for item in names['character_id']], 'type': 0, 'label': names['name']}),
        pd.DataFrame({'uid': [get_id(item) for item in groups['group_id']], 'type': 1, 'label': groups['group']}),
        pd.DataFrame({'uid': [get_id(item) for item in birth_names['birth_name']], 'type': 2, 'label': birth_names['birth_name']}),
        pd.DataFrame({'uid': [get_id(item) for item in alt_names['alt_name']], 'type': 2, 'label': alt_names['alt_name']})
    ],
    sort=True
).drop_duplicates()
nodes.to_csv('./nodes.csv', index=False)

[17]:

nodes.head()

[17]:

	label	uid
0	Colossus	2105314676
1	Rachel Summers	298635603
2	Black Widow	1897346471
3	Bullseye	2027281781
4	Medusa	347320780

Edges connect characters to their birth names and their groups.

[18]:

# character_group_edges = frame[['character_id', 'group_id']].drop_duplicates()
# character_birth_name_edges = frame[['character_id', 'birth_name']].drop_duplicates()

edges = pd.concat([
    # character to group
    pd.DataFrame([
        {'start': get_id(start), 'end': get_id(end)}
        for start, end in zip(character_group['character_id'], character_group['group_id'])
    ]),

    # character to alt name
    pd.DataFrame([
        {'start': get_id(start), 'end': get_id(end)}
        for start, end in zip(alt_names['character_id'], alt_names['alt_name'])
    ])
])

[19]:

edges.to_csv('./edges.csv', index=False)