Creating a Dataset¶
[1]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd
import hashlib
To install the use the dependencies for this notebook:
conda env create -f environment.yml
source activate fornax_tutorial
To run this notebook from the project root:
cd docs/tutorial
jupyter-notebook
Download¶
For the duration of this tutorial we will be using the social network of Marvel Comicbook characters.
Nodes will represent
- characters
- aliases
- groups of characters
edges will represent relationships between the nodes.
For example Wolverine, Logan and X-Men are all nodes. There is an edge between Wolverine and Logan because Logan is an alternative name for Wolverine. There is an edge between Wolverine and X-Men because Wolverine is a member of X-Men. There is no direct relationship between Logan and X-Men so there is no edge between them.
SPARQL¶
Below is a SPARQL query which will return data in the following format (using ‘Wolverine’ as an example):
{
"group": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q2690825"
},
"character": {
"type": "uri",
"value": "http://www.wikidata.org/entity/Q186422"
},
"birthName": {
"xml:lang": "en",
"type": "literal",
"value": "James Howlett"
},
"characterLabel": {
"xml:lang": "en",
"type": "literal",
"value": "Wolverine"
},
"groupLabel": {
"xml:lang": "en",
"type": "literal",
"value": "Horsemen of Apocalypse"
},
"characterAltLabel": {
"xml:lang": "en",
"type": "literal",
"value": "Logan, Weapon X, Jim Logan, Patch, James Howlett, Agent Ten, Experiment X, Weapon Ten"
}
}
[2]:
sparql = SPARQLWrapper("https://query.wikidata.org/sparql")
sparql.setQuery("""
SELECT ?character ?characterLabel ?group ?groupLabel ?birthName ?characterAltLabel
WHERE {
?group wdt:P31 wd:Q14514600 ; # group of fictional characters
wdt:P1080 wd:Q931597. # from Marvel universe
?character wdt:P463 ?group. # member of group
optional{ ?character wdt:P1477 ?birthName. }
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en".}
}
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
Loading with Pandas¶
We’ll be using pandas to do some data manipulation so lets put the result inside a pandas dataframe.
[3]:
# load the results into a pandas DataFrame
records = []
for result in results["results"]["bindings"]:
character_id = result['character']['value']
group_id = result['group']['value']
name = result['characterLabel']['value']
group = result['groupLabel']['value']
alt_names = None
if 'characterAltLabel' in result:
alt_names = result['characterAltLabel']['value']
birth_name = None
if 'birthName' in result:
birth_name = result['birthName']['value']
records.append((character_id, group_id, name, group, birth_name, alt_names))
frame = pd.DataFrame.from_records(records, columns=['character_id', 'group_id', 'name', 'group', 'birth_name', 'alt_names'])
[4]:
frame.head()
[4]:
character_id | group_id | name | group | birth_name | alt_names | |
---|---|---|---|---|---|---|
0 | http://www.wikidata.org/entity/Q60002 | http://www.wikidata.org/entity/Q2603976 | Colossus | Excalibur | Пётр Николаевич Распутин | Peter Rasputin, Piotr Nikolayevich Rasputin, P... |
1 | http://www.wikidata.org/entity/Q258015 | http://www.wikidata.org/entity/Q2603976 | Rachel Summers | Excalibur | Rachel Anne Summers | Phoenix, Prestige, Marvel Girl, Mother Askani,... |
2 | http://www.wikidata.org/entity/Q369197 | http://www.wikidata.org/entity/Q2527918 | Black Widow | Thunderbolts | Наталья Алиановна Романова | Natalie Rushman, Natasha Romanoff, asesina rus... |
3 | http://www.wikidata.org/entity/Q388316 | http://www.wikidata.org/entity/Q2527918 | Bullseye | Thunderbolts | None | Lester, Hawkeye, Benjamin Poindexter |
4 | http://www.wikidata.org/entity/Q432272 | http://www.wikidata.org/entity/Q2457162 | Medusa | Frightful Four | None | None |
Splitting into Tables¶
The dataframe above is unwieldy since it contains a list of values in the alt_names column. Most values also appear in many rows. Below we seperate the frame into a set of tables with unique rows much like we would for a relational database.
[5]:
names = frame[['character_id', 'name']].drop_duplicates()
groups = frame[['group_id', 'group']].drop_duplicates()
character_group = frame[['character_id', 'group_id']].drop_duplicates()
birth_names = frame[
frame['birth_name'].notna() # do not include a row for characters without a birthname
][['character_id', 'birth_name']].drop_duplicates()
[6]:
records = []
for uid, alt_names in zip(frame['character_id'], frame['alt_names']):
if alt_names is None:
continue
for name in alt_names.split(','):
records.append({'character_id': uid, 'alt_name': name})
alt_names = pd.DataFrame.from_records(records).drop_duplicates()
Analysis¶
Lets quickly examine the data to check it still makes sense.
There are 399 unique characters in the dataset
[7]:
# count the number of unique character ids
len(names)
[7]:
400
The characters belong to 107 different groups
[8]:
# count the number of unique group ids
len(groups)
[8]:
107
Let’s find Wolverine…
[9]:
names[names['name'] == 'Wolverine']
[9]:
character_id | name |
---|
What groups has he been in?
[10]:
character_group[
character_group['character_id'] == 'http://www.wikidata.org/entity/Q186422'
].merge(groups, on='group_id')['group']
[10]:
0 X-Men
1 Alpha Flight
2 Avengers
3 Horsemen of Apocalypse
4 Team X
Name: group, dtype: object
What are his alternative names?
[11]:
alt_names[alt_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']
[11]:
alt_name | character_id | |
---|---|---|
187 | Logan | http://www.wikidata.org/entity/Q186422 |
188 | Weapon X | http://www.wikidata.org/entity/Q186422 |
189 | Jim Logan | http://www.wikidata.org/entity/Q186422 |
190 | Patch | http://www.wikidata.org/entity/Q186422 |
191 | James Howlett | http://www.wikidata.org/entity/Q186422 |
192 | Agent Ten | http://www.wikidata.org/entity/Q186422 |
193 | Experiment X | http://www.wikidata.org/entity/Q186422 |
194 | Weapon Ten | http://www.wikidata.org/entity/Q186422 |
What is his birthname?
[12]:
birth_names[birth_names['character_id'] == 'http://www.wikidata.org/entity/Q186422']
[12]:
character_id | birth_name | |
---|---|---|
103 | http://www.wikidata.org/entity/Q186422 | James Howlett |
What is the biggest team?
[13]:
sorted(
[
(len(group), group[['group_id', 'group']].drop_duplicates())
for uid, group
in character_group.merge(groups, on='group_id').groupby('group_id')
],
key=lambda x: x[0],
reverse=True
)[0]
[13]:
(116, group_id group
125 http://www.wikidata.org/entity/Q128452 X-Men)
Who has been in the most groups?
[14]:
sorted(
[
(len(group), group['name'].drop_duplicates())
for uid, group
in character_group.merge(names, on='character_id').groupby('character_id')
],
key=lambda x: x[0],
reverse=True
)[0]
[14]:
(8, 102 Cannonball
Name: name, dtype: object)
Export to CSV¶
Let’s write each node to a csv file, we need to record
- a unique ID for each node (we use a hash of the Wikidata URL since later we will need an integer ID)
- a label (such as the name of the character or the group)
- a type (0, 1, 2 for character, group or birthname)
Node that birth names don’t have a Wikidata URL so we just use a hash of the name.
[15]:
def get_id(url):
""" A function to map the python hash function onto 32-bit integers"""
return int(hashlib.sha256(url.encode('utf-8')).hexdigest(), 16) % 2147483647
[16]:
nodes = pd.concat(
[
pd.DataFrame({'uid': [get_id(item) for item in names['character_id']], 'type': 0, 'label': names['name']}),
pd.DataFrame({'uid': [get_id(item) for item in groups['group_id']], 'type': 1, 'label': groups['group']}),
pd.DataFrame({'uid': [get_id(item) for item in birth_names['birth_name']], 'type': 2, 'label': birth_names['birth_name']}),
pd.DataFrame({'uid': [get_id(item) for item in alt_names['alt_name']], 'type': 2, 'label': alt_names['alt_name']})
],
sort=True
).drop_duplicates()
nodes.to_csv('./nodes.csv', index=False)
[17]:
nodes.head()
[17]:
label | type | uid | |
---|---|---|---|
0 | Colossus | 0 | 2105314676 |
1 | Rachel Summers | 0 | 298635603 |
2 | Black Widow | 0 | 1897346471 |
3 | Bullseye | 0 | 2027281781 |
4 | Medusa | 0 | 347320780 |
Edges connect characters to their birth names and their groups.
[18]:
# character_group_edges = frame[['character_id', 'group_id']].drop_duplicates()
# character_birth_name_edges = frame[['character_id', 'birth_name']].drop_duplicates()
edges = pd.concat([
# character to group
pd.DataFrame([
{'start': get_id(start), 'end': get_id(end)}
for start, end in zip(character_group['character_id'], character_group['group_id'])
]),
# character to alt name
pd.DataFrame([
{'start': get_id(start), 'end': get_id(end)}
for start, end in zip(alt_names['character_id'], alt_names['alt_name'])
])
])
[19]:
edges.to_csv('./edges.csv', index=False)