Tutorial¶

[1]:

import os
import sys
import pandas as pd
import json
import matplotlib.pyplot as plt
import networkx as nx
import fornax

%matplotlib inline
from IPython.core.display import SVG

# Add project root dir
ROOT_DIR = os.path.abspath("../../")
sys.path.append(ROOT_DIR)

To install the use the dependencies for this notebook:

conda env create -f environment.yml
source activate fornax_tutorial

To run this notebook from the project root:

cd docs/tutorial
jupyter-notebook

In this tutorial we will:

Load a graph of superheros and their teams from csv files
Search for nodes in the graph using a string similarity function
Use fornax to search for nodes using string similarity and fuzzy graph matching

The data in this tutorial we be generated using the preceding notebook: Tutorial1.ipynb.

Introduction¶

nodes.csv and edges.csv contain a graph of superheros and their teams along with alternative names for those heros and groups (or aliases).

The image below uses the example of Iron Man, who is known as “Tony” to his friends. Iron man is a member of the Avengers, a.k.a. Earth’s Mightiest Superheros. Other heros are also members of The Avengers, and they will also have aliases. Other heros will also be members of other teams and so and so forth.

All of these heros, teams and aliases together make our target graph, a graph which we will search using fornax.

[2]:

SVG('../img/iron_man.svg')

[2]:

Let’s load the data into the notebook using pandas.

[3]:

# used for converting csv values in nodes.csv
mapping = {
    '0': 'hero',
    '1': 'team',
    '2': 'hero_alias',
    '3': 'team_alias'
}

nodes_df = pd.read_csv(
    './nodes.csv',
    # rename the columns as targets as this will form the target graph
    # (the graph which we will be searching)
    names=['target_label', 'target_type', 'target_id'],
    # ignore the header
    header=0,
    converters = {
        # convert target_type from numeric values to
        # literal string representations for ease of reading
        'target_type': lambda key: mapping.get(key)
    }
)

# contains pairs of target node ids
edges_df = pd.read_csv('./edges.csv')

We can see that the target nodes have a label (the hero’s primary name). The target_type column will be one of hero, team, hero alias, team alias, the four types of nodes in the graph.

(Note that by hero we mean a person in a comic book who has superpowers regardless of them being good or bad)

[4]:

nodes_df['target_label'].head()

[4]:

0         Selene
1    Doctor Doom
2          Viper
3            Sin
4    David North
Name: target_label, dtype: object

Edges are pairs of target_id values. Note that fornax deals with undirected graphs so there is no need to add the edge in the reverse direction. Doing so will cause an exception as the edge will be considered a duplicate.

[5]:

edges_df.head()

[5]:

	end	start
0	839851079	87770955
1	685373387	2073821878
2	1988120854	396175249
3	608208951	396175249
4	1988120854	2062678112

Label similarity¶

For some motivation, before using fornax, let us search for nodes just using their labels. Let’s search for nodes similar to guardians, star and groot.

We will create a function that given a pair of labels, it will return a score where:

\[0 <= score <= 1\]

Secondly we’ll create a search function that returns rows from our table of target nodes that have a non zero similarity score.

[6]:

def node_scoring_function(first: str, second: str):
    """ node scoring function takes two strings and returns a
        score in the range 0 <= score <= 1
    """
    first_, second_ = sorted((first.lower(), second.lower()), key=len)
    # if first is not a substring of second: score = 0
    if not first_ in second_:
        return 0
    # otherwise use the relative difference between
    # the two lengths
    score = len(second_) - len(first_)
    score /= max(len(first_), len(second_))
    score = 1. - score
    return score

[7]:

def search(query_id: int, query_label: str):
    # compute all of the scores
    scores = nodes_df['target_label'].apply(
        node_scoring_function,
        args=(query_label,)
    )
    # create a boolean mask
    mask = scores > 0
    # graph the non zero scoring nodes
    matches = nodes_df[mask].copy()
    # add extra columns
    matches['score'] = scores[mask]
    matches['query_label'] = query_label
    matches['query_id'] = query_id
    return matches

Aside:¶

Note that these string search functions are not terribly efficient. They involve repeated full scans of the target nodes table. If we were searching a larger graph we could use a search tree as an index, an external sting matching service or database. However, since this is a tutorial, the above functions are simpler and more reproducible. This is important as we will be using these search results with fornax.

[8]:

query_labels = ['guardians', 'star', 'groot']

Examining the table below we can see that we have a conundrum. There are 22 nodes with varying similarity to star and 4 nodes similar to galaxy.

[9]:

# find the nodes similar to 'guardians', 'star' and 'groot'
matches = pd.concat(search(id_, label) for id_, label in enumerate(query_labels))
matches

[9]:

	target_label	target_type	target_id	score	query_label	query_id
285	Guardian	hero	1081675	0.888889	guardians	0
427	Guardians of the Galaxy	team	870807271	0.391304	guardians	0
507	Guardians of the Galaxy (1969 team)	team	1295400389	0.257143	guardians	0
1019	Guardian	hero_alias	2062791326	0.888889	guardians	0
10	Danielle Moonstar	hero	2083850919	0.235294	star	1
25	Darkstar	hero	1276753309	0.500000	star	1
71	Firestar	hero	274821742	0.500000	star	1
121	Star-Lord	hero	1061867605	0.444444	star	1
189	Northstar	hero	1260880284	0.444444	star	1
292	Starfox	hero	1594294259	0.571429	star	1
323	Ultimate Firestar	hero	1718026772	0.235294	star	1
338	Shatterstar	hero	1241925506	0.363636	star	1
401	Upstarts	team	839851079	0.500000	star	1
443	Starjammers	team	895117495	0.363636	star	1
474	Starforce	team	1605941117	0.444444	star	1
536	James Proudstar	hero_alias	268149375	0.266667	star	1
587	John Proudstar	hero_alias	880197081	0.285714	star	1
604	Anthony "Tony" Edward Carbonell Stark	hero_alias	2007806013	0.108108	star	1
661	Moonstar	hero_alias	294373473	0.444444	star	1
750	Star-Lord	hero_alias	92571479	0.400000	star	1
831	Starlord	hero_alias	1788314407	0.444444	star	1
832	Star Lord	hero_alias	925434646	0.400000	star	1
1010	Anthony Edward "Tony" Stark	hero_alias	2138996395	0.142857	star	1
1011	Tony Stark	hero_alias	182299133	0.363636	star	1
1014	The Star Spangled Man With A Plan	hero_alias	1915573563	0.117647	star	1
1069	Firestar	hero_alias	1580065367	0.444444	star	1
120	Groot	hero	74671434	1.000000	groot	2

Fornax enables a more powerful type of search. By specifying ‘guardians’, ‘star’, ‘groot’ as nodes in a graph, and by specifying the relationships between them, we can search for nodes in our target graph with the same relationships.

Creating a target graph¶

Fornax behaves much like a database. In fact it uses SQLite or Postgresql to store graph data and index it. To insert a new graph into fornax we can use the following three steps: 1. create a new graph 2. add nodes and node meta data 3. add edges and edge meta data

The object fornax.GraphHandle is much like a file handle. It does not represent the graph but it is an accessor to it. If the GraphHandle goes out of scope the graph will still persist until it is explicitly deleted, much like a file.

[10]:

with fornax.Connection('sqlite:///mydb.sqlite') as conn:
    target_graph = fornax.GraphHandle.create(conn)
    target_graph.add_nodes(
        # use id_src to set a custom id on each node
        id_src=nodes_df['target_id'],
        # use other keyword arguments to attach arbitrary metadata to each node
        label=nodes_df['target_label'],
        # the type keyword is reserved to we use target_type
        target_type=nodes_df['target_type']
        # meta data must be json serialisable
    )
    target_graph.add_edges(edges_df['start'], edges_df['end'])

We can use the graph_id to access our graph in the future.

[11]:

with fornax.Connection('sqlite:///mydb.sqlite') as conn:
    target_graph.graph_id
    another_target_graph_handle = fornax.GraphHandle.read(conn, target_graph.graph_id)
    print(another_target_graph_handle == target_graph)

True

Creating a query graph¶

Let’s imagine that we suspect groot is directly related to guardians and star is also directly related to guardians. For example groot and star could both be members of a team called guardians. Let’s create another small graph that represents this situation:

[12]:

with fornax.Connection('sqlite:///mydb.sqlite') as conn:
    # create a new graph
    query_graph = fornax.GraphHandle.create(conn)

    # insert the three nodes:
    #   'guardians' (id=0), 'star' (id=1), 'groot' (id=2)
    query_graph.add_nodes(label=query_labels)

    # alternatively:
    #    query_graph.add_nodes(id_src=query_labels)
    # since id_src can use any unique hashable items

    edges = [
        (0, 1), # edge between groot and guardians
        (0, 2)  # edge between star and guardians
    ]

    sources, targets = zip(*edges)
    query_graph.add_edges(sources, targets)

Search¶

We can create a query in an analogous way to creating graphs using a QueryHandle, a handle to a query stored in the fornax database. To create a useful query we need to insert the string similarity scores we computed in part 1. Fornax will use these scores and the graph edges to execute the query.

[13]:

with fornax.Connection('sqlite:///mydb.sqlite') as conn:
    query = fornax.QueryHandle.create(conn, query_graph, target_graph)
    query.add_matches(matches['query_id'], matches['target_id'], matches['score'])

Finally we can execute the query using a variety of options. We specify we want the top 5 best matches between the query graph and the target graph.

[14]:

with fornax.Connection('sqlite:///mydb.sqlite') as conn:
    %time results = query.execute(n=5)

CPU times: user 69.4 ms, sys: 2.35 ms, total: 71.8 ms
Wall time: 74.1 ms

/Users/dstaff/anaconda3/envs/fornax/lib/python3.6/site-packages/numpy/core/records.py:513: FutureWarning: Numpy has detected that you may be viewing or writing to an array returned by selecting multiple fields in a structured array.

This code may break in numpy 1.15 because this will return a view instead of a copy -- see release notes for details.
  return obj.view(dtype=(self.dtype.type, obj.dtype))

Visualise¶

query.execute returns an object describing the search result. Of primary interest is the graph field which contains a list of graphs in node_link_graph format. We can use networkx to draw these graphs and visualise the results.

[15]:

def draw(graph):
    """ function for drawing a graph using matplotlib and networkx"""

    # each graph is already in node_link_graph format
    G = nx.json_graph.node_link_graph(graph)

    labels = {node['id']: node['label'] for node in graph['nodes']}
    node_colour = ['r' if node['type'] == 'query' else 'b' for node in graph['nodes']]
    pos = nx.spring_layout(G)
    nx.draw_networkx_nodes(G, pos, node_size=600, node_color=node_colour, alpha=.3)
    edgelist = [(e['source'], e['target']) for e in graph['links'] if e['type'] != 'match']
    nx.draw_networkx_edges(G, pos, width=3, edgelist=edgelist, edge_color='grey', alpha=.3)
    edgelist = [(e['source'], e['target']) for e in graph['links'] if e['type'] == 'match']
    nx.draw_networkx_edges(G, pos, width=3, edgelist=edgelist, style='dashed', edge_color='pink')
    nx.draw_networkx_labels(G, pos, font_size=12, font_family='sans-serif', labels=labels)

Result 1 contains the best match. The three query nodes (in red) best match the three target nodes (in blue). The dashed lines show which pairs of query and target nodes matched each other. The blue nodes are a subgraph of the target graph. Note that the result does not describe the whole target graph because in principle it can be very large.

Here we can see that the blue subgraph has exactly the same shape as the red query graph. However, the labels are not exactly the same (e.g. guardians != Guardians of the Galaxy) so the result scores less than the maximum score of 1. However, we can see that our query graph is really similar to Groot and Star-Lord from Guardians of the Galaxy. Since this is the best match we know that

[16]:

for i, graph in enumerate(results['graphs'][:1]):
    plt.title('Result {0}, score: {1:.2f}'.format(1, 1. - graph['cost']))
    draw(graph)
    plt.xlim(-1.2,1.2)
    plt.ylim(-1.2,1.2)
    plt.axis('off')
    plt.show()

/Users/dstaff/anaconda3/envs/fornax/lib/python3.6/site-packages/networkx/drawing/nx_pylab.py:611: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  if cb.is_numlike(alpha):

Results 2-4 have a lower score because star matches to a different node not adjacent to Guardians of the Galaxy. Further inspection would show that star has matched aliases of Star-Lord which are near Guardians of the Galaxy but not ajacent to it.

[17]:

for i, graph in enumerate(results['graphs'][1:4]):
    plt.title('Result {0}, score: {1:.2f}'.format(i+2, 1. - graph['cost']))
    draw(graph)
    plt.xlim(-1.2,1.2)
    plt.ylim(-1.2,1.2)
    plt.axis('off')
    plt.show()

/Users/dstaff/anaconda3/envs/fornax/lib/python3.6/site-packages/networkx/drawing/nx_pylab.py:611: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  if cb.is_numlike(alpha):

The final match pairs guardians and star to two nodes that do not have similar edges to the target graph. groot is not found in the target graph. The result gets a much lower score than the preceding results and we can be sure that any additional results will also be poor because the result are ordered.

[18]:

for i, graph in enumerate(results['graphs'][4:]):
    plt.title('Result {0}, score: {1:.2f}'.format(i+5, 1. - graph['cost']))
    draw(graph)
    plt.xlim(-1.2,1.2)
    plt.ylim(-1.2,1.2)
    plt.axis('off')
    plt.show()

/Users/dstaff/anaconda3/envs/fornax/lib/python3.6/site-packages/networkx/drawing/nx_pylab.py:611: MatplotlibDeprecationWarning: isinstance(..., numbers.Number)
  if cb.is_numlike(alpha):