Dash Cytoscape

https://fp-growth.herokuapp.com/

A few weeks back I learnt the ins-and-outs of drawing, ahem, generating dynamic trees and graphs using Dash Cytoscape, which is a package built on top of Plotly.js. Dash has attempted to bring their wonderful graphing libraries to the Python world by wrapping Plotly.js into Dash Plotly (a python package). Then they went a step further and wrapped their Python graphing libraries into a fully-features web application 'framework' built on Flask and React.js. Honestly, the ability to generate visualisations quickly from data wrangled in Python without needing to translate to some other scripting language is quite nice. But enough rambling, what did we do?

Aim

Visualise the growth of the FP-Growth algorithm for association analysis using the FP-tree

Background

The FP-Growth algorithm (frequent-path-growth) is an iterative method of finding groups of associated items in tuples (think rows) of data. My goal was to allow small CSV's to be visualised in terms of the FP-tree, which is a recursive data structure typically implemented by means of a Trie. As a bonus, I also explored how to build the Trie in Python (recursively)

Method

This would be built, as aforementioned, using Dash....but if you read the title, you already knew that, so let me be a bit less vague and explain how FP-Growth Works. Firstly, all the items in our CSV would be sorted lexically and ordered in terms of number of occurences or support-count. A tree is constructed starting from the first item from the first tuple of items (first row from our sorted rows from the CSV table), and new nodes are appended to this first node for the second, third, etc.

The next row is added, similarly: if a node exists for its first item, increment the support of the node, else add a new node, and iterate for every row and every item in the given row. So adding the second row might look like:

And third:

Additionally, a special table called a header table keeps pointers to every given 'type' of node. For example, the A'th entry in the header table would contain a pointer to the head of a linked-list pointing to all items of type A. This is used to build conditional frequent paths for each given item. In other words, new trees are built with every leaf being the same type of item. It is from these trees that frequent paths are gotten. If the support-count of items preceding the leaves are above a pre-specified threshold, these items then contribute to the frequent-itemsets of the entire set of items in our original CSV.

Takeaways

A linked list of linked lists sounds difficult until you try to implement it, then it becomes impossible
I like Python, coming from working in JavaScript and SQL for over a year, while I initially didn't take to the Pythonic conventions, it's beginning to grow on me

Unforeseen Troubles

I still can't figure out how to cleanly step through building the FP-tree, so instead of placing buttons to show the construction of the FP-tree, I added an interactive feature. If you open the link at the top of the page and upload your csv, click on any of the nodes, and the app generates the conditional base path for whichever item you clicked, dynamically