We retrieved corpus where each document is a gene and the bag-of-words is the set of GO terms annotating each gene. We used a topic model to generate a set of topics T over the set of retrieved GO terms.

Each topic T_i is associated with a set of probabilities, where the values are PROB(T_i, GO term, probability); this is the probability for that GO term to appear in topic T_i.

For the visualization, based on the distribution of the probability scores for the GO terms in each topic, we selected topics that have at least 3 keywords with a threshold (different for each dataset), and further restricted the topic GO terms that were reviewed to those with a score above 0.05.

We used Sankey Diagram to represent the topics where topic names are on the left and GO terms on the right. The thickness of each link encodes the probability of the GO terms occurring in topic X.