eGitty

Discover The Most Popular Algorithms

An Introduction to Build a Word Graph From a Document or Dataset

In this article, we will introduce how to create a word graph from a document or dataset. This word graph can be used in some Graph Neural Networks (GNN).

How to build a word graph?

We can find a method from paper:  Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks

There are some steps:

Step 1: split a sentence to words

Step 2: remove some stop words

If you are using nltk, you can read this tutorial to learn how to do.

Remove English Stop Words with NLTK Step by Step – NLTK Tutorial

Step 3: determine a a fixed-size sliding window

This size can be 3 or 2. In this paper, the window size is 3.

For example, as to sentence:

eggity.com is a website on ai model and algorithm

If the center word is website, the sliding window size = 3. This part of sentence

eggity.com is a website on ai model

will be processed.

Step 4: build undirected word graph and co-occurrence between words

We construct the graph for a textual document by representing unique words as vertices and co-occurrences between words as edges, denoted as \(G = (V, E)\) where \(V\) is the set of vertices and \(E\) the edges. The co-occurrences describe the relationship of words that occur within a fixed-size sliding window.

Here is an example:

An Introduction to Build a Word Graph From a Document or Dataset

Notice:

We can build two kinds of word graph. If we only compute co-occurrence between words in a document, we can build a document word graph. Meanwhile, if you compute co-occurrence in the whole dataset, we can build a dataset word graph.

Leave a Reply

Your email address will not be published. Required fields are marked *