An Introduction to Build a Word Graph From a Document or Dataset

January 28, 2022
/ eGitty

In this article, we will introduce how to create a word graph from a document or dataset. This word graph can be used in some Graph Neural Networks (GNN).

How to build a word graph?

We can find a method from paper: Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks

There are some steps:

Step 1: split a sentence to words

Step 2: remove some stop words

If you are using nltk, you can read this tutorial to learn how to do.

Remove English Stop Words with NLTK Step by Step – NLTK Tutorial

Step 3: determine a a fixed-size sliding window

This size can be 3 or 2. In this paper, the window size is 3.

For example, as to sentence:

eggity.com is a website on ai model and algorithm

If the center word is website, the sliding window size = 3. This part of sentence

eggity.com is a website on ai model

will be processed.

Step 4: build undirected word graph and co-occurrence between words

We construct the graph for a textual document by representing unique words as vertices and co-occurrences between words as edges, denoted as \(G = (V, E)\) where \(V\) is the set of vertices and \(E\) the edges. The co-occurrences describe the relationship of words that occur within a fixed-size sliding window.

Here is an example:

Notice:

We can build two kinds of word graph. If we only compute co-occurrence between words in a document, we can build a document word graph. Meanwhile, if you compute co-occurrence in the whole dataset, we can build a dataset word graph.

An Introduction to Build a Word Graph From a Document or Dataset

How to build a word graph?

Notice:

Leave a Reply Cancel reply