Natural Language Processing (NLP) involves the task of inferring useful information from text. Most elementary demonstrations of NLP involve sentiment analysis, i.e. categorizing an opinion as positive or negative. In this demonstration I tackle a more sophisticated task, namely **aspect-entity** extraction.
I'll show how we can build a machine learning pipeline for analysing customer reviews of restaurants that identifies:
In the process, we'll see how to implement several important NLP techniques, including:
- Text cleaning, contraction expansion, and lemmatization.
- Converting text sequence elements into pretrained embedding vectors.
- Creating Parts-of-Speech (POS) tags.
- Sequence-to-sequence IOB2 tagging for entity identification.
- Creating an end-to-end inference pipeline.
Our final network will be able to take the following example input phrase...
"Service was terribly slow and the restaurant was noisy, but the waiter was friendly and the calamari was very delicious."
... and produce the following output:
1 | 2 | 3 | 4 | |
---|---|---|---|---|
OTE Target Words: | ['service'] | ['restaurant'] | ['waiter'] | ['calamari'] |
Entity-Aspect: | SERVICE:GENERAL | AMBIENCE:GENERAL | SERVICE:GENERAL | FOOD:QUALITY |
Polarity: | negative | negative | positive | positive |
1) Viewing and Parsing the Dataset
4) OTE Extraction via Sequence-to-Sequence Tagging
5) Aspect-Entity and Sentiment Extraction
# general
import gc # garbage collection for memory management
import re # regular expressions toolkit
import os
import json
import shutil
import math
import pickle
import numpy as np
import pandas as pd
import tensorflow as tf
# data preprocessing
import xml.etree.ElementTree as ET # for xml file parsing
from collections import Counter # for counting word occurences
from pycontractions import Contractions # for contraction expansion
from nltk.stem import WordNetLemmatizer # for word Lemmatization
from nltk.tag.stanford import StanfordPOSTagger # Part-of-Speech tags
from tensorflow.python.keras.preprocessing import sequence # for padding
# keras model creation
import keras as K
import keras.backend as KB
print('Using Keras Version', K.__version__)
from keras.layers import Conv1D, Dropout, Embedding
from keras.layers import Dense, TimeDistributed
# data visualization
import matplotlib.pyplot as plt
from IPython.display import Image
%matplotlib inline
from prettytable import PrettyTable # formatted text-based tables
from tensorboard import summary
from tensorboard.plugins.custom_scalar import layout_pb2
from keras.utils.vis_utils import plot_model # for viewing model graphs
# TensorFlow Logging
tf.logging.set_verbosity(tf.logging.INFO)
Whereas sentiment analysis in the past has focused mostly on classifying the overall polarity of a statement or document, Aspect-Based Sentiment Analysis (ABSA) is a newer and more challenging framework that seeks to understand specific sentiments towards different aspects of an entity. For example, the review sentence "Service was devine, oysters were as great as they come, and the price can't be beat!!!" contains three different (OTE, entity:aspect, polarity) tuples:
To develop our ABSA pipeline, we'll use a dataset that was part of a competitive challege at the International Workshop on Semantic Evaluation (SemEval-2016), which can be found here. The dataset consists of 350 review texts (2000 sentences) that have been annotated with roughly 2500 (entity, aspect, polarity) tuples. The relatively small size of the training set is part of the challenge.
Our focus in this demonstration will be on developing the NLP pipeline. We won't worry so much about the specific deep learning models used nor their performance. Once we've built the pipeline, it becomes fairly straightforward to train on much larger text corpuses, import pretrained models for transfer learning, and experiment with the neural network architecture.
Strategy
Our pipeline will consist of two trained models in serial:
Step 1: Identify all opinion target expressions (OTEs, the specific words describing the entities about which opinions are expressed) within a given sentence. For this we'll perform sequence-to-sequence IOB2 tagging with using a convolutional neural network (CNN) with a time-distributed dense layer at the output. The inputs will be padded tokenized word sequences and a pretrained embedding matrix. The output will be IOB2 tags that identify the sequence locations of the OTEs (i.e. the entity terms).
Step 2: For each OTE identified within each review sentence, we will futher identify (a) the entity:aspect pair (e.g. FOOD:QUALITY) and (b) the sentiment polarity (positive, negative, or neutral). For this, we will use a recursive neural network (RNN) based on long short-term memory (LTSM) stacked on top of a CNN. The inputs will be tokenized word sequences, a pretrained embedding matrix, part-of-speech (POS) tags, and IOB2 tags identifying the OTE positions. The output will be the aspect and polarity associated with that particular OTE/entity.
Let's take a moment to get familiar with the XML file structure. One quick way to view an XML tree is to drag and drop the xml file onto a new tab in your internet browser (e.g. Chrome):
Exploring the data, we find that some sentences have multiple associated entity-aspect opinions (as expected), while others have no associated opinion at all.
We'll organize our data into a pandas dataframe having the following columns:
For sentences with no associated opinion, we'll still create a row entry, and simply insert NaN in the relevant columns.
Let's create a function that extracts and repackages the data from our xml files (xml = "extensible markup language"). For this, we'll use Python's ElementTree module:
def data_xml_to_df(xml_file):
"""
Converts our raw xml data file into a pandas dataframe; xml tree structure is
specific to the data files found at http://alt.qcri.org/semeval2016/task5/.
:param xml_file: xml filetype, the raw data we wish to process
:return df: pandas dataframe, contains the extracted data sorted into the
following columns: opinion_id, text_content, target, category,
polarity, ote_start, ote_stop
"""
# parse the xml file
tree = ET.parse(xml_file)
root = tree.getroot()
# initialize variables to populate
opinion_id = []
text_content = []
target = []
category = []
polarity = []
ote_start = []
ote_stop = []
# jump directly to the 'sentence' branches and iterate
for sentence in root.iter('sentence'):
sentence_id = sentence.get('id')
sentence_text = sentence.find('text').text
opinions = sentence.find('Opinions')
# the number of opinions associated with a sentence varies
if opinions is None: # no associated opinions...
opinion_id.append(sentence_id + ':0')
text_content.append(sentence_text)
target.append(np.nan)
category.append(np.nan)
polarity.append(np.nan)
ote_start.append(np.nan)
ote_stop.append(np.nan)
else: # one or more opinions...
for (i, opinion) in enumerate(sentence.find('Opinions')):
opinion_id.append(sentence_id + ':%s' % i)
text_content.append(sentence_text)
target.append(opinion.get('target'))
category.append(opinion.get('category'))
polarity.append(opinion.get('polarity'))
ote_start.append(opinion.get('from'))
ote_stop.append(opinion.get('to'))
# Now convert to dataframe:
# (column names are specified upfront to define column order)
df = pd.DataFrame(columns=['opinion_id',
'text',
'target',
'category',
'polarity',
'ote_start',
'ote_stop'])
df['opinion_id'] = opinion_id
df['text'] = text_content
df['target'] = target
df['category'] = category
df['polarity'] = polarity
df['ote_start'] = ote_start
df['ote_stop'] = ote_stop
return df
Now let's use our function to extract the data and preview one of the dataframes:
# specify file paths
data_xml_train = './SemEval2016-Task5/ABSA16_Restaurants_Train_SB1_v2.xml'
data_xml_test = './SemEval2016-Task5/restaurants_trial_english_sl.xml'
# convert to pandas dataframes
df_train = data_xml_to_df(data_xml_train)
df_test = data_xml_to_df(data_xml_test)
# preview
df_train.head(10)
Next, we'll create a new feature, ote, which is merely text_content sampled between ote_start and ote_stop:
# generate new OTE feature
df_train['ote'] = df_train.apply(
lambda row: np.nan if math.isnan(float(row['ote_start']))
else (0 if row['ote_start']==row['ote_stop']
else row['text'][int(row['ote_start']):int(row['ote_stop'])]), axis=1)
# preview result
column_list = ['text', 'category', 'target', 'ote_start', 'ote_stop', 'ote']
df_train[column_list].head(5)
However, we discover that this new feature is identicla to target, so we won't end up using this variable. Good to know.
Some sentences have no entity#aspect pair, and thus show NaN in the category column, for example:
print('Text:', df_train.iloc[34].text)
print('E#A:', df_train.iloc[34].category)
Let's replace all such NaN values with the string 'NULL':
df_train = df_train.replace(np.nan, 'NULL', regex=True)
df_test = df_test.replace(np.nan, 'NULL', regex=True)
# check:
df_train.iloc[34]
For convenience, let's also separate category into its entity and aspect components:
def df_split_category(dataframe):
"""
Splits "category" into "entity" and "aspects" columns.
Specific to our dataset.
"""
# Note: 'nan' for "category" gives type float, not str,
# and we have used this for 'nan' detection.
for i, feature in enumerate(['entity', 'aspect']):
dataframe[feature] = dataframe.apply(
lambda row: 'NULL' if row['category']=='NULL'
else row['category'].split('#')[i], axis=1)
return dataframe
# separate category into entity and aspect
df_train = df_split_category(df_train)
df_test = df_split_category(df_test)
# preview
feature_list = ['category', 'entity', 'aspect']
df_train[feature_list].head(10)
Let's get a sense of the spectrum of entity-aspect pairs and targets we can have. Note that for ABSA problems, these are pre-defined up front.
# combine test and train datasets
df_combined = pd.concat([df_train, df_test])
# extract instances
entity_list = []
aspect_list = []
category_list = []
target_list = []
for index, row in df_combined.iterrows():
entity_list.append(row['entity'])
aspect_list.append(row['aspect'])
category_list.append(row['category'])
target_list.append(row['target'])
# collect by unique entries
entity_counts = Counter(entity_list)
aspect_counts = Counter(aspect_list)
category_counts = Counter(category_list)
target_counts = Counter(target_list)
We find six possible entities (plus 'NULL'):
entity_counts
And five possible aspects (plus 'NULL'):
aspect_counts
We can also see how these aspects are distributed among entities:
# generate cross-tabulation
table = pd.crosstab(df_combined['entity'], df_combined['aspect'])
print('\n', table)
# plot the aspect fractions for each entity
plt.rcParams['font.size'] = 12
table_fractions = table.div(table.sum(1).astype(float), axis=0)
g = table_fractions.plot(kind="bar", stacked=True,
colormap='viridis', figsize=(8,5))
plt.xticks(rotation=0)
plt.xlabel('Entity', weight='bold')
plt.ylabel('Fraction of Total', weight='bold')
plt.tight_layout()
leg = plt.legend(title='Aspect', loc=9, bbox_to_anchor=(1.20, 1.0))
We see that ambience, location, and service have only one possible aspect; restaurant has two; food and drinks have three. The number of unique combinations is small enough that we might consider working with just category, rather than entity and aspect separately; on the other hand, the number of data samples available for each unique category (e.g. DRINKS#PRICES) is smaller than if we consider entity and aspect separately.
category_counts
For target, we find that most entries are unique and highly specific. Ideally, we want our model to identify OTE terms based on generic word patterns (or sentence structures), rather than by merely memorizing specific words.
target_counts
for i in [2, 3, 7, 23, 42]:
print(df_train['text'][i], '\n')
To aid model performance, we want to obtain the simplest representation of the text without losing useful information. The cleaning and pre-processing steps will include:
- Expanding all English contractions.
- Removing unusual characters and emoticons.
- Converting all words to lowercase.
- Removing numbers (e.g. '3').
- Replacing slang/abbreviated word substitutions (i.e. 'u' --> 'you')
- Lemmatizing words into their base form (e.g. computing, computed --> compute).
In this case I am choosing not to remove stopwords, which are common parts of speech (is, and, or). The reason is that these provide useful information about sentence structure which will be valuable when identifying entity#aspect targets.
Contraction Expansion:
We begin with contractions. Manually creating a dictionary to expand contractions can lead to ambiguities (e.g. I'd -> I had, or I would?), therefore we'll use the pycontractions library for a more accurate context-based contraction expansion with the aid of Google's Word2Vec model.
# load up the Word2Vec model for context-based contraction expansion
cont = Contractions('/home/rpm/ML-models/GoogleNews-vectors-negative300.bin')
cont.load_models()
# test it out:
print('Before:', df_train['text'][42], '\n')
print('After:', list(cont.expand_texts([df_train['text'][42]], precise=True)))
# apply to all reviews
df_train['text'] = list(cont.expand_texts(df_train['text'], precise=True))
df_test['text'] = list(cont.expand_texts(df_test['text'], precise=True))
# release references and 'garbage collect' to free memory
del cont
gc.collect()
Cleaning and Lemmatization:
Next we'll apply the remaining pre-processing steps and tokenize our review sentences into words by whitespace. Note that we also need to save a copy of our text strings prior to lemmatization in order to generate Part-of-Speech (POS) tags later.
def apply_substitutions(text_tokenized):
# we'll only apply a single slang substitution here,
# but many more are possible...
text_tokenized = ['you' if word is 'u' else word for word in text_tokenized]
return text_tokenized
def preprocessor(text):
# remove unusual characters
text = re.sub('<[^>]*>', '', text)
# remove emoticons
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
# convert all words to lowercase
text = re.sub('[\W]+', ' ', text.lower()) + " ".join(emoticons).replace('-', '')
# strip numbers
text = re.sub(r'\d+', '', text)
# tokenize by whitespace
text_tokenized = text.split()
# apply word substitutions (for slang correction)
text_tokenized = apply_substitutions(text_tokenized)
return text_tokenized
# for lemmatization after pre-processing
lemmatizer = WordNetLemmatizer()
def lemmatize(text_tokenized):
return [lemmatizer.lemmatize(word, pos='v') for word in text_tokenized]
# generate dataframe entries containing the cleaned text
df_train['text_cln'] = df_train['text'].apply(preprocessor)
df_test['text_cln'] = df_test['text'].apply(preprocessor)
# generate dataframe entries containing cleaned AND lemmatized text
df_train['text_cln_lm'] = df_train['text_cln'].apply(lemmatize)
df_test['text_cln_lm'] = df_test['text_cln'].apply(lemmatize)
Let's view an example of a sentence before and after lemmatization:
print('Before Lemmatization:\n', df_train['text_cln'][2])
print('\nAfter Lemmatization:\n', df_train['text_cln_lm'][2])
Removing Empty Sentences:
Note that there seem to be some 'empty' sentences, e.g. for which the original input was only a number or some other stripped character:
df_train.iloc[842].head()
We'll remove these from our dataset:
def remove_empty_sentences(data):
return data.loc[data['text_cln_lm'].apply(len) != 0]
df_train = remove_empty_sentences(df_train)
df_test= remove_empty_sentences(df_test)
Viewing the Pre-Processed Text:
Finally, let's examine the sentences we previewed in the beginning to see their final, post-processed form:
for i in [2, 3, 7, 23, 42]:
print(df_train['text_cln_lm'][i], '\n')
We have now stripped the text to its bare essentials. Data quality is still limited by imperfect grammatical style and sometimes poor articulation ("human error", as HAL-9000 would put it), but we won't attempt to correct this.
Next we'll map each word in our review text corpus to a unique integer. We'll also add an integer to account for out-of-vocabulary words.
# count word occurrences
counts = Counter()
df_combined = pd.concat([df_train, df_test])
for sentence in df_combined['text_cln_lm']:
counts.update(sentence)
# create mapping (unique word to integer, descending frequency)
word_counts = sorted(counts, key=counts.get, reverse=True)
word_to_index = {word: index for index, word in enumerate(word_counts, 2)}
# we'll have to add a pad token to our word index (so that our
# embedding matrix treats this as part of the vocabulary)
word_to_index['<PAD>'] = 0
# here's a token reserved for out-of-vocabulary (OOV) words
word_to_index['<OOV>'] = 1
# also create an inverse mapping (ints to words)
index_to_word = {v: k for k, v in word_to_index.items()}
index_to_word[0] = '<PAD>' # for sequence padding
index_to_word[1] = '<OOV>' # for out-of-vocab words
# create mapping functions
def map_word_to_int(tokenized_text):
return [word_to_index[word] for word in tokenized_text]
def map_int_to_word(text_ints):
return [index_to_word[index] for index in text_ints]
# apply mapping
df_train['text_ints'] = df_train['text_cln_lm'].apply(map_word_to_int)
df_test['text_ints'] = df_test['text_cln_lm'].apply(map_word_to_int)
# preview and check
print(map_int_to_word(df_train['text_ints'].iloc[1]), '\n')
print(df_train['text_ints'].iloc[1])
print('\nIndex for "we" is:', word_to_index['we'])
Let's examine the top 10 and bottom 10 words by occurrence. In a simple bag-of-words model, the least-frequently occuring words would be considered the most important, whereas words like "the, be, and", etc. would convey little useful information and be discarded. For our more complex task, however, we are keeping these so-called "stopwords" words due to their importance in elucidating sentence structure and word relationships.
print('*** Top 10 Most Frequent Words ***')
for word in word_counts[:10]:
line = ['Word:', word, 'Occurences:', counts[word]]
print('{:<0} {:<16} {:<0} {:<0}'.format(*line))
print('\n*** Bottom 10 Words ***')
for word in word_counts[-10:]:
line = ['Word:', word, 'Occurences:', counts[word]]
print('{:<0} {:<16} {:<0} {:<0}'.format(*line))
Rather than one-hot-encoding our words to generate vectors whose dimensionality matches the number of unique words in our vocabulary, we'll instead use "embedding". Embedding lets us represent our vocabulary as real-valued vectors of much smaller dimensionality which are easier for our machine learning models to handle. Furthermore, embedding is able to capture information about semantic relations (e.g. the most famous example being 'king - man + woman = queen'). Given the small size of our dataset, we'll use pre-trained vectors from the popular GloVe embedding model.
The GloVe model we use is based on a the Wikipedia 2014 + Gigaword 5 dataset (6B tokens, 400K vocab, uncased, 300d vectors). You can obtain this model by issuing the terminal command $ wget http://nlp.stanford.edu/data/glove.6B.zip
in the desired directory and unzipping its contents.
An embedding matrix is basically a lookup table, where we give the index of the unique word, and get a corresponding d-dimensional vector representing it.
def get_embedding_matrix(model_path, word_to_index):
# load the embeddings into a dictionary
embeddings = {}
with open(model_path, 'r', encoding='utf-8') as file:
for line in file:
values = line.strip().split()
word = values[0]
vector = np.asarray(values[1:], dtype='float32')
embeddings[word] = vector
# initialize the embedding matrix as a random distribution
embedding_size = len(list(embeddings.values())[0]) # equals len of 1st vector
vocab_size = len(word_to_index) # can be changed to a hyperparameter if desired
np.random.seed(777) # for reproducibility
embedding_matrix = np.random.uniform(-1, 1, size=(vocab_size, embedding_size))
# replace row with corresponding embedding vector
# if a given word is within the embedding vocab
num_loaded = 0
for word, index in word_to_index.items():
vec = embeddings.get(word)
if vec is not None and index < vocab_size:
embedding_matrix[index] = vec
num_loaded += 1
# finish and print how many words were loaded
print('Successfully loaded pretrained embeddings for '
'%f/%f words.' % (num_loaded, vocab_size))
print('Embedding vectors for non-loaded words were '
'randomly initialized between -1 and 1.')
embedding_matrix = embedding_matrix.astype(np.float32)
return embedding_matrix
# we'll use an embedding size of 100d dimensions
# (GloVe also has 50d, 200d, and 300d models)
embedding_matrix = get_embedding_matrix(
model_path='/home/rpm/ML-models/glove.6B.100d.txt',
word_to_index=word_to_index)
print('Embedding Matrix Shape:', embedding_matrix.shape)
# define vocab_size and embedding_size variables for later use
vocab_size = embedding_matrix.shape[0]
embedding_size = embedding_matrix.shape[1]
print('Vocabulary Size:', vocab_size)
print('Embedding Size:', embedding_size)
Words in our corpus vocabulary that are not found in the GloVe embedding model (e.g. proper nouns or misspelled words) have been assigned a randomly-initialized embedding vector.
We want to ensure that the feature vectors of word-index sequences fed to our model all have the same length. First let's find the largest sequence length occuring in our dataset:
# first let's find the largest sequence size
df_combined = pd.concat([df_train, df_test])
max_length = 0
for index, row in df_combined.iterrows():
seq_len = len(row['text_cln_lm'])
if seq_len > max_length:
max_length = seq_len
print('Max sequence length is:', max_length)
We will set the sequence length to be 62 and left-pad with zeros. We'll also save the original unpadded sequence length for future reference (this is sometimes used in applying sequence masking layers). Conveniently, the tensorflow.python.keras.preprocessing
library has a function called sequence
that we can use to easily perform our padding.
sequence_size = 62
# generate data arrays containing our padded sequences
X_train = sequence.pad_sequences(df_train['text_ints'],
maxlen=sequence_size,
truncating='post',
padding='post',
value=word_to_index['<PAD>'])
X_test = sequence.pad_sequences(df_test['text_ints'],
maxlen=sequence_size,
truncating='post',
padding='post',
value=word_to_index['<PAD>'])
# record the original unpadded sequence length
X_len_train = np.array([len(row) for row in df_train['text_ints']])
X_len_test = np.array([len(row) for row in df_test['text_ints']])
# checks
print('X_train dimensions:', X_train.shape)
print('X_test dimensions:', X_test.shape)
print('\nX_len_train dimensions:', X_len_train.shape)
print('X_len_test dimensions:', X_len_test.shape)
# preview for the first instance
print('\nFirst padded sequence in training set:')
print(X_train[0])
print('\nCorresponding words:')
print(map_int_to_word(X_train[0]))
We will create sequence tags to identify the location of the OTE words (a.k.a. 'target') within each sentence, using the IOB2 scheme. Each sequence position is labelled with 'O', 'B', or 'I' depending on whether it is [O]utside, at the [B]eginning of, or [I]nside a group of OTE words, respectively.
First we'll apply a preprocessor to our target strings to ensure they have the same linguistic form as our text sequences:
# apply preprocessor and lemmatization to the target strings
df_train['target_seq'] = df_train['target'].apply(preprocessor)
df_train['target_seq'] = df_train['target_seq'].apply(lemmatize)
df_test['target_seq'] = df_test['target'].apply(preprocessor)
df_test['target_seq'] = df_test['target_seq'].apply(lemmatize)
Next we'll generate y_train_IOB2
and y_test_IOB2
sequence tags for our X_train
and X_test
word sequences. Our approach here, in essence, is similar to "sliding" the target sequence across the text sequence and looking for the index where they match:
# ensure we are still working with an equal number of entries
# (in case this ever changes with future script modifications)
if df_train['target_seq'].shape[0] != X_train.shape[0]:
raise ValueError('Row dimension mismatch',
'between text and target sequences')
if df_test['target_seq'].shape[0] != X_test.shape[0]:
raise ValueError('Row dimension mismatch',
'between text and target sequences')
# a helper function for the get_IOB2_tags() function
def get_closest_seq_index(text_seq, ote_start, buffer=0):
'''
Makes an educated guess at the sequence index for which
we expect to find the first aspect target word. Does this
by accumulatively adding the characters in each sequence
step and stopping at the index where this sum matches or
exceeds ote_start.
Note that "text_seq" here needs to be a sequence of strings,
i.e. already converted from token integers back to words
using our map_int_to_word dictionary defined eariler.
'''
sum_ = 0
for index in range(len(text_seq)):
sum_ += len(text_seq[index])
if sum_ >= (int(ote_start) - buffer):
return index
# if we didn't trigger a return:
raise ValueError('Sum of characters never reaches'
' ote_start: increase buffer value.')
def get_IOB2_tags(text_int_seq, target_seq, ote_start, ote_stop):
"""
Returns a 1 dimensional numpy array of IOB2 tags for the input
text sequence. Requires use of our map_int_to_word dictionary
defined earlier during the text pre-processing steps.
:param text_int_seq: numpy array of tokenized words as a sequence
of integers
:param target_seq: sequence of aspect target words (string format,
not converted to integers) taken directly from
df_DATA['target_cleaned'] (where DATA = train or
test)
:ote_start: original starting character index of the aspect target
sequence before word pre-processing; taken directly
from df_DATA['target_cleaned'], where it is stored as
a string value (rather than an integer)
:ote_stop: original ending character index of the aspect target
sequence before word pre-processing; taken directly
from df_DATA['target_cleaned'], where it is stored as
a string value (rather than an integer)
"""
# map our integer tokens back to words
text_seq = map_int_to_word(text_int_seq)
# initialize tags to 'O' for 'Outside Target'
tag_seq = ['O' for word in text_seq]
tag_seq = np.array(tag_seq)
# if there is no target within the sentence, return all 'O'
if target_seq == ['null'] or ote_stop == 0 or ote_start == 'NULL':
return tag_seq
# sift through text_seq with a window size matching
# the length of target_seq, and record matches:
matches = []
target_seq_len = len(target_seq)
# 1. matching a single word
if target_seq_len == 1:
target_seq = target_seq[0] # empty list so that dtype is string
for index_ in range(len(text_seq)):
if text_seq[index_] == target_seq:
matches.append(index_)
if len(matches) == 1:
tag_seq[matches[0]] = 'B'
elif len(matches) > 1:
# use ote_start to guess where the index should be
approximated_index = get_closest_seq_index(text_seq,
ote_start,
buffer=5)
# pick the entry whose value is closest to this index
closest_index = min(matches,
key=lambda x:abs(x-approximated_index))
tag_seq[closest_index] = 'B'
else:
# this should never happen, but here is a safeguard
raise ValueError('matches is empty for sequence:',
text_seq, 'for target sequence:',
target_seq)
# 2. matching a sequence
else:
for index_ in range(len(text_seq) - target_seq_len + 1):
if text_seq[index_:(index_ + target_seq_len)] == target_seq:
matches.append(index_)
if len(matches) == 1:
matched_index = matches[0]
tag_seq[matched_index:(matched_index + target_seq_len)] = 'I'
tag_seq[matched_index] ='B'
elif len(matches) > 1:
# use ote_start to guess where the index should be
approximated_index = get_closest_seq_index(text_seq,
ote_start,
buffer=5)
# pick the entry whose value is closest to this index
closest_index = min(matches,
key=lambda x:abs(x-approximated_index))
tag_seq[closest_index:(closest_index + target_seq_len)] = 'I'
tag_seq[closest_index] ='B'
else:
# this should never happen, but here is a safeguard
raise ValueError('matches is empty for sequence:',
text_seq, 'for target sequence:',
target_seq)
return tag_seq
def IOB2_tag_generator(df_data, X_data):
y_tags = np.empty(X_data.shape, dtype=str) # initialize array
for index in range(df_data.shape[0]):
# note: for some reason if we use df_data.iterrows(), on index
# 844 we get an error because the target_seq returned by row
# repeats the one for index 843 -- source of issue remains unknown
y_tags[index, :] = get_IOB2_tags(
text_int_seq=X_data[index, :],
target_seq=df_data['target_seq'].iloc[index],
ote_start=df_data['ote_start'].iloc[index],
ote_stop=df_data['ote_stop'].iloc[index])
return y_tags
# now generate the tags
y_train_IOB2 = IOB2_tag_generator(df_train, X_train)
y_test_IOB2 = IOB2_tag_generator(df_test, X_test)
# check sizes
names = ['X_train', 'y_train_IOB2', 'X_test', 'y_test_IOB2']
for arr, name_ in zip([X_train, y_train_IOB2, X_test, y_test_IOB2], names):
line = ['Name:', name_, 'Shape:', str(arr.shape)]
print('{:<0} {:<16} {:<0} {:<0}'.format(*line))
# map tags to integer values (required for training)
# (we will one-hot encode them later within our estimators)
IOB2_tag_to_int = {'O': 0,
'I': 1,
'B': 2,
}
The code below lets us manually double-check a selection of entries to verify that the tags are being generated correctly:
# examine the text and tag sequences side-by-side
index_temp = 12
termination_index = 16
print('Target Sequence:', df_train['target_seq'].iloc[index_temp])
for word in range(len(y_train_IOB2[index_temp,:])):
line = ['Word:', name_, 'Tag:', str(arr.shape)]
print(map_int_to_word([X_train[index_temp, word]]),
y_train_IOB2[index_temp, word])
if word >= termination_index:
break
To provide our models with more information about sentence structure and context, we will generate Part-of-Speech (POS) tags using the Stanford POS tagger, which labels each word with a tag based on the Penn Treebank tag set,
examples:
- 'NNP' = proper noun, singular
- 'VBD' = verb, past tense
- 'JJS' = adjective, superlative
The Stanford POS tag set consists of 45 tags, plus one that we'll add for padding. We'll use these tags later as an additional input feature during model training.
Let's load the tagger and view a quick example:
# definite paths to POS model files
POS_MODEL_PATH = '/home/rpm/ML-models/english-bidirectional-distsim.tagger'
POS_JAR_PATH = '/home/rpm/ML-models/stanford-postagger.jar'
# initialize tagger object
st_POS = StanfordPOSTagger(POS_MODEL_PATH,
path_to_jar=POS_JAR_PATH)
# test it out
POS_tags_temp = st_POS.tag(
'What is the airspeed of an unladen swallow ?'.split())
# print word-tag tuple pairs:
print('Word--Tag Tuples:')
print(POS_tags_temp)
# to obtain only the tags:
print('\nTags Only:')
tags_only_temp = [list(t) for t in zip(*POS_tags_temp)][1]
print(tags_only_temp)
Now let's generate the POS tags for our padded training and test datasets. To get the correct tags, we'll have to apply the POS tagger to the pre-lemmatized text. Note that the POS tag generation below can take up to an hour to complete on some machines.
def get_POS_tags(df_row):
POS_word_tag_pairs = st_POS.tag(df_row)
tags_only = [list(t) for t in zip(*POS_word_tag_pairs)][1]
return tags_only
print('Generating POS tags, may take a while...')
df_train['POS_tag_lists'] = df_train['text_cln'].apply(lambda row: get_POS_tags(row))
df_test['POS_tag_lists'] = df_test['text_cln'].apply(lambda row: get_POS_tags(row))
Let's view a sample of the POS tags from the first 5 training entries:
df_train['POS_tag_lists'].head(5)
Now let's create dictionaries for mapping our POS tags to integer numbers (and vice versa). We'll only create entries for tags we actually use (a subset of the 46; e.g. we don't use the integer or punctuation tags).
# count word occurrences
counts_POS_tags = Counter()
df_combined = pd.concat([df_train, df_test])
for sentence in df_combined['POS_tag_lists']:
counts_POS_tags.update(sentence)
# create dictionary mapping
POS_tag_counts = sorted(counts_POS_tags, key=counts_POS_tags.get, reverse=True)
POS_tag_to_index = {tag: index for index, tag in enumerate(POS_tag_counts, 1)}
POS_tag_to_index['<PAD>'] = 0
# also create an inverse mapping (ints to words)
index_to_POS_tag = {v: k for k, v in POS_tag_to_index.items()}
index_to_POS_tag[0] = '<PAD>' # for sequence padding
Next we'll generate padded POS tag arrays:
# create mapping functions
def map_POS_to_int(POS_tag_list):
return [POS_tag_to_index[tag] for tag in POS_tag_list]
def map_int_to_POS(POS_int_list):
return [index_to_POS_tag[index] for index in POS_int_list]
# apply mapping
df_train['POS_int_lists'] = df_train['POS_tag_lists'].apply(map_POS_to_int)
df_test['POS_int_lists'] = df_test['POS_tag_lists'].apply(map_POS_to_int)
# generate padded sequence arrays
sequence_size = 62
X_POS_train = sequence.pad_sequences(
df_train['POS_int_lists'],
maxlen=sequence_size,
truncating='post',
padding='post',
value=POS_tag_to_index['<PAD>']).astype(np.int32)
X_POS_test = sequence.pad_sequences(
df_test['POS_int_lists'],
maxlen=sequence_size,
truncating='post',
padding='post',
value=POS_tag_to_index['<PAD>']).astype(np.int32)
Finally, let's run some quick checks on our POS tag output to ensure everything works as expected:
print('Original Cleaned Text:')
print(df_train['text_cln'][0])
print('\nPadded POS Tags:')
print(map_int_to_POS(X_POS_train[0]))
print('\nPadded POS Ints:')
print(X_POS_train[0])
print('\nOne-Hot Encoded Vector for first Tag:')
def get_POS_onehot(X_POS_ints):
return K.utils.to_categorical(X_POS_ints, len(index_to_POS_tag))
print(get_POS_onehot(X_POS_train[0])[0])
In our dataset, sentences containing multiple aspect-entity pairs (and hence multiple OTEs) have been logged as separate entries. When training our model for IOB2 tagging, we should combine such sentences into single test or training instances, otherwise our network will be incorrectly penalized when it identifies multiple OTEs within the same sentence.
First, let's determine how many instances of repeated text sequences there are in our dataset:
def count_sentence_repeats(X_data_):
last_sequence = np.zeros(X_data_[0,:].shape)
num_repeats = 0
for sequence in X_data_:
if np.array_equal(sequence, last_sequence):
num_repeats += 1
last_sequence = sequence
return num_repeats
num_X_train_repeats = count_sentence_repeats(X_train)
num_X_test_repeats = count_sentence_repeats(X_test)
print("Number of Repeated Training Sentences:",
num_X_train_repeats)
print("Number of Repeated Test Sentences:",
num_X_test_repeats)
print("\nExpected Size of Train Set After Repeats Removed:",
X_train.shape[0] - num_X_train_repeats)
print("Expected Size of Test Set After Repeats Removed:",
X_test.shape[0] - num_X_test_repeats)
Now let's create a new dataset where we've combined IOB2 tags for repeated sentences.
# function that maps IOB2 sequences to int sequences
def get_IOB2_ints(tag_sequence):
return np.vectorize(IOB2_tag_to_int.get)(tag_sequence)
# dictionary to convert IOB2 ints back to string tags
int_to_IOB2_tag = {v: k for k, v in IOB2_tag_to_int.items()}
# sentence/tag merge function
def merge_repeated_sentences(X_data, X_POS_data,
X_len_data, y_data_IOB2):
"""
For repeated sentences, we'll first merge all of the
associated IOB2 tags together (to capture ALL OTEs
within that sentence) and remove repeats. Hence,
each sentence in our dataset will be unique.
"""
X_data_uniq = []
X_POS_data_uniq = []
X_len_data_uniq = []
y_data_IOB2_uniq_ints = []
# loop through all sentence sequences
last_sequence = np.zeros(X_data[0,:].shape)
for index, sequence in enumerate(X_data):
if np.array_equal(sequence, last_sequence) is False:
# the first time we are seeing this sequence...
X_data_uniq.append(sequence)
X_POS_data_uniq.append(X_POS_data[index])
X_len_data_uniq.append(X_len_data[index])
y_data_IOB2_uniq_ints.append(
get_IOB2_ints(y_data_IOB2[index]))
else:
# we saw this sequence last time
# so add this one's IOB2 tag to the
# tag of the last entry, AFTER checking
# that we aren't adding two identical
# OTE-containing tag sequences together
# (results in out-of-range integer of 4)
# (some data entries with repeat sentences
# also repeat the OTE if there are two
# associated aspect:entities).
seq_sum = y_data_IOB2_uniq_ints[-1] + get_IOB2_ints(
y_data_IOB2[index])
if not any(i == 4 for i in seq_sum):
y_data_IOB2_uniq_ints[-1] += get_IOB2_ints(
y_data_IOB2[index])
last_sequence = sequence
# debugging check to avoid overlapping IOB2 tags
# where we have the same OTE in the same sentence
# for two different x entries
# # see if we have any integer entries out of range [0, 2]
# for i, seq in enumerate(y_data_IOB2_uniq_ints):
# for j, elem in enumerate(seq):
# if elem not in [0, 1, 2]:
# print('\nGot value', elem)
# print('at index', i, '-', j)
# convert IOB2 ints back to string tags
int_to_IOB2_tag = {v: k for k, v in IOB2_tag_to_int.items()}
y_data_IOB2_uniq = np.vectorize(
int_to_IOB2_tag.get)(y_data_IOB2_uniq_ints)
# convert all to numpy arrays
X_data_uniq = np.array(X_data_uniq)
X_POS_data_uniq = np.array(X_POS_data_uniq)
X_len_data_uniq = np.array(X_len_data_uniq)
y_data_IOB2_uniq = np.array(y_data_IOB2_uniq)
return (X_data_uniq, X_POS_data_uniq,
X_len_data_uniq, y_data_IOB2_uniq)
# apply merge function to our data
(X_train_uniq,
X_POS_train_uniq,
X_len_train_uniq,
y_train_IOB2_uniq) = merge_repeated_sentences(
X_train, X_POS_train, X_len_train, y_train_IOB2)
(X_test_uniq,
X_POS_test_uniq,
X_len_test_uniq,
y_test_IOB2_uniq) = merge_repeated_sentences(
X_test, X_POS_test, X_len_test, y_test_IOB2)
# print data shapes
print('X_train_uniq Shape', X_train_uniq.shape)
print('X_POS_train_uniq Shape', X_POS_train_uniq.shape)
print('X_len_train_uniq Shape', X_len_train_uniq.shape)
print('y_train_IOB2_uniq Shape', y_train_IOB2_uniq.shape)
print('\nX_test_uniq Shape', X_test_uniq.shape)
print('X_POS_test_uniq Shape', X_POS_test_uniq.shape)
print('X_len_test_uniq Shape', X_len_test_uniq.shape)
print('y_test_IOB2_uniq Shape', y_test_IOB2_uniq.shape)
In what follows, for better readability I have decided to use the term "category" to refer to the "entity", and "attribute" to refer to the "aspect", as I find these terms are more easily understood. I may use the terms entity and aspect interchangeably with category and attribute, respectively.
First we'll define dictionaries for each category (CAT), attribute (ATTR), and polarity (POL), which will be used later when translating to and from one-hot-encoded labels.
# dictionaries
CAT_to_int = {'AMBIENCE': 0,
'DRINKS': 1,
'FOOD': 2,
'LOCATION': 3,
'RESTAURANT': 4,
'SERVICE': 5,
'NULL': 6}
ATTR_to_int = {'GENERAL':0,
'PRICES': 1,
'QUALITY': 2,
'STYLE_OPTIONS': 3,
'MISCELLANEOUS': 4,
'NULL': 5}
POL_to_int = {'positive': 0,
'neutral': 1,
'negative': 2,
'NULL': 3}
# inverse dictionaries
int_to_CAT = {v: k for k, v in CAT_to_int.items()}
int_to_ATTR = {v: k for k, v in ATTR_to_int.items()}
int_to_POL = {v: k for k, v in POL_to_int.items()}
Next we write a function that extracts the CAT, ATTR, and POL labels from our dataframes, replacing null or NaN entries with 'NONE':
def get_CAT_ATTR_POL_labels(df_data):
# extract from dataframe
y_CAT = df_data.category.str.split('#', expand=True)[0].values
y_ATTR = df_data.category.str.split('#', expand=True)[1].values
y_POL = df_data.polarity.values
# replace NaN with 'NULL'
for i in range(len(y_CAT)):
if type(y_CAT[i]) is not str:
y_CAT[i] = 'NULL'
if type(y_ATTR[i]) is not str:
y_ATTR[i] = 'NULL'
if type(y_POL[i]) is not str or y_POL[i] == 'NULL':
y_POL[i] = 'NULL'
return (y_CAT, y_ATTR, y_POL)
# applying this to our dataframes
(y_CAT_train,
y_ATTR_train,
y_POL_train) = get_CAT_ATTR_POL_labels(df_train)
(y_CAT_test,
y_ATTR_test,
y_POL_test) = get_CAT_ATTR_POL_labels(df_test)
# checking the output
print('Displaying data samples...')
print('\ny_CAT_train:', y_CAT_train[5:10])
print('\ny_ATTR_train:', y_ATTR_train[5:10])
print('\ny_POL_train:', y_POL_train[5:10])
We'll pause here to save our data to file, so that we don't need to run the computatinally expensive data preprocessing again in future notebook instances. (Note, you may still need to re-run certain function definitions from earlier cells).
SAVING FILES
def save_data(FILEPATH):
'''
Saves relevant arrays, dataframes, and dictionaries for
future retrieval, via the "pickle" library.
'''
# create filepath if necessary
if not os.path.exists(os.path.dirname(FILEPATH)):
os.makedirs(os.path.dirname(FILEPATH))
# we'll save our data as pickled dictionaries
# -- variables
variables_to_file = {
'X_train': X_train,
'X_test': X_test,
'X_POS_train': X_POS_train,
'X_POS_test': X_POS_test,
'X_len_train': X_len_train,
'X_len_test': X_len_test,
'X_train_uniq': X_train_uniq,
'X_test_uniq': X_test_uniq,
'X_POS_train_uniq': X_POS_train_uniq,
'X_POS_test_uniq': X_POS_test_uniq,
'X_len_train_uniq': X_len_train_uniq,
'X_len_test_uniq': X_len_test_uniq,
'embedding_matrix': embedding_matrix,
'y_train_IOB2': y_train_IOB2,
'y_test_IOB2': y_test_IOB2,
'y_train_IOB2_uniq': y_train_IOB2_uniq,
'y_test_IOB2_uniq': y_test_IOB2_uniq,
'y_CAT_train': y_CAT_train,
'y_ATTR_train': y_ATTR_train,
'y_POL_train': y_POL_train,
'y_CAT_test': y_CAT_test,
'y_ATTR_test': y_ATTR_test,
'y_POL_test': y_POL_test}
outfile = open(os.path.join(FILEPATH, 'variables.pkl'),'wb')
pickle.dump(variables_to_file, outfile)
outfile.close()
# -- dictionaries
dictionaries_to_file = {
'word_to_index': word_to_index,
'index_to_word': index_to_word,
'POS_tag_to_index': POS_tag_to_index,
'index_to_POS_tag': index_to_POS_tag,
'IOB2_tag_to_int': IOB2_tag_to_int,
'int_to_IOB2_tag': int_to_IOB2_tag,
'CAT_to_int': CAT_to_int,
'ATTR_to_int': ATTR_to_int,
'POL_to_int': POL_to_int,
'int_to_CAT': int_to_CAT,
'int_to_ATTR': int_to_ATTR,
'int_to_POL': int_to_POL}
outfile = open(os.path.join(FILEPATH, 'dictionaries.pkl'),'wb')
pickle.dump(dictionaries_to_file, outfile)
outfile.close()
# -- dataframes
df_train.to_pickle(os.path.join(FILEPATH, 'df_train.pkl'))
df_test.to_pickle(os.path.join(FILEPATH, 'df_test.pkl'))
print('*** Successfully saved data to specified directory ***.')
return None
save_data(FILEPATH=os.path.join(os.getcwd(), 'data/'))
LOADING FILES
Function for loading the pre-processed data:
def load_data(FILEPATH):
assert os.path.exists(os.path.dirname(FILEPATH)), (
'Specified directory does not exist! (%s)' % FILEPATH)
# unpickle data
# -- variables
infile = open(os.path.join(FILEPATH, 'variables.pkl'),'rb')
var_dict = pickle.load(infile)
infile.close()
# -- dictionaries
infile = open(os.path.join(FILEPATH, 'dictionaries.pkl'),'rb')
dict_dict = pickle.load(infile)
infile.close()
# -- dataframes
df_train = pd.read_pickle(os.path.join(FILEPATH, 'df_train.pkl'))
df_test = pd.read_pickle(os.path.join(FILEPATH, 'df_test.pkl'))
df_dict = {'df_train': df_train, 'df_test': df_test}
print('*** Successfully retrieved data from specified directory. ***')
return (var_dict, dict_dict, df_dict)
Now load data and perform various checks:
# load data
(var_data, dict_data, df_data) = load_data(
FILEPATH=os.path.join(os.getcwd(), 'data/'))
# assign numpy variables
X_train = var_data['X_train']
X_test = var_data['X_test']
X_POS_train = var_data['X_POS_train']
X_POS_test = var_data['X_POS_test']
X_len_train = var_data['X_len_train']
X_len_test = var_data['X_len_test']
X_train_uniq = var_data['X_train_uniq']
X_test_uniq = var_data['X_test_uniq']
X_POS_train_uniq = var_data['X_POS_train_uniq']
X_POS_test_uniq = var_data['X_POS_test_uniq']
X_len_train_uniq = var_data['X_len_train_uniq']
X_len_test_uniq = var_data['X_len_test_uniq']
embedding_matrix = var_data['embedding_matrix']
y_train_IOB2 = var_data['y_train_IOB2']
y_test_IOB2 = var_data['y_test_IOB2']
y_train_IOB2_uniq = var_data['y_train_IOB2_uniq']
y_test_IOB2_uniq = var_data['y_test_IOB2_uniq']
y_CAT_train = var_data['y_CAT_train']
y_ATTR_train = var_data['y_ATTR_train']
y_POL_train = var_data['y_POL_train']
y_CAT_test = var_data['y_CAT_test']
y_ATTR_test = var_data['y_ATTR_test']
y_POL_test = var_data['y_POL_test']
# assign dataframes
df_train = df_data['df_train']
df_test = df_data['df_test']
# assign dictionaries
word_to_index = dict_data['word_to_index']
index_to_word = dict_data['index_to_word']
POS_tag_to_index = dict_data['POS_tag_to_index']
index_to_POS_tag = dict_data['index_to_POS_tag']
IOB2_tag_to_int = dict_data['IOB2_tag_to_int']
int_to_IOB2_tag = dict_data['int_to_IOB2_tag']
CAT_to_int = dict_data['CAT_to_int']
ATTR_to_int = dict_data['ATTR_to_int']
POL_to_int = dict_data['POL_to_int']
int_to_CAT = dict_data['int_to_CAT']
int_to_ATTR = dict_data['int_to_ATTR']
int_to_POL = dict_data['int_to_POL']
# remove unnecessary variables
del [var_data, dict_data, df_data]
# print shapes as a check
print('\nX_train Shape:', X_train.shape)
print('X_POS_train Shape:', X_POS_train.shape)
print('X_len_train Shape:', X_len_train.shape)
print('y_train_IOB2 Shape:', y_train_IOB2.shape)
print('y_CAT_train Shape:', y_CAT_train.shape)
print('y_ATTR_train Shape:', y_ATTR_train.shape)
print('y_POL_train Shape:', y_POL_train.shape)
print('\nX_test Shape:', X_test.shape)
print('X_POS_test Shape:', X_POS_test.shape)
print('X_len_test Shape:', X_len_test.shape)
print('y_test_IOB2 Shape:', y_test_IOB2.shape)
print('y_CAT_test Shape:', y_CAT_test.shape)
print('y_ATTR_test Shape:', y_ATTR_test.shape)
print('y_POL_test Shape:', y_POL_test.shape)
print('\nX_train_uniq Shape:', X_train_uniq.shape)
print('X_POS_train_uniq Shape:', X_POS_train_uniq.shape)
print('y_train_IOB2_uniq Shape:', y_train_IOB2_uniq.shape)
print('X_len_train_uniq Shape:', X_len_train_uniq.shape)
print('\nX_test_uniq Shape:', X_test_uniq.shape)
print('X_POS_test_uniq Shape:', X_POS_test_uniq.shape)
print('y_test_IOB2_uniq Shape:', y_test_IOB2_uniq.shape)
print('X_len_test Shape_uniq:', X_len_test_uniq.shape)
print('\nembedding_matrix Shape:', embedding_matrix.shape)
This is the first model in our pipeline.
Goal: Identify all opinion target expressions (OTEs, the target words of any opinions expressed) within a given sentence.
Our model will use a stack of 3 convolutional layers, having a kernel size of 3 words, with 50 feature maps per layer. We will avoid pooling to preserve sequence length. Dropout will be applied after each convolution for regularization. At the output, we use a dense layer that gets applied to each sequence element. I've based this on an architecture described in this paper.
If you are new to building neural networks, you may want to check out my my tutorial on how to quickly implement deep learning architectures in Keras for rapid prototyping.
def pretrained_embedding_initializer(
shape=embedding_matrix.shape,
dtype=tf.float32,
partition_info=None):
assert dtype is tf.float32
return embedding_matrix
def build_cnn_IOB2_tagger(num_filter_channels=50,
kernel_size=3,
dropout_rate=0.5,
num_conv_layers=3,
include_POS=False):
# define input sequence tensor shape
seq_len = X_train_uniq[0].shape[0]
seq_input = K.layers.Input(shape=(seq_len, ),
dtype='int32',
name='seq_input')
# first layer is embedding layer
seq_embedded = Embedding(
input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
embeddings_initializer=pretrained_embedding_initializer
)(seq_input)
# adding in POS data if included
if include_POS is True:
POS_input = K.layers.Input(
shape=(seq_len, len(index_to_POS_tag)),
dtype='float32',
name='POS_input')
x = K.layers.Concatenate()([seq_embedded, POS_input])
else:
x = seq_embedded
# generate convolution layers
for i in range(num_conv_layers):
x = Conv1D(
filters=num_filter_channels,
kernel_size=kernel_size,
strides=1,
padding='same',
data_format='channels_last',
activation='relu',
use_bias=True,
kernel_initializer='glorot_uniform',
bias_initializer='zeros')(x)
x = Dropout(rate=dropout_rate)(x)
# time-distributed dense layer that maps each
# sequence element to the IOB2 tags: I, O, or B
# --> note, this is a fully-connected layer that
# gets applied to every sequence element
tag_probas = TimeDistributed(
Dense(3, activation='softmax'),
input_shape=(seq_len, num_filter_channels))(x)
# assemble
if include_POS is True:
model = K.Model(inputs=[seq_input, POS_input],
outputs=tag_probas)
else:
model = K.Model(inputs=seq_input, outputs=tag_probas)
return model
K.backend.clear_session() # resets the computational graph
model_IOB2_tagger = build_cnn_IOB2_tagger()
# model_IOB2_tagger = build_cnn_IOB2_tagger(include_POS=True)
model_IOB2_tagger.summary()
Below we make some additional useful function definitions. This includes a custom accuracy function, since we want to determine per-sequence accuracy (i.e. that an entire tag sequence is correct) rather than the per-sequence-element accuracy.
# function that maps IOB2 sequences to int sequences
def get_IOB2_ints(tag_sequence):
return np.vectorize(IOB2_tag_to_int.get)(tag_sequence)
# map int sequences to IOB2 sequences
def get_IOB2_tags_from_ints(int_sequence):
input_dtype = type(int_sequence)
assert input_dtype == type(np.zeros(1)), (
'Expected numpy array input but got %s' % input_dtype)
return np.vectorize(int_to_IOB2_tag.get)(int_sequence)
# one-hot encoder
def get_IOB2_onehot(y_data):
y_data_ints = get_IOB2_ints(y_data)
return K.utils.to_categorical(y_data_ints, 3)
# custom accuracy function
def categorical_accuracy_per_sequence(y_true, y_pred):
return KB.mean(KB.min(KB.cast(KB.equal(KB.argmax(y_true, axis=-1),
KB.argmax(y_pred, axis=-1)), 'float32'), axis=-1))
Having built our model in Keras, we can easily compile it with our custom accuracy function, and train it using the following code block:
RANDOM_SEED = 777
np.random.seed(RANDOM_SEED) # for reproducibility
tf.set_random_seed(RANDOM_SEED)
model_IOB2_tagger.compile(
loss='categorical_crossentropy',
optimizer='rmsprop', # what the paper uses
metrics=['accuracy',
categorical_accuracy_per_sequence])
history_IOB2_tagger = model_IOB2_tagger.fit(
x=X_train_uniq,
y=get_IOB2_onehot(y_train_IOB2_uniq),
batch_size=128,
epochs=125,
verbose=1,
callbacks=None,
validation_data=(X_test_uniq,
get_IOB2_onehot(y_test_IOB2_uniq)),
shuffle=True,
initial_epoch=0)
Let's write some functions to visualize the training history and evaluate performance:
# function for plotting metrics versus
# training epoch from a history object
def plot_histories(history_):
# summarize history for accuracy
plt.plot(history_.history['categorical_accuracy_per_sequence'])
plt.plot(history_.history['val_categorical_accuracy_per_sequence'])
plt.title('model accuracy (per sequence)')
plt.ylabel('sequence-wise accuracy')
plt.xlabel('epoch')
plt.legend(['training', 'validation'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history_.history['loss'])
plt.plot(history_.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.ylim((0, history_.history['loss'][1]))
plt.legend(['training', 'validation'], loc='upper left')
plt.show()
return None
# plot history and evaluate accuracy
def plot_and_evaluate(model_, history_):
plot_histories(history_)
scores_ = model_.evaluate(
x=X_test_uniq,
y=get_IOB2_onehot(y_test_IOB2_uniq),
batch_size=None,
verbose=1)
print('Per-Sequence Test Accuracy:', scores_[2])
return scores_
scores = plot_and_evaluate(model_IOB2_tagger, history_IOB2_tagger)
After roughly 120 iterations, we achieve roughly 100% accuracy on our test set, but only about 90% accuracy on our training set. This is pretty unusual and could be an artefact of our small dataset size. Our network might be memorizing keywords rather than learning about word relationships, and perhaps the distribution of these keywords is unbalanced and favours correct inference in the test set, whereas the more difficult cases are predominantly in the training set.
We could mitigate this by shuffling the training and test datasets together and performing k-fold cross-validation.
However, the focus of this tutorial is on building the pipeline, rather than optimizing our model performance (which can be done later), so we choose to proceed as-is.
def are_tags_equal(tags_1, tags_2):
# note: we cannot use np.equal for string comparison due
# to a present python-numpy compatibility bug, so we will
# need to convert the tags to integers first
return np.min(np.equal(get_IOB2_ints(tags_1),
get_IOB2_ints(tags_2)), axis=-1)
def view_IOB2_predictions(model_, X_data, y_data,
num_samples=5, condition=None):
# generate predictions
preds = model_.predict(X_data)
# map predictions to IOB2 tags
pred_tags = get_IOB2_tags_from_ints(np.argmax(preds, axis=-1))
# apply condition
if condition is None:
print('\nDisplaying Randoly-Selected Samples.')
inds = np.arange(len(preds))
elif condition is 'correct':
print('\nDisplaying Samples with condition "Correct".')
inds = np.arange(len(preds))[
are_tags_equal(y_data, pred_tags)]
elif condition is 'incorrect':
print('\nDisplaying Samples with condition "Incorrect".')
inds = np.arange(len(preds))[np.logical_not(
are_tags_equal(y_data, pred_tags))]
else:
raise ValueError(
'condition must be: "None", "correct" or "incorrect"')
if num_samples > len(inds):
num_samples = len(inds)
print('*Only %s sample(s) of condition %s exist.'
% (num_samples, condition))
print('[Format: word - predicted tag (true tag)]')
# randomly draw examples from our predictions
for i, index in enumerate(np.random.choice(inds,
size=num_samples,
replace=False)):
this_X = X_data[index]
this_y_pred = pred_tags[index]
this_y_true = y_data[index]
# sample number and correct/incorrect sequence status
if are_tags_equal(this_y_pred, this_y_true) == True:
outcome_str = 'correct prediction'
else:
outcome_str = 'incorrect prediction'
print('\n-- Sample # %s (%s) --\n' % (i+1, outcome_str))
# display sequence and predicted tags
for j in range(len(this_X)):
print(index_to_word[this_X[j]],
'-', this_y_pred[j],
'(%s)' % this_y_true[j])
if this_X[j] == 0.0:
break
return None
view_IOB2_predictions(model_IOB2_tagger, X_test_uniq, y_test_IOB2_uniq,
num_samples=3, condition='correct')
Now let's feed in our own custom sentence and see what our tagger infers! First, some useful function defintions:
def raw_sentence_to_padded_sequence(sentence, sequence_size=62):
print('\nOriginal Text:')
print(sentence)
print('\nPreprocessed Text:')
sentence_proc = lemmatize(preprocessor(sentence))
print(sentence_proc)
print('\nCorresponding Integer Tokens:')
sentence_ints = map_word_to_int(sentence_proc)
print(sentence_ints)
print('\nFinal Padded Sequence:')
sentence_padded_seq = sequence.pad_sequences(
[sentence_ints],
maxlen=sequence_size,
truncating='post',
padding='post',
value=word_to_index['<PAD>'])
print(sentence_padded_seq[0])
return sentence_padded_seq
def predict_and_display_IOB2_tags(model_, X_data_, shown_index=0):
preds_ = model_.predict(X_data_)
preds_tag_ = get_IOB2_tags_from_ints(np.argmax(preds_, axis=-1))
# display sequence and predicted tags
print('\nPredicted Tags:')
for j in range(len(X_data_[shown_index])):
print(index_to_word[X_data_[shown_index][j]],
'-', preds_tag_[shown_index][j])
if X_data_[shown_index][j] == 0.0:
break
return None
Now let's feed it a custom review sentence and see what IOB2 tags it predicts:
my_rev_1 = "The french fries were soggy, and the sandwich was dry."
my_rev_padded_seq_1 = raw_sentence_to_padded_sequence(my_rev_1)
predict_and_display_IOB2_tags(model_IOB2_tagger, my_rev_padded_seq_1)
Let's try another one:
my_rev_2 = "Service was slow, and the decor was gross, but the food was delicious."
my_rev_padded_seq_2 = raw_sentence_to_padded_sequence(my_rev_2)
predict_and_display_IOB2_tags(model_IOB2_tagger, my_rev_padded_seq_2)
Having labelled OTEs within our review sentences, we now seek to assign these an entity (category), aspect (attribute), and polarity.
Goal: Given a sentence sequence and an IOB2 tag for a single OTE within that sentence, identify (a) the entity:aspect (e.g. FOOD:QUALITY) and (b) the sentiment polarity (positive, negative, or neutral).
- Inputs: Review sentence word sequences, distance-to-OTE tags (computed from IOB2 tags with OTEs separated); POS tags,
- Outputs: entity (category), aspect (attribute), and sentiment polarity.
Before we proceed, we'll further preprocess our data to achieve the following:
We acheive this with the following code:
# category-aspect-polarity preprocessor
def CAP_extraction_preprocessor(X_data,
X_POS_data,
X_len_data,
y_IOB2_tags,
sequence_length,
window_size,
y_CAT_data=None,
y_ATTR_data=None,
y_POL_data=None):
# check assumptions
assert window_size < X_data.shape[1], (
'Window size must be smaller than sequence length.')
# find any rows that lack OTEs
reject_indices = []
count = 0
for i in range(y_IOB2_tags.shape[0]):
B_index = np.where(y_IOB2_tags[i]=='B')[0]
if len(B_index) == 0:
reject_indices.append(i)
# proceed only with rows containing at least one OTE
if len(reject_indices) > 0 :
print('Attention: omitted %s rows that were missing OTEs.'
% len(reject_indices))
X_data = np.delete(X_data, reject_indices, axis=0)
X_POS_data = np.delete(X_POS_data, reject_indices, axis=0)
y_IOB2_tags = np.delete(y_IOB2_tags, reject_indices, axis=0)
if y_CAT_data is not None:
y_CAT_data = np.delete(y_CAT_data, reject_indices, axis=0)
if y_ATTR_data is not None:
y_ATTR_data = np.delete(y_ATTR_data, reject_indices, axis=0)
if y_POL_data is not None:
y_POL_data = np.delete(y_POL_data, reject_indices, axis=0)
# initialize truncated arrays
X_data_tr = np.zeros((X_data.shape[0], window_size),
dtype=int)
X_POS_data_tr = np.zeros((X_POS_data.shape[0], window_size),
dtype=int)
X_len_data_tr = np.zeros(X_len_data.shape[0], dtype=int)
y_IOB2_tags_tr = np.empty((y_IOB2_tags.shape[0], window_size),
dtype=str)
# compute distance vectors and truncate to window size
X_OTE_dist = np.zeros(X_data.shape)
X_OTE_dist_tr = np.zeros((X_data.shape[0], window_size))
dist_ref_vec = np.arange(0, y_IOB2_tags.shape[1])
window_ref_vec = np.arange(0, window_size)
window_middle_index = math.floor(window_size/2)
for i in range(y_IOB2_tags.shape[0]):
# ** get distance vector **
# (equals zero for B and I tags)
B_index = np.where(y_IOB2_tags[i]=='B')[0]
I_index = np.where(y_IOB2_tags[i]=='I')[0]
if len(B_index) > 1:
print('WARNING: multiple OTEs detected for')
print('IOB2 tag vector located at index %s;' % i)
print('proceeding with the first one.')
# suffices for cases where we just have the B tag
X_OTE_dist[i, :] = dist_ref_vec - B_index[0]
# # if 'I' tags are present, offset dist so that
# # positive distances are counted from the LAST 'I'
# num_I = len(I_index)
# if num_I > 0:
# for elem in range(B_index[0] + 1, y_IOB2_tags.shape[1]):
# X_OTE_dist[i, elem] = X_OTE_dist[i, elem] - num_I
# X_OTE_dist[i, I_index] = 0
# ** truncate data sequences to specified window size **
# (window is centered on B_index where possible)
window_index_shift = B_index[0] - window_middle_index
# case 1: window limited by sequence start
if window_index_shift < 0:
keep_indices = window_ref_vec
# case 2: window limited by sequence end
elif (window_size + window_index_shift) > sequence_length:
keep_indices = window_ref_vec + (sequence_length - window_size)
# case 3: window fits within sequence
else:
keep_indices = window_ref_vec + window_index_shift
# ** populate this array row **
X_data_tr[i] = X_data[i, keep_indices]
X_OTE_dist_tr[i] = X_OTE_dist[i, keep_indices]
X_POS_data_tr[i] = X_POS_data[i, keep_indices]
y_IOB2_tags_tr[i] = y_IOB2_tags[i, keep_indices]
# ** update X_len **
# -- Since we've clipped the sequence, we must
# -- update X_len to be equal to the new number
# -- of non-padded entries
first_padded_index = np.where(X_data_tr[i] == 0)[0]
if len(first_padded_index) == 0:
X_len_data_tr[i] = window_size
else:
X_len_data_tr[i] = first_padded_index[0]
# generate dictionary for returned data
return_data = {'X': X_data_tr.astype(int),
'X_len': X_len_data_tr,
'X_POS': X_POS_data_tr,
'X_OTE_dist': X_OTE_dist_tr,
'y_IOB2_tags': y_IOB2_tags_tr}
if y_CAT_data is not None:
return_data['y_CAT'] = y_CAT_data
if y_ATTR_data is not None:
return_data['y_ATTR'] = y_ATTR_data
if y_POL_data is not None:
return_data['y_POL'] = y_POL_data
return return_data
print('**Preprocessing Training Set**')
CAP_train_data = CAP_extraction_preprocessor(
X_data=X_train,
X_POS_data=X_POS_train,
X_len_data=X_len_train,
y_IOB2_tags=y_train_IOB2,
sequence_length=62,
window_size=20,
y_CAT_data=y_CAT_train,
y_ATTR_data=y_ATTR_train,
y_POL_data=y_POL_train)
print('\n**Preprocessing Test Set**')
CAP_test_data = CAP_extraction_preprocessor(
X_data=X_test,
X_POS_data=X_POS_test,
X_len_data=X_len_test,
y_IOB2_tags=y_test_IOB2,
sequence_length=62,
window_size=20,
y_CAT_data=y_CAT_test,
y_ATTR_data=y_ATTR_test,
y_POL_data=y_POL_test)
Let's also create a helfpul function for displaying and checking the output of our data preprocessing (for verification and debug purposes):
def display_CAP_data_sequences(CAP_data_dict, row_num):
print('\nDisplaying data sequences for row %s:' % row_num)
print('-- CAT:', CAP_data_dict['y_CAT'][row_num])
print('-- ATTR:', CAP_data_dict['y_ATTR'][row_num])
print('-- POL:', CAP_data_dict['y_POL'][row_num])
print('-- X_len:', CAP_data_dict['X_len'][row_num])
pt = PrettyTable(['Seq. #',
'Text',
'IOB2 Tag',
'Dist.',
'POS Tag'])
for i in range(CAP_data_dict['X'][row_num].shape[0]):
pt.add_row([i,
map_int_to_word(CAP_data_dict['X'][row_num])[i],
CAP_data_dict['y_IOB2_tags'][row_num][i],
CAP_data_dict['X_OTE_dist'][row_num][i],
map_int_to_POS(CAP_data_dict['X_POS'][row_num])[i]])
print(pt)
return None
Now we can examine some examples of the preprocessed output to see that our windowing and distance vectors are being applied correctly.
# display two examples (same sentence, different OTEs)
display_CAP_data_sequences(CAP_train_data, row_num=21)
display_CAP_data_sequences(CAP_train_data, row_num=22)
Lastly, we define some one-hot encoders:
# one-hot encoders
def get_CAT_onehot(y_CAT_data):
y_CAT_int = np.vectorize(CAT_to_int.get)(y_CAT_data)
return K.utils.to_categorical(y_CAT_int, len(CAT_to_int))
def get_ATTR_onehot(y_ATTR_data):
y_ATTR_int = np.vectorize(ATTR_to_int.get)(y_ATTR_data)
return K.utils.to_categorical(y_ATTR_int, len(ATTR_to_int))
def get_POL_onehot(y_POL_data):
y_POL_int = np.vectorize(POL_to_int.get)(y_POL_data)
return K.utils.to_categorical(y_POL_int, len(POL_to_int))
def get_POS_onehot(X_POS_ints):
return K.utils.to_categorical(X_POS_ints, len(index_to_POS_tag))
We'll choose to create a combined model that provides inference for all three category, attribute, and polarity classes concurrently. By no means is this necessarily the best approach, but we'll proceed for simplicity so that we can focus on building a complete inference pipeline.
Here are the details:
- We use a stacked architecture, inspired by in this paper, comprised of 3 conv layers followed by an LSTM layer; the reasoning is that conv layers are particularly good at local feature extraction, whereas LSTMs allow us to capture relationships between these features over larger distances within a sentence.
- We choose to embed our distance vectors, which has been found to provide improved results in similar inference problems (according to the papers found here, here, and here.
- The inputs will be the CAP-preprocessed versions of
X
(tokenized text sequences),X_POS
(Part-of-Speech tags), andX_dist
(distance-to-OTE term), as well as the pretrainedembedding_matrix
.- Outputs will be
y_CAT
,y_ATTR
, andy_POL
, which are the inferred category (entity), attribute (aspect), and polarity respectively.
For a nice intro to LSTMs, see this excellent post by Christopher Olah.
def pretrained_embedding_initializer(
shape=embedding_matrix.shape,
dtype=tf.float32,
partition_info=None):
assert dtype is tf.float32
return embedding_matrix
def build_CAP_extraction_network(
window_size=20,
num_conv_layers=3,
num_conv_filter_channels=50,
conv_kernel_size=3,
conv_dropout_rate=0.3,
num_LSTM_units=100,
num_dense_units=100):
# define input sequence tensor shapes
# -- text sequence
seq_input = K.layers.Input(shape=(window_size, ),
dtype='int32',
name='seq_input')
# -- POS tags (one-hot encoded)
POS_input = K.layers.Input(shape=(window_size, len(index_to_POS_tag)),
dtype='float32',
name='POS_input')
# -- distance vector
dist_input = K.layers.Input(shape=(window_size, ),
dtype='float32',
name='dist_input')
# create embedding layers for text and distance
# -- for text sequence; will be masked on zero (pad)
seq_embedded = Embedding(
input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
mask_zero=False, # not currently supported by conv1D
embeddings_initializer=pretrained_embedding_initializer
)(seq_input)
# -- for distance vector
dist_embedded = Embedding(
input_dim=window_size,
output_dim=10,
embeddings_initializer='uniform'
)(dist_input)
# concatenate input features
x = K.layers.Concatenate()([seq_embedded,
dist_embedded,
POS_input])
# generate convolution layers
for i in range(num_conv_layers):
x = Conv1D(
filters=num_conv_filter_channels,
kernel_size=conv_kernel_size,
strides=1,
padding='same',
data_format='channels_last',
activation='relu',
use_bias=True,
kernel_initializer='glorot_uniform',
bias_initializer='zeros')(x)
x = Dropout(rate=conv_dropout_rate)(x)
# apply LSTM layer
# -- we want only the final hidden state,
# -- therefore return_sequences=False
x_LSTM_out = K.layers.LSTM(
units=num_LSTM_units,
activation='tanh',
recurrent_activation='hard_sigmoid',
use_bias=True,
kernel_initializer='glorot_uniform',
recurrent_initializer='orthogonal',
bias_initializer='zeros',
dropout=0.0,
return_sequences=False, # refers to *hidden* states
return_state=False, # refers to *cell* states
stateful=False)(x)
# output branch associated with CAT prediction
x_CAT = Dense(
num_dense_units,
activation='relu')(x_LSTM_out)
y_CAT_probas = Dense(
len(CAT_to_int),
activation='softmax',
name='CAT_output')(x_CAT)
# output branch associated with ATTR prediction
x_ATTR = Dense(
num_dense_units,
activation='relu')(x_LSTM_out)
y_ATTR_probas = Dense(
len(ATTR_to_int),
activation='softmax',
name='ATTR_output')(x_ATTR)
# output branch associated with POL prediction
x_POL = Dense(
num_dense_units,
activation='relu')(x_LSTM_out)
y_POL_probas = Dense(
len(POL_to_int),
activation='softmax',
name='POL_output')(x_POL)
# assemble
model = K.Model(
inputs=[seq_input, POS_input, dist_input],
outputs=[y_CAT_probas, y_ATTR_probas, y_POL_probas])
return model
model_CAP_extraction = build_CAP_extraction_network()
model_CAP_extraction.summary()
# visualizing our model
plot_model(model_CAP_extraction, to_file='model_CAP_extraction_plot.png',
show_shapes=True, show_layer_names=True)
Image('model_CAP_extraction_plot.png')
Note that when training RNNs, such as the LSTM layer, we typically mask our sequences so that padded entries are ignored. However, due to our network being stacked on top of CNNs, there is currently no straightforward way to do this in Keras (v2.2.2) since conv layers have not yet been made compatible with mask objects.
We train in the usual way, but this time we specifying our multiple intputs and outputs by referring to the specific layer names using a dictionary.
np.random.seed(RANDOM_SEED) # for reproducibility
tf.set_random_seed(RANDOM_SEED)
model_CAP_extraction.compile(
loss='categorical_crossentropy',
optimizer='rmsprop', # what the paper uses
metrics=['accuracy'])
history_CAP_exctraction = model_CAP_extraction.fit(
x={'seq_input': CAP_train_data['X'],
'POS_input': get_POS_onehot(CAP_train_data['X_POS']),
'dist_input': CAP_train_data['X_OTE_dist']},
y={'CAT_output': get_CAT_onehot(CAP_train_data['y_CAT']),
'ATTR_output': get_ATTR_onehot(CAP_train_data['y_ATTR']),
'POL_output': get_POL_onehot(CAP_train_data['y_POL'])},
batch_size=128,
epochs=125,
verbose=1,
callbacks=None,
validation_data=(
{'seq_input': CAP_test_data['X'],
'POS_input': get_POS_onehot(CAP_test_data['X_POS']),
'dist_input': CAP_test_data['X_OTE_dist']},
{'CAT_output': get_CAT_onehot(CAP_test_data['y_CAT']),
'ATTR_output': get_ATTR_onehot(CAP_test_data['y_ATTR']),
'POL_output': get_POL_onehot(CAP_test_data['y_POL'])}),
shuffle=True,
initial_epoch=0)
As before, we'll create some functions for visualing the results:
def plot_CAP_extraction_histories(history_):
fig, ax = plt.subplots(nrows=2, ncols=3,
sharex=False, sharey=False,
figsize=(14, 8))
ax[0,0].plot(history_.history['CAT_output_acc'])
ax[0,0].plot(history_.history['val_CAT_output_acc'])
ax[0,0].set_title('CATEGORY Accuracy')
ax[0,0].set_ylabel('Accuracy')
ax[0,0].set_xlabel('Epoch #')
ax[0,0].legend(['training set', 'test set'], loc='upper left')
ax[0,1].plot(history_.history['ATTR_output_acc'])
ax[0,1].plot(history_.history['val_ATTR_output_acc'])
ax[0,1].set_title('ASPECT Accuracy')
ax[0,1].set_ylabel('Accuracy')
ax[0,1].set_xlabel('Epoch #')
ax[0,1].legend(['training set', 'test set'], loc='upper left')
ax[0,2].plot(history_.history['POL_output_acc'])
ax[0,2].plot(history_.history['val_POL_output_acc'])
ax[0,2].set_title('POLARITY Accuracy')
ax[0,2].set_ylabel('Accuracy')
ax[0,2].set_xlabel('Epoch #')
ax[0,2].legend(['training set', 'test set'], loc='upper left')
ax[1,0].plot(history_.history['CAT_output_loss'])
ax[1,0].plot(history_.history['val_CAT_output_loss'])
ax[1,0].set_title('CATEGORY Loss')
ax[1,0].set_ylabel('Loss')
ax[1,0].set_xlabel('Epoch #')
ax[1,0].legend(['training set', 'test set'], loc='upper left')
ax[1,1].plot(history_.history['ATTR_output_loss'])
ax[1,1].plot(history_.history['val_ATTR_output_loss'])
ax[1,1].set_title('ASPECT Loss')
ax[1,1].set_ylabel('Loss')
ax[1,1].set_xlabel('Epoch #')
ax[1,1].legend(['training set', 'test set'], loc='upper left')
ax[1,2].plot(history_.history['POL_output_loss'])
ax[1,2].plot(history_.history['val_POL_output_loss'])
ax[1,2].set_title('POLARITY Loss')
ax[1,2].set_ylabel('Loss')
ax[1,2].set_xlabel('Epoch #')
ax[1,2].legend(['training set', 'test set'], loc='upper left')
plt.tight_layout()
plt.show()
return None
def evaluate_CAP_performance(model_):
scores_ = model_.evaluate(
x={'seq_input': CAP_test_data['X'],
'POS_input': get_POS_onehot(CAP_test_data['X_POS']),
'dist_input': CAP_test_data['X_OTE_dist']},
y={'CAT_output': get_CAT_onehot(CAP_test_data['y_CAT']),
'ATTR_output': get_ATTR_onehot(CAP_test_data['y_ATTR']),
'POL_output': get_POL_onehot(CAP_test_data['y_POL'])},
batch_size=None,
verbose=1)
print('\nCATEGORY Accuracy: %.4f' % scores_[4])
print('ASPECT Accuracy: %.4f' % scores_[5])
print('POLARITY Accuracy: %.4f' % scores_[6])
return scores_
plot_CAP_extraction_histories(history_CAP_exctraction)
scores_CAP = evaluate_CAP_performance(model_CAP_extraction)
We obtain fairly high test accuries; meanwhile our training accuracies for category and attribute are 5-10% lower. Once again, this may be due to the small dataset size and how the vocabulary is distributed between train and test sets as hypothesized earlier. So we don't yet know whether this is because our network understands word feature relations, or if it has merely memorized keywords. In a more realistic setting, we would train our network on a larger corpus and perhaps include some pretrained layers via transfer learning. But what we're interested in for this tutorial is building the pipeline, so we'll proceed without further experimentation.
Finally, we'll create a function that takes a review sentence as an input, and runs the complete pipeline to:
When creating this pipeline, we'll also need take tagged sequences with multiple OTEs (from the output of model 1) and separate these so that we feed only one OTE at a time to model 2. We achieve this using the functions below:
def predict_CAP(model_, CAP_data):
pred_ = model_.predict(
x={'seq_input': CAP_data['X'],
'POS_input': get_POS_onehot(CAP_data['X_POS']),
'dist_input': CAP_data['X_OTE_dist']},
batch_size=1)
CAP_pred_ = int_to_CAT[np.argmax(pred_[0], axis=-1)[0]]
ATTR_pred_ = int_to_ATTR[np.argmax(pred_[1], axis=-1)[0]]
POL_pred_ = int_to_POL[np.argmax(pred_[2], axis=-1)[0]]
CAP_preds_ = {'CAT': CAP_pred_,
'ATTR': ATTR_pred_,
'POL': POL_pred_}
return CAP_preds_
def display_sequence_results(CAP_data_dict):
row_num = 0
print('SEQUENCES:')
pt = PrettyTable(['Seq. #',
'Text',
'IOB2 Tag',
'Dist.',
'POS Tag'])
for i in range(CAP_data_dict['X'][row_num].shape[0]):
pt.add_row([i,
map_int_to_word(CAP_data_dict['X'][row_num])[i],
CAP_data_dict['y_IOB2_tags'][row_num][i],
CAP_data_dict['X_OTE_dist'][row_num][i],
map_int_to_POS(CAP_data_dict['X_POS'][row_num])[i]])
print(pt)
return None
def inference_pipeline(raw_text, show_sequences=False):
"""
Notes:
* must use the same sequence and window sizes that were
used for model training.
* accepts only a single sentence at a time (not batches).
"""
SEQUENCE_LENGTH = 62
WINDOW_SIZE = 20
print('\n*** Input sentence: ***')
print(raw_text)
# apply preprocessor to clean text
text_cln = preprocessor(raw_text)
# obtain Part-of-Speech tags
text_POS_tags = np.array(get_POS_tags(text_cln))
# lemmatize cleaned text
text_cln_lm = lemmatize(text_cln)
# convert words to integer tokens
text_cln_lm_ints = map_word_to_int(text_cln_lm)
# pad text sequences
text_int_pad_seq = sequence.pad_sequences(
[text_cln_lm_ints],
maxlen=sequence_size,
truncating='post',
padding='post',
value=word_to_index['<PAD>'])
text_orig_len = []
text_orig_len.append(len(text_int_pad_seq))
text_orig_len = np.array(text_orig_len)
# pad POS tag sequences
text_POS_int_pad_seq = sequence.pad_sequences(
[map_POS_to_int(text_POS_tags)],
maxlen=sequence_size,
truncating='post',
padding='post',
value=POS_tag_to_index['<PAD>']).astype(np.int32)
# predict IOB2 tags
IOB2_int_pred = model_IOB2_tagger.predict(text_int_pad_seq)
IOB2_tag_pred = get_IOB2_tags_from_ints(
np.argmax(IOB2_int_pred, axis=-1))
# find OTE indices
B_indices = np.where(IOB2_tag_pred[0]=='B')[0]
# create separate IOB2 tags if there are multiple OTEs
if len(B_indices) > 1:
IOB2_temp = np.full(
(len(B_indices), SEQUENCE_LENGTH), 'O', dtype=str)
for OTE_id in range(len(B_indices)):
# set "B" tag for this OTE
this_B = B_indices[OTE_id]
IOB2_temp[OTE_id, this_B] = 'B'
# check if any "I" tags follow the "B"
for next_index in range(1, 10):
if IOB2_tag_pred[0, this_B + next_index] == 'I':
IOB2_temp[OTE_id, this_B + next_index] = 'I'
else:
break
# reassign IOB2 tag variable
IOB2_tag_pred = IOB2_temp
results = []
print('\nDetected %s OTE(s)' % len(B_indices))
for i in range(IOB2_tag_pred.shape[0]):
# apply CAP extraction preprocessor
CAP_data = CAP_extraction_preprocessor(
X_data=text_int_pad_seq,
X_POS_data=text_POS_int_pad_seq,
X_len_data=text_orig_len,
y_IOB2_tags=IOB2_tag_pred[i].reshape(1, SEQUENCE_LENGTH),
sequence_length=SEQUENCE_LENGTH,
window_size=WINDOW_SIZE)
# perform CAP extraction
CAP_pred = predict_CAP(model_CAP_extraction, CAP_data)
# also isolate the OTE
OTE_B_indices = np.where(IOB2_tag_pred[i]=='B')[0]
OTE_I_indices = np.where(IOB2_tag_pred[i]=='I')[0]
OTE_indices = np.concatenate((OTE_B_indices, OTE_I_indices))
this_OTE = map_int_to_word(text_int_pad_seq[0, OTE_indices])
# print results
print('\n========================')
print('Results for OTE # %s' % (i+1))
print('========================')
print('INFERRED:')
print('* OTE (Target Words):', this_OTE)
print('* Category (Entity):', CAP_pred['CAT'])
print('* Aspect (Attribute):', CAP_pred['ATTR'])
print('* Polarity:', CAP_pred['POL'])
if show_sequences is True:
display_sequence_results(CAP_data)
# collect for export
these_results = {'OTE': this_OTE,
'CAT': CAP_pred['CAT'],
'ATTR': CAP_pred['ATTR'],
'POL': CAP_pred['POL']}
results.append(these_results)
return results
Now let's test this out on a few custom review sentences. First a simple one:
my_rev_1 = "The french fries were soggy, and the sandwich was dry."
results = inference_pipeline(my_rev_1, show_sequences=True)
Now a more complicated one:
my_rev_2 = "Service was terribly slow and the restaurant \
was noisy, but the waiter was friendly and the calamari was very delicious."
CAP_data_temp = inference_pipeline(my_rev_2)
However, our model isn't perfect. Here's a case where it makes some incorrect inferences, perhaps an indication that it has relied to heavily on keywords and hasn't made good use of the distance feature vector.
my_rev_3 = "The coffee was excellent, but the desert was too expensive."
CAP_data_temp = inference_pipeline(my_rev_3)
In summary, this tutorial demonstrated how to build an inference pipeline for finding opinion-term expressions (OTEs) and classifying these based on entity, aspect, and sentiment polarity. By now, you should be familiar with how to:
- clean and preprocess text data
- make use of pre-trained embedding vectors
- generate Part-of-Speech (POS) tags
- perform sequence-to-sequence inference for IOB2 tag generation
- define custom (per-sequence) accuracy metrics
Having developed these tools and built a basic pipeline, we are ready to go back and optimize our models. Better yet, we can train our models on a larger dataset such as the USAGE review corpus, which also labels opinion terms that can be used for relational extraction, allowing us to implement more powerful architectures such as the one described here.