Innovation Lab EPO QA 2

Innovation Lab EPO QA

Applying deep learning to automate the allocation of categories to patent abstracts

W@btech Innovation Lab developers have been developing a process that uses natural language processing to train in the automated allocation of category symbols to patents for European Patent Office QA processes

W@btech has been working with the European Patents Office (EPO) on a critical upgrade of its patent allocation system of which it has an in-depth knowledge. Developers at the Infotel Innovation Lab have extracted Cooperative Patent Classification data and used natural language processing to train and refine a deep learning model for automating the process of ‘reading’ patent abstracts to determine which symbols (of some 250,000) should be allocated per patent.

• The training process involves a number of programming and assessment steps: Load data > Preprocess and embed data with rules and weights > Define model > Train > Evaluate > Re-train and refine
• The key is to iterate and refine, beginning as simply as possible (fewer records, minimal viable ‘success’), then adding complexity.

Goal of Project & Key Success Factors

GOAL Patent categorization is a job done by expert examiners who read abstracts (concise summary of inventions in applications) and allocate defined symbols; the project goal is to develop an algorithm to automate the process of scanning patent abstracts and assigning allocation symbols, so as to flag discrepancies in the allocation system, with the potential for audit cost savings.

SOLUTION Extract and process CPC patent data consisting of abstracts and correctly-allocated symbols from some 1 million records, to use as the basis for training, testing and refining deep-learning models, before using on unseen (un-allocated) data for testing.
SUCCESS FACTORS – the system is able to allocate symbols at > 95% accuracy for the first 5 categorization levels in the allocation symbol system, providing a fit-for-purpose automated system.

Pre-process data

Refine & Pre Process

PREPROCESS → Refine abstracts data (remove padding words, punctuation etc) → Standardize (lower case) → Attach labels to symbols (required by FastText) → Create array from string of raw unique data → Index words as integers using Wiktionary semi-structured dataset (maps lexographic to numeric data) → Machine learning format (key: value pairs)

PREPARE TRAINING DATASETS → shuffle (randomize) array→ split into sets – training (90%), test (5%), validation (5%)

Embed

EMBED SYNONYMS MATRIX using Stanford University’s GloVe word embeddings to map synonym groups (allows words of similar meaning to be used). The process involves generating an embedding matrix of 100 dimensions, then applied to the word set.

Calculate weighting

WEIGHTING CALCULATIONS: Data must be correctly weighted so repetitions in abstracts do not ‘weigh’ too heavy and bias the training algorithm. Spread data by applying weighting calculations: simplistically, for AAAAA and BCDEF → reduce A’s weighting

GOTCHA: This weighting works if dealing with a single value (1:1 abstract : symbol). But more than one symbol may be allocated, so weighting needs to accommodate exponential possibilities with a matrix to account for amount of times a word could appear, and amount of times it’s applied to symbol(s).

Train, evaluate, train

Define model

DEFINE MODEL FOR CONVOLUTION NEURAL NETWORK MODEL. Convolution is a process of taking input data, assigning importance (learnable weights and bias) and adding convolution layers, filtering and downsampling to make data nodes as small as possible and unique. This process reduces the number of parameters involved without losing features critical for getting a good prediction.
PREPARE MODEL FOR TRAINING:
Import data → Create base model → Add convolution layers → Apply embedding

Train

TRAINING DATASET provides the basis for learning (this text: these symbols); convolutions tune the model, accuracy is checked with the test dataset, and the model is refined using the validation dataset.

TRAIN Compile model → Start training process → save to BIN FILE (aka BRAIN!)
BEGIN SIMPLE: Target first letter of allocation symbols for initial training using a smaller dataset → 95-99% accuracy, then refine further (classes, sub-classes, groups and sub-groups) → 75%, then increase training set size.

Evaluate

VISUALIZE AND VALIDATE
Data stored in history → Generate graph to compare Training with Validation (assess overfit, underfit) to refine model by iterating the process.