Adding Missing Words to Regular Expressions

Thomas Rebele, Katerina Tzompanaki, Fabian Suchanek

Abstract

Regular expressions are textual patterns used in data-intensive applications to extract data of specific interest. However, even hand-crafted regular expressions may fail to match all the intended words. In this paper, we propose a novel way to learn a regular expression starting from an original one and a set of missing (non-matched) words. Our method finds an approximate match between the missing word(s) and the regular expression, and adds disjunctions for the unmatched parts appropriately. Our goal is to improve the recall of the initial regular expression without deteriorating its precision. We show the effectiveness and generality of our technique by experiments on various datasets.

Publication

In Pacific-Asia Conference on Knowledge Discovery and Data Mining.

Date

June, 2018

Links

PDF Project