< Terug naar vorige pagina

Publicatie

Storing Data using Natural Language Addressing

Boek - Dissertatie

Large unstructured or semi-structured datasets require a high level of computational sophistication because operations that are easy at a small scale — such as moving data between machines or in and out of storage, visualizing the data, or displaying results —can all require substantial algorithmic ingenuity. As a data set becomes increasingly massive, it may be infeasible to gather it in one place and analyze it as a whole. Thus, there may be a need for algorithms that operate in a distributed fashion, analyzing subsets of the data and aggregating those results to understand the complete set. One aspect of this is the challenge of data assimilation, in which we wish to use new data to update model parameters without reanalyzing the entire data set. This is essential when new waves of data continue to arrive, or subsets are analyzed in isolation of one another, and one aims to improve the model and inferences in an adaptive fashion — for example, with streaming algorithms [NRC, 2013]. In accordance with the actuality of these problems, firstly in [Mitov, 2011] and after that in [Markov et al, 2013] a new idea has been proposed. It is a method for effective building and storing of pattern sets in multi-layer structures during the process of associative rule mining using the possibilities of multi-dimensional numbered information spaces. The main algorithm was called “MPGN”, an abbreviation from "Multi-layer Pyramidal Growing Networks of information spaces". The main goal was to extend the possibilities of network structures by using a special kind of multi-layer memory structures called "pyramids", which permits defining and realizing new opportunities. The bottleneck of MPGN became the need to search in billions of values of the association rules’ features to convert instances in numbered arrays (vectors). This is a part of preprocessing step of algorithm (see page 97 of [Mitov, 2011]). The process of numbering took considerable time. After numbering, the MPGN algorithm had shown very good results. This work is aimed to propose a solution of the problem of searching in big index structures by proposing a special kind of hashing, so-called “multi-layer hashing”, i.e. by implementing recursively the same specialized hash function to build and resolve the collisions in hash tables. In other words, the main idea consists in using the specialized hashing functions in depth till it is needed. This approach is called “Natural Language Addressing” (NLA) [Ivanova et al, 2012a; Ivanova et al, 2013a; Ivanova et al, 2013d]. The common sense meaning of the concept “address” is such as a description of the location (of a person or organization), as written or printed on mail as directions for delivery [AHD, 2009]; the conventional form by which the location of a building is described [Collins, 2003]; a sign in front of a house or business carrying the conventional form by which its location is described; [WordNet, 2012]. We will use the concept “address” in the sense accepted in the Computer Science: the code that identifies where a piece of information is stored [WordNet, 2012]; a name or number used in information storage or retrieval that is assigned to a specific memory location; the memory location identified by this name or number [AHD, 2009]. Natural Language Addressing (NLA) is a possibility to access information using natural language words as paths to the information. For this purpose the internal encoding of the letters is used to generate corresponded path. The idea of Natural Language Addressing (NL-Addressing) is very simple. It is based on the computer internal representation of the word as strings of codes in a system of encoding (ASCII, UNICODE, etc.). For example, the ASCII encoding of the word „accession” has the next representation: (97, 99, 99, 101, 115, 115, 105, 111, 110). It may be used as array for multi-layer hashing, which indicates a path to point, where the corresponded information may be stored. The main problem in such approach is that the words have different lengths and, in addition, several words may form one phrase and this way to be assumed as single concept. This means that we need tools for managing multi-layer hashing with variable path lengths in an integrated structure. Due to the complexity of MPGN algorithm and the corresponded program system realized in [Mitov, 2011], their redesign and reprogramming for using NLA have to be done after proving the efficiency of NLA realization. Because of this we will concern several types of semi-structured data: ― small datasets - dictionaries, thesauruses, ontologies; ― middle-size and large RDF triple or quadruple datasets, and will provide corresponded experiments and practical implementation. In accordance with this, the PhD research is aimed to propose information model for NL-addressing and corresponded access method as well as the tools for working in such style, theirs main principles, and storing functions. Results presented in this work were implemented in the Institute of Cybernetics V.M. Glushkov at the National Academy of Sciences of Ukraine, Kiev. They had been used for storing dictionaries, thesauruses, ontologies, and RDF-graphs, extracted from multiple documents from own databases as well as from different internet sources.
Aantal pagina's: 340
Jaar van publicatie:2014
Toegankelijkheid:Open