IDEA #8G8GM2 Semantic Search and Text Analysis Engine (ATASys Project – Automatic Text Analysis System)

The development of ATASys has started in 1984. Even back then, it was obvious that the complication of the technical infrastructure of developed countries would cause an avalanche-like growth of textual information and would require adequate automated and automatic data processing facilities. The greatest difficulty is associated with the processing of unstructured data with respect to search, translation and analytics. The fundamental difference between ATASys and similar systems was that it was not a software project but rather a linguistic project. Professional programmers performed auxiliary tasks on the interaction of application programs with the operating system. The formulation of the problem, the development of algorithms and programming of text operations were carried out by specialists in grammar and text semantics, lexicographers, structural linguists, translators. COBOL was chosen as the main programming language. As the project was being improved, the methodology of industrial programming was developing, as well as COBOL's capabilities were expanded to work with texts in natural language. The complexity of the software, the laboriousness of information support and other circumstances led to the fact that only in 2006 the prototype of the search system, based on surface semantics, was tested. About 50 aspects were considered therein (“islands of meaning”). On the same array of control texts (about 32,000 documents), the accuracy of search by the prototype was tens of times higher than that of Google. In 2014, we tested the work of the prototype of a search system using lexical semantics. Approximately 3200 semantic hues were taken into account. The search query was automatically expanded due to the use of synonyms and close associations. Not only a radical reduction in information noise was achieved, but also a reduction of the number of refining search sessions. The user received the ability to control search parameters: the minimum value of relevance of search in percentages; the number of search results; receiving not simply the reference link to the document, but directly the text of the most relevant document, its annotation and abstract. In 2017, we have completed a detailed project of ATASys, which includes a deep, multi-level analysis of unstructured texts. The structure of the system is hierarchical and multi-level. The first level consists of four subsystems: • Managing, instrumental and informational subsystem common to all languages. • Text preparation subsystem: digitization, removal of noise, semantic recognition of printed texts. • Subsystem for analysis of written texts. • Subsystem for analysis of spoken text. ATASys Application Area In fact, ATASys appeared to be the key to solve a number of important tasks: • Semantic Web. • Semantic translation of written texts. • Semantic translation of spoken texts. • Analytical processing of texts (annotation, abstracting). • Qualitative analytics of news data. • Primary real-time analysis of data for specialized analytics. • Automatic preliminary examination of patent applications. • Semantic search in multilingual patents databases at Interlingua level. • Automatic verification of patent and other information for the presence of plagiarism. • Addition of corporate DBMS functions of search and translation of unstructured texts. • Electronic libraries, archives and museums (LAM): o automatic formation of extended electronic catalogues; o automatic formation of semantic versions of LAM; o semantic search within LAM. • Preservation of cultural heritage. Each of these tasks can be solved qualitatively only with the systemic approach, since for all applications a single informational support is required, that is, the most complete information about each language supported by the system: morphology, syntax, semantics, informatics, and pragmatics. Industrial version of the system on the basis of the prototype of the latest version can only be created by a large IT company. The Role of COBOL in the Development of ATASys and of other Systems with Complex Information Analysis Algorithms Creation of the final version of the prototype of ATASys by a small team would be impossible without the use of COBOL and without the development of an industrial programming technology using this language. The architect of our project with a preliminary 10-year COBOL programming experience noticed evident advantages of this language: • Standardization of language specifications at the ISO level. • Continuity of standards (bottom-up compatibility). • Maximum degree of self-documentation. • Continuity of application software. • Operation with data structures of arbitrary complexity. • Explicit and complete description of data in files and in RAM. • Large library of built-in procedures and functions. • Platform independence of source code. • A number of other properties that are important for projects with a long life cycle. The principal specification of the project is that it is linguistic. The main developers of algorithms and software are not professional programmers, but experts in the field of computational linguistics. • ATASys provides all types of text processing, based on a multi-level semantic analysis of the text: semantic search, semantic translation, automatic creation of semantic Web. Used levels of analysis: the sign; abbreviation; abridgment; alphanumeric complexes; complex systems; words; (semi) expressions; phrases; simple sentences; complex sentences; paragraphs; sections (from level 1 to 6); extended bibliographic description of the document; the entire document; package of the documents. • User's request is automatically extended the way to cover semantically similar texts that are relevant to the request. • The result of the search presented not in the form of links, but in the set of documents, available for immediate familiarization of them. • The relevance of a document is estimated on a complex algorithm that associates the semantics of the query and the semantics of the document. Relevance value is quantified by 100% scale. • Relevance is not calculated for the document in general, but for its fragments represented by paragraphs. This allows you to correctly match information requested by user and availability of such information in the text. • At the same time with the found document, forms its full synopsis («summary»), displaying many relevant paragraphs, and a minimum set of exposures («annotation»), showing the most significant fragments. In fact, the search engine simulates for the user the work of an expert (the referent). • In addition, it is possible to familiarize with the beginning of the document, sufficient to give a good overview of his subject. In the future, this mode will be carried out by automatically extended bibliographication. • The maximum number of documents available for review specifies by the user and not by the search engine. • Implemented the possibility to search through foreign-language documents: user can specify any combination of languages in which the documents should be sought. In general, both the query and the required documents can be multilingual. • Quantifying relevance over several orders reduces noise information in the search results. • The semantic indexing, implemented in the search engine, solves the problem of the automatic creation of the multilingual semantic Internet. • Semantic search creates a base of the information to ensure full semantic translation. • Semantic translation is the key to the creation of quality systems to analyze speech: translating speech into normalized written text, the analysis of this text, the synthesis of spoken text. Without the implementation of these steps based on deep semantics of speech and writing, there are no chances for a quality automatic processing of spoken language on the industrial scale. • To ensure the effective management of cumbersome and algorithmically complex project, a unique system of project management was developed. • Extensibility of the list of supported languages requires minimal additions to the software of search engine. • Natural languages are constantly changing semantically and vocabulary. Continuous accounts of these changes determine the long life span of the system as a commercial product. • High intellectual complexity of product virtually eliminates the possibility of a competing system. Developer of the full-scale system becomes the unattainable leader in the field of information technology.
For more information or to license this innovation: