Online analytical processing explained

In computing, online analytical processing, or OLAP, is an approach to quickly answer multi-dimensional analytical (MDA) queries.[1] The term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP).[2] OLAP is part of the broader category of business intelligence, which also encompasses relational databases, report writing and data mining.[3] Typical applications of OLAP include business reporting for sales, marketing, management reporting, business process management (BPM),[4] budgeting and forecasting, financial reporting and similar areas, with new applications emerging, such as agriculture.

OLAP tools enable users to analyse multidimensional data interactively from multiple perspectives. OLAP consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.[5] Consolidation involves the aggregation of data that can be accumulated and computed in one or more dimensions. For example, all sales offices are rolled up to the sales department or sales division to anticipate sales trends. By contrast, the drill-down is a technique that allows users to navigate through the details. For instance, users can view the sales by individual products that make up a region's sales. Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the OLAP cube and view (dicing) the slices from different viewpoints. These viewpoints are sometimes called dimensions (such as looking at the same sales by salesperson, or by date, or by customer, or by product, or by region, etc.).

Databases configured for OLAP use a multidimensional data model, allowing for complex analytical and ad hoc queries with a rapid execution time.[6] They borrow aspects of navigational databases, hierarchical databases and relational databases.

OLAP is typically contrasted to OLTP (online transaction processing), which is generally characterized by much less complex queries, in a larger volume, to process transactions rather than for the purpose of business intelligence or reporting. Whereas OLAP systems are mostly optimized for read, OLTP has to process all kinds of queries (read, insert, update and delete).

Overview of OLAP systems

At the core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). It consists of numeric facts called measures that are categorized by dimensions. The measures are placed at the intersections of the hypercube, which is spanned by the dimensions as a vector space. The usual interface to manipulate an OLAP cube is a matrix interface, like Pivot tables in a spreadsheet program, which performs projection operations along the dimensions, such as aggregation or averaging.

The cube metadata is typically created from a star schema or snowflake schema or fact constellation of tables in a relational database. Measures are derived from the records in the fact table and dimensions are derived from the dimension tables.

Each measure can be thought of as having a set of labels, or meta-data associated with it. A dimension is what describes these labels; it provides information about the measure.

A simple example would be a cube that contains a store's sales as a measure, and Date/Time as a dimension. Each Sale has a Date/Time label that describes more about that sale.

For example: Sales Fact Table +-------------+----------+ | sale_amount | time_id | +-------------+----------+ Time Dimension | 2008.10| 1234 |----+ +---------+-------------------+ +-------------+----------+ | | time_id | timestamp | | +---------+-------------------+ +---->| 1234 | 20080902 12:35:43 | +---------+-------------------+

Multidimensional databases

Multidimensional structure is defined as "a variation of the relational model that uses multidimensional structures to organize data and express the relationships between data".[5] The structure is broken into cubes and the cubes are able to store and access data within the confines of each cube. "Each cell within a multidimensional structure contains aggregated data related to elements along each of its dimensions".[5] Even when data is manipulated it remains easy to access and continues to constitute a compact database format. The data still remains interrelated. Multidimensional structure is quite popular for analytical databases that use online analytical processing (OLAP) applications.[5] Analytical databases use these databases because of their ability to deliver answers to complex business queries swiftly. Data can be viewed from different angles, which gives a broader perspective of a problem unlike other models.[7]

Aggregations

It has been claimed that for complex queries OLAP cubes can produce an answer in around 0.1% of the time required for the same query on OLTP relational data.[8] [9] The most important mechanism in OLAP which allows it to achieve such performance is the use of aggregations. Aggregations are built from the fact table by changing the granularity on specific dimensions and aggregating up data along these dimensions, using an aggregate function (or aggregation function). The number of possible aggregations is determined by every possible combination of dimension granularities.

The combination of all possible aggregations and the base data contains the answers to every query which can be answered from the data.[10]

Because usually there are many aggregations that can be calculated, often only a predetermined number are fully calculated; the remainder are solved on demand. The problem of deciding which aggregations (views) to calculate is known as the view selection problem. View selection can be constrained by the total size of the selected set of aggregations, the time to update them from changes in the base data, or both. The objective of view selection is typically to minimize the average time to answer OLAP queries, although some studies also minimize the update time. View selection is NP-Complete. Many approaches to the problem have been explored, including greedy algorithms, randomized search, genetic algorithms and A* search algorithm.

Some aggregation functions can be computed for the entire OLAP cube by precomputing values for each cell, and then computing the aggregation for a roll-up of cells by aggregating these aggregates, applying a divide and conquer algorithm to the multidimensional problem to compute them efficiently. For example, the overall sum of a roll-up is just the sum of the sub-sums in each cell. Functions that can be decomposed in this way are called decomposable aggregation functions, and include COUNT, MAX, MIN, and SUM, which can be computed for each cell and then directly aggregated; these are known as self-decomposable aggregation functions.

In other cases, the aggregate function can be computed by computing auxiliary numbers for cells, aggregating these auxiliary numbers, and finally computing the overall number at the end; examples include AVERAGE (tracking sum and count, dividing at the end) and RANGE (tracking max and min, subtracting at the end). In other cases, the aggregate function cannot be computed without analyzing the entire set at once, though in some cases approximations can be computed; examples include DISTINCT COUNT, MEDIAN, and MODE; for example, the median of a set is not the median of medians of subsets. These latter are difficult to implement efficiently in OLAP, as they require computing the aggregate function on the base data, either computing them online (slow) or precomputing them for possible rollouts (large space).

Types

OLAP systems have been traditionally categorized using the following taxonomy.[11]

Multidimensional OLAP (MOLAP)

MOLAP (multi-dimensional online analytical processing) is the classic form of OLAP and is sometimes referred to as just OLAP. MOLAP stores this data in an optimized multi-dimensional array storage, rather than in a relational database.

Some MOLAP tools require the pre-computation and storage of derived data, such as consolidations – the operation known as processing. Such MOLAP tools generally utilize a pre-calculated data set referred to as a data cube. The data cube contains all the possible answers to a given range of questions. As a result, they have a very fast response to queries. On the other hand, updating can take a long time depending on the degree of pre-computation. Pre-computation can also lead to what is known as data explosion.

Other MOLAP tools, particularly those that implement the functional database model do not pre-compute derived data but make all calculations on demand other than those that were previously requested and stored in a cache.

Advantages of MOLAP

Disadvantages of MOLAP

Products

Examples of commercial products that use MOLAP are Cognos Powerplay, Oracle Database OLAP Option, MicroStrategy, Microsoft Analysis Services, Essbase, TM1, Jedox, and icCube.

Relational OLAP (ROLAP)

ROLAP works directly with relational databases and does not require pre-computation. The base data and the dimension tables are stored as relational tables and new tables are created to hold the aggregated information. It depends on a specialized schema design. This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. ROLAP tools do not use pre-calculated data cubes but instead pose the query to the standard relational database and its tables in order to bring back the data required to answer the question. ROLAP tools feature the ability to ask any question because the methodology is not limited to the contents of a cube. ROLAP also has the ability to drill down to the lowest level of detail in the database.

While ROLAP uses a relational database source, generally the database must be carefully designed for ROLAP use. A database which was designed for OLTP will not function well as a ROLAP database. Therefore, ROLAP still involves creating an additional copy of the data. However, since it is a database, a variety of technologies can be used to populate the database.

Advantages of ROLAP

Disadvantages of ROLAP

Performance of ROLAP

In the OLAP industry ROLAP is usually perceived as being able to scale for large data volumes but suffering from slower query performance as opposed to MOLAP. The OLAP Survey, the largest independent survey across all major OLAP products, being conducted for 6 years (2001 to 2006) have consistently found that companies using ROLAP report slower performance than those using MOLAP even when data volumes were taken into consideration.

However, as with any survey there are a number of subtle issues that must be taken into account when interpreting the results.

Downside of flexibility

Some companies select ROLAP because they intend to re-use existing relational database tables—these tables will frequently not be optimally designed for OLAP use. The superior flexibility of ROLAP tools allows this less-than-optimal design to work, but performance suffers. MOLAP tools in contrast would force the data to be re-loaded into an optimal OLAP design.

Hybrid OLAP (HOLAP)

The undesirable trade-off between additional ETL cost and slow query performance has ensured that most commercial OLAP tools now use a "Hybrid OLAP" (HOLAP) approach, which allows the model designer to decide which portion of the data will be stored in MOLAP and which portion in ROLAP.

There is no clear agreement across the industry as to what constitutes "Hybrid OLAP", except that a database will divide data between relational and specialized storage.[12] For example, for some vendors, a HOLAP database will use relational tables to hold the larger quantities of detailed data and use specialized storage for at least some aspects of the smaller quantities of more-aggregate or less-detailed data. HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the capabilities of both approaches. HOLAP tools can utilize both pre-calculated cubes and relational data sources.

Vertical partitioning

In this mode HOLAP stores aggregations in MOLAP for fast query performance, and detailed data in ROLAP to optimize time of cube processing.

Horizontal partitioning

In this mode HOLAP stores some slice of data, usually the more recent one (i.e. sliced by Time dimension) in MOLAP for fast query performance, and older data in ROLAP. Moreover, we can store some dices in MOLAP and others in ROLAP, leveraging the fact that in a large cuboid, there will be dense and sparse subregions.[13]

Products

The first product to provide HOLAP storage was Holos, but the technology also became available in other commercial products such as Microsoft Analysis Services, Oracle Database OLAP Option, MicroStrategy and SAP AG BI Accelerator. The hybrid OLAP approach combines ROLAP and MOLAP technology, benefiting from the greater scalability of ROLAP and the faster computation of MOLAP. For example, a HOLAP server may store large volumes of detailed data in a relational database, while aggregations are kept in a separate MOLAP store. The Microsoft SQL Server 7.0 OLAP Services supports a hybrid OLAP server

Comparison

Each type has certain benefits, although there is disagreement about the specifics of the benefits between providers.

Other types

The following acronyms are also sometimes used, although they are not as widespread as the ones above:

APIs and query languages

Unlike relational databases, which had SQL as the standard query language, and widespread APIs such as ODBC, JDBC and OLEDB, there was no such unification in the OLAP world for a long time. The first real standard API was OLE DB for OLAP specification from Microsoft which appeared in 1997 and introduced the MDX query language. Several OLAP vendors – both server and client – adopted it. In 2001 Microsoft and Hyperion announced the XML for Analysis specification, which was endorsed by most of the OLAP vendors. Since this also used MDX as a query language, MDX became the de facto standard.[23] Since September-2011 LINQ can be used to query SSAS OLAP cubes from Microsoft .NET.[24]

Products

History

The first product that performed OLAP queries was Express, which was released in 1970 (and acquired by Oracle in 1995 from Information Resources).[25] However, the term did not appear until 1993 when it was coined by Edgar F. Codd, who has been described as "the father of the relational database". Codd's paper[1] resulted from a short consulting assignment which Codd undertook for former Arbor Software (later Hyperion Solutions, and in 2007 acquired by Oracle), as a sort of marketing coup.

The company had released its own OLAP product, Essbase, a year earlier. As a result, Codd's "twelve laws of online analytical processing" were explicit in their reference to Essbase. There was some ensuing controversy and when Computerworld learned that Codd was paid by Arbor, it retracted the article. The OLAP market experienced strong growth in the late 1990s with dozens of commercial products going into market. In 1998, Microsoft released its first OLAP Server Microsoft Analysis Services, which drove wide adoption of OLAP technology and moved it into the mainstream.

Product comparison

See main article: Comparison of OLAP servers.

OLAP clients

OLAP clients include many spreadsheet programs like Excel, web application, SQL, dashboard tools, etc. Many clients support interactive data exploration where users select dimensions and measures of interest. Some dimensions are used as filters (for slicing and dicing the data) while others are selected as the axes of a pivot table or pivot chart. Users can also vary aggregation level (for drilling-down or rolling-up) the displayed view. Clients can also offer a variety of graphical widgets such as sliders, geographic maps, heat maps and more which can be grouped and coordinated as dashboards. An extensive list of clients appears in the visualization column of the comparison of OLAP servers table.

Market structure

Below is a list of top OLAP vendors in 2006, with figures in millions of US Dollars.[26]

Vendor Global Revenue Consolidated company
1,806 Microsoft
1,077 Oracle
735 IBM
416 SAP
416 MicroStrategy
330 SAP
Cartesis (SAP) 210 SAP
205 IBM
199 Infor
159 Oracle
Others 152 Others
Total 5,700

Open source

See also

References

Sources

Further reading

Notes and References

  1. Web site: Providing OLAP (On-line Analytical Processing) to User-Analysts: An IT Mandate . Codd & Date, Inc . Codd E.F. . Codd S.B. . Salley C.T. . amp . 1993 . 2008-03-05 .
  2. Web site: 1997 . OLAP Council White Paper . 2008-03-18 . OLAP Council.
  3. Book: Business Intelligence for Telecommunications . CRC Press . Deepak Pareek . 2007 . 294 pp . 978-0-8493-8792-0 . 2008-03-18.
  4. Book: Business Process Management:A Data Cube To Analyze Business Process Simulation Data For Decision Making . . Apostolos Benisis . 2010 . 204 pp . 978-3-639-22216-6.
  5. O'Brien, J. A., & Marakas, G. M. (2009). Management information systems (9th ed.). Boston, MA: McGraw-Hill/Irwin.
  6. Web site: Introduction to OLAP – Slice, Dice and Drill! . Data Warehousing Review . Hari Mailvaganam . 2007 . 2008-03-18.
  7. Williams, C., Garza, V.R., Tucker, S, Marcus, A.M. (1994, January 24). Multidimensional models boost viewing options. InfoWorld, 16(4)
  8. Web site: MicroStrategy, Incorporated . 1995 . The Case for Relational OLAP . 2008-03-20.
  9. Surajit Chaudhuri . Umeshwar Dayal . amp . An overview of data warehousing and OLAP technology . SIGMOD Rec. . 26 . 1 . 1997 . 65 . 10.1145/248603.248616 . 10.1.1.211.7178 . 8125630 .
  10. Gray . Jim . Jim Gray (computer scientist) . Chaudhuri . Surajit . Layman . Andrew . Reichart . Don . Venkatrao . Murali . Pellow . Frank . Pirahesh . Hamid . Data Cube: Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals . J. Data Mining and Knowledge Discovery . 1 . 1 . 29–53 . 1997 . 2008-03-20. 10.1023/A:1009726021843 . cs/0701155 . 12502175 .
  11. Web site: OLAP architectures . OLAP Report . Nigel Pendse . 2006-06-27 . 2008-03-17 . dead . https://web.archive.org/web/20080124155954/http://www.olapreport.com/Architectures.htm . January 24, 2008 .
  12. Bach Pedersen . Torben . S. Jensen . Multidimensional Database Technology . Distributed Systems Online . 34 . 12 . 0018-9162 . 40–46 . December 2001 . 10.1109/2.970558 . Christian .
  13. cs/0702143. 10.1016/j.ins.2005.09.005 . Attribute value reordering for efficient hybrid OLAP . 2006 . Kaser . Owen . Lemire . Daniel . Information Sciences . 176 . 16 . 2304–2336 .
  14. News: This Week in Graph and Entity Analytics. 2016-12-07. Datanami. 2018-03-08. en-US.
  15. News: Cambridge Semantics Announces AnzoGraph Support for Amazon Neptune and Graph Databases. 2018-02-15. Database Trends and Applications. 2018-03-08. en-US.
  16. Web site: Multi-Dimensional, Phrase-Based Summarization in Text Cubes . Tao. Fangbo . Zhuang. Honglei . Yu. Chi Wang. Qi. Wang . Taylor. Cassidy . Lance. Kaplan . Clare. Voss. Han . Jiawei . 2016.
  17. Liem. David A.. Murali. Sanjana. Sigdel. Dibakar. Shi. Yu. Wang. Xuan. Shen. Jiaming. Choi. Howard. Caufield. John H.. Wang. Wei. Ping. Peipei. Han. Jiawei. 2018-10-01. Phrase mining of textual data to analyze extracellular matrix protein patterns across cardiovascular disease. American Journal of Physiology. Heart and Circulatory Physiology. 315. 4. H910–H924. 10.1152/ajpheart.00175.2018. 1522-1539. 29775406. 6230912.
  18. Book: Lee . S. . Kim . N. . Kim . J. . 2014 IEEE Fourth International Conference on Big Data and Cloud Computing . A Multi-dimensional Analysis and Data Cube for Unstructured Text and Social Media . 2014 . 761–764 . 10.1109/BDCloud.2014.117. 978-1-4799-6719-3 . 229585 .
  19. Ding . B. . Lin. X.C.. Han. J.. Zhai. C.. Srivastava. A.. Oza. N.C.. Efficient Keyword-Based Search for Top-K Cells in Text Cube . IEEE Transactions on Knowledge and Data Engineering . December 2011 . 23 . 12 . 1795–1810 . 10.1109/TKDE.2011.34. 13960227 .
  20. Book: Ding . B. . Zhao . B. . Lin . C.X. . Han . J. . Zhai . C. . 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010) . TopCells: Keyword-based search of top-k aggregated documents in text cube . 2010 . 381–384 . 10.1109/ICDE.2010.5447838. 978-1-4244-5445-7 . 10.1.1.215.7504 . 14649087 .
  21. Book: Lin . C.X. . Ding . B. . Han . K. . Zhu . F. . Zhao . B. . 2008 Eighth IEEE International Conference on Data Mining . Text Cube: Computing IR Measures for Multidimensional Text Database Analysis . IEEE Data Mining . 2008 . 905–910 . 10.1109/icdm.2008.135. 978-0-7695-3502-9 . 1522480 . https://ink.library.smu.edu.sg/sis_research/1008 .
  22. Book: Liu . X. . Tang . K. . Hancock . J. . Han . J. . Song . M. . Xu . R. . Pokorny . B. . Greenberg . A.M. . Kennedy . W.G. . Bos . N.D. . A Text Cube Approach to Human, Social and Cultural Behavior in the Twitter Stream . Springer . Berlin, Heidelberg . 978-3-642-37209-4 . 321–330 . 7812 . Social Computing, Behavioral-Cultural Modeling and Prediction. SBP 2013. Lecture Notes in Computer Science. 2013-03-21 .
  23. Web site: Commentary: OLAP API wars . OLAP Report . Nigel Pendse . 2007-08-23 . 2008-03-18 . dead . https://web.archive.org/web/20080528220113/http://www.olapreport.com/Comment_APIs.htm . May 28, 2008 .
  24. Web site: SSAS Entity Framework Provider for LINQ to SSAS OLAP.
  25. Web site: The origins of today's OLAP products . OLAP Report . 2007-08-23 . Nigel Pendse . November 27, 2007 . dead . https://web.archive.org/web/20071221044811/http://www.olapreport.com/origins.htm . December 21, 2007 .
  26. Web site: OLAP Market . OLAP Report . Nigel Pendse . 2006 . 2008-03-17.
  27. News: Yegulalp . Serdar . 2015-06-11 . LinkedIn fills another SQL-on-Hadoop niche . InfoWorld . 2016-11-19.
  28. Web site: Apache Doris . Github . Apache Doris Community . 5 April 2023.
  29. Web site: An in-process SQL OLAP database management system . 2022-12-10 . DuckDB . en.
  30. Web site: Anand . Chillar . 2022-11-17 . Common Crawl On Laptop - Extracting Subset Of Data . 2022-12-10 . Avil Page . en.