Thursday, 6 September 2012

The Importance of In-memory Databases


The Importance of In-memory Databases

In-memory database
It has been predicted that in-memory computing will be one of the Top 10 technologies of 2012. In-memory databases (IMDBs) are a critical part of this paradigm. Through this introductory article, let’s get acquainted with the basics of IMDBs. We will look at what they are, why they are developed, and the key differences between IMDB and traditional disk DBs.
Gartner has predicted in-memory computing to be one of the top 10 strategic technologies of 2012. In-memory computing is expected to have a disruptive impact on the data warehousing domain in the coming two years. There have been a series of such products released in the last two years, one of the most famous being SAP’s HANA. Real-time analytics and sub-second response times for enterprise applications require high-performance data management systems. This, in turn, has led to a surge in the importance of IMDBs.
Network bandwidth has increased dramatically and multi-core processors are available even in mobile phones. However, disk I/O speed has not been increasing at the same rate, which has crippled traditional databases. The first step towards high-performance databases is IMDBs. With ever-growing RAM size and the ability to address more RAM (with 64-bit address spaces), IMDBs are in vogue.

What is an in-memory database?

Wikipedia defines an in-memory database as, “An in-memory database (IMDB; also, main memory database system or MMDB) is a database management system that primarily relies on main memory for computer data storage. It is contrasted with database management systems that employ a disk storage mechanism.”
Margaret H Eich has given a very simple definition of an IMDB/MMDB: a database whose primary data store is main memory.
IMDBs are architected and designed differently from traditional RDBMSs. They have simplified algorithms and mechanisms, which are built with the awareness that all data is going to be in RAM. Figure 1 is a simple illustration of the differences.
IMDB vs Disk DB
Figure 1: IMDB vs Disk DB
IMDBs are also different from simplified storage algorithms like hash-tables or trees, in that IMDBs are “databases” or RDBMSs. Most IMDBs are ACID-compliant relational databases offering SQL. They have all the properties of a traditional RDBMS, but are tuned for data to be in-memory.

Getting deeper into the IMDB

IMDBs are not a new phenomenon. The great DB scientist Jim Gray conceptualised this technology 30 years ago and also predicted that in the 2000s these technologies would become widely accepted. Almost from the early 1980s, IMDBs have existed in the telecom and defense domains. However, these were built and maintained as internal components. IBM built one of the first in-memory engines (IMS/VS FastPath) way back in 1978.
The first significant commercial IMDB offering was TimesTen (later acquired by Oracle). Since then, almost all top DB vendors boast an IMDB product — like IBM’s SolidDB, Sybase’s ASE, etc.

Myths about IMDBs

Consider the following assertions:
  • Given the same amount of RAM, disk DBs can perform at the same speed as IMDBs (by using caching technology).
  • If a RAM disk is created and a traditional disk DB is deployed on it, it delivers the same performance as an in-memory database.
  • If the system crashes, all the data stored in IMDBs will be lost.
  • SSDs and flash storage are getting better and better. Using these technologies along with traditional disk DBs yields the same performance as an in-memory database.
  • Since RAM size is limited, sizes of IMDBs are also limited.
  • Finally, IMDBs are not special; they are traditional disk DBs made to run on RAM.
Naturally, all of these are myths. It is close to impossible to create an in-memory DB from a traditional disk database by just changing the OS or hardware environment.
So, internally, how different is an IMDB from a traditional disk DB?
IMDBs are architected and designed keeping in mind the fact that all data is in memory. This actually leads to much simpler design as compared to disk DBs. There are six areas of difference:
  1. Query optimisation: In disk DBs, the I/O cost factor dominates the optimisation. However, in IMDBs there is no such clear factor, which makes query optimisation very tricky. This is generally solved by taking constants and falling back on rule-based optimisation.
  2. Indexing: More memory-friendly data structures and algorithms are used for indexing. While most disk DBs use B-Tree as a primary indexing data structure/algorithm, IMDBs tend to use T-Tree as a primary indexing data structure/algorithm.
  3. Internal data representation: Compactness of representation dominates concerns for IMDBs. With all data being in memory, IMDBs tend to use direct memory pointers heavily. This is very typical of the IMDB memory page, index data or relation representations.
  4. Durability and recovery: Contrary to popular belief, IMDBs are durable. They use algorithms similar to disk DBs for persistence. However, the buffer management, which is the biggest performance bottleneck for disk DBs, is eliminated. During database loading, IMDBs tend to take a bit more time as they have to load the complete data into memory. Hence, recovery is a bit slower.
  5. Access methodology: Generally, disk DBs offer client server over sockets as a primary access method. However, with no disk I/O, if IMDBs only offer sockets for access, this will become a bottleneck. Hence, most IMDBs tend to offer shared-memory access as a primary method. In a few cases, JDBC/ODBC interfaces are also supported.
  6. Concurrency control: Due to inherent speed in processing, IMDBs can take coarser locks and also do less to persist them. However, disk DBs take finer locks and take elaborate measures to persist them.
Figure 2 is a typical architecture for an in-memory database.
IMDB architecture
Figure 2: IMDB architecture

Typical applications of IMDBs

IMDBs are applicable in all domains that require real-time performance and very low latency. Four domains typically use IMDBs: telecom, financial segments, enterprises, and e-commerce and Web applications. In these spaces, IMDBs are used in a variety of applications (refer to Figure 3).
IMDB usecases
Figure 3: IMDB usecases

Key offerings in the IMDB space

The IMDB space is dominated by a lot of commercial players. Some of the most important ones are Oracle TimesTen, IBM SolidDB, Sybase ASE, ENEA Polyhedra and McObject ExtremeDB. There are also some typical open source solutions. Let us take a look at two of the FOSS IMDBs — CSQL and MonetDB.

CSQ

CSQL is an open source main-memory high-performance RDBMS developed in India. It is one of the fastest open source IMDBs. It is designed to provide high performance on simple SQL queries and DML statements that involve only one table. It supports only limited features, which are used by most real-time applications, like INSERTUPDATEDELETE on a single table, andSELECT with local predicates on a single table.
It provides multiple interfaces such as JDBC, ODBC and other SQL APIs. CSQL offers atomicity, consistency and isolation. It is typically recommended for use as a cache for existing disk-based commercial databases.

MonetDB

MonetDB is an open source high-performance DBMS developed at the National Research Institute for Mathematics and Computer Science in the Netherlands. It was designed to provide high performance on complex queries against large databases, e.g., combining tables with hundreds of columns and multi-million rows.
MonetDB is one of the first database systems to focus its query optimisation effort on exploiting CPU caches. Development of MonetDB started in 1979 and it became an open source project in 2003. MonetDB has been successfully applied in high-performance applications for data mining, OLAP, GIS, XML Query, and text and multimedia retrieval.
How do IMDBs change the rules of the game? While existing applications and schemas can directly benefit from IMDB due to performance improvement, if some things are not carefully handled, the full benefit of IMDB is not achieved. Areas that require significant changes at both the conceptual as well as implementation levels are application design, database schema design and data design (partitioning of data).

Application design

Taking advantage of performance benefits offered by an IMDB means redesigning applications to take advantage of the specific strengths of in-memory technology. The main way to achieve this is to push work that is currently done in the application layer down to the database.
This not only allows developers to take advantage of special operations offered by the DBMS, but also reduces the amount of data that must be transferred between application layers. This can lead to substantial performance improvements and can open up new application areas.

Database schema design

IMDB is beneficial if the data fits into a single database. This requires efforts to conserve space, so more elaborate normalisation procedures are not suitable. Also, using very precise and apt data types enhances storage space. Reduced redundancy, carefully formed columns, precise index creation and efficient data management will help IMDB yield better performance.

Data design

One of the key changes when moving from traditional disk DBs to in-memory DBs is the space available, so data partitioning and storage assumes great significance. IMDB requires as much related data as possible in a single process space. Too much distribution will introduce network I/O in the processing, thereby degrading performance.

Summing up

IMDBs, combined with various current hardware trends, have the ability to change the performance of enterprise and other applications drastically. This, in turn, will result in tremendous value generation to businesses. Such enhanced performance will also foster the evolution of innovative applications and services.
Do send in your feedback and queries, which can be addressed in our forthcoming articles in the IMDB series.

No comments:

Post a Comment