Friday, September 24, 2010

In-memory BI is not the future. It’s the past.

In recent times, one of the most popular subjects related to the field of Business Intelligence (BI) has been In-memory BI technology. The subject gained popularity largely due to the success of QlikTech, provider of the in-memory-based QlikView BI product. Following QlikTech’s lead, many other BI vendors have jumped on the in-memory “hype wagon,” including the software giant, Microsoft, which has been aggressively marketing PowerPivot, their own in-memory database engine.

The increasing hype surrounding in-memory BI has caused BI consultants, analysts and even vendors to spew out endless articles, blog posts and white papers on the subject, many of which have also gone the extra mile to describe in-memory technology as the future of business intelligence, the death blow to the data warehouse and the swan song of OLAP technology.

I find one of these in my inbox every couple of weeks.

Just so it is clear - the concept of in-memory business intelligence is not new. It has been around for many years. The only reason it became widely known recently is because it wasn’t feasible before 64-bit computing became commonly available. Before 64-bit processors, the maximum amount of RAM a computer could utilize was barely 4GB, which is hardly enough to accommodate even the simplest of multi-user BI solutions. Only when 64-bit systems became cheap enough did it became possible to consider in-memory technology as a practical option for BI.

The success of QlikTech and the relentless activities of Microsoft’s marketing machine have managed to confuse many in terms of what role in-memory technology plays in BI implementations. And that is why many of the articles out there, which are written by marketers or market analysts who are not proficient in the internal workings of database technology (and assume their readers aren’t either), are usually filled with inaccuracies and, in many cases, pure nonsense.

The purpose of this article is to put both in-memory and disk-based BI technologies in perspective, explain the differences between them and finally lay out, in simple terms, why disk-based BI technology isn’t on its way to extinction. Rather, disk-based BI technology is evolving into something that will significantly limit the use of in-memory technology in typical BI implementations.

But before we get to that, for the sake of those who are not very familiar with in-memory BI technology, here’s a brief introduction to the topic.

Disk and RAM

Generally speaking, your computer has two types of data storage mechanisms – disk (often called a hard disk) and RAM (random access memory). The important differences between them (for this discussion) are outlined in the following table:


Most modern computers have 15-100 times more available disk storage than they do RAM. My laptop, for example, has 8GB of RAM and 300GB of available disk space. However, reading data from disk is much slower than reading the same data from RAM. This is one of the reasons why 1GB of RAM costs approximately 320 times that of 1GB of disk space.

Another important distinction is what happens to the data when the computer is powered down: data stored on disk is unaffected (which is why your saved documents are still there the next time you turn on your computer), but data residing in RAM is instantly lost. So, while you don’t have to re-create your disk-stored Microsoft Word documents after a reboot, you do have to re-load the operating system, re-launch the word processor and reload your document. This is because applications and their internal data are partly, if not entirely, stored in RAM while they are running.

Disk-based Databases and In-memory Databases

Now that we have a general idea of what the basic differences between disk and RAM are, what are the differences between disk-based and in-memory databases? Well, all data is always kept on hard disks (so that they are saved even when the power goes down). When we talk about whether a database is disk-based or in-memory, we are talking about where the data resides while it is actively being queried by an application: with disk-based databases, the data is queried while stored on disk and with in-memory databases, the data being queried is first loaded into RAM.

Disk-based databases are engineered to efficiently query data residing on the hard drive. At a very basic level, these databases assume that the entire data cannot fit inside the relatively small amount of RAM available and therefore must have very efficient disk reads in order for queries to be returned within a reasonable time frame. The engineers of such databases have the benefit of unlimited storage, but must face the challenges of relying on relatively slow disk operations.

On the other hand, in-memory databases work under the opposite assumption that the data can, in fact, fit entirely inside the RAM. The engineers of in-memory databases benefit from utilizing the fastest storage system a computer has (RAM), but have much less of it at their disposal.

That is the fundamental trade-off in disk-based and in-memory technologies: faster reads and limited amounts of data versus slower reads and practically unlimited amounts of data. These are two critical considerations for business intelligence applications, as it is important both to have fast query response times and to have access to as much data as possible.

The Data Challenge

A business intelligence solution (almost) always has a single data store at its center. This data store is usually called a database, data warehouse, data mart or OLAP cube. This is where the data that can be queried by the BI application is stored.

The challenges in creating this data store using traditional disk-based technologies is what gave in-memory technology its 15 minutes (ok, maybe 30 minutes) of fame. Having the entire data model stored inside RAM allowed bypassing some of the challenges encountered by their disk-based counterparts, namely the issue of query response times or ‘slow queries.’

Disk-based BI

When saying ‘traditional disk-based’ technologies, we typically mean relational database management systems (RDBMS) such as SQL Server, Oracle, MySQL and many others. It’s true that having a BI solution perform well using these types of databases as their backbone is far more challenging than simply shoving the entire data model into RAM, where performance gains would be immediate due to the fact RAM is so much faster than disk.

It’s commonly thought that relational databases are too slow for BI queries over data in (or close to) its raw form due to the fact they are disk-based. The truth is, however, that it’s because of how they use the disk and how often they use it.

Relational databases were designed with transactional processing in mind. But having a database be able to support high-performance insertions and updates of transactions (i.e., rows in a table) as well as properly accommodating the types of queries typically executed in BI solutions (e.g., aggregating, grouping, joining) is impossible. These are two mutually-exclusive engineering goals, that is to say they require completely different architectures at the very core. You simply can’t use the same approach to ideally achieve both.

In addition, the standard query language used to extract transactions from relational databases (SQL) is syntactically designed for the efficient fetching of rows, while rare are the cases in BI where you would need to scan or retrieve an entire row of data. It is nearly impossible to formulate an efficient BI query using SQL syntax.

So while relational databases are great as the backbone of operational applications such as CRM, ERP or Web sites, where transactions are frequently and simultaneously inserted, they are a poor choice for supporting analytic applications which usually involve simultaneous retrieval of partial rows along with heavy calculations.

In-memory BI

In-memory databases approach the querying problem by loading the entire dataset into RAM. In so doing, they remove the need to access the disk to run queries, thus gaining an immediate and substantial performance advantage (simply because scanning data in RAM is orders of magnitude faster than reading it from disk). Some of these databases introduce additional optimizations which further improve performance. Most of them also employ compression techniques to represent even more data in the same amount of RAM.

Regardless of what fancy footwork is used with an in-memory database, storing the entire dataset in RAM has a serious implication: the amount of data you can query with in-memory technology is limited by the amount of free RAM available, and there will always be much less available RAM than available disk space.

The bottom line is that this limited memory space means that the quality and effectiveness of your BI application will be hindered: the more historical data to which you have access and/or the more fields you can query, the better analysis, insight and, well, intelligence you can get.

You could add more and more RAM, but then the hardware you require becomes exponentially more expensive. The fact that 64-bit computers are cheap and can theoretically support unlimited amounts of RAM does not mean they actually do in practice. A standard desktop-class (read: cheap) computer with standard hardware physically supports up to 12GB of RAM today. If you need more, you can move on to a different class of computer which costs about twice as much and will allow you up to 64GB. Beyond 64GB, you can no longer use what is categorized as a personal computer but will require a full-blown server which brings you into very expensive computing territory.

It is also important to understand that the amount of RAM you need is not only affected by the amount of data you have, but also by the number of people simultaneously querying it. Having 5-10 people using the same in-memory BI application could easily double the amount of RAM required for intermediate calculations that need to be performed to generate the query results. A key success factor in most BI solutions is having a large number of users, so you need to tread carefully when considering in-memory technology for real-world BI. Otherwise, your hardware costs may spiral beyond what you are willing or able to spend (today, or in the future as your needs increase).

There are other implications to having your data model stored in memory, such as having to re-load it from disk to RAM every time the computer reboots and not being able to use the computer for anything other than the particular data model you’re using because its RAM is all used up.

A Note about QlikView and PowerPivot In-memory Technologies

QlikTech is the most active in-memory BI player out there so their QlikView in-memory technology is worth addressing in its own right. It has been repeatedly described as “unique, patented associative technology” but, in fact, there is nothing “associative” about QlikView’s in-memory technology. QlikView uses a simple tabular data model, stored entirely in-memory, with basic token-based compression applied to it. In QlikView’s case, the word associative relates to the functionality of its user interface, not how the data model is physically stored. Associative databases are a completely different beast and have nothing in common with QlikView’s technology.

PowerPivot uses a similar concept, but is engineered somewhat differently due to the fact it’s meant to be used largely within Excel. In this respect, PowerPivot relies on a columnar approach to storage that is better suited for the types of calculations conducted in Excel 2010, as well as for compression. Quality of compression is a significant differentiator between in-memory technologies as better compression means that you can store more data in the same amount RAM (i.e., more data is available for users to query). In its current version, however, PowerPivot is still very limited in the amounts of data it supports and requires a ridiculous amount of RAM.

The Present and Future Technologies

The destiny of BI lies in technologies that leverage the respective benefits of both disk-based and in-memory technologies to deliver fast query responses and extensive multi-user access without monstrous hardware requirements. Obviously, these technologies cannot be based on relational databases, but they must also not be designed to assume a massive amount of RAM, which is a very scarce resource.

These types of technologies are not theoretical anymore and are already utilized by businesses worldwide. Some are designed to distribute different portions of complex queries across multiple cheaper computers (this is a good option for cloud-based BI systems) and some are designed to take advantage of 21st-century hardware (multi-core architectures, upgraded CPU cache sizes, etc.) to extract more juice from off-the-shelf computers.

A Final Note: ElastiCube Technology

The technology developed by the company I co-founded, SiSense, belongs to the latter category. That is, SiSense utilizes technology which combines the best of disk-based and in-memory solutions, essentially eliminating the downsides of each. SiSense’s BI product, Prism, enables a standard PC to deliver a much wider variety of BI solutions, even when very large amounts of data, large numbers of users and/or large numbers of data sources are involved, as is the case in typical BI projects.

When we began our research at SiSense, our technological assumption was that it is possible to achieve in-memory-class query response times, even for hundreds of users simultaneously accessing massive data sets, while keeping the data (mostly) stored on disk. The result of our hybrid disk-based/in-memory technology is a BI solution based on what we now call ElastiCube, after which this blog is named. You can read more about this technological approach, which we call Just-in-Time In-memory Processing, at our BI Software Evolved technology page.
By: Elad Israeli | The ElastiCube Chronicles - Business Intelligence Blog

23 comments:

  1. The goal of BI isn't to see how much data you can stuff into a box. The measure of usefulness involves how easily and quickly you can ask, refine, and reframe questions of the data. By that measure QlikView is an excellent tool.

    Thanks for the blog!

    ReplyDelete
  2. Thanks for your comment, TB.

    The goal of this post is not about a particular product, but rather about the future of backbone BI technology.

    QlikView's success is not debatable and I am by no means saying QlikView is a bad product. The point I am trying to make, in regards to QlikView in particular, is that you do not need in-memory technology to achieve the same results and that holds great benefit for future BI solutions.
    Thanks again for your comment.

    ReplyDelete
  3. Good article, I am a huge QlikView fan but only became one because I was open to learning about (and trying) new technology. It will be interesting to learn more about Prism. The real key, though, is not just performance, but each step in the development and implementation cycle. To quote a recent CITO Research article, "to have maximum impact, BI (solutions) must be fast. They must be created fast, perform fast, and adapt fast..."

    -KS

    ReplyDelete
  4. Absolutely true, KS.

    Performance is just one of the recurring issues in typical BI implementations. But it's an example of how important the underlying technology is to enable simplifying the BI process. New technology that is not designed within real life practical guidelines is better left within the confines of academic studies.

    The exciting thing is that there is a lot of BI experience out there, and coupled with the amazing new hardware that is available on any run of the mill computers it is very possible to create fast and adaptive BI without compromising on scalability or endure heavy costs.

    We have plenty of material about Prism on our website (http://www.sisense.com). You can also download it there. I am also available at elad (-a-t-) sisense.com for any inquiry.

    Regards,
    Elad

    ReplyDelete
  5. Elad,

    What an interesting blog. Thank you for sharing. I would like to understand a bit better how do you calculate the 20% overload in RAM for each user who query the In-memory data concurrently.
    Do you have some formula to calculate it? the reason I am asking is actually the fact that we are now deploying Panorama on PowerPivot and we are being asked to size the implementation.
    As you know, our offering is usually a combination of OLAP (disk) and In-memory (RAM), hence the problem is smaller than the traditional in-memory vendors such as Qlikview.
    Also, to the best of our knowledge, PowerPivot is a bit better in handling large datasets, which supposed to decrease the overload per concurrent user.

    Also, we would love to see more interesting technologies opening their platforms for different BI suites to run on, same as I suggested to Qliktech goes also to SiSense. But again, it is a business decision, not always a technical one. We helped different companies become a better platform for BI, why wont we help you. good luck!

    Eynav Azaria, www.Panorama.com

    ReplyDelete
  6. Good article.
    I'm an extensive user of Qlik
    the battle is between amount (and not speed) of data management and h/w evolution.

    You have probably a good niche solution.

    regards

    ReplyDelete
  7. Thanks, PC.

    Not sure about it being a niche solution ;-)

    Elad

    ReplyDelete
  8. Einav,

    Thank you for the kind words.

    There is no magic formula you can use to calculate how much RAM you would need for intermediate queries. It depends on a lot of things. Since QlikView usually prefer to do most calculations during the actual data import stage, the 20% value is what I hear them (numeous times) recommend to their customers in order to accommodate the simple click-to-filter functionality of their UI.

    The way QlikView works is somewhat different than what PowerPivot does so you will have to test and see. I agree that PowerPivot's technology is better engineered for multi-user support (if you consider in-memory technology a good multi-user solution in the first place) but Microsoft hasn't done much yet to capitalize on this. Right now QlikView's technology is better for multi user simply because they've been doing it longer and have worked out many issues Microsoft have yet to understand.

    Elad

    ReplyDelete
  9. I have a Solid State Disk myself and it is so extremely fast that I don't need in-memory BI anymore.
    Maybe it is a nice idea to blog about "The need to analyze in memory disappears with the advent of SSD's"?

    ReplyDelete
  10. That's an interesting idea. Thanks.

    ReplyDelete
  11. Called it what you want, this is old milk on new bottles!

    ReplyDelete
  12. tse, not sure what you're trying to say. Care to elaborate?

    ReplyDelete
  13. Another advantage of PowerPivot is the integration with SharePoint, end users like it.

    ReplyDelete
  14. Sure.

    However, that is limited to customers who already have SharePoint (and SQL 2008 R2). For those who don't, there are much easier, more accessible ways of getting the same thing. Using SharePoint with PowerPivot is overkill for those who don't already have them installed and/or the expertise to use them.

    ReplyDelete
  15. Hey, and what about the NoSQL movement on this days with Cassandra, or Hadoop, combined with the new SSD storage.

    We, on our company are using QlikView, but I like to know new alternatives (like Jaspersoft, Pentaho, Oracle alternatives, and yours, Prism)

    ReplyDelete
  16. I am planning articles both on the BI aspects of NoSQL and SSD storage. Stay tuned.

    In a nutshell:

    The traditional origins of NoSQL do not come from BI and focus on mass scalability for applications that must handle ridiculously extensive read/writes. Because of its original purpose, NoSQL is often synonymous to distributed database architecture (Cassandra, MongoDB, etc).

    In BI, a distributed architecture introduces more problems than solutions. But because BI is focuses more on 'read' rather than 'read/write' and due to multi-core 64-bit computers, it is possible to apply certain aspects of NoSQL on BI solutions, without forcing a significant learning curve from a DBA whose familiarity is with SQL-based databases. More on that in the article.

    As far as SSD is concerned - we've actually been doing some testing of our own software on SSD and you do get performance gains. But, I think that it's impact on RDBMS would be limited, at least in the BI space. Once you read a certain point, I/O is not the main bottle neck anymore but rather the memory and CPU cache.

    Elad

    ReplyDelete
  17. PC: We've been doing surveys of BI users for ten years and our results show that the need for large data sets tends to be exaggerated, and that fast performance is determinant of project success.

    Eynav: Data compression is closely related to cognition. The ability to predict how well data can be compressed (without actually compressing it) would be AI in my book. ;-)

    Anon: Interesting point, but solid state disks are fast but not the same as in-memory because the data still has to be loaded from the "disk" into memory for the CPU to access it.

    Elad: Great post. As I mention in my blog columnar architecture often gets confused with in-memory.
    http://bi-trends.blogspot.com/2011/06/column-oriented-databases-are-not-same.html

    ReplyDelete
  18. Ever since I published this post, I've been talking to several people about how problematic the RAM restriction is in pure in-memory technology.

    It depends on who you ask, I suppose, but I did find something interesting.

    Google estimates 210 monthly searches for the exact term "out of memory qlikview", using the Google Adwords Keyword Tool.

    The number itself may be misleading as Google's estimates are not very accurate. But the fact Google even has this term recorded says a lot about people's search patterns.

    Another interesting thing is that this number was 140, a couple of months ago.

    As time marches on and data volumes grow, I believe more and more people are going to hit the wall - not just in QlikView, but in any BI technology that utilizes RAM like it's free.

    ReplyDelete
  19. In the late nineties and early 2000s, I did a lot of work tuning Oracle databases. One of my principles was similar to the concepts here: to read the disks once, and do the rest of the work in memory. Results were consistently speeding response time up from hours to minutes, and even hours down to seconds. And that was on much slower hardware. See the long presentation I made at the Oracle User's Group on the subject:
    http://rodgersnotes.wordpress.com/2010/09/14/oracle-performance-tuning/

    Another principle is to avoid loops. Even if you already have a lot of data in memory, doing row by row loop de loops on a dataset is a great way to go slow. Much of the secret is in how you code.

    It's better to first, change bad data models, and fix convoluted code (see also my work on cartesian products in SQL).

    Adding new hardware and software onto poor data models and code is a really expensive fix.

    ReplyDelete
  20. hi
    nice article
    but i have question if you don’t mind , you said " there is nothing “associative” about QlikView’s in-memory technology "
    and I have read in QlikView concept and introduction whitepaper that :
    " QlikView then extracts the data and builds a powerful, associative database that enhances the way users work with the information"
    so does QlikView transfer the data to an associative database in the RAM and handle the data model as associative database design or what ?
    Regards

    ReplyDelete
  21. Qlik use the term 'associative' to describe how they do joins and how their UI works. The database they use is not 'associative' in the 'associative database' meaning of the term. In fact, the database they use is a simple table store.

    HTH,
    Elad

    ReplyDelete
  22. it's been over 3 years since your observations with regard to the longevity and usefulness of in-memory technologies. Curious if your opinion has changed given the continued success of Qlik, Tableau, Tibco Spotfire, Gartner's creation of a new BI genre & Magic quadrant for data discovery and the continued investment the likes of Microsoft, SAP, SAS and Microstrategy are making to close the "business data discovery" chasm btw their traditional BI offerings and that of the "new genre of data discovery" leader product capabilities.

    lowering RAM costs, more effective data compression routines, more effective chipsets and architectures opening the parallel pipes btw CPU and RAM with increasing cache sizes and transfer speeds along with the leverage of Non-volatile RAM stores in hybrid configs, all are compelling.

    Thoughts?

    ReplyDelete
  23. Hi Mike,

    Thanks for your comments.

    Not only have I not changed my mind, I am now even more confident that In-Memory technology is just an interim technology.

    QlikView has been struggling continuously with the limitations their own technology impose - in particular in the ability to penetrate the enterprise. This is a known fact. Their growth has significantly slowed down.

    Tableau's success has nothing to do with In-Memory, and in fact their technology is not really in-memory technology, at least not in the sense of QlikView or SAP Hana.

    SAP HANA claims to crunch 0.5TB for about $500K on hardware and software. The same benchmarks can be achieved for 10% that cost with alternative technology (like In-Chip). The other MISO giants are just jumping on the hype bandwagon (too late, as usual).

    If you've researched the internet, you'll notice that we're pretty much betting the company on the fact that In-Memory has a very bleak future. RAM storage will not be able to catch up with the speed at which data grows and there's really no reason to use an In-Memory approach when you can achieve exactly the same, without imposing a RAM limit.

    Thanks again,
    Elad

    ReplyDelete

Total Blog Directory Technology Blog Directory Business Intelligence Directory