Tableau 6.0 is out, and according to Tableau Software’s CEO one of its main features is a new data engine. Here’s an excerpt from one of the articles covering Tableau’s latest release:
"Our new Tableau Data Engine achieves instant query response on hundreds of millions of data rows, even on hardware as basic as a corporate laptop... No other platform allows companies to choose in-memory analytics on gigabytes of data …" Christian Chabot, CEO of Tableau Software, said in a statement.
These are bombastic claims indeed and the underlined segments of the CEO’s quote are particularly interesting. So with the help of my friend, colleague and brilliant database technologist Eldad Farkash, I decided to put these claims to a real life test.
Since this data engine was claimed to be utilizing in-memory technology, we set up a 64-bit computer with adequate amounts of RAM (hardly a corporate laptop) and used a real customer’s data set consisting of 560 million rows of raw internet traffic data. To make it easier, we imported just a single text field out of this entire data set.
Initial Findings:
1. Surprisingly, and unlike what Tableau’s CEO claims, Tableau’s new data engine is not really in-memory technology. In fact, their entire data set is stored on disk after it is imported and RAM is hardly utilized.
2. It took Tableau 6.0 approximately 5 hours to import this single text field, out of which 1.5 hours was pure import and the rest a process Tableau calls ‘Column Optimization’ which we believe is creating an index very similar to that of a regular relational database. For comparison, it took QlikView 50 minutes and ElastiCube 30 minutes to import the same field. That is an x7 difference. All products were using their default settings.
3. Once the import process completed, we asked Tableau to count how many distinct values existed in that field, a common query required for business intelligence purposes. That query took 30 minutes to return. For comparison, it took both QlikView and ElastiCube approximately 10 seconds to return. That’s an x180 difference. Again, both products were used with their default settings.
Initial Conclusions:
Tableau’s new data engine is a step up from their previous engine which was quite similar to that which Microsoft Access had been using in Office 2007. That is good news for individual analysts working with non-trivial amounts of data using earlier versions of Tableau, which were quite poor in this respect. This release, I imagine, also helps Tableau against SpotFire (Tibco), which until now was the only pure visualization player who could claim to have technology aimed for handling of larger data sets.
From a practical perspective, however, the handling of hundreds of millions of rows of data as well as the reference to in-memory analytics are more marketing fluff geared towards riding the in-memory hype than a true depiction of what this technology is or what it is capable of. Tableau’s data engine is not in the same league as in-memory technology, or pure columnar technologies like ElastiCube, when it comes to import times or query response times. In fact, it is slower by several orders of magnitude.
Do I have this right? You're the CEO of a competitive product to Tableau. You slice their CEO's quote leaving out key points. You do 1 test that is probably optimized to your product.
ReplyDeleteI'm a Tableau user and I've been using their new product analyzing 212 million records. It is fast like nothing else. I didn't have to write any scripts or programs. It just worked.
Jim,
ReplyDeleteThank you for your comment.
SiSense and Tableau are hardly competitors. Tableau is a visualization tool while SiSense is a development environment for creating enterprise-grade BI solutions with centralized data repositories. SiSense competes on the same deals as QlikView, IBM, SAP etc.
My sole interest in Tableau’s new release is the technological aspects of their new data engine, not their product (note the title of this post). The portions I removed from Tableau CEO's quote are those who have no bearing on the data engine that is being discussed here.
The test we’ve conducted is on ONE field, which is considered the simplest test there is. There is no way to ‘optimize’ a single field for anyone’s product. You’ll notice that in this post I am also comparing QlikView’s technology (a direct competitor of SiSense) to Tableau’s – exactly to avoid claims of subjectivity. As you may have noticed, QlikView have done quite well on this benchmark, so any claims of bias are invalid.
As for your own experience – publish your data set properties, hardware used and benchmark results and I’ll be happy to review it. I can’t really comment on claims like ‘It is fast like nothing else’, unless ‘nothing else’ means Tableau 5.2 in which case I would totally take your word for it.
Thanks again,
Elad
I've used the new Tableau on hundereds of millions of rows and I agree with Jim - for the data I use it is very fast.
ReplyDeleteWhich makes me wonder about your test. What sort of text field were you trying to insert? Was it just half a billion lines of web log entries, all different? Did you try anything else, like half a billion rows of numbers? Half a billion rows of dates? Half a billion rows of country names?
That's the kind of "real world" data I am trying to analyze and so far it has been amazing.
Brenda,
ReplyDeleteThanks for your comment.
As for your question - the field consisted of 12 character strings with roughly 70M unique values. I could perform the same test on a field with significantly less unique values, but then the results would be misleading because even a simple RDBMS could handle that within the same time frame as Tableau's engine by performing an index scan (which I believe what Tableau's engine does really).
It's important to understand that the purpose of this post is not to determine a definitive 'fast' or 'slow' grade. This is a technological post that aims to *compare* between existing data engines and Tableau's new data engine (i.e. slowER or fastER)
Fast or slow really depends on what you're doing and what you expect within the scope of the use case involved. A 30 minute response time for a distinct count, like I've shown, could be 'fast' or 'slow' too. It depends on what you're doing.