Many applications start out with a good idea, and even a good architecture for demonstrating that idea or serving hundreds to thousands of clients. If it’s popular though, you’ll soon have to adopt horizontal scaling ideas.
One of the best ways to scale is to separate Database reads and writes. Sharding is a great way to add additional capacity by placing new read or write shards into load balancing rotation, or geographically closer to clients. Read shards can often replicate in less than realtime, and don’t really present a challenge, except as data grows and you outgrow the alloted storage. As Read or write data gets above a certain size, queries slow, and upgrading all those disks in all those shard servers is expensive and difficult. Also, if writes are only stored in specific write shards (as opposed to replicated network wide). It is difficult to report on business data, and long term, it’s generally not a good idea to store all that historical data in production anyway.
It’s therefore critical to create data aggregation processes to compile and store data important to business intelligence. It’s crucial to note that not all data needs to be siphoned off and stored long term. It’s equally important to decide what data you don’t need to keep and make sure it gets purged periodically from your production shards.
I’ll talk more later about effective ways to do this.
When you ask people about business intelligence, some think it’s something only big companies have or can afford. Mention of those two words will often provoke stories about how their company spent millions on a data warehouse project that was ultimately a failure.
While it’s true that BI is often not handled well, every business needs it and indeed, has it. Whether it’s a spreadsheet, accounting software reports (such as Quickbooks, or Great Plains), or reports pulled directly from your production databases, most if not all businesses need to look at measures of how they are doing.
There are definite advantages to being more systematic about it though. Versions of spreadsheets can get lost or confused. Access databases are prone to corruption and can’t handle large volumes of data. By looking at single silos of information (accounting, web traffic, etc.) you run the risk of missing connections and seeing the big picture.
Lots of companies will offer you Business Intelligence and Data Warehousing solutions. Most of us have heard of multi-million dollar data warehouse projects gone sour. It doesn’t have to be that way, and you can get there from wherever you are.
Going Open Source
There are a lot of BI solutions out there, and some are good and easy, and others are good, but complex and require a lot of expertise to implement properly. Most of them are very very expensive. We decided to evaluate, and then implement on an open source BI solution called Pentaho. In upcoming posts, you’ll see some tips and impressions. We also opted for support, but I have to say about half of our problems have been solved by googling. The documentation is a bit incomplete but getting better. I’d say it’s a good choice when you can’t afford a more packaged solution, and it’s also good if you want to get into the guts and implement something truly unique.
What you get (for free!)
- Pentaho Data Integration (PDI) – This is an ETL workflow tool similar to SSIS. It actually has several more useful modules out of the box, and easily talks to multiple types of database systems. We find it useful for pulling data from a SQL server system into a MySQL business intelligence database.
- Pentaho Report server
- Pentaho Design studio and Metadata editor (for building data models)
- There are also Mondrian analysis tools, and Weka data mining
The enterprise ($) version adds an interactive analyzer tool, dashboards, and several other things.
- Connectivity to multiple databases. While you can configure ODBC connections for other db’s in SSIS, it’s finicky, and Pentaho includes models for most of them (dialogs that you fill in with connection specific properties.)
- Works with open standards for mail, etc. for easy enterprise integration.
- Cross platform. This is especially cool, as you can develop on windows or mac laptops, and deploy on Unix servers.
- Free! (well most of it.)
- google-able support. There’s a fairly large population of users who post in forums and on blogs like this one.
- It’s a bit of a fast moving train. It takes some discipline to settle on a specific version, especially when you see new features appearing all the time. Unlike commercial product where there are 1 or 2 releases a year, this moves fast. The supported enterprise version is slower, and better QA’d.
- New features often aren’t QA’d well. It’s best to give them bake time, or try them (and give feedback) but don’t deploy to production until they are baked.
- It’s easier to get help on new features, than when you run into a problem with old. The community seems to have “moved on” including the paid support folks.
- Performance optimization is uneven. There are some features that are really efficient and fast, and others are, well, not. This is probably a function of “baked-ness” and how much that particular feature is used. For some functions like insert-update, it’s better to do it in sql anyway.
- Some functions aren’t free. There’s a really cool analyzer mode that’s available in the enterprise version. It’s still cheaper than anything else out there, so this isn’t really a disadvantage. It’s just if you need to cheap out, be aware, that you can’t have it all!
That’s all for now. I’ll publish some specific tips in coming weeks.
Just a quick tip.
Data is valuable, but as it grows in your production system, it can slow everything down.
If you have data that has grown to millions of rows, and decide to archive it off your servers, and you have replication/sharding going, selecting into archive tables and deleting millions of rows can cause your replication to back up.
Another strategy, though requiring downtime, is quicker:
- Make sure nothing is updating the table you are archiving.
- Select the amount of data you want to KEEP into an equivalent schema table with equivalent indexing.
- rename the original table to indicate it’s oldness.
- rename the new table to the original table name
- Turn your updates back on.
Then, going forward you can use the siphoning off approach for smaller chunks of data that fall off the end of your window (30, 90 days, whatever.)