Managing Your Data

I assume for this article that you know what a spreadsheet is and that you have some idea about how a database will provide for larger scale data management over spreadsheets. I'm writing this under the assumption that you are someone who is outgrowing your spreadsheet solution and don't know where to go from here.

Individuals and small businesses with data management needs typically stick things in a spreadsheet and use it as a pseudo-database. On a small scale, this works fine. Modern spreadsheets have several features that help them to step up and be data management tools by sorting and manipulating the data they contain. The main question here is, "Would a huge RDBMS (Relational Database Management System) ever return money for my investment on such a small scale?" The answer is, probably not. But a client database may make sense.

You may want to read my analysis of spreadsheets before moving on.

Large businesses typically have enterprise level database solutions on multiple servers with custom built clients to access and manipulate their data. There's really nothing less that will do the jobs large corporations need. The choices large businesses are faced with have to do with the level of customization they need and which company they'd like to provide their actual database software. The only real question here is, "Which product will meet my needs?" Cost is always a factor for comparing solutions, but a large server-side RDBMS is absolutely essential in these cases.

You may want to read my analysis of database servers at this point.

The obvious answer is to use a client database, but if you are planning on growth or have a lot of people already, you may want to skip that step and build a database server now. Client databases operate as databases, but run on your own machine. You can put the actual database file on a shared drive or a server, but run the actual software on each person's computer. This is often, but not always, the best option for a small business.

You may want to read my analysis of client databases once I write it.

Spreadsheets are simply not an option to manage millions of records; and server-side databases are too expensive for tracking your hundred or so customers on a small scale. Most of the real dilemmas arise between these two scales. Small businesses are typically in a realm where Excel just isn't cutting it, but spending the thousands of dollars on licensing fees and even more on the talent to manage larger systems isn't feasible either. Should you get a client-level database program like Microsoft Access or Filemaker? Or should you spend more and put something on a server? Or should you spend the time on pushing the limits of your spreadsheet? If your data consists of items common to many businesses, like a customer list, you could look into off-the-shelf contact management software. And there are other tools for other other common needs. This document will attempt to help clear up this confusion, but I'll just be looking at managing your own data here.

The two clear reasons you really need to move beyond spreadsheets are: first, volume of data; and second, number of users. Technically, most spreadsheets will allow you to store more information than you should, but if you're storing thousands of records for the purpose of storage more than for the purpose of analysis, you ought to consider a database. However, if you're the only person needing access to this data, and other people only see it in reports you provide, you could stretch it out a little more. Just back it up frequently. Once you start sharing your data (not just its products, like reports) you will quickly grow tired of messages from the operating system telling you that someone else has the file open and you only get it read-only. You may also run into the problems of people making copies of the file, making changes, then wondering whether to overwrite the original, which may or may not have other peoples' changes saved. If you're sharing a spreadsheet, consider very strongly getting a database.

There are other issues that may cause you to grow out of a database. If you need to provide different views of your data, or pull custom reports from that data, or do complicated sorting or analysis of the data, your specific needs may push you into the database camp. Some analysis is more easily done in your spreadsheet than in a database program, however. You can typically do anything you can imagine with a database, but at the expense of more work to create the tools. Excel can do some pretty cool analysis with pivot tables, and that tool comes included with your copy of Excel. But to do the same thing in a database, you either need to ask Excel to import the data from your database (which makes a database seem like a bad idea, but can actually be a great idea as you'll learn soon) or build or buy tools to do that with the data you keep in your database. This brings up one of the biggest trade-offs to a database upgrade. You can buy unlimited capabilities, but each capability can cost more time and effort in addition to the database expense itself. One thing to remember, however, is that if you are sold on a database for any of the strengths it provides, you can almost always pull data into your spreadsheet to do analysis. This ought to provide you with a crutch as you move into more complicated tools, or even a permanent tool in your repertoire.

There is obviously a large swath of gray space where you could use either a spreadsheet or a database to achieve your objectives. For the purpose of moving on to a discussion of databases, we'll assume that you are sharing your data with other people, causing the problems mentioned previously, and that you have tens of thousands of orders to track and putting them in your spreadsheet is just becoming too cumbersome. We've got a clear and urgent need for a more powerful tool. The next logical step is to purchase a client database. These can be had for substantially less than a server database; they don't require a server; and they're geared to ordinary users, not engineers. The two biggest reasons for using one of these tools are, surprise, the same reasons for moving beyond a spreadsheet. Multiple users can have the database open at the same time, and the database engine will keep your fingers out of each others' work. You won't step on each other unless you're both trying to edit the same exact record at the same time. Imagine your spreadsheet allowing you to all have the file open at the same time, inserting changes into all of your open spreadsheets, and updating them in real-time. These databases can also hold a tremendous amount of information. You should expect to have the database tool very easily import your spreadsheet data into a single table, where you can start using it.

Now you have successfully upgraded your data storage from spreadsheet to database ... now what? The first thing you'll notice that's slightly different from the database is that you don't have a bunch of extra empty cells hanging around. The database likes to arrange its data in rows and columns like a spreadsheet, but calls the rows records and the columns fields. You can add and delete records very easily, but won't want to add or delete fields very often. You'll want to design your data structures up front, knowing what all the fields are going to be for a type of record, and sticking to it. Imagine that you have never before worried about whether an order was made as a gift or not. But now you want to offer a gift wrapping service. It would make sense to add a field to track whether each order will be a gift. You may have previously just typed gift in the next spreadsheet column to record this. But now you need to modify the design of the table to add the gift field. Once done, you'll be able to either type gift as before, or better, click a yes/no box to indicate that the order is a gift. And since you're checking boxes now, lets discuss one of the big benefits of databases that also has a cost.

Most databases end up with what we call a front-end or a client. When you move into databases, you separate your data from your presentation of it. You can build a front-end in your client database software with the tools it provides for you. This will usually be a form that includes some fill-in boxes and check boxes to read and change the data stored in your tables. Rather than looking row by row through your spreadsheet, you will now see everything from that row, or now called a record, displayed on a form that can have fields above or below one another, to the right or left of one another, or even invisible. At first glance, this seems nice, but upon further reflection, you'll see that you can provide two different forms that both access the same data in the same table. Perhaps you have one form for entering new orders that makes the data entry of specific fields efficient. And you can create another form that displays the order information, but only allows updates to comments on the order. Multiple views of the data can be used in hundreds of ways to create efficiencies that are untouchable in the spreadsheet world. The only drawback is that you have to take the time to build the forms.

Another benefit of separating your data from your presentation, sometimes called "data layer" and "presentation layer" is that you can put buttons on your form that will do calculations or automate tasks. If you had someone call and cancel an order back in the spreadsheet days, you may have had a procedure to follow that may have deleted the row from the spreadsheet (bad for record-keeping) or added a new column that said CANCEL in bold letters or something similar. Then you may have had to send an email to your friend who packed and shipped the orders. In your form, you can have a single cancel button that can be programmed to first, do whatever you need to do in your table to mark the order cancelled, then automatically kick out a form email to someone without access to the database to inform them of the cancellation. One more form to look at the same data. Of course, the better option would just be to have the guy in shipping pull his orders from the database one at a time. If the order is marked cancelled, he'd never have to pack it in the first place. The email would only go out if he'd already packed the order and he needed to avoid shipping it.

Another presentation to sit atop your data is in the form of reports. A report is read-only, typically produced after something has happened to provide information about it. Lets say you want a report every day that tells you how many orders were received that day, how many were filled, how many were shipped, and even how many are still outstanding from previous days. Creating this in a spreadsheet was possible, but time consuming. More importantly, it required you to move around your actual data in the spreadsheet, risking the deletion or breaking apart of records. With a database, you build a report once, calculating numbers of records that match certain dates and criteria. You save the report, then simply open it each evening with fresh, reliable, up-to-date information. You can build dozens of reports to provide information to different people with different interests in the data. This efficiency of access to the information is huge as a business grows.

You've been using this client database for a while (months or years) and your business is growing. You may have noticed that you often have to sit and wait while records come up in your front-end. You notice that those new reps you hired are sitting around waiting on the database to give them the order information they need to answer customer questions. Reports that used to pop up in less than a second are now taking nearly a minute. You may have outgrown your client database at this point. It will continue to work, but as more and more people are simultaneously accessing the same file on your network, the slower that access will be. The data that is being pumped into the database is causing the size of the file to swell also, not causing any storage issues since you have large disk drives to hold the file, but causing the larger file to be passed around the network hundreds of times each day as people access the data. You need a server with a solid database engine. This will keep the database file in a secure place where it can be coddled and backed up since it's the life blood of your business, and it will ensure that only necessary data will be put out on a wire to clog your network.

With a server handling your database needs, you are going to get multiple improvements over your client database. First, the server is doing all of the processing by itself. If ten people give it ten requests within the same second, the server will serve each request, knowing about the other nine. Formerly, each client was trying to process its own data and had to figure out what the implications of the other requests would be. This lack of centralization can cause a lot of extra time and processing to happen on each of your client computers. Second, the server is only going to put information onto the network that is needed. Rather than having a bunch of employees grabbing for the whole database file repeatedly, each client will request a record, and receive a record. This will drastically improve your network congestion levels. It will also cut down on processing and memory load on each client machine. Records will start coming up much more quickly and reports won't clock for minutes anymore.

The decision to move from a client database to a server database is based on even more subjective criteria than the decision to move from a spreadsheet to a client database. There were two things to look for in the first case, quantity of data and quantity of users. These two things both factor in here, too, but in a vaguer way. You could say in the first case that if you're sharing your spreadsheet, you ought to move to a database. But here, you're clearly sharing your database and probably among more than just two or three people. Here, you need to look at the delays and frustrations and put a cost on them; then look at the cost of upgrading to a server database and see how they compare.

You'll be able to move your tables, that have probably changed since the spreadsheet days, straight to the database server and use them as you always have. One possible cost here is that the presentation layer you built in your client database may or may not scale up to the server database. If not, you could be stuck rebuilding your presentation layer with different tools. This could be a fantastic opportunity to redesign things for ever more efficiency or to scale better in the future, but it can cause some headaches and frustrations to throw away previous work. And now you have a whole new decision to make. Once you reach this stage, you are spoiled for choice among front-end technologies. There are literally so many options for building database front-ends, you will never be able to analyze them all. You can build things with web technologies, perhaps allowing for automated web-access for your customers to their own orders. You can build clients that run on your employees' computers with any of the many packages or build your own from scratch. You can do reporting in much the same way, or you can use any of the dozens of packages that provide reporting and analysis of your data. You've reached the big time. The only place to go from here is a server farm rather than a single server, but the differences there are for your IT department to manage; it all looks the same from where you sit.

You have probably noticed that a spreadsheet can be managed, in fact is best managed, by a single user. A client database ought to have someone who is responsible for building the forms, modifying tables when necessary, etc. But that person doesn't need to be a full-time database admin. It's probably just an employee with some technical interest or ability. It may just be the employee with the best understanding of the processes employed at your organization. We're still talking in terms of a handful of people at most. A client database could even be reasonably employed by the same single person who wanted more out of his/her spreadsheet. But once you get to the server-side database, you will probably need a programmer to build and maintain your presentation layer. You will need a database administrator to manage your data layer and database security. You could even need a network administrator to handle backups and network maintenance and security issues. These could be part-time or outsourced positions, but the cost will go up along with the capabilities.

I am going to provide a comparison of tools at each of these three levels and link to them from here when they're done. I hope this has been educational. Feel free to add comments or ask questions below. I'll try to stay on top of them.