Weighing risks of storing files in a content management system

I was cleaning out a desk drawer last weekend and found a few old 3 1/2 inch floppy disks. The discovery made me realize that in order to read the data off those disks, I would have to pull the floppy drive from an old computer and install it in a functioning computer -- and hope the new computer had the appropriate data connector, or that I could find an adapter.

The discovery of the floppies, and the realization that the data couldn't easily be read, is parallel to a situation I face today as I debate moving my document and media files to a content management system. The risk today, just like with the floppy disks, is that I might end up storing important data in a format that later becomes unreadable.

I have been contemplating moving a bulk of my personal text files, and perhaps even multimedia files, to an Alfresco CMS running on my home server in order to access my documents conveniently from anywhere. I have been using Evernote to store text notes and web clips, and I looked at Google Docs as an option. Both are great services and dead-simple to manage, but these online services don't provide some of the conveniences of Alfresco.

Alfresco is a free, open source Java web application that is slowly becoming a Swiss army knife for managing content. One of Alfresco's compelling features is the wide variety of file-access protocols it offers to manipulate documents stored in its repository. Alfresco's documents can be accessed via its web client, Java APIs, and CMIS, sure. But more interesting for my current needs is that documents stored in Alfresco also can be read, written and deleted from other computers on the LAN using a CIFS/SMB shared drive, over the web using WebDAV, using NFS, and even FTP.

Because Alfresco can expose its managed content using so many industry-standard protocols, I thought storing my files inside Alfresco would make it easier for me to access my documents no matter where I was, without adding the need for a specialized client application or web connection. I could use a CIFS shared drive at home to access documents from any of my home computers. I could access the documents securely from work using WebDAV over SSL. And I could access documents from a friend's house by logging into my home server from a web browser using Alfresco's native web application. My documents would be stored at home, but also available "in the cloud." I could even use Alfresco's feature to emulate a SharePoint server to version and share my MS Word and other Office documents from Office applications.

Making my documents this accessible would be convenient and (on the geeky side of) cool. I do have a strong concern about whether I want to risk exposing my documents to Internet hackers. But the longer term concern is will I find myself wanting to access my files someday without using Alfresco? Will my content repository one day become the equivalent of a floppy disk?

The risk of storing data in a format that later becomes unreadable is not new, and the problem grows as more of our lives become digitized. I remember a few years ago hearing Grady Booch describe his work preserving seminal software for the Computer History Museum and his labor of love, the Handbook of Software Architecture. He mentioned that software that should be preserved for historical and educational reasons is sometimes stored in once-popular paper or magnetic formats that are difficult to read today. The Library of Congress has been concerned with what digital formats it should use in order to store its electronic archives.

Alfresco stores its content as regular files, which is good. However, those files are named using globally unique identifiers rather than the original file name. The stored documents are mixed with other files used by Alfresco for versioning and other purposes in a series of numbered subdirectories. Do I want to rely on Alfresco being the required middleman to give me the files I need? Using the digital media sustainability factors used by the Library of Congress to rate digital preservation, I would rate Alfresco's storage like this, with High meaning good for sustainability:
  • Disclosure: High
    The files are stored in your native format like ext3 or NTFS. Alfresco itself is open source, and it runs under Java, which can be run on nearly any modern operating system.
  • Adoption: Low
    Despite Alfresco being powerful and free, the file organization and metadata formats are unique to Alfresco.
  • Transparency: High
    Alfresco stores files as regular files, albeit buried within its own directory organization scheme, and the file metadata is stored in a relational database of your choosing.
  • Self-documentation: Low
    Alfresco separates file contents from its metadata using a proprietary storage scheme. Reuniting the two requires Alfresco.
  • External dependencies: Medium-to-High
    Retrieving file data with its metadata requires Java, a web application server, Alfresco, and the database used to store the metadata. Since I use the open source MySQL as my database, and all other dependencies are open source, the external dependencies can be easily assembled. But it would be a pain.
  • Impact of patents: High
    I think all the technology needed to retrieve the data is unencumbered by patents.
  • Technical protection mechanisms: High
    Alfresco's files are stored on the file system without alteration, so no translation or decryption is needed.
Overall, Alfresco scores well for sustainability. The data should be retrievable for the foreseeable future by anyone properly motivated. However, its low Self-documentation and Adoption scores concern me. For example, say a fire completely destroys our home. While we run outdoors, my wife wisely remembers to grab my external backup USB drive. That hard drive -- hopefully -- contains a backup of the Alfresco repository file system and the Alfresco MySQL database containing the content repository metadata, like file name, file path, date created. Whew! My data are safe.

But, hmm, how to find that fire insurance policy while I'm at the local motel's shared lobby computer. Yes, it will be possible to find the file I need through search tools, or by opening the files one by one -- and in the case of binary image files, by changing the file extension from Alfresco's ".bin" to whatever format the file really contains so I can open it with the proper application. But getting my files out of the Alfresco repository, with the file name and directory structure with which the files are usable, will not be as easy as plugging the disk drive into someone else's computer and opening the file with a text editor. It will be in unexpected situations like this when I will have wished I had kept my files stored as regular files in regular directories and just used Samba.

That's where I am now, weighing the advantages of storing my files in a content management system versus the disadvantages and risks. I'm guessing many businesses go through this same struggle whenever they adopt a content management system for their documents. Once a company switches to a content management system, they must jump in with both feet and live with the benefits and problems of storing their documents inside an electronic vault controlled by a piece of non-standard software. At least with Alfresco, the process is reversible through its CIFS interface, and less scary because of its open source nature.

Maybe my solution will be to use Alfresco but to backup my content repository using the CIFS interface. That way, my backups are independent of Alfresco and I preserve the files with their original names and directory locations. I'd lose any extra Alfresco metadata stored with the files, any versioning, any software triggers or rules associated with the files. But I'd still enjoy Alfresco's benefits on my live file system. If you have faced and solved a similar situation when using a content management system, your comments are welcome.