Weighing risks of storing files in a content management system

I was cleaning out a desk drawer last weekend and found a few old 3 1/2 inch floppy disks. The discovery made me realize that in order to read the data off those disks, I would have to pull the floppy drive from an old computer and install it in a functioning computer -- and hope the new computer had the appropriate data connector, or that I could find an adapter.

The discovery of the floppies, and the realization that the data couldn't easily be read, is parallel to a situation I face today as I debate moving my document and media files to a content management system. The risk today, just like with the floppy disks, is that I might end up storing important data in a format that later becomes unreadable.

I have been contemplating moving a bulk of my personal text files, and perhaps even multimedia files, to an Alfresco CMS running on my home server in order to access my documents conveniently from anywhere. I have been using Evernote to store text notes and web clips, and I looked at Google Docs as an option. Both are great services and dead-simple to manage, but these online services don't provide some of the conveniences of Alfresco.

Alfresco is a free, open source Java web application that is slowly becoming a Swiss army knife for managing content. One of Alfresco's compelling features is the wide variety of file-access protocols it offers to manipulate documents stored in its repository. Alfresco's documents can be accessed via its web client, Java APIs, and CMIS, sure. But more interesting for my current needs is that documents stored in Alfresco also can be read, written and deleted from other computers on the LAN using a CIFS/SMB shared drive, over the web using WebDAV, using NFS, and even FTP.

Because Alfresco can expose its managed content using so many industry-standard protocols, I thought storing my files inside Alfresco would make it easier for me to access my documents no matter where I was, without adding the need for a specialized client application or web connection. I could use a CIFS shared drive at home to access documents from any of my home computers. I could access the documents securely from work using WebDAV over SSL. And I could access documents from a friend's house by logging into my home server from a web browser using Alfresco's native web application. My documents would be stored at home, but also available "in the cloud." I could even use Alfresco's feature to emulate a SharePoint server to version and share my MS Word and other Office documents from Office applications.

Making my documents this accessible would be convenient and (on the geeky side of) cool. I do have a strong concern about whether I want to risk exposing my documents to Internet hackers. But the longer term concern is will I find myself wanting to access my files someday without using Alfresco? Will my content repository one day become the equivalent of a floppy disk?

The risk of storing data in a format that later becomes unreadable is not new, and the problem grows as more of our lives become digitized. I remember a few years ago hearing Grady Booch describe his work preserving seminal software for the Computer History Museum and his labor of love, the Handbook of Software Architecture. He mentioned that software that should be preserved for historical and educational reasons is sometimes stored in once-popular paper or magnetic formats that are difficult to read today. The Library of Congress has been concerned with what digital formats it should use in order to store its electronic archives.

Alfresco stores its content as regular files, which is good. However, those files are named using globally unique identifiers rather than the original file name. The stored documents are mixed with other files used by Alfresco for versioning and other purposes in a series of numbered subdirectories. Do I want to rely on Alfresco being the required middleman to give me the files I need? Using the digital media sustainability factors used by the Library of Congress to rate digital preservation, I would rate Alfresco's storage like this, with High meaning good for sustainability:
  • Disclosure: High
    The files are stored in your native format like ext3 or NTFS. Alfresco itself is open source, and it runs under Java, which can be run on nearly any modern operating system.
  • Adoption: Low
    Despite Alfresco being powerful and free, the file organization and metadata formats are unique to Alfresco.
  • Transparency: High
    Alfresco stores files as regular files, albeit buried within its own directory organization scheme, and the file metadata is stored in a relational database of your choosing.
  • Self-documentation: Low
    Alfresco separates file contents from its metadata using a proprietary storage scheme. Reuniting the two requires Alfresco.
  • External dependencies: Medium-to-High
    Retrieving file data with its metadata requires Java, a web application server, Alfresco, and the database used to store the metadata. Since I use the open source MySQL as my database, and all other dependencies are open source, the external dependencies can be easily assembled. But it would be a pain.
  • Impact of patents: High
    I think all the technology needed to retrieve the data is unencumbered by patents.
  • Technical protection mechanisms: High
    Alfresco's files are stored on the file system without alteration, so no translation or decryption is needed.
Overall, Alfresco scores well for sustainability. The data should be retrievable for the foreseeable future by anyone properly motivated. However, its low Self-documentation and Adoption scores concern me. For example, say a fire completely destroys our home. While we run outdoors, my wife wisely remembers to grab my external backup USB drive. That hard drive -- hopefully -- contains a backup of the Alfresco repository file system and the Alfresco MySQL database containing the content repository metadata, like file name, file path, date created. Whew! My data are safe.

But, hmm, how to find that fire insurance policy while I'm at the local motel's shared lobby computer. Yes, it will be possible to find the file I need through search tools, or by opening the files one by one -- and in the case of binary image files, by changing the file extension from Alfresco's ".bin" to whatever format the file really contains so I can open it with the proper application. But getting my files out of the Alfresco repository, with the file name and directory structure with which the files are usable, will not be as easy as plugging the disk drive into someone else's computer and opening the file with a text editor. It will be in unexpected situations like this when I will have wished I had kept my files stored as regular files in regular directories and just used Samba.

That's where I am now, weighing the advantages of storing my files in a content management system versus the disadvantages and risks. I'm guessing many businesses go through this same struggle whenever they adopt a content management system for their documents. Once a company switches to a content management system, they must jump in with both feet and live with the benefits and problems of storing their documents inside an electronic vault controlled by a piece of non-standard software. At least with Alfresco, the process is reversible through its CIFS interface, and less scary because of its open source nature.

Maybe my solution will be to use Alfresco but to backup my content repository using the CIFS interface. That way, my backups are independent of Alfresco and I preserve the files with their original names and directory locations. I'd lose any extra Alfresco metadata stored with the files, any versioning, any software triggers or rules associated with the files. But I'd still enjoy Alfresco's benefits on my live file system. If you have faced and solved a similar situation when using a content management system, your comments are welcome.

Finally got Tomboy working in Fedora 10

After installing Fedora 10 last month, I finally got the Tomboy note-taking application working. I began using Tomboy in Fedora 8, and have several notes stored in Tomboy notebooks. When Tomboy broke in Fedora 10, I put it on my to-do list to figure out how to get it working. I figured the fix would be as easy as re-installing Tomboy. It wasn't.

Fedora 10 was released three months ago tomorrow. That's why I was surprised to find that reinstalling / upgrading to the latest Tomboy from the Fedora repository didn't fix the bug. Before I fixed the problem, trying to run Tomboy would give me an error like:
** (Tomboy:4816): WARNING **: The following assembly referenced from
/usr/lib/tomboy/Tomboy.exe could not be loaded:
Assembly:   Mono.Addins    (assemblyref_index=8)
Version:    0.3.0.0
Public Key: 0738eb9f132ed756
The assembly was not found in the Global Assembly Cache, a path listed in the
MONO_PATH environment variable, or in the location of the executing assembly
(/usr/lib/tomboy).
Until I saw the error, I didn't even know Tomboy was a .NET application running under Mono. I searched around for a solution to the problem and found the bug has been reported three times to Red Hat Bugzilla, but still no one has solved it. The solution, fortunately, was pretty simple, and was mentioned by Austin Acton in a bug comment. The solution also was mentioned on this blog post by Mark Ito (I'm assuming that's his name from the subdomain).

The solution is to install mono-addins from the 'fedora' repository.
sudo yum install mono-addins
For such an easy fix, you have to wonder why this 5-month old bug with high severity is still open. Tomboy comes as part of the standard Fedora 10 install. It must not be as easy as making the tomboy package dependent on the mono-addins package.

Installing Sun Java JDK 6 Update 12 on Fedora 10

When I set out to install Sun's latest Java development kit on my newly upgraded Fedora 10 development box, I discovered the previous instructions I had used on Fedora 8 from the Fedora FAQ no longer cover installing the Sun JDK. The instructions now refer only to OpenJDK using the java-1.6.0-openjdk package. After a short search, I found a newer installation technique, but unfortunately had to tweak it because it didn't work with JDK 6u12.

The best instructions I found for installing the Sun JDK on Fedora were from Fedora developer Paul Howarth at www.city-fan.org/tips/SunJava6OnFedora. Paul's instructions and his modified jpackage Java 6 RPM package are fantastically helpful. He details how to custom-build Java installation RPMs by rebuilding his RPM with the Sun Microsystems Java 1.6 "bin" installer.

The only roadblock to success was that Paul built his RPM for Java 6 update 7. The RPM spec file doesn't work if you run it with Sun's latest (as of this writing) jdk-6u12-linux-i586.bin file. My first attempt to follow Paul's instructions got me this:
[tom@development Download]$ rpmbuild --rebuild java-1.6.0-sun-1.6.0.7-1.1.cf.nosrc.rpm
Installing java-1.6.0-sun-1.6.0.7-1.1.cf.nosrc.rpm
warning: InstallSourcePackage at: psm.c:246: Header V3 DSA signature: NOKEY, key ID b56a8bac
warning: user paul does not exist - using root
warning: group paul does not exist - using root
warning: user paul does not exist - using root
warning: group paul does not exist - using root
warning: user paul does not exist - using root
warning: group paul does not exist - using root
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.W96jt0
+ umask 022
+ cd /home/tom/rpmbuild/BUILD
+ LANG=C
+ export LANG
+ unset DISPLAY
+ rm -rf /home/tom/rpmbuild/BUILD/jdk1.6.0_07
+ export MORE=10000
+ MORE=10000
+ sh /home/tom/rpmbuild/SOURCES/jdk-6u7-linux-i586.bin
sh: /home/tom/rpmbuild/SOURCES/jdk-6u7-linux-i586.bin: No such file or directory
error: Bad exit status from /var/tmp/rpm-tmp.W96jt0 (%prep)
The warnings are harmless. But as you can see, during the "prep" stage, rpmbuild is expecting the bin file to be called jdk-6u7-linux-i586.bin instead of the bin file for update 12. I optimistically hoped I might be able to get around this snag by renaming the newer file to the older name:
[tom@development Download]$ mv ~/rpmbuild/SOURCES/jdk-6u12-linux-i586.bin ~/rpmbuild/SOURCES/jdk-6u7-linux-i586.bin
But that just got me one step farther:
[tom@development Download]$ rpmbuild --rebuild java-1.6.0-sun-1.6.0.7-1.1.cf.nosrc.rpm
Installing java-1.6.0-sun-1.6.0.7-1.1.cf.nosrc.rpm
warning: InstallSourcePackage at: psm.c:246: Header V3 DSA signature: NOKEY, key ID b56a8bac
warning: user paul does not exist - using root
warning: group paul does not exist - using root
warning: user paul does not exist - using root
warning: group paul does not exist - using root
warning: user paul does not exist - using root
warning: group paul does not exist - using root
Executing(%prep): /bin/sh -e /var/tmp/rpm-tmp.1AYQKX
+ umask 022
+ cd /home/tom/rpmbuild/BUILD
+ LANG=C
+ export LANG
+ unset DISPLAY
+ rm -rf /home/tom/rpmbuild/BUILD/jdk1.6.0_07
+ export MORE=10000
+ MORE=10000
+ sh /home/tom/rpmbuild/SOURCES/jdk-6u7-linux-i586.bin
+ cd /home/tom/rpmbuild/BUILD
+ cd jdk1.6.0_07
/var/tmp/rpm-tmp.1AYQKX: line 33: cd: jdk1.6.0_07: No such file or directory
error: Bad exit status from /var/tmp/rpm-tmp.1AYQKX (%prep)
The rpmbuild was able to find and run Sun's (renamed) shell script, but then failed when it tried to switch to the non-existent jdk1.6.0_07 directory in BUILD.

To solve the problem, I had to edit the RPM "spec" file and make two small changes to account for the updated version. Then I continued with Paul's instructions, except using the modified spec file in place of directly using his RPM file. I got the idea of editing the spec file from a blog posting by Nick Lothian.

Here are my modification's to Paul's instructions,
  • Follow Paul's instructions up to and including running the rpmbuild command under the section "Build Java RPM Packages."

  • Begin Detour: After you get the error (shown above) that says "jdk-6u7-linux-i586.bin: No such file or directory," you won't have the RPM files but you will have an RPM spec file stored in ~/rpmbuild/SPECS, called java-1.6.0-sun.spec.

  • Edit this ~/rpmbuild/SPECS/java-1.6.0-sun.spec file by:
    Changing this line (line 37 in my spec file):
    %define buildver        7
    
    to say:
    %define buildver        12
    
    so the buildver is 12 instead of 7, and changing this line (line 45 in my spec file):
    %define toplevel_dir    jdk%{javaver}_0%{buildver}
    
    to say:
    %define toplevel_dir    jdk%{javaver}_%{buildver}
    
    That is, remove the "0" (zero) right before the %{buildver} variable. That second change stumped me at first because Paul apparently had to add a zero-padding in the directory name to get "07" when he was working with Update 7.

  • Run rpmbuild again by using the spec file instead of the rpm file using this command:
    [tom@development Download]$ rpmbuild -ba --rebuild ~/rpmbuild/SPECS/java-1.6.0-sun.spec
    
    This command should succeed with building new RPMs for Sun's JDK.

  • End Detour. Continue with Paul's instructions under "Remove Any Old Cruft."
Now that I have the Sun JDK installed, I'm am curious whether I can find situations where OpenJDK performs differently from Sun's JDK. Thanks to the handy "alternatives" Linux command that lets me easily switch between the different JDK versions, I'll be able to test my Java applications within both environments. After a sudo alternatives --config java, I have:
[tom@development ~]$ java -version
java version "1.6.0_12"
Java(TM) SE Runtime Environment (build 1.6.0_12-b04)
Java HotSpot(TM) Server VM (build 11.2-b01, mixed mode)
Success.