Marine Biodiversity Center Database Architecture

Dean Pentcheff
dean2@biol.sc.edu

Rationale
Assumptions
Data Processing
   Offline core databases
   Data transport to the online databases
   Presentation on the World Wide Web
Sample databases
Sample hardware and software

 

Rationale

We wished to make it simple for users to create and maintain databases using PC/Mac desktop database programs, then make the contents of those databases available on the World Wide Web.

Desktop database programs emphasize ease of use, particularly in simple database design and data entry. These qualities are essential to encourage biologists to create and populate databases.

On the other hand, desktop databases are poor choices for making data available on the web, since they tend to have inflexible "web publishing" back ends, and it is unwise to host web-accessible content on research desktop computers (both for security and availability reasons).

For this reason, more "industrially-oriented" database engines are more appropriate for the web end of the operation. These include commercial systems such as Oracle or Sybase, as well as freely available systems such as MySQL and PostgreSQL.

To gain the best of both worlds, we have developed an architecture that uses desktop databases as the core data repositories, then transports the data to an online database. That database is accessed using a web front end. This architecture has been achieved for a relatively low investment of capital and runs on reasonably modest hardware.

Furthermore, it decouples the selection of the desktop database software from the database software needed for web presentation. Users are free to adopt whatever technology is available, familiar, or affordable for either end of the operation. Changing either the desktop database or the presentation database can be done independently.

Assumptions

This architecture makes some assumptions:

  1. Small database tables. Since the design requires completely reconstructing the on-line database tables on each update, it is appropriate only for small databases. On our hardware (see below) it takes about a minute per 10,000 records when recreating a representative database of about 10 fields and 500 bytes per record. Note that we have made no effort to optimize database loading speeds.
  2. Infrequent updates. Again, because updating requires reconstructing the online tables, we've assumed that updates are less frequent than once per day.
  3. No direct input from the web. There is no provision for directly gathering data from web pages into the core database. The core database is maintained on the PC/Mac desktop and has a one-way flow of data to the web front end. Data are input only on the PC/Mac desktop databases.

These assumptions are true for a wide range of applications. Many small-scale databases that are essentially publications of controlled data (as contrasted with web-interactive data compilations) fit them well.

Data processing

Offline core databases

Core databases are maintained by their curators on Mac or PC desktop computers using Filemaker Pro software. Note that if users were familiar with other programs (e.g. Microsoft Access), they would serve just as well. Filemaker has the advantage of being fully cross-platform (between Windows and Mac machines) and has a very friendly database design and layout interface.

Databases can be simple tables, or can be multi-table relational systems. The application should dictate the database design. This figure shows a sample data-entry page for one database table in Filemaker on a Macintosh (click on it for a fullsize version).

Desktop databases are backed up on a regular (daily) basis to ensure data integrity.

Data transport to the online databases

Whenever an update to the web database is needed, the database table is exported to a comma-separated-text (CSV) file by the desktop database software. For ease and consistency, one can define a script or macro to do that job (dependent on the software).

Next, a web page on the web server is accessed from the desktop computer. That web page permits one or more of the database tables (now in CSV format) to be uploaded to the server. This page is password-protected and accessible only to appropriate personnel. A sample page is shown here (click on it for a fullsize version).

Clicking the "Upload" button on the form causes the files to be uploaded, processed, and loaded into the online database.

The web server presenting the upload page is the Apache web server running in the Linux operating system on a surplus 200 MHz PC. The server computer hosts the data processing program, as well as the online database engine and the web server itself.

The program called by the upload page to do the work is written in the Perl language (well-suited to textual manipulations). Each data table is described in a plain-text configuration file on the server which lists the characteristics of each field (field name, data type, and database indexing requirements).

The processing program has two main tasks: perform "fixups" on the data, and load the table into the online database. Because of differing ideas about "special" character codes, line-endings, date formats, and other application-specific details, the incoming records need a general massaging to standardize the data. Once that has been achieved, the processing program connects to the online database, drops the existing table (if it exists) and loads in all the records from the new version. These operations are done using the standard database interface system within Perl, in this case communicating with the MySQL database.

Because of the application-specific nature of the data "fixups", it seems inevitable that the processing program will need to be adapted for different data input scenarios (Filemaker vs. Access, etc.).

The strategy of completely replacing the database tables with each update avoids all the complexity of selectively updating only new or changed records. Of course, it is this strategy that puts upper limits (hardware-speed-dependent) on the practical size of the database.

Presentation on the World Wide Web

Once the data have been transported from the desktop database to the MySQL database on the Linux server, they are available to the web server for presentation to clients over the web. We have chosen to design the data-presentation pages using the widely-used PHP language for creating dynamic web pages. PHP is a markup language that can be embedded within the HTML coding of a web page. PHP has the ability to perform database queries and generate page content dynamically, based on user requests from web browsers.

Note, however, that any other data access and formatting system could be used. At this point, the data reside in a network-accessible SQL-style database. Many tools exist that can read such databases, perform queries, and produce output in a multitude of formats. In a proprietary software environment, for example, Microsoft®'s Active Server Pages (ASP) could be used instead of PHP.

The presentation of the data can be specified precisely, depending on the desired outcome. On the sample page shown here (click on the figure for a fullsize version), all the page contents are dynamically generated from the database and presented using HTML with embedded PHP code. The paragraphs of text come from database text fields, and the tables are generated from database queries. Currently, the pages are generated dynamically on each request. To support higher levels of web traffic, it would certainly be possible to pre-generate and cache pages that do not change between database updates.

It is also possible to provide more generalized search interfaces that query the online database and then format and present the results. The example here shows a literature/web resource database searched using free-text search keys.

As mentioned in the assumptions, no provision has been made for web-users to insert data into the databases, since the core databases on the PC/Mac desktop computers never receive data from the online copies. The web server presenting the pages uses read-only access to the databases. A major advantage of this strategy is that it is not possible for malicious users to corrupt the core databases. At worst, the online versions could be damaged, but that damage cannot propagate back to the core databases.

Sample databases

Here are links to the two initial databases where we have applied this method:

Sample hardware and software

As of July 2009, our hardware and software implementing this scheme is detailed below. Undoubtedly the software version numbers actually in use will change with time.

ComponentDescriptionVersionCost
Workstations for hosting core databases Mac and Windows PCs MacOS X, Windows XP, Vista US$500 - US$2,000
Desktop database software Filemaker Pro 9.0 approx. US$200 per copy
Data manipulation software Perl 5.8.8 Open Source Software (no cost)
Web server hardware AMD Athlon 1.8MHz PC Shuttle surplus, so no direct cost
Web server operating system Linux (Ubuntu) 8.04.3 (Hardy Heron) Open Source Software (no cost)
Online database engine MySQL 5.0.51a Open Source Software (no cost)
Web server software Apache (with PHP) 2.2.8 Open Source Software (no cost)
Hence, the total hardware and software costs were those required to give each core database administrator a desktop computer and a copy of a desktop database software package. Many (perhaps nearly all) researchers already have regular access to a desktop computer.

 

Creative Commons License
All content on this website (including text, photographs, and any other original work), unless otherwise noted, is licensed under a
Creative Commons License.