Fedora Performance

Eric Melz, October 25, 2009

eric@emelz.com

Overview

This document describes Fedora performance issues. We discuss the architecture of Fedora, then we look at some experiments which illustrate the impact of various data access patterns on Fedora performance.

Fedora Architecture

The Fedora repository is designed to manage large numbers of digital objects. Digital objects have data and metadata. The data is the actual content (images, audio, etc), and metadata represents facts about the data. Metadata is represented using RDF.

Data is stored as files on the filesystem. Fedora objects have one or more datastreams, and each datastream is stored as a separate file. The metadata is also stored as a file in the filesystem. There is a file for each digital object which contains structural information such as the datastreams associated with the object, as well as the RDF metadata, represented in XML format.

Indexing of objects is performed using two different systems. The first system is a standard relational database such as MySQL which contains several tables tracking the Fedora objects. There is one table which is used to facilitate of retrieval of objects based on Dublin Core (DC) metadata. There are a dozen or so standard DC predicates, and each predicate has a column in the table. For example there is a "title" column and a "description" column in this table.

The second system which is used to index objects is the Mulgara triple store. Mulgara saves each subject-predicate-object RDF triple in the triple store (e.g., (pid dc:title "Freakonomics")). RDF queries using language such as iTQL or SPARQL can be performed against the triple store to retrieve objects.

Additionally, the Lucene text-search engine is integrated with the Mulgara triple store. When triples containing text literals are inserted into Mulgara, Lucene indices are also updated.

Performance Implications

Each component of Fedora has different performance characteristics. The MySql database is fairly mature and performs well on all types of operations: inserts, updates, and deletes.

The story with Mulgara and Lucene is a bit more complicated. When a triple is inserted into Mulgara, it buffers the triple in memory. At various points, the triple buffer is flushed to a persistent store. One effect of the buffering strategy is that newly-inserted triples may not be visible to queries. It turns out that flushing the triple buffer is fairly expensive, so applications must make a choice between poor performance or having somewhat stale query results.

Similarly, writes to Lucene are very expensive. Each time a triple is sent to Lucene, the text index is recomputed and this can take a long time.

Experiments

In this section we describe a series of experiments which illustrate various Fedora performance characteristics.  The experiments were performed on a modern Linux workstation.  The Fedora version tested was 3.2.  Triplestore indexing and the Lucene index were enabled for Fedora.  The object repository contained approximately 100,000 objects.

Experiment 1- testIngestThenQuery

In this experiment we ingested 5 objects. After each ingest, we did a query to retieve it. Performance in this case is not good- each ingest/retrieve cycle takes approximately 35 seconds.

testIngestThenQuery:
--------------------
Created 1256348517127: 35625ms
Created 1256348552717: 32971ms
Created 1256348585689: 36285ms
Created 1256348621977: 33910ms
Created 1256348655884: 33845ms
Total Elapsed: 172659

Experiment 2- testIngestOnly

In this experiment we ingested 5 objects but did not perform any retrievals. Compared to Experiment 1, performance is greatly improved - each ingest takes approximately 172ms, representing about a 200x speedup.

testIngestOnly:
---------------
Created 1256348689899: 56ms
Created 1256348689955: 55ms
Created 1256348690011: 46ms
Created 1256348690058: 46ms
Created 1256348690104: 664ms
Total Elapsed: 872

Experiment 3- testIngestThenQueryDontForceFlush

In this experiment we repeated Experiment 1, with the only difference being that we do not force Mulgara to flush it's triples. Performance was similar to Experiment 2. However, the retrieval results did not reflect the recent insertions (note the pidPrime=-1; it should be equal to the pid that was just inserted).

testIngestThenQueryDontForceFlush:
----------------------------------
Created 1256351077429 (pidPrime=-1): 802ms
Created 1256351078201 (pidPrime=-1): 49ms
Created 1256351078251 (pidPrime=-1): 45ms
Created 1256351078299 (pidPrime=-1): 53ms
Created 1256351078350 (pidPrime=-1): 49ms
Total Elapsed: 1022

Experiment 4- Retrieving many objects

In this experiment we executed a triplestore query which retrieved a subset of the repository objects.  The query returned 85767 objects in about 7 seconds. This represents a retrieval rate of 82 microseconds per object.

Experiment 5- testIngestThenQueryUsingDb

In this experiment we perform 5 ingest+retrievals as in Experiment1, but rather than using the Mulgara triple store to retrieve objects, we use the MySql database. This method produced an average of 217ms per ingest/retrieval cycle, which is only slightly greater than the results of Experiment 2 (172ms per ingest), in which no retrievals were performed. This experiment confirms the hypothesis that when ingests are interspersed with retrievals, it is significantly more efficient to use the database than the triplestore for retrieval.

testIngestThenQueryUsinbDb:
---------------------------
Created 1256446084459: 341ms
Created 1256446084749: 207ms
Created 1256446084955: 176ms
Created 1256446085135: 167ms
Created 1256446085300: 176ms
Total Elapsed: 1088

Scalability

One important question is, how many objects can Fedora ingest before it starts to keel over? Will ingest and retrieval times be acceptable with many objects? This question has been partially answered by a study done by Kai Strnad and Matthias Razum (http://smartech.gatech.edu/handle/1853/28433). The authors ingested a very large number (14M) of objects into Fedora. They observed ingest rates of about 10 objects/second, a number which held constant throughout the ingest process. Unfortunately, the authors did not address the critical questions of retrieval times, or of various access patterns involving Mulgara, Lucene, etc.

Conclusions

Fedora performance is highly dependent on the data access pattern. If data needs to be retrieved using non-DC predicates, "online" updates, in which updates are interspersed with retrievals, are not recommended. However "batch" updates, in which a set of objects is ingested and then later retrieved, is feasible for non-DC predicates.

In contrast, object retrieval using DC predicates does not pose the same performance penalty as non-DC predicates. That is, retrieval using DC predicates as the search criteria may be performed in either online or batch mode with acceptable performance. For this reason, it is recommended that "key" data such as IDs be represented using DC predicates. "Auxiliarly" data, such as descriptions may be represented using non-DC predicates.

Comments