Eric Melz, October 25, 2009Fedora performance issues. We discuss the architecture of Fedora, then we look at some experiments which illustrate the impact of various data access patterns on Fedora performance.
The Fedora repository is designed to manage large numbers of digital objects. Digital objects have data and metadata. The data is the actual content (images, audio, etc), and metadata represents facts about the data. Metadata is represented using RDF.
Data is stored as files on the filesystem. Fedora objects have one or more datastreams, and each datastream is stored as a separate file. The metadata is also stored as a file in the filesystem. There is a file for each digital object which contains structural information such as the datastreams associated with the object, as well as the RDF metadata, represented in XML format.
Indexing of objects is performed using two different systems. The first system is a standard relational database such as MySQL which contains several tables tracking the Fedora objects. There is one table which is used to facilitate of retrieval of objects based on Dublin Core (DC) metadata. There are a dozen or so standard DC predicates, and each predicate has a column in the table. For example there is a "title" column and a "description" column in this table.
The second system which is used to index objects is the Mulgara triple store. Mulgara saves each subject-predicate-object RDF triple in the triple store (e.g., (pid dc:title "Freakonomics")). RDF queries using language such as iTQL or SPARQL can be performed against the triple store to retrieve objects.
Additionally, the Lucene text-search engine is integrated with the Mulgara triple store. When triples containing text literals are inserted into Mulgara, Lucene indices are also updated.
Each component of Fedora has different performance characteristics. The MySql database is fairly mature and performs well on all types of operations: inserts, updates, and deletes.
The story with Mulgara and Lucene is a bit more complicated. When a triple is inserted into Mulgara, it buffers the triple in memory. At various points, the triple buffer is flushed to a persistent store. One effect of the buffering strategy is that newly-inserted triples may not be visible to queries. It turns out that flushing the triple buffer is fairly expensive, so applications must make a choice between poor performance or having somewhat stale query results.
Similarly, writes to Lucene are very expensive. Each time a triple is sent to
Lucene, the text index is recomputed and this can take a long time.
In this experiment we ingested 5 objects. After each ingest, we did a query to retieve it. Performance in this case is not good- each ingest/retrieve cycle takes approximately 35 seconds.
In this experiment we ingested 5 objects but did not perform any retrievals.
Compared to Experiment 1, performance is greatly improved - each ingest takes
approximately 172ms, representing about a 200x speedup.
In this experiment we repeated Experiment 1, with the only difference being that we do not force Mulgara to flush it's triples. Performance was similar to Experiment 2. However, the retrieval results did not reflect the recent insertions (note the pidPrime=-1; it should be equal to the pid that was just inserted).
In this experiment we executed a triplestore query which retrieved a subset of the repository objects. The query returned 85767 objects in about 7 seconds. This represents a retrieval rate of 82 microseconds per object.
In this experiment we perform 5 ingest+retrievals as in Experiment1, but rather than using the Mulgara triple store to retrieve objects, we use the MySql database. This method produced an average of 217ms per ingest/retrieval cycle, which is only slightly greater than the results of Experiment 2 (172ms per ingest), in which no retrievals were performed. This experiment confirms the hypothesis that when ingests are interspersed with retrievals, it is significantly more efficient to use the database than the triplestore for retrieval.
One important question is, how many objects can Fedora ingest before it starts to keel over? Will ingest and retrieval times be acceptable with many objects? This question has been partially answered by a study done by Kai Strnad and Matthias Razum (http://smartech.gatech.edu/handle/1853/28433). The authors ingested a very large number (14M) of objects into Fedora. They observed ingest rates of about 10 objects/second, a number which held constant throughout the ingest process. Unfortunately, the authors did not address the critical questions of retrieval times, or of various access patterns involving Mulgara, Lucene, etc.
Fedora performance is highly dependent on the data access pattern. If data needs to be retrieved using non-DC predicates, "online" updates, in which updates are interspersed with retrievals, are not recommended. However "batch" updates, in which a set of objects is ingested and then later retrieved, is feasible for non-DC predicates.
In contrast, object retrieval using DC predicates does not pose the same performance penalty as non-DC predicates. That is, retrieval using DC predicates as the search criteria may be performed in either online or batch mode with acceptable performance. For this reason, it is recommended that "key" data such as IDs be represented using DC predicates. "Auxiliarly" data, such as descriptions may be represented using non-DC predicates.