We're updating the issue view to help you get more done. 

Glacially slow indexing in 1.5

Description

from John Mark Ockerbloom at Penn, 8/15/12: We're running VIVO 1.5 in test on a virtual server. Recently we loaded a large batch of data on publications, many with abstracts, from PubMed and an internal database, using the RDF upload feature. The publication file was quite large (a 667 MB N3 file for ~200k publications; since we'd originally generated this before we installed 1.5, it includes precomputed inverse statements.) Despite the size, the data went in relatively quickly. (I checked back a few hours after I uploaded the file, and could find data entities from both the start and end of the file.) We gave the Tomcat Java processes a good amount of memory, which seems to have helped.
However, once we started building an index of this data, performance suffered badly. The indexer reported "Number of individuals to be indexed: 548159 by 10 worker threads", but then the time per individual started at 22900 msec and went up from there. Our virtual server admin reported the server was hitting the SAN disks very hard (enough to disrupt other virtual servers). When we finally shut down things down entirely, it had indexed 32300 individuals at a rate of 36512 msec per individual, with a load average of over 14. Has anyone else seen behavior like this, and if so, what can be done to fix it? (Indexing the smaller datasets we loaded earlier went much quicker.)

Status

Assignee

Brian Caruso

Reporter

Jon Corson-Rikert

Browser Version

None

Team

None

Time tracking

12h

Components

Fix versions

Affects versions

v1.5

Priority

Critical