Opened 9 years ago

Closed 8 years ago

Last modified 8 years ago

#772 closed enhancement (duplicate)

Parallelised import

Reported by: Twan Goosen Owned by:
Priority: major Milestone:
Component: VLO importer Version:
Keywords: Cc: Twan Goosen

Description

Experiments have indicated that (most of) the mapping and SOLR import process (after initialisation) can be carried out in parallel threads with considerable improvements to the total amount of time it takes to run a full import.

Implementation suggestion:

  • Create a queue that has a configurable number of worker threads at its disposal
  • The 'driver' thread processes the directories with metadata records and passes them to this queue, if need be in batch with a callback to perform post processing (such as committing to Solr or processing the hierarchy information)
  • When a thread becomes available, process the first record on the queue
  • When the queue is empty and the driver thread is done as well, perform any required global finalisation and exit

Alternative: start a thread for each directory so that hierarchy post-processing can happen in the same thread, with a queue for directories so that a maximum of N threads are active at any given time. Downside is that the load balancing across threads will be less optimal (large directories may become a bottleneck - maybe give them priority?)

Change History (10)

comment:1 Changed 9 years ago by DefaultCC Plugin

Cc: Twan Goosen added

comment:2 Changed 9 years ago by Oliver Schonefeld

The CMDI Validator uses such a Worker approach: see CMDIValidator/trunk/cmdi-validator-core/src/main/java/eu/clarin/cmdi/validator/ThreadedCMDIValidatorProcessor.java. It's build on and slightly abuses the ExecutorService but does so to only initialize the actual Validator (Worker) object once). Maybe this can serve as inspiration. (NB: When you call shutdown() on a Executor instance, the worker Threads will get interrupted; I use this to cleanly shutdown the workers.

I would try to avoid spawning a thread per directory; rather spawn a number of workes, and try to reuse as much data structures you can.

comment:3 Changed 9 years ago by Twan Goosen

Thanks for the pointer, I was indeed thinking of looking into the Java concurrency utilities for the implementation. And you're right about not spawning new threads for each directory, I guess that should have been 'job' (or the like) rather than 'thread'.

comment:4 Changed 9 years ago by Oliver Schonefeld

I guess sending a single file to a worker for processing will scale most. Something like a "job" object per directory (= data-provider) is certainly needed.

I'm not entirely sure, how building of the relation tree is handled right now. I guess one could also build up the tree while handling single files:

  • pass the "job" object along with the files to the worker
  • worker can record the appropriate relation information to that object
    • carefully do locking/critical sections and try to keep it to a minimum
    • beware the size of the tree in memory; last resort: serializing parts to disk
  • after all files a processed update the relation information
    • maybe, once the hierarchy is established, if can also be done in parallel

comment:5 Changed 9 years ago by Twan Goosen

Decided to be put on '3.5 or later' milestone in developer video conference

comment:6 Changed 8 years ago by Twan Goosen

Milestone: VLO-3.5 or laterVLO-4.0 or later

Milestone renamed

comment:7 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.0 or laterVLO-4.1 or later

Milestone renamed

comment:8 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.1 or laterVLO-4.1 (temp)

Move to temporary milestone (eventually migrate to https://github.com/clarin-eric/VLO/milestone/1)

comment:9 Changed 8 years ago by Twan Goosen

Resolution: duplicate
Status: newclosed

Migrated to GitHub as issue #29

comment:10 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.1 (temp)

Ticket retargeted after milestone deleted

Note: See TracTickets for help on using tickets.