Opened 9 years ago

Closed 9 years ago

#739 closed defect (fixed)

Multiple MdProfile header elements breaks entire import

Reported by: Twan Goosen Owned by:
Priority: major Milestone:
Component: VLO importer Version:
Keywords: Cc: keesjan.vandelooij@mpi.nl, Twan Goosen

Description

On 2015-02-27 the VLO import finished unsuccessfully after processing only a portion of the harvested files.

The cause turned out to be a number of files (from mpi.nl, example attached) with erroneous CMDI header content in that there were two MdProfile elements, e.g.:

   <Header>
      <MdCreator>Vis00003</MdCreator>
      <MdCreationDate>2013-04-03+01:00</MdCreationDate>
      <MdSelfLink>http://hdl.handle.net/1839/00-196A81B7-6396-4B66-BFCD-2A9CE9C88CF9</MdSelfLink>
      <MdProfile>clarin.eu:cr1:p_1366895758243</MdProfile>
      <MdCollectionDisplayName>TLA: DiscAn</MdCollectionDisplayName>
      <MdProfile>clarin.eu:cr1:p_1361876010653</MdProfile>
   </Header>

(the offending files have been fixed since)

When the importer reached this file it hit a SolrServerException tracing to the error message

multiple values encountered for non multiValued field _componentProfile: [DiscAn_Case, DiscAn_TextCorpus]

on the Solr backend (larger log snippets from importer and solr instance below). This exception bubbled up and the import process was gracefully terminated.

The direct cause, as can be inferred from the error message, is the storing of multiple values (two actually) in a field that, according to the Solr schema, should have only one.

Even though the actual error was with the metadata record in this case, it would be good to make the importer robust against this kind of goofs. Possibly Solr server exceptions can be caught further down the processing chain but in some cases these are actually fatal and should halt the import. A more sophisticated solution would involve limiting the number of values for a specific field according to the facet mapping definition although this information is currently not present.
Possibly Solr can also be configured to be robust/less strict in cases like this?


LOG SNIPPETS

from vlo-importer.log:

2015-02-27 05:21:32,385 ERROR [org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer#handleError:399] - error
org.apache.solr.common.SolrException: Bad Request



request: http://catalog.clarin.eu:8077/solr/vlo/core0/update?wt=javabin&version=2
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
2015-02-27 05:21:35,257 INFO [eu.clarin.cmdi.vlo.importer.MetadataImporter#sendDocs:413] - Sending 128 docs to solr server. Total number of docs updated till now: 1368
32
2015-02-27 05:21:35,258 ERROR [eu.clarin.cmdi.vlo.importer.MetadataImporter#startImport:162] - error updating files:

org.apache.solr.client.solrj.SolrServerException: org.apache.solr.common.SolrException: Bad Request



request: http://catalog.clarin.eu:8077/solr/vlo/core0/update?wt=javabin&version=2
        at eu.clarin.cmdi.vlo.importer.MetadataImporter.sendDocs(MetadataImporter.java:417)
        at eu.clarin.cmdi.vlo.importer.MetadataImporter.updateDocument(MetadataImporter.java:359)
        at eu.clarin.cmdi.vlo.importer.MetadataImporter.processCmdi(MetadataImporter.java:291)
        at eu.clarin.cmdi.vlo.importer.MetadataImporter.startImport(MetadataImporter.java:146)
        at eu.clarin.cmdi.vlo.importer.MetadataImporter.main(MetadataImporter.java:514)
Caused by: org.apache.solr.common.SolrException: Bad Request



request: http://catalog.clarin.eu:8077/solr/vlo/core0/update?wt=javabin&version=2
        at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

from solr.log:

2015-02-27 05:21:32,062 ERROR [org.apache.solr.core.SolrCore#log:108] - org.apache.solr.common.SolrException: ERROR: [doc=http_58__47__47_hdl.handle.net_47_1839_47_00-
196A81B7-6396-4B66-BFCD-2A9CE9C88CF9] multiple values encountered for non multiValued field _componentProfile: [DiscAn_Case, DiscAn_TextCorpus]
        at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:90)
        at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:77)
        at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:215)
        at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69)
        at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:595)
        at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435)
        at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100)
        at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
        at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
        at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
        at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
        at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
        at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413)
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197)
        at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
        at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
        at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
        at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103)
        at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293)
        at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861)
        at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606)
        at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
        at java.lang.Thread.run(Thread.java:745)

Attachments (1)

oai_www_mpi_nl_1839_00_196A81B7_6396_4B66_BFCD_2A9CE9C88CF9.xml (59.8 KB) - added by Twan Goosen 9 years ago.
offending CMDI record

Download all attachments as: .zip

Change History (3)

comment:1 Changed 9 years ago by DefaultCC Plugin

Cc: Twan Goosen added

Changed 9 years ago by Twan Goosen

offending CMDI record

comment:2 Changed 9 years ago by teckart@informatik.uni-leipzig.de

Resolution: fixed
Status: newclosed

Added missing attribute allowMultipleValues ("false") to field "_componentProfile" in facetConcepts.xml (r6086), the facet definition in the facetConcepts file is now consistent with the Solr schema

Note: See TracTickets for help on using tickets.