Opened 9 years ago
Closed 9 years ago
#739 closed defect (fixed)
Multiple MdProfile header elements breaks entire import
Reported by: | Twan Goosen | Owned by: | |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | VLO importer | Version: | |
Keywords: | Cc: | keesjan.vandelooij@mpi.nl, Twan Goosen |
Description
On 2015-02-27 the VLO import finished unsuccessfully after processing only a portion of the harvested files.
The cause turned out to be a number of files (from mpi.nl, example attached) with erroneous CMDI header content in that there were two MdProfile elements, e.g.:
<Header> <MdCreator>Vis00003</MdCreator> <MdCreationDate>2013-04-03+01:00</MdCreationDate> <MdSelfLink>http://hdl.handle.net/1839/00-196A81B7-6396-4B66-BFCD-2A9CE9C88CF9</MdSelfLink> <MdProfile>clarin.eu:cr1:p_1366895758243</MdProfile> <MdCollectionDisplayName>TLA: DiscAn</MdCollectionDisplayName> <MdProfile>clarin.eu:cr1:p_1361876010653</MdProfile> </Header>
(the offending files have been fixed since)
When the importer reached this file it hit a SolrServerException tracing to the error message
multiple values encountered for non multiValued field _componentProfile: [DiscAn_Case, DiscAn_TextCorpus]
on the Solr backend (larger log snippets from importer and solr instance below). This exception bubbled up and the import process was gracefully terminated.
The direct cause, as can be inferred from the error message, is the storing of multiple values (two actually) in a field that, according to the Solr schema, should have only one.
Even though the actual error was with the metadata record in this case, it would be good to make the importer robust against this kind of goofs. Possibly Solr server exceptions can be caught further down the processing chain but in some cases these are actually fatal and should halt the import. A more sophisticated solution would involve limiting the number of values for a specific field according to the facet mapping definition although this information is currently not present.
Possibly Solr can also be configured to be robust/less strict in cases like this?
LOG SNIPPETS
from vlo-importer.log:
2015-02-27 05:21:32,385 ERROR [org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer#handleError:399] - error org.apache.solr.common.SolrException: Bad Request request: http://catalog.clarin.eu:8077/solr/vlo/core0/update?wt=javabin&version=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 2015-02-27 05:21:35,257 INFO [eu.clarin.cmdi.vlo.importer.MetadataImporter#sendDocs:413] - Sending 128 docs to solr server. Total number of docs updated till now: 1368 32 2015-02-27 05:21:35,258 ERROR [eu.clarin.cmdi.vlo.importer.MetadataImporter#startImport:162] - error updating files: org.apache.solr.client.solrj.SolrServerException: org.apache.solr.common.SolrException: Bad Request request: http://catalog.clarin.eu:8077/solr/vlo/core0/update?wt=javabin&version=2 at eu.clarin.cmdi.vlo.importer.MetadataImporter.sendDocs(MetadataImporter.java:417) at eu.clarin.cmdi.vlo.importer.MetadataImporter.updateDocument(MetadataImporter.java:359) at eu.clarin.cmdi.vlo.importer.MetadataImporter.processCmdi(MetadataImporter.java:291) at eu.clarin.cmdi.vlo.importer.MetadataImporter.startImport(MetadataImporter.java:146) at eu.clarin.cmdi.vlo.importer.MetadataImporter.main(MetadataImporter.java:514) Caused by: org.apache.solr.common.SolrException: Bad Request request: http://catalog.clarin.eu:8077/solr/vlo/core0/update?wt=javabin&version=2 at org.apache.solr.client.solrj.impl.ConcurrentUpdateSolrServer$Runner.run(ConcurrentUpdateSolrServer.java:240) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745)
from solr.log:
2015-02-27 05:21:32,062 ERROR [org.apache.solr.core.SolrCore#log:108] - org.apache.solr.common.SolrException: ERROR: [doc=http_58__47__47_hdl.handle.net_47_1839_47_00- 196A81B7-6396-4B66-BFCD-2A9CE9C88CF9] multiple values encountered for non multiValued field _componentProfile: [DiscAn_Case, DiscAn_TextCorpus] at org.apache.solr.update.DocumentBuilder.toDocument(DocumentBuilder.java:90) at org.apache.solr.update.AddUpdateCommand.getLuceneDocument(AddUpdateCommand.java:77) at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:215) at org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:69) at org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:51) at org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:595) at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:435) at org.apache.solr.update.processor.LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:100) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:710) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:413) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:197) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:103) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:293) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:861) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:606) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) at java.lang.Thread.run(Thread.java:745)
Attachments (1)
Change History (3)
comment:1 Changed 9 years ago by
Cc: | Twan Goosen added |
---|
Changed 9 years ago by
Attachment: | oai_www_mpi_nl_1839_00_196A81B7_6396_4B66_BFCD_2A9CE9C88CF9.xml added |
---|
comment:2 Changed 9 years ago by
Resolution: | → fixed |
---|---|
Status: | new → closed |
Added missing attribute allowMultipleValues ("false") to field "_componentProfile" in facetConcepts.xml (r6086), the facet definition in the facetConcepts file is now consistent with the Solr schema
offending CMDI record