Opened 6 years ago

Last modified 5 years ago

#1052 assigned task

Do a full ResourceProxy link availability check

Reported by: matej.durco@oeaw.ac.at Owned by: can.yilmaz@oeaw.ac.at
Priority: major Milestone:
Component: MetadataCuration Version:
Keywords: Cc: wolfgang.sauer@oeaw.ac.at, Twan Goosen, menzo.windhouwer@di.huc.knaw.nl, Dieter Van Uytvanck

Description (last modified by matej.durco@oeaw.ac.at)

Actual availability/accessibility of the resources referenced in the CMDI records is a big issue. Many are 404, or are leading to a login page etc.

Even though basically link checking can be automatized, with around 6 mio. resource proxies in the 1,6 mio. CMDI records in the VLO this is not done so easily.

Curation module actually has link-checking built-in but it is deactivated for the compilation of collection scores, because it would take forever.

However we could start a run of the curation module with link-checking activated and let it run "forever", so that we get a comprehensive overview even though it's not completely up to date.

Then the question what to do with the results (ideally feed them somehow into the VLO -> availability facet, or at least make it part of the harvester-dashboard), but this has to be decided once we saw and investigate the actual numbers.

First step would be to run this on a small sample and see if the resulting information compiled by curation module into the xml-reports is sufficient for further processing and analysis.

Then we can start a full run and compile the results for internal investigation.

Wolfgang@ACDH will kick-off this task and report back here.

Change History (45)

comment:1 Changed 6 years ago by matej.durco@oeaw.ac.at

Description: modified (diff)

comment:2 Changed 6 years ago by matej.durco@oeaw.ac.at

Owner: set to wolfgang.sauer@oeaw.ac.at
Status: newassigned

comment:3 Changed 6 years ago by wolfgang.sauer@oeaw.ac.at

Owner: changed from wolfgang.sauer@oeaw.ac.at to can.yilmaz@oeaw.ac.at

comment:4 Changed 6 years ago by matej.durco@oeaw.ac.at

This feature is now implemented as part of the curation module.
Can, Wolfgang, please provide further details.

Formal specification / Documentation pending.

comment:5 Changed 6 years ago by wolfgang.sauer@oeaw.ac.at

ok - if Can doesn't have the time to write down a brief description of the current workflow (which was basically taken from Davor) before going on vacation, I will do it next week

comment:6 Changed 6 years ago by can.yilmaz@onb.ac.at

Core module link checking with database:
If the database boolean is set to true in the properties file, core module will only check the database for url validation. If not, it will check the links by itself(using the linkChecker module checkLink method).

There are two collections in the mongo database(collections in mongodb are equivalent to tables in sql). LinksChecked consists of the response details of checked links. LinksToBeChecked consists of the urls that haven't been checked yet. If the database is set to true, core module connects to the database and checks in the linksToBeChecked collection. If it is found, then it is added to the report, if not then the url is added to LinksToBeChecked collection in the database. The report has among others three fields: total number of links, total number of unique links and total number of checked links. The last one is used for any calculation(like calculating the ratio of valid links: (total number of checked links - total number of broken links)/total number of checked links).

Linkchecker database interacton:
Link Checker runs in an endless loop. First it starts to loop through all the links from linksToBeChecked collection. It handles collections in a parallelized way. This means that a thread is created for every collection and inside the thread all the links are checked sequentially. This means that theoretically, checking of all urls should take as much as the biggest(the one with most links) collection. Link Checker also deletes urls from linksToBeChecked after checking them. After all threads are finished, it checks if linksToBeChecked is empty(it could be that core module added new links while the url checking was going on). If it is not empty, it starts the whole process again. If it is empty then it copies all the urls from linksChecked into linksToBeChecked, thus assuring that the loop is always endless and linksChecked is always kept up-to-date. Whenever the Link Checker checks a link that was already in linksChecked, it replaces it so that the information in the database is always more recent.

When an instance is given as input, the same database algorithm applies. That means the first time that the instance is validated, total number of checked links will be always 0. Then when the link checker has the time to check those links, the next run of the same instance will deliver correct results. But the time that linkChecker needs to start this instance's urls is unpredictable(it could be currently busy checking a big collections urls).

Httplinkchecker algorithm:
Apache Commons HttpClient is used for the checkLink method in LinkChecker module. This method is a recursive method that deals with redirects. Its parameters are as follows: checkLink(String url, int redirectFollowLevel, long durationPassed, String originalURL). Firstly it executes a HEAD requests, if it is a redirect the method calls itself with redirectFollowLevel+1. There is a redirectFollowLimit set in the properties file which this level can't exceed. If the HEAD fails, it executes a GET request with the same redirect handling(recursive). If the GET request also fails, the link is deemed broken.
A successful response is defined as having a status code of 200 or 304. A redirect response is defined as having a status code of 301, 302, 303, 307 or 308. All other status codes are deemed as failures. Timeout is also defined in the properties file.

Ps: Apache Commons HttpClient handles redirects by itself by default. I have however disabled this because the code I inherited from Davor handled redirects itself. So if we have time to refactor, letting HttpClient handle redirects by itself would make the code cleaner and shorter.

Default valuse:
redirectFollowLimit=5

#in ms
timout=5000

This is all implemented in the branch "linkChecker-issue23".

Last edited 6 years ago by can.yilmaz@onb.ac.at (previous) (diff)

comment:7 Changed 6 years ago by Twan Goosen

Thanks, this is very useful. Perhaps it can be turned into a Wiki page (either here or some other place like the GitHub repository). And I think it would be good to summarise the most important functional aspects (ideally enumerated in terms of MUST, SHOULD, MAY [NOT]) on the top so that it's easy to be informed about these without having to understand the implementation details, and also to evaluate the implementation as it evolves. I'm thinking mainly about the fact that the most recent status of a link can be requested, there is a processing queue, and what responses are considered valid.

comment:8 Changed 6 years ago by matej.durco@oeaw.ac.at

Second Twan's remark.
Issues comments are not the place to document functionality.
Nevertheless thanks Can for putting this down. It's a good start.
Maybe Wolfgang can take over and make it to a wiki-page along the lines of what Twan suggested.

comment:9 Changed 6 years ago by wolfgang.sauer@oeaw.ac.at

we should discuss how to treat the case were the link target is password protected and the request redirected to a portal page for first login (see f.e. http://hdl.handle.net/10932/00-0372-5BAB-E74D-AC01-9).

Currently the linkchecker logs an error with the message, that the link can't be checked. But we could achieve a higher quality of link-ckecking if we either ask the data provider for some interface for link checking or if we enable the linkchecker to login automatically. The would mean to register a user and to store it's credentials in the configuration file of the linkchecker.

comment:10 in reply to:  9 Changed 6 years ago by Twan Goosen

Replying to wolfgang.sauer@…:

we should discuss how to treat the case were the link target is password protected and the request redirected to a portal page for first login (see f.e. http://hdl.handle.net/10932/00-0372-5BAB-E74D-AC01-9).

Currently the linkchecker logs an error with the message, that the link can't be checked.

If I go to this example URL, I get a chain of redirects 303 -> 302 -> 302 ... and then end up at a page that is served with status code 200. So what exactly is the error? Is it due to the redirect, or the content type of the last page?

But we could achieve a higher quality of link-ckecking if we either ask the data provider for some interface for link checking or if we enable the linkchecker to login automatically. The would mean to register a user and to store it's credentials in the configuration file of the linkchecker.

For some cases that might be a nice solution indeed, although it adds technical and administrative complexity. Also I'm not sure if many providers would be willing to expose protected resources for this purpose. What could be done instead is see if we can agree on a semantically less amibiguous response for such cases, i.e. an equivalent for the regular 401/403. Maybe there is an existing convention for this, but I'm not so well informed. In any case there doesn't seem to be an (official) www-authenticate header value although to me that seems to be the most elegant solution.

comment:11 Changed 6 years ago by Dieter Van Uytvanck

Thanks all for the ideas on this. Just discussed the issue with Twan and we agreed that interpreting if a URL is working or not should not be the responsibility of the Link Checker itself. Instead it should just store the following details:

  • final HTTP response code (after following up to 10 redirects)
  • number of redirects encountered
  • timeout: boolean
  • time to load: milliseconds
  • expected mime type (optional, provided as input parameter)
  • retrieved mime type (from Content-Type header, if available)
  • Content-Length (if available)

The application querying the Link Checker database could then infer if the link is working/useful.

Suggesting to use the www-authenticate header is a nice idea, but in practice it will be difficult to impose this. So we should start using the pure HTTP codes to at least find out the clear categories: 40? vs 20? vs other return codes.

comment:12 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

Hi, it looks like the link checker is causing a fair bit of load on our repository server (archive.mpi.nl). The HEAD requests are coming in roughly one per second so that's not unreasonable one would think, but still we see an increased load that peaks every now and then, and during those peaks it causes a noticeable performance degradation.

Would it be possible to lower the frequency a bit, to see whether that helps?

As a general feature, it would be nice if the link checker would use a site's robots.txt and have a recognisable user agent, such that we could define the crawl delay ourselves.

comment:13 Changed 6 years ago by matej.durco@oeaw.ac.at

Hi, very sorry for that! It's with best intentions.

We stopped it for the moment and will make it more polite (less aggressive and less anonymous) before continuing.
We will inform you before we start again. Maybe ideally in a coordinated manner, so that you can observe online and give us feedback right away.

comment:14 Changed 6 years ago by matej.durco@oeaw.ac.at

On the bright side, we checked already almost 4 Mio. links, some 950.000 to go.

comment:15 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

Thanks Matej. We should also look into why this is a problem in the first place, since one HEAD request per second should be doable. But for now the load has immediately dropped to almost zero.

comment:16 Changed 6 years ago by matej.durco@oeaw.ac.at

Paul, regarding the crawl-delay directive in robots.txt
Do I understand correctly that still we would need to implement it on our side, to honour that, right? You can only set it to a value you deem feasible and hope that the crawler will be nice enough to take it into consideration, right?

comment:17 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

Hi Matej, yes, it would be up to the crawler (link checker in this case) to implement and respect it. Most legitimate web crawlers do respect it, for the ones that don't you have to come up with other measures ;)

comment:18 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

One other point to consider then, if you implement such a feature, is how often you check for changes to the robots.txt file, as you don't want to do this before every request. This interval varies per crawler, from a couple of hours to a full day or so, but maybe in this case one hour would be good.

comment:19 Changed 6 years ago by wolfgang.sauer@oeaw.ac.at

@Paul: we're going to restart our linkchecker now with a fixed delay of 3000ms and useragent »CLARIN Curation LinkChecker?@acdh.oeaw.ac.at«.
In fact we decided to use a fixed delay for all links to be checked since already with a crawl-delay of 10 seconds, as defined in your robots.txt, the checking of all links would take too long. For other servers we might even have a higher crawl-delay.
Please send us a message, if you face any problems because of the linkchecker.

comment:20 Changed 6 years ago by menzo.windhouwer@di.huc.knaw.nl

The link checker doesn't have to be idle that time, isn't it? It just can't check an URL from the same server during the server specific delay. So if your queue has enough URLs from other servers there is no need to wait ... but maybe I'm missing something.

comment:21 Changed 6 years ago by matej.durco@oeaw.ac.at

@Menzo:
yeah, the crawl-delay is applied per server/provider/collection. Every collection is run in a separate thread in parallel.
But still there are repositories like TLA or BAS with > 1 Mio. links to be checked. These will take crawl-delay * |links| time.
It seems that all of the not yet checked URLs are actually those of the TLA.

1.000.00 / 86400 = 11,57 days (with one check per second).
Thus with 3 seconds delay it will be round 35 days.

comment:22 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

Hi guys, restarting it late Friday afternoon is not exactly what I had in mind when you said "coordinated manner"... but OK, the server seems to be coping fine now with the current pace, after an initial period of high load that lasted a couple of hours. I'll keep an eye on it and will let you know if there are any further problems.

comment:23 Changed 6 years ago by Twan Goosen

My €0.02: like Menzo, I think that a lot could be gained by altering the approach somewhat and adopting something like a global queue that is processed by a number of threads which take random items from the queue plus a bit of logic that guarantees a minimal amount of time between two requests to the same host or domain. In any case, I really think that pacing requests and other restrictions by the target hosts should always be respected. If these are deemed 'unreasonable' by a certain standard, better to just skip that collection.

comment:24 Changed 6 years ago by wolfgang.sauer@oeaw.ac.at

Is it clear to everybody that we have in facet two different processes which are processing links? With the current setting the process of report generation, which runs 3 times a week, extracts links (URLs) from cmdi and looks them up in the db. If the link is not in the db already, it writes the link to the db without any further processing. If it is in the db it reads the status and linkchecker results from the db.
On the other hand we have the linkchecker which runs permanently with one thread per collection. Some collections contain only a few files (and therefore a few links), while others contain hundred thousands of files. As a consequence some threads will finish much earlier than others and at the end the whole application runs for a long time only with one single thread until all the links are loaded again to be rechecked.
However, in my view a random approach as suggested from Twan would not necessarily mean a progress. Since, if the number of threads is lower than the number of collections, one cycle of linkchecking would take even longer. If it is equal or higher than the number of collections, it would not speed up the cycle but just cause a more frequent checking of small collections.
Respecting the restrictions of each target host could be done by looking for the robots.txt at the first access of each host in a checking cycle - this could be coded in a couple of hours.
BUT, if we find more settings which are unreasonable for our case (and I think a delay of 10 secs is unreasonable for linkchecking), the results of our link checking process would bring us much further. Since then most of the links could not be checked and we had to discuss individual settings with each data provider.
However, we can discuss this tomorrow on our video conference.

comment:25 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

I haven't really followed the history of this link-checker approach, but to me it sounds a bit like a last resort in case this information cannot be obtained otherwise. At least if the goal is to inform the user, rather than to validate the repository.

Each repository already knows which of their resources are open and which ones require authentication, so it should not be necessary to figure this out in such a rudimentary way, in my opinion. If we could agree on a format in which this information can be exchanged, it would save a lot of time and resources, in particular if the link checker is supposed to be running continuously (in which case I would probably end up blocking it at some point, since I'm still seeing periods of significantly increased load).

comment:26 Changed 6 years ago by matej.durco@oeaw.ac.at

@Paul:
Agree that restarting Friday afternoon was not exactly nice and coordinated.
That was a shortcircuit on my side. Sorry for that.

The motivation for the whole link-checker was the ongoing work on metadata curation and the conclusion that the most important information for the user is actually the actual availability of the resources, combined with the anecdotal evidence that oftentimes a Resource-link leads e.g. to 404. Thus we proposed to systematically check all the (Resource-)links we encounter in the CMDI-files, to get first hand information about their availability.
We had the URL-checking feature already built-in in the previous version of the curation module, but given the network latency and time-out it turned out to be prohibitively slow for the curation module to do its work in timely manner (all the other checks are done on the local copy of the CMDI-record and are accordingly much faster.)
Thus we decided to decouple this one curation aspect and let the link-checking run asynchronously in a low-priority process and let the curation module read upon request whatever information regarding the availability of the resources is available for given CMDI record.
Thus this information may be incomplete or outdated, but we felt it's still much better than none.

Thus the motivation for the url-checking, as with whole curation module is both: feedback loop to the providers and assessment information for the users.
I am sure TLA would provide us with any information we would request, but that isn't the case with the providers in general, leaving us "on our own" with our curation efforts.
Also I believe there is value in a "independent" third party assessment of ... whatever assets, in this case CMDI-files and the resources behind it.

Having said all that, the last thing we want to do is to annoy data provider (and their administrators) so we will do all we can to find a consensual solution.
We indeed found the crawl-delay of 10sec given the ~1.Mio records of the TLA a bit too restrictive (it means it takes ~ 115 days to go through the repository once), but in the end there is no real hurry to this and consent from the provider is much more important to us, than getting through the list faster.
Thus the conclusion is, we will adjust to adhere to whatever robots.txt advises us. And maybe optionally approach the providers and ask them, if there is any chance they would be willing to be hit more often.

BTW, you can see the first results of our checks in the curation module:
https://curate.acdh.oeaw.ac.at/#!Collections
If you look at the column: Ratio Of Valid Links (and sort by it) you see that there seems to be value to our work, revealing that in most repositories there indeed are problems with the links. It is too early to blame anybody, we need to inspect the findings first. It may well be that there are errors on the side of the link-checker and many of the "invalid" links may be actually 401 Unauthorized errors.
But with the link-checker now we have a tool to go about this systematically.

Again, disclaimer: the link-checker is still in development/fine-tuning stage and it's a tricky thing to do.
So, please, be patient and give us feedback (as you just did Paul, thanks for that.), so that we can optimize.

comment:27 Changed 6 years ago by matej.durco@oeaw.ac.at

To the suggestions from Twan and Menzo:
I believe the cleanest way is as we have it now:
Each collection (provider) gets its own thread. That way we can ensure respecting the crawl-delay which is provider-specific rather easily.
As Wolfgang said, the "problem" are big repositories like TLA or BAS and it makes no sense to "attack" these from multiple parallel threads. We would only get Paul more upset, ending up with being blacklisted. ;)

As first measure we applied a global crawl-delay, but given the feedback, we will implement the robots.txt-respecting per collection approach.

comment:28 in reply to:  27 Changed 6 years ago by Twan Goosen

Replying to matej.durco@…:

To the suggestions from Twan and Menzo:
I believe the cleanest way is as we have it now:
Each collection (provider) gets its own thread. That way we can ensure respecting the crawl-delay which is provider-specific rather easily.
As Wolfgang said, the "problem" are big repositories like TLA or BAS and it makes no sense to "attack" these from multiple parallel threads. We would only get Paul more upset, ending up with being blacklisted. ;)

Admittedly I have not looked at the code yet. But from the descriptions Wolfgang and Can have given, I concluded that each thread would be checking a link, then pause for a second (or 3, or 10), then check the next link and so on. It seems obvious to me that although it would be a straightforward implementation, this cannot not the most efficient way of going about. So perhaps it's a misrepresentation of the actual logic. We will have a telco shortly so we can discuss the details then :)

comment:29 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

@matej thanks for the explanation. I think there is certainly a value to the independent assessment, I'm just not too thrilled by the idea of it running continuously given the increased loads we're seeing (which I haven't actually found an explanation for so far). The crawl delay of 10 seconds was just an initial response to the problem before we had figured out which "bot" was the culprit. We're also getting crawled by various other bots, both known ones as well as more obscure ones, but so far they all seem to be respecting the robots.txt

So I think we could experiment a bit with the delay value, but then it would be good if the robots.txt was re-evaluated more frequently than once every run, like every couple of hours or so. It's also possible to have specific delay values per "bot", so the value for the link-checker does not have to be the same as the overall one.

Last edited 6 years ago by Paul.Trilsbeek@mpi.nl (previous) (diff)

comment:30 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

Sorry guys but we again had a period of very high load that interfered with the normal operation of the server. I have now blocked the link-checker based on the user agent string, so any results after 20:15 today (roughly) are not valid (you'll get a 403).

comment:31 Changed 6 years ago by matej.durco@oeaw.ac.at

sorry to hear that. thank you for reporting.
I stopped link-checker for now (around 23:00)
We will evaluate the situation in coming days and keep you updated.
We won't start link-checker without prior notification and consent from this group.
(or we may omit TLA)

Still I am wondering what could cause this high load with one direct resource-HEAD-request every three seconds...

But to show that we are not only bothering you with DOS-attacks, here are some meanwhile numbers:

For TLA:

linksChecked: 378607 
linksToBeCHecked: 867240 

(Strangely, this does not add up to the No Of Unique Links stated by the curation report for the TLA, which is 1,811,515. This striking mismtch needs to be investigated further.

And here are the counts, average and maximume response times per status code (again just for TLA)

HTTP status code count avg_responseTime max_responseTime
0 60848 0 0
200 245167 1688.1 28544
401 11 1129.9 2162
403 19896 270.2 17559
404 19746 866.5 12665
500 32877 944.8 7607
503 62 785.4 5135

I guess the 403 are all generated since the ban.

(for those interested in the underlying the mongodb query:)

db.linksChecked.aggregate([
{ $match: { "collection": "The_Language_Archive"} },
{ $group: { _id: { "status": "$status" } , 
			count: { $sum : 1 }, 
			avg_resp: { $avg : "$duration" },  
			max_resp: { $max : "$duration" } } },
{ $sort: {"_id.status": 1 } }  ] );

and here a sample of two resource that took longer than 20 seconds to serve:
https://hdl.handle.net/1839/00-0000-0000-0000-5783-9 - 28544s
https://hdl.handle.net/1839/00-0000-0000-0008-1263-C - 26519s

Although I would suspect that it's not specifically these resources that are problematic, but rather they just happened to be queried, when the server load was already too high, so their serving took so long.

comment:32 Changed 6 years ago by matej.durco@oeaw.ac.at

And while I am at it, adding to the long comment-queue some overall numbers:

linksChecked:     4 059 185
linksToBeChecked:   885 273 
HTTP status code count avg_responseTime max_responseTime
200 3555457 441.63 29736
400 13 154 410
401 13154 450.11 7026
403 30410 664.51 17559
404 31750 761.84 12665
405 176 252.37 4793
406 1 173 173
410 4 493.25 1011
451 1 324 324
500 283404 378 42769
503 1511 200.44 5135
520 1 258 258

I didn't know there is a status cod 451 - and that is actually a nice one:

HTTP 451 Unavailable For Legal Reasons

This is just to give you a first taste.
I think this is the direction of the kind of representation that should be integrated in the curation module.

comment:33 Changed 6 years ago by can.yilmaz@onb.ac.at

We are ready to start the linkchecker again with some updates to it and curation module. Among these updates is collection specific crawl delay. We have set the crawl-delay for TLA to 10 seconds. I want to ask Paul if it is ok to start it like that and if he can remove the block on linkchecker? Thanks.

comment:34 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

OK, let's give it a try. I have removed the block (commented it out, so I can reinstate it quickly if necessary ;)

Is this now a fixed crawl delay, or are you reading the robots.txt once in a while? Because if we see no performance issues with 10 seconds, we could try making it a bit smaller.

comment:35 Changed 6 years ago by can.yilmaz@onb.ac.at

The crawl delay is fixed and for now only set for TLA. If you see no performance issues, we can try to lower the delay later. I will start it tomorrow when I come in to work so we don't leave it unmonitored through the whole night. Please inform us as soon as possible if there are any problems tomorrow.

comment:36 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

Hi, have you started it yet? I'm not seeing any request with the link checker user agent so far.

comment:37 Changed 6 years ago by can.yilmaz@onb.ac.at

Hello Paul, sorry for not informing you. We haven't started yet, because there were some complications with the system. I think it is a good idea to postpone it to Monday.

Last edited 6 years ago by can.yilmaz@onb.ac.at (previous) (diff)

comment:38 Changed 6 years ago by André Moreira

Cc: menzo.windhouwer@di.huc.knaw.nl added; Menzo Windhouwer removed

comment:39 Changed 6 years ago by Paul.Trilsbeek@mpi.nl

OK, no problem. Have a good weekend.

comment:40 Changed 6 years ago by wolfgang.sauer@oeaw.ac.at

We're going to start the linkchecker again right now

comment:41 Changed 6 years ago by can.yilmaz@onb.ac.at

We have discovered a bug which limited URLs being checked, which resulted in no requests being sent this week to TLA. It is now fixed and requests should be coming soon. I'm sorry that it is a Friday again but I guess Paul can just block like last time if there are any problems.

comment:42 Changed 5 years ago by matej.durco@oeaw.ac.at

We have another DoS-complaint for the link-checker.

This time it's Europeana.
The message has been relayed by our computing centre and is from
https://pro.europeana.eu/person/ash-marriott

They state:
linkchecker.acdh.oeaw.ac.at is making 3 Requests / second and asked to reduce it to 30 / minute.

Can, please making according custom configuration for Europeana.

Also, I wonder, how come it is 3/second, I though we set the default to be 1/s ?

lG,
Sebastian Semler

comment:43 Changed 5 years ago by wolfgang.sauer@oeaw.ac.at

set a delay of 2000 ms for the following collections a restarted the linkchecker:
08804_Ag_EU_ETravel_DebBooks 2022411_Ag_RO_Elocal_audioinb 9200366_Ag_EU_TEL_a0641_Newspapers_Slovenia 92076_Ag_EU_TEL_a0497_DutchBooksOnline
2021006_Ag_FI_NDL_ephemera_tb 9200301_Ag_EU_TEL_a0611_Newspapers_Finland 9200384_Ag_EU_TEL_a0613_Newspapers_ONB 92099_Ag_EU_TEL_a1080_Europeana_Regia_France
2022402_Ag_RO_Elocal_arhivele 9200360_Ag_EU_TEL_a0639_Newspapers_Luxembourg 92068_Ag_Slovenia_ETravel

comment:44 Changed 5 years ago by Twan Goosen

Can this be only be configured at the level of collection, not OAI endpoint? FYI: we will for sure change the set of collections we harvest from Europeana so this problem can be expected to pop up again once we do unless it can be configured in a more targeted way.

comment:45 Changed 5 years ago by wolfgang.sauer@oeaw.ac.at

for the moment yes. But we can handle the europeana case separately in the future with a general crawl delay for all included collections. This might also be recommendable for the reports generation.

Will discuss this tomorrow with Can when we're both in the office

Note: See TracTickets for help on using tickets.