Opened 8 years ago

Closed 8 years ago

#835 closed defect (worksforme)

Generate sitemap

Reported by: Twan Goosen Owned by: Twan Goosen
Priority: major Milestone: VLO-3.4
Component: VLO web app Version:
Keywords: Cc: teckart@informatik.uni-leipzig.de, go.sugimoto@oeaw.ac.at, Kai Zimmer

Description

Create a sitemap that provides links to all record pages for Google and other search engines. There are some options, either have a statically generated site map on (or after) import, or generating (parts of) the sitemap on the fly at request. In both cases, one option would be to create a sitemap per collection or per provider and link to these from a sitemap index file.

Some useful information can be found at:

Change History (11)

comment:1 Changed 8 years ago by DefaultCC Plugin

Cc: teckart@informatik.uni-leipzig.de added

comment:2 Changed 8 years ago by go.sugimoto@oeaw.ac.at

As of 6th of January, the current state of Google indexing:

  • Only c 37000 has been indexed, while we now have c 870000 in production. The indexing has started since September 6th 2015, if you happened to remember what you have configured around that date. There is no doubt that we should create sitemaps and submit them as soon as possible to increase discoverability and attract more traffic.
  • Test sitemaps are created by ACDH. We have one URL per record. A sitemap index contains links to dozens of sitemaps, which describe up to 50000 URLs per sitemap (Google's requirement).
  • If time permits, we might want to also discuss the change of record URI/L in the long run.

comment:3 Changed 8 years ago by go.sugimoto@oeaw.ac.at

Cc: go.sugimoto@oeaw.ac.at added

comment:4 Changed 8 years ago by Twan Goosen

Davor wrote a sitemap generation tool for the VLO, currently available at https://github.com/acdh-oeaw/vlo-sitemap-gen (to be integrated with the VLO code base)

comment:5 Changed 8 years ago by Twan Goosen

Milestone: VLO-3.5 or laterVLO-4.0 or later

Milestone renamed

comment:6 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.0 or laterVLO-4.1 or later

Milestone renamed

comment:7 Changed 8 years ago by Twan Goosen

Milestone: VLO-4.1 or laterVLO-3.4
Resolution: fixed
Status: newclosed

Available and in use as of VLO 3.4.0

comment:8 Changed 8 years ago by Kai Zimmer

Cc: Kai Zimmer added
Resolution: fixed
Status: closedreopened

comment:9 Changed 8 years ago by Kai Zimmer

As of 3rd of June 2016, searching for 'site:vlo.clarin.eu', Google states it indexed 44.200 documents, but when i go to the last page (adding all duplicates), there are only 54 (!) documents available. It seems Google doesn't find the content attractive...

comment:10 Changed 8 years ago by Kai Zimmer

some suggestions:

add to robots.txt
Disallow: /wicket
Disallow: /*jsessionid

for sitemaps

  • focus on one document type (index page OR cmdi) to lower the number of total URLs
  • add a changefreq-attribute (probably monthly)
  • resubmit the new sitemap via Google Search console
Last edited 8 years ago by Kai Zimmer (previous) (diff)

comment:11 Changed 8 years ago by Twan Goosen

Resolution: worksforme
Status: reopenedclosed

Thanks Kai, for your analysis and suggestions. Indeed, there seem to be issues with the way Google sees or understands the VLO and our previous attempts to tackle these have not been very successful. The latest release of the VLO at least produces proper 404s for records that do not exist anymore, which should lead to a cleaner index but not necessarily more results.

I have some hopes that the fix described in GitHub issue #4 will make things more attractive for Google.

I'm not sure if added disallowing '/wicket' and ' /*jsessionid' will actually have the desired effect. I will remove the parameter logic that I configured via webmaster tools as they probably don't do much good either.

The sitemap gets re-read and processed by Google frequently, as I have been able to confirm based on the data in the webmaster tools. Google does appear to be quite selective in the number of records it picks up, however. Possibly it could help to split the sitemap up into smaller pages although there is no indication or documentation that suggest that this will actually help.

I'm closing this ticket as there are no concrete actions remaining other than those mentioned here and in the GitHub ticket.

Note: See TracTickets for help on using tickets.