Opened 8 years ago
Closed 8 years ago
#835 closed defect (worksforme)
Generate sitemap
Reported by: | Twan Goosen | Owned by: | Twan Goosen |
---|---|---|---|
Priority: | major | Milestone: | VLO-3.4 |
Component: | VLO web app | Version: | |
Keywords: | Cc: | teckart@informatik.uni-leipzig.de, go.sugimoto@oeaw.ac.at, Kai Zimmer |
Description
Create a sitemap that provides links to all record pages for Google and other search engines. There are some options, either have a statically generated site map on (or after) import, or generating (parts of) the sitemap on the fly at request. In both cases, one option would be to create a sitemap per collection or per provider and link to these from a sitemap index file.
Some useful information can be found at:
Change History (11)
comment:1 Changed 8 years ago by
Cc: | teckart@informatik.uni-leipzig.de added |
---|
comment:2 Changed 8 years ago by
comment:3 Changed 8 years ago by
Cc: | go.sugimoto@oeaw.ac.at added |
---|
comment:4 Changed 8 years ago by
Davor wrote a sitemap generation tool for the VLO, currently available at https://github.com/acdh-oeaw/vlo-sitemap-gen (to be integrated with the VLO code base)
comment:7 Changed 8 years ago by
Milestone: | VLO-4.1 or later → VLO-3.4 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Available and in use as of VLO 3.4.0
comment:8 Changed 8 years ago by
Cc: | Kai Zimmer added |
---|---|
Resolution: | fixed |
Status: | closed → reopened |
comment:9 Changed 8 years ago by
As of 3rd of June 2016, searching for 'site:vlo.clarin.eu', Google states it indexed 44.200 documents, but when i go to the last page (adding all duplicates), there are only 54 (!) documents available. It seems Google doesn't find the content attractive...
comment:10 Changed 8 years ago by
some suggestions:
add to robots.txt
Disallow: /wicket
Disallow: /*jsessionid
for sitemaps
- focus on one document type (index page OR cmdi) to lower the number of total URLs
- add a changefreq-attribute (probably monthly)
- resubmit the new sitemap via Google Search console
comment:11 Changed 8 years ago by
Resolution: | → worksforme |
---|---|
Status: | reopened → closed |
Thanks Kai, for your analysis and suggestions. Indeed, there seem to be issues with the way Google sees or understands the VLO and our previous attempts to tackle these have not been very successful. The latest release of the VLO at least produces proper 404s for records that do not exist anymore, which should lead to a cleaner index but not necessarily more results.
I have some hopes that the fix described in GitHub issue #4 will make things more attractive for Google.
I'm not sure if added disallowing '/wicket' and ' /*jsessionid' will actually have the desired effect. I will remove the parameter logic that I configured via webmaster tools as they probably don't do much good either.
The sitemap gets re-read and processed by Google frequently, as I have been able to confirm based on the data in the webmaster tools. Google does appear to be quite selective in the number of records it picks up, however. Possibly it could help to split the sitemap up into smaller pages although there is no indication or documentation that suggest that this will actually help.
I'm closing this ticket as there are no concrete actions remaining other than those mentioned here and in the GitHub ticket.
As of 6th of January, the current state of Google indexing: