Search optimisation of our Data Catalogue

Jeannine Beeken and Veerle Van den Eynden of the UK Data Service, report on enhancements and optimisation of the search functionality of the data catalogue.

Since the launch of our new content and access management system and data catalogue a year ago, we have optimised the search functionality of the data catalogue and how search results are retrieved, displayed and sorted. This was informed by observing and testing how users actually search nowadays.

Here we describe the enhancements done and explain the logic behind the search functionality and sorting order, so users can search and find data in an optimal way.

Visit the data catalogue

The data catalogue contains records for individual datasets, as well as for series of data collections that result from longitudinal or repeated surveys, such as the Labour Force Survey, Family Resources Survey, etc.

The default search when visiting the catalogue is for individual datasets (called studies) and the default display and sorting order of data is by ‘Most recently released’, displaying all studies in our collection.

Searching for data (studies)

When searching for specific words, the sorting order of search results automatically changes to ‘Relevance’. Other options for sorting the search results are ‘Title (A-Z)’ and ‘Title (Z-A)‘.

Searches are case-insensitive, so searching for ‘IMF’ produces the same results as searching for ‘imf’ or ‘Imf’.

A search for ‘energy prices’ searches for both “energy prices” (the words are adjacent and in that order) as well as for ‘energy AND prices’ (the words are not adjacent, and can be in any order). The results list will display all studies for which the search terms are found at least once in the metadata record, in the metadata fields in which the search is carried out.

When searching for well-known study abbreviations and acronyms, for example QLFS, CLOSER, GBHD or WAS, the system searches for both the abbreviation/acronym and the full term of the study or dataset.

Figure 1 shows how search term ‘CLOSER’ finds resulting dataset from the Cohort and Longitudinal Studies Enhancement Resources (CLOSER) project.

Figure 1. Search results for ‘CLOSER’

Searches for singular or plural terms and for gerunds yield the same number of results, for example, ‘tax’ or ‘taxes’; ‘nurse’, ‘nurses’ and ‘nursing’, but the display order may differ. This is thanks to an English stemmer in the search functionality. Ideally, a search on the stem ‘cat’ will identify results with ‘cats’, ‘catlike’, and ‘catty’.

The sorting order of the results will correspond with the specific search term. When searching for ‘tax’, the results containing the singular form will be higher up in the ranking than the results with the plural form. When searching for ‘taxes’, the results containing the plural form will appear first.

Refining with filters

Five filters are available to refine the search results further: date, topic, data type, access category and country. The ‘Date from – Date to’ filter is set to default from ‘440’ (the year of our earliest dataset) to the current year, and can be modified by the user.

Metadata fields that are searched

Searches are carried out across different metadata fields in the metadata records of the dataset (Table 1). Search results will include all datasets for which the search term is found in at least one of the searched metadata fields of the dataset. Some metadata fields are given a boosting weight (indicated in brackets below) to facilitate the sorting of results by relevance.

Title (50)	Country
Study number (15)	Geographical coverage
Abstract (10)	Spatial unit
Alternative title (10)	Town / village
Topic (10)	Other geography
Primary investigator (5)	Sampling procedure
Keyword (2)	Population
Data collector	Time period
Depositor	Time dimension
Sponsor	Kind of data
Grant number	Data type
Data producer	Language of study description
Series number	Language of study documentation
Subtitle	Data access tool
Type of key dataset

Table 1. Metadata fields that are searched (with boost weight) for relevance ranking

What relevance means

Each search result gets a relevance score based on various factors. The result with the highest score comes first in the relevance ranking of the search results.

The relevance score is determined by the following:

The number of hits of a search term in all the metadata elements that are searched in each dataset record. A study record where the search term is found ten times scores higher than a study where the search terms is found five times.
The relative weight of the specific metadata fields where the search term is found. Some key field such as title, study number, abstract, topic, data creator and keyword are given a higher weight to help relevance ranking. The relative weight of each field is shown in Table 1. This ensures that search results where both ‘energy’ and ‘prices’ are found in the study title will come at the top of the search results, higher than search results that have one word in the title, or that have the search terms in other areas of the metadata record.
The relative length (as number of words) of the metadata field in which the search term is found. If ‘energy’ is found in a study title of just three words it scores higher than if it is found in a title of ten words. If ‘energy prices’ is found in an abstract of 200 words it scores higher than in an abstract of 1000 words.

Within groups of search results that have the same score, the results are ordered in descending order of version date, which is the date that a dataset was last published or updated. That means that within a series of studies that have the same or a similar title and the same or a similar abstract, if the various studies have the same relevance score, then the results will show in descending order of year with the most recently published first.

If, however, the relevance scores are not the same for all studies in a series, then the order of results will not be by year. Also, if the chronology of publishing or updates of a series of studies does not correspond with the order in which the studies were carried out (for example if the Crime Survey for England and Wales 2016-2017 was published after the Crime Survey for England and Wales 2017-18 and both have the same relevance score), then based on version date the studies would be in reversed order of year.

Why does the order of search results sometimes seem unexpected?

Through the various enhancements described above, and the careful setting of boost weights for particular metadata fields, we have tried to make the relevance ranking as intuitive as possible.

If the order of search results seems unexpected, it may be influenced by the textual content of the metadata (for example the text of the abstract) or the year when the study was last updated.

Sometimes also very old studies in the collection are updated, which gives them a recent version date. This can make such older studies appear high in the relevance ranking.

You are welcome to report any unexpected results sorting to us via our helpdesk, so we can try to improve it.

Searching for series

The data catalogue also offers the option to search only through series of data collections. A series typically results from a study that is repeated at regular intervals, for example annually.

Within a series, the individual studies are often grouped into generics. There may be different generics because the series has changed significantly at one stage. Or there may be generics by access categories, whereby special licence access datasets and controlled access datasets are grouped in a separate generic from the standard access datasets.

For example, the Family Resources Survey series with number SN 200017, contains two generics:

generic GN33283, Family Resources Survey, 1993- (standard access)
generic GN33457, Family Resources Survey, 2005- and Households Below Average Income, 1994-: Safe Room Access.

Expanding a generic, for example GN332383 (standard access) lists one or more studies, for example study SN 6523, Family Resources Survey, 2008-2009.

Searching for series is largely the same as searching for studies. There are a few differences:

The metadata fields ‘Series title’ and ‘Series number’ are searched instead of ‘Title’ and ‘Study number’, with the same boosting factors.
Within each series, the generics are ordered in ascending order of generics number. Within each generic the studies are in descending order of year. Generics and studies are listed via the ‘Access data’ tab.

For example, if ‘Understanding Society’ is selected in search results, then via the ‘Access data;’ the generics are listed whereby the most open category of data (generic) is listed above the ‘Special Licence Access’ and ‘Secure Access’ generics.

About the author

Jeannine Beeken is the co-ordinator of the UK Data Service Data Catalogue. She is co-lead of the UK Data Service Metadata Group, responsible for the revision and update of the UK Data Service metadata schema and pipeline, from pre-ingest to data catalogue, including the pull and push of metadata to systems such as DataCite (DOIs), UKRI Research Gateway, Google Data Search, Euro Question Bank (CESSDA-EQB), ORCID, OAI-PMH. Jeannine is also co-lead of the User Group, which suggests and decides on how to enhance the user experience for the UKDS Data Catalogue, including registration/access/download, searching, retrieval and sorting of results.
Jeannine also works as part of these CESSDA teams: CESSDA Metadata Office and CESSDA Euro Question Bank, as well as being part of the Oral History & Technology projects, especially the linguistic component.

Veerle Van den Eynden manages the Research Data Management team for the UK Data Service. This team provides expertise, guidance and training on data management and data sharing to researchers, to promote good data practices and optimise data sharing. She combines this with a position as Research Data Manager for the Global Challenges project Drugs and (Dis)order at the School of Oriental and African Studies.

Data Impact blog