The importance of a quality dataset for automated tasks
We will now try to focus on the automation of tasks using trained NLP models, where the quality of the input dataset for their creation plays a key role. In this case, the most common tasks are categorization and extraction tasks, which differ from each other in their function. For example, some help to sort emails according to the specified categories, while others assign sentiment to emails.
Others can determine the topics of individual messages within ongoing conversations. Interesting examples include tasks that help to find terms in a selected message that are significant in their content (called entities) or to identify word types and their meaning in a sentence.
The importance of a well-made dataset
Each task has its own learning process. As a rule, a training dataset with "correct answers" is required as input. Such a dataset is crucial for the quality of the resulting task - it must be correctly labelled, balanced in terms of the representation of the different categories, and contain a sufficient amount of labelled data. Obtaining, labeling, or otherwise creating this dataset for a particular task is one of the most challenging parts of the HR learning process.
For this reason, it is necessary to reflect current trends in natural language processing (NLP). These are currently based on large language models trained for a specific language - in our case, Czech. A good language model itself carries a strong knowledge of the language (Czech), and forms a robust basis for creating specific automated tasks. In addition, it significantly reduces the requirements on the size of the input training datasets, and tasks created on top of it achieve higher overall success rates.
The ActiveLearning domain, or the process of training models with minimal input dataset size requirements, cannot be overlooked, which can also achieve comparable model success rates. It offers a significant reduction in human resource requirements.
In practice, we then create separate models for each task. The input datasets essentially characterise the behaviour of the resulting task and therefore cannot be automatically taken from general sources or from other entities. Datasets always differ in content, topics, language used and other factors. Therefore, models for individual tasks are usually published in Trask as services with the described interface.