By Daniel Benito, Chief Product Officer at Bitext
When implementing conversation flows for a chatbot, one of the most time-consuming tasks is preparing training data for the Natural Language Understanding (NLU) component. This involves coming up with examples of the many ways in which a user could formulate a request or answer a question.
Different users will use different language registers, ranging from formal (“can you please help me…”) to informal (“can u help me…”); some may use only keywords (“cancel order”), and very often they’ll introduce spelling or grammar errors (“remve an item from the cart”), among other phenomena.
While there are tools to assist with all of the different steps of creating a bot, covering all the different ways in which a user could express the same content is still something that must be done manually.
At Bitext, we are tackling this problem using Natural Language Generation (NLG) to generate synthetic training data. Our advanced NLG technology ensures that the different linguistic phenomena are accounted for, allowing us to tailor the training data to the linguistic profiles of the typical users of each bot.
Customer Support Intent Detection Training Dataset for Rasa
To showcase our NLG technology, we have recently published a dataset designed to train intent recognition models in the Rasa NLU platform.
The dataset contains 8,175 training utterances, with between 290 and 324 utterances per intent, with the following specifications:
- Customer Service domain
- 11 categories or intent groups
- 27 intents assigned to one of the 11 categories
- 8,175 utterances assigned to the 27 intents
The dataset also reflects commonly occurring linguistic phenomena of real-life chatbots, such as:
- spelling mistakes
- run-on words
- punctuation errors…
Categories and Intents
The categories and intents covered by the dataset are:
- ACCOUNT: create_account, delete_account, edit_account, recover_password, registration_problems, switch_account
- CANCELLATION_FEE: check_cancellation_fee
- CONTACT: contact_customer_service, contact_human_agent
- DELIVERY: delivery_options, delivery_period
- FEEDBACK: complaint, review
- INVOICE: check_invoice, get_invoice
- NEWSLETTER: newsletter_subscription,
- ORDER: cancel_order, change_order, place_order, track_order
- PAYMENT: check_payment_methods, payment_issue
- REFUND: check_refund_policy, get_refund, track_refund
- SHIPPING_ADDRESS: change_shipping_address, set_up_shipping_address
These intents have been selected from Bitext’s collection of 20 domain-specific datasets (banking, retail, utilities…), covering the intents that are common across all 20 domains. For a full list of domains see here.
The dataset is freely downloadable from GitHub.
Download it, use it to bootstrap the training of your own customer service chatbots, and let us know how they turn out.
We will soon be uploading a version that also includes entity/slot annotations for most of these intents. Stay tuned for this and other updates.
Other Formats and Linguistic Tags
The dataset is also available in CSV format and will be made available in other NLU platform-specific formats soon.
The CSV version includes tags that indicate the type of language variation that the utterance expresses. When associated to each utterance, they allow Conversational Designers to customize training datasets to different user profiles with different uses of language. Through these tags, many different datasets can be created to make the resulting assistant more accurate and robust. A bot that sells sneakers should be mainly targeted to younger population that use a more colloquial language; while a classical retail banking bot should be able to handle more formal or polite language.
You can find more details about the linguistic tags here