Voicebank Usage Tutorial with Arpasing Assistant

This tutorial will explain how to use an Arpasing voicebank in UTAU, using Arpasing Assistant.

It is in your best interest to have a fully OTO'd voicebank, before making USTs with it, because adjusting the timing of notes will depend on preutterance values.

If you don't yet have the Arpasing Assistant plugin, you can download it from here.

Put plain english words into the UST, select the notes, and run the Arpasing Assistant plugin to convert them. Alternatively, you can enter plain arpabet phonemes into the note, separated by spaces. If you put "/u" at the end of a word, the note will be divided evenly for each diphone, instead of each diphone being different lengths.

The notes you select for conversion MUST have another note after them. The best way to ensure this is by putting a rest at the end of the UST.

To do multi-syllable words via word input, you will have to delete all notes but the first syllable, and extend the first note to cover the length of all the notes combined. Put the entire word in the note, and convert as normal. From there you will have to put the notes back at the correct pitch and fix the timing.
However, doing multi-syllable words via phoneme input is a lot simpler, as you can simply break the phonemes across multiple existing notes before conversion.

FOR MULTIPITCH: If the voicebank you're using is multipitch, there are issues with Arpasing Assistant. The best thing to do is to temporarily remove all pitches with suffixes from the voicebank. The pitch without a suffix should be the only one remaining, and should not be inside a subfolder. Then, run Arpasing Assistant as usual. Once the UST has been converted, put the other pitches back in.

When converting words to diphone notes, Arpasing Assistant will select the best copy of a particular diphone based on context. For example, if you input the word "stand", it will choose a [t ae] note from a "stae" sample rather than from a "tae" sample. The context of word-like recordings creates natural pronunciation. This is the purpose of numeric suffixes, to distinguish one from another. Don't delete the numbers.

While it is recommended to use a completely untuned UST, you can adjust the volume level of notes and flags prior to conversion, and they will be retained. Pitchbends and vibrato get messed up and aligned to the wrong notes, so it's recommended to remove them entirely first.

Listen to the UST as it is right now to get an idea of how it sounds, and what sections you need to fix. Chances are that some notes will be off time, and there will be pronunciations that you are unhappy with.
Find the words or syllables where the pronunciation isn't what you want it to be. If you're not familiar with Arpabet, you can reference this article when editing.

Let's fix the timing. The most glaring problem is with multi-syllable words. Because they had to be combined into a single note prior to conversion, the timing is no longer the same as intended. Hold down ctrl and drag the ends of the notes to change the lengths without pushing the notes around. Align them so that the core CVs of each syllable start at the intended position. Fix the pitches while you're at it, by selecting all the notes of one syllable and moving it up/down.
Look for phrases that start with consonants. (They would be the first syllable after what used to be a rest.) The [- C] note is often very short, so hold ctrl and drag the left side to make it longer that way. Adjust the length so that it starts where the envelope of the next note starts.

From there you can go through note by note fixing the timing of all the small notes, by changing them to the length of the preutterance envelope of the next note. This is why you want your bank to already be fully OTO'd. You probably want to set your Quantize to 64th notes.
As you go along you may also be modifying pronunciation. For example, "don't want" will become [d ow][ow n][n t][t w][w aa] etc. but this is far too many notes for a short space, where in practicality this phrase often sounds like "don want" when singing. So, remove the [n t] and change the [t w] to [n w].

From here, you can tune as normal.