rsca: a reversible sound change applier

last updated 10 May 2006 in substance, 4 February 2012 in picayunety

rsca is a command-line program which applies sound changes to sets of words. It is by no means the first such sound change applier: this honor (to my knowledge) goes to Mark Rosenfelder's sounds. Among other sound change appliers, rsca draws particular inspiration from Geoff's Sound Change Applier and Henrik Theiling's schcompile, all of which are worth checking out. There is also IPA Zounds (which, as I learned after writing this one, supports running in reverse but isn't so ostentatious about it, proving this a reinvention of the wheel. Ah well). Nope, IPA Zounds' prerequisites have succumbed to bitrot. But VSCA looks good too.

So why yet another sound change applier? rsca's main selling point (and the feature from which it takes its name) is that it's reversible: not only can it apply sound changes, it can unapply them and reconstruct the set of all proto-words yielding a given word. I've also tried to make rsca fairly general in its sound-changing capabilities, and given it support for X-SAMPA and similar notations. (And, I'll admit, I partly did it for the chance to demonstrate my transducer approach...)

rsca is implemented in C++, and is currently in version 0.1. For the moment it can be obtained only as raw source (24K), which makes to yield an executable rsca. g++ and flex and bison are required.

Comments, questions, bug reports, feature requests (if you're feeling lucky...) to me at gmail with username 000024.

Table of contents

Using rsca
Sound change files Modifier characters, or how to write phones Sound categories
Sound changes Basic changes Sound categories Matching categories Regular expressions Changes with multiple blanks Parallelized changes Sound change options Constraints
Other topics Errors rsca's implementation of sound changes Conflicts in determinization Going in reverse Wishlist

Using rsca

In its default mode of behaviour, rsca emulates the *nix filter ideal: it takes input words on standard input, one per line, and returns the output words with no extraneous formatting, one line per input word (note that this
doesn't imply "one word per line"!). rsca is invoked as
rsca [options] <sound change file>
where sound change file is a file specifying the sound changes (of no mandated extension; but if you use one, it must be supplied) and the options are as follows.

Sound change files

A sound change file consists of three components: Interspersed anywhere among these may be lines of the following forms: Whitespace (tabs and spaces) is critical in the definition of sound changes, but aside from this rsca is fairly lenient about whitespace.

In the following discussion, examples of text which could appear in a sound change file are presented in fixed pitch font and often appear blockquoted,

like this .
In such displays, anything in roman text (like) is a sequence of characters to appear literally in the file, while anything in italic text (this) is a placeholder, to be substituted as appropriate.

Modifier characters, or how to write phones

rsca is limited to an eight-bit character set. It does not natively support any encoding of Unicode, although you could hack UTF-8 by converting each character you use to an appropriate seven-bit representation at the beginning and back at the end, or by defining all bytes in the range 0x80–0xbf to be mod10.

In compensation, it has flexible ideas of what constitutes a phone: you're not restricted to single characters. In particular, rsca handles X-SAMPA and its derivates such as CXS and Z-SAMPA.

This flexibility is provided by modifier character definitions. Each definition is a line of the form

modmn character list;
where mn is a pair of digits, either 01, 10, 02, or 11, and character list is a list of characters. This defines each character in character list to be a modifier character which combines the m phones on its left and the n on its right into a single new phone. (I'll tend to use phone in this document when concentrating on a phone's textual representation, and sound at other times; this is indefensible from a linguistic point of view, but it's still a nice distinction to make.)

Thus, for example, the modifier character definitions for X-SAMPA could look like

mod10 \`~=:
mod11 _
This means that \ and ` and ~ and = and : can be attached to the end of any phone, yielding a new phone (so that K\ and t` and r\= and 3`:~ are all single phones), and that _ goes between two phones to yield a new one (so that t_h and J\_t_?\ are also single phones).

I haven't been able to provide a mod20, to support the CXS and Z-SAMPA ) for affricates and coarticulates (as in ts)). You'll have to use the tiebar _ (t_s), or perhaps a 02 modifier (maybe {ts or something; ( and ) have other functions anyway).

The characters []()+*/| have special meanings to rsca, as does the double underscore __. All other characters, and single _, should be safe to use as or within phones.

Sound categories

Sound category definitions have the form
category name = sound list
This defines a category named category name, containing all of the sounds in sound list. Sound category names are unrestricted, but names which contain whitespace or start with a caret ^ or consist entirely of digits won't work very well.

For example, you might define categories for some voiceless and voiced stops and fricatives by

vlstop = ptk
vdstop = bdg
vlfric = fsx
vdfric = vzG
You could also do this instead:
vl = ptkfsx
vd = bdgvzG
stop = ptkbdg
fric = fsxvzG
and then use the techniques of
referencing categories to get "voiceless fricative" and such.

Sound changes

rsca supports a fair number of extensions to your basic sound change. I'll go through all of these in this section, starting with the simplest form.

Sound changes are applied to words in the order they're listed in the sound change file, from top to bottom.

Basic changes

At its simplest, a sound change looks like
before after env
and means "before becomes after in the environment env". The whitespace here is critical: before, after, and env are delimited by whitespace, so they themselves must not contain any (except when it's otherwise called for).

The environment env contains a blank __, representing the part that changes. Note that the blank consists of two underscores, diverging from the standard choice of one (since a single underscore is a frequent X-SAMPA modifier character). A few examples:

s S __i
expresses "s changes to S before i", and
kw k_w __
"kw changes to k_w unconditionally".

0 is interpreted as zero (the empty string), and # represents a word boundary. So

h 0 a__a
"h is lost between as"
d t __#
"d becomes t word-finally"
0 j #__i
"j is epenthesised before a word-initial i".

As a general principle to keep in mind, instead of writing a change in which before and after share common material, it's often better to move this material to the environment. So the version of "h is lost between as" given above is preferable to

aha aa __ .

Sound categories

The sound categories defined above can be employed in a sound change. The simplest employment consists of a category name, enclosed in brackets [ ] (like in featural notation); this matches any single sound contained in the category. If you use a category reference in the after part and one in the before part, then each phone in before changes to the corresponding one in after, in the order they were listed in the definition. So, if you have defined
stop = ptkbdg
fric = fsxvzG
then
[stop] [fric] [vowel]__[vowel]
means "stops become (corresponding) fricatives intervocalically". Further examples (with appropriate definitions):
[unaspir]h [aspir] __
"unaspirated stops fuse with a following h to become aspirated stops"
[stop] 0 __#
"stops are lost word-finally".

You can name two or more categories separated by whitespace between the brackets; this gives you the intersection of all the categories named, i.e. the set of all sounds in all of the categories. So if you set up definitions like

vl = ptkfsx
vd = bdgvzG
stop = ptkbdg
fric = fsxvzG
[vl stop] will give you a voiceless stop. If you precede a category name by a caret ^, the category is complemented, i.e. you get everything not in the category: so [^stop] is any non-stop, and [vd ^stop] is a voiced non-stop. You can use two carets to exclude a specific sound: [^^@] is anything but @, [stop ^^t ^^d] is a stop other than t or d. (If you want the union of categories, or to include single sounds, you'll be needing the regular expression character |.) An empty set of brackets [] matches any sound (but not a word boundary #).

When categories appear in after, there isn't quite the flexibility described above. There you can only use a single, noncomplemented, category name, or nothing (in which case the sound is not changed). If you use a category name, it is taken to correspond to the first category named in before (which had itself better not be complemented). So, to change voiceless stops to voiced, you need to write

[vl stop] [vd] __
and not
[stop vl] [vd] __ .
Unfortunately, in this version of rsca, the upper limit of one category name in after prevents you from changing, say, voiceless stops to voiced fricatives in one fell swoop, if you set up the categories this way.

It's worth mentioning a few more tricks to do with categories. Firstly, for many-to-one changes, you can repeat entries in a category:

vowel = aeiou
raised-vowel = oiiuu
[vowel] [raised-vowel] __n
raises vowels before n, leaving already high ones alone. Secondly, if you want, say, a voiceless and a voiced category that you can both convert between and reference independently, but some voiced sounds don't have voiceless counterparts (or vice versa), you can use some fake sound to pad out the categories appropriately, for instance
vl = ptkfsx$$$$$$$$$$$$
vd = bdgvzGmnNrljwieaou

As a final aside, through these categories rsca provides half-hearted support for distinctive features, in the fashion suggested by calling the two categories above -voice and +voice. But this certainly isn't full-hearted support: that would at least entail a friendlier way to define features, and a smarter strategy for changing the value of a feature (in particular one that can change multiple features at once).

Matching categories

The distinction laid out in the discussion above between categories used in after and categories used elsewhere is actually more general.

Any usage of a category can reference any other usage of a category. This reference is established by placing a number before the other category usage (we call it the labelled category), and placing the same number preceded by a hash # before the referencing category. For example, if a sound change contains [3 stop] and [#3 fric], then the first matches a stop and the second only matches the corresponding fricative. If you leave out a category name in the reference, it matches the same sound: so [1 vowel ^^@] and [#1] match a vowel other than @ and a second instance of that same vowel.

These category references are quite powerful for stating many kinds of sound changes for which the simpler tools above don't quite suffice. Among these are assimilations:

stop = ptkbdg
nasal = mnNmnN
[nasal] [#1 nasal] __[1 stop]
renders "nasals assimilate to following stops" (which may take a bit of thought to see). Again,
? [#1] [1 stop]__
"glottal stops assimilate completely to a preceding stop".

A few more assorted examples:

[1 vowel][2 liquid] [#2][#1] __
"liquids metathesise with preceding vowels"
0 [#1] [vowel]__[1 vl stop][vowel]
"voiceless stops geminate between vowels"
[#1] 0 __[1 cons]
"all geminate consonants simplify"
0 i #__[1 cons][#1]
"i is inserted before initial geminates"
[1 vowel][vowel] [#1 long] __
"two consective vowels coalesce into a long version of the first".

If a category usage is not given a numeric label or reference, and it occurs in before or after, it behaves just as if it bore the label 0 or reference #0, respectively (this is how categories in before and after are made to match up by default).

Some points on (it pains me to say...) restrictions on labels and references should be noted, since rsca generates a particularly wide variety of errors concerning them.

rsca understandably doesn't permit you to use a reference when there are two categories labelled with that number: so for instance, you can't put [1 stop], [1 fric], and [#1] in the same sound change. Much more arbitrarily, you can't use two references with the same number either: so using [1 stop], [#1 fric], [#1] together is out. Even though I haven't yet thought of a sound change which requires multiple references to state, this is still an arbitrary restriction, and you can expect it to be lifted in a future version of rsca.

Categories with labels cannot occur in after. Aside from this, these labels and references may appear anywhere, subject to restrictions which only gain force when we add regular expressions to the mix.

Regular expressions

rsca supports a simple set of regular expressions (and I had to work for these regular expressions! not like some sound change appliers, who get them for free just for using Perl or Python... :-) ). If you're familiar with regular expressions, you may wish to skip ahead after reading that rsca supports the operators | * + / () and catenation, where / is a nonstandard name for the more usual ? (since ? is so useful for glottal stop).

Regular expressions are built up from smaller regular expressions using these operators. Here X and Y are smaller regular expressions, which may be single sounds.

These are listed from lowest to highest precedence. Parentheses ( ) may be used to override the precedence, so that for instance ([stop]|[nasal])* matches zero or more stops or nasals.

Regular expressions may be used in env only.

These operators are key to a lot of rsca's expressivity. Some examples:

[vowel] [long-vowel] __[vd fric]|r
"vowels lengthen before a voiced fricative or r"
a e __[^vowel]*[front]
"a mutates to e when the next vowel is front", i.e. "...after zero or more nonvowels and then a front vowel"
u o __[cons]+#
"u becomes o in a final closed syllable", i.e. "...after one or more consonants and then the end of the word"
[vowel] [nasal-vowel] __[glide]/[nasal]
"vowels nasalize before a nasal, preceded by an optional glide"
and if syllable breaks . have been inserted in the appropriate spots,
[vowel] [stressed-vowel] #[^^.]*__
"vowels in the initial syllable receive stress", i.e. "vowels such that there are no syllable breaks between the start of the word and here..."

We can finally explicate the danger alluded to above regarding references and labels in regular expressions. What's ruled out are the sort of things typified by

? [#1] __[1 stop]/
and
? [#1] __[1 stop]+
in which there might not be a sound matched by the labelled category to reference, or there might be more than one. Less defensibly (again), rsca also rules out cases where it's the reference, not the labelled category, in this situation. If you violate these restrictions, rsca will issue the error message
group n couldn't be joined
where n is the label of the transgressing categories.

Changes with multiple blanks

Sound changes in rsca allow for multiple changing parts which are separated by arbitrary non-changing material. Such changes still have the familiar form
before after env ;
now, however, the blank __ will appear multiple times in env, and before and after will be divided into the same number of pieces with two pipes || between each piece. So a two-blank change will look like
before1||before2 after1||after2 env0__env1__env2 .
Changes with multiple blanks are useful where you might otherwise wish to write a change which has a regular expression in before and after; it's often the case that these parts don't change, and this allows you to move them to the environment instead.

Some examples:

[glide]||0 0||[] __[cons ^glide]+__[vowel]
"glides metathesise across one or more non-glide consonants before a vowel"
'||[vowel] 0||[stressed-vowel] __[cons]*__
"syllables preceded by a stress mark ' lose the stress mark, and their vowel becomes stressed".

Parallelized changes

rsca allows multiple sound changes to take place at the same time, instead of any one occuring before or after another; such changes are said to be parallelized. A set of parallelized changes looks like
before1 after1 env1 //
before2 after2 env2 //
  ... etc ...
beforeN afterN envN
with a double virgule // after every change but the last.

You can use parallelized changes to swap two classes of sound without an intermediary:

[high-tone] [low-tone] __ //
[low-tone] [high-tone] __
"low tone and high tone are swapped".
It's also useful, for instance, for stating push chains in their natural order (this example is due to Benct Philip Jonsson, and assumes the length mark : is treated as a separate phone):
a: o: __ //
o: u: __ //
u y __
"a: is rounded and raised to o:, o: is raised to u:, u of any length is fronted to y".

Sound change options

Before each sound change or parallel group of sound changes, you can specify a number of options. Each option is placed on a separate line of the form
keyword (argument) ,
preceding the statement of the change; not every kind of option takes an argument. The option keywords are recognizable because they all begin with a bracket [ (and don't end with a matching ]). All of these options, of course, are optional.

The valid options are as follows. I haven't introduced enough material to present some of them yet; they're presented here, but discussion of their function is postponed to a more appropriate place.

Constraints

Finally, rsca understands one type of "sound change" that doesn't specify a change at all, namely constraints. A constraint looks like
* constraint
and has the meaning that no word may appear at that point in the derivation of which some part matches constraint. The constraint may contain any of the earlier-discussed constructs which could occur in the environment env of a normal sound change.

If a word which matches, or violates, the constraint, comes up during derivation, rsca will spill forth a warning, and the word will not be processed any further; the eventual output will be an empty line (unless there were other possibilities from a sporadic change). The warning can be suppressed with the -q command line option.

Constraints really come into their own when rsca is used in reverse. They're still useful in forward derivation, though, to make sure that assumptions you've made about the changing phonotactics at some point in the sequence of changes really do hold true.

Some examples:

* u
"u is not a valid phone"
* [cons][cons][cons][cons]
"four consonants in a row cannot occur"
* [ejective][]*[ejective]
"more than one ejective cannot appear in a word"
* [short][vd obst][cons]*(.|#)
"vowels before a coda voiced obstruent cannot be short" (assuming syllable breaks . are present).

The options discussed above apply equally well to constraints, although many of them are unuseful (like sporadic; a sporadic constraint has absolutely no effect).

Constraints can also be parallelized. This doesn't do anything that putting them in sequence wouldn't, but does allow you to, for instance, give the set of constraints a unitary name: so

[name vowel harmony
* [front][]*[back] //
* [back][]*[front]
is a possible statement of total front-back vowel harmony. It's not legal to join a constraint to a sound change with //.

Other topics

Errors

Frustratingly, rsca has very limited patience: if it detects something it doesn't like in a sound change file, it will issue an error message and give up then and there, without even trying to process subsequent lines of the file.

When rsca reports an error, it will say

file:line: error message ,
meaning that the error was discovered on line line of file file. I like to think that the error messages are relatively clear and understandable (well, except for the annoyingly vague parse error), so I haven't gone through them here. A few of them are discussed elsewhere:

rsca's implementation of sound changes

rsca implements sound changes by converting them to nondeterministic
finite state transducers. This is quite an appropriate, indeed I'd say the appropriate, model for sound changes. Many sound change appliers are implemented with regular expressions, which, somewhere far behind the scenes, get converted to finite state automata; but what we really want are finite state automata with the capability to output, and these are exactly finite state transducers. The nondeterminism lets us go backwards and do constraints and sporadic changes and other nice stuff.

On the other hand, there are probably more bugs lurking in my transducer code than in any regexp library... and I shudder to even think of trying to seriously optimise my transducer code, so you can expect rsca to be slower than the alternative.

This implementation has a few consequences for the way rsca's sound changes behave:

Conflicts in determinization

Occasionally, rsca will reject a sound change, saying
conflict in determinisation (sound change may have ambiguous cases) .
These conflicts are perhaps the downside of insisting that sound changes happen
everywhere at once. rsca signals a conflict when it thinks a change could apply in two overlapping places, and is unsure what should happen when this overlap occurs. Conflicts can also be generated when the same sound is affected by two parallel rules, when rsca is equally unsure what to do. A conflict can, however, mean that rsca is just too stupid to figure out that there's no possibility of any strange behaviour.

A typical change that could earn this warning is

aa a: __
(suppose for the following discussion that a: is one phone). What, then, should be done if a word happens to contain aaa? Left to right would suggest a:a, right to left aa:; but then there are two instances of aa, so maybe we should cough up two a:a:s; and one a: doesn't look that bad either...

There are a couple things you can do to get your change through. One is to rewrite the change to move common material from before and after into env. This, for instance, is the right thing to do with the conflict-producing change

ii i __ ,
which goes down fine if rewritten as
i 0 __i
(or the same with the i before the blank).

If you can't do this, you can get a second opinion on the conflict by using the option [flip-conflicts. This causes rsca to look at the change right-to-left instead of left-to-right when compiling it into a transducer. It has no effect on the behaviour of the change when it succeeds. [flip-conflicts is most likely to work when there's context on the right of the blank in env; an example for which it succeeds is

tt ss __i .

If this doesn't work, the last resort is to use the option [ignore-conflicts. This, as you might have guessed, instructs rsca just to ignore its misgivings and plunge ahead with the change. The catch: when [ignore-conflicts is used, rsca's behaviour on that sound change is undefined. That is, rsca will do whatever it likes with the change, especially with the overlapping cases. You can generally expect it to do something sensible with the normal non-overlapping cases, but even this can't be relied upon (it seems, for instance, to be fond of simply ignoring parallelized changes that would otherwise raise a conflict). rsca's behaviour will at least be the same between multiple uses of the same change: so if you try it out and it seems to do what you expect, it's probably safe to use.

Going in reverse

At last, we come to (one of) the raison(s) d'être of rsca: applying sound changes in reverse, to reconstruct all possible proto-words which yield a given word. rsca is kicked into reverse by supplying it with the -r command line option. When operating in reverse, it takes the sound changes specified in the sound changes file and unapplies them in turn, from bottom to top of the sound change file (as you'd expect). It outputs all the words it finds on a single line, separated by single spaces, or a blank line if it finds that the input word is impossible to generate.

This is more evidence for something I've already hinted at above with sporadic changes and constraints (and nondeterminism), so I'll come clean: in either forward or reverse operation, rsca actually maintains a set of all words that the input word could have become (or come from) at the current point in the sound changes. This set may have one member, or a few, or a great many, or none. In forward operation, this isn't very noticeable, because sound changes generally have exactly one possible output for each input. In reverse operation, though, it becomes a lot more prominent: any given word can easily have many ancestors, or none, under a given sound change.

Combine that effect with any reasonably sized list of sound changes, and you'll soon run into quite a huge explosion of possible reconstructed forms, most of them quite ridiculous. For instance, running the rsca translation of Mark Rosenfelder's Latin to Portuguese example in reverse on the Portuguese distrito gives a list of 1120 possible "Latin" etyma:

diiistericto diiisterictom diiisterictoms diiisterictos diiisterictu diiisterictum diiisterictums diiisterictus diiisteriicto diiisteriictom diiisteriictoms diiisteriictos diiisteriictu diiisteriictum diiisteriictums diiisteriictus diiisteriiipto diiisteriiiptom diiisteriiiptoms diiisteriiiptos ... (1080 skipped) ... divivistriviiptu divivistriviiptum divivistriviiptums divivistriviiptus divivistrivipto divivistriviptom divivistriviptoms divivistriviptos divivistriviptu divivistriviptum divivistriviptums divivistriviptus divivistrivivipto divivistriviviptom divivistriviviptoms divivistriviviptos divivistriviviptu divivistriviviptum divivistriviviptums divivistriviviptus

This problem can be made much more manageable by means of some well-placed constraints in your sound change file, to rule out the reconstructed words which violate the phonotactics of the language at any point during the changes. It's advisable to place these constraints as late as possible in the file, so that the bad candidates are removed as early as possible in reconstruction. For example, I've introduced three choice constraints (which are commented out), all rather narrow in scope, into the above sound change file:

* ms#

at the beginning, to rule out Latin final ms;

* ii

just before the first deletion of intervocalic material, which isn't really right since Latin had ii sequences, but it'll do for the example; and

* iii

just before the change which deletes is, 'cause iii is just silly. These reduce the 1120 reconstructed forms to just 72 (distericto disterictom disterictos disterictu disterictum ... divistriviptus), all of them looking rather more Latinate, and districtus is easy to pick out among them.

Since constraints are intended to be used heavily for this sort of pruning, the warning about words failing constraints is suppressed when running rsca in reverse.

rsca can offer a particularly strong guarantee on the correctness of the words it reconstructs, with respect to forward application of the sound changes. That is, if rsca is given a word in reconstruction mode and returns a set of ancestors, and then rsca is turned around and run in forward mode, every one of these ancestor words will yield the original word as output, and no other words will. Big deal, I hear you say, that's only to be expected... but this is actually another benefit of using transducers, I say, gloating over the fact that this doesn't seem anywhere near so easy to achieve with regular expressions :-)

Actually, there are two exceptions to the correctness of reversibility. (Everything else should be fine, even changes with [ignore-conflicts.)

Firstly, some changes can altogether not be reversed. These are the changes which transform an infinite set of sounds into the same thing, for instance

[^vowel] 0 __#
intended to delete non-vowels at the end of the word. The problem with these is that there's an infinite set of ancestors for any given word: any phone that's not a vowel can be inserted at the end of the word, and there are infinitely many such phones (rsca believes this even if you didn't define any modifier characters). rsca will refuse to run in reverse if such changes are present, giving the error message
change cannot be reversed .
The solution, of course, is to restate the change using a finite set (defining a new category if you have to). For instance, the above is fixed by defining a consonant category cons:
[cons] 0 __# .

The other exception is provided by deletion. A sound change like

h 0 __
is perfectly innocuous in the forward direction, but in the reverse direction it again yields infinite sets of ancestors. Even a reflex like a could be the outcome of a or ha or hha or hhha or hhhha or ..., to say nothing of hah and hhhahhhhhhhh and the rest. To cope with this, rsca assigns to each sound change a parameter giving the maximal number of times it's allowed to undelete material at any one point, when applying the change in reverse. This parameter defaults to 1, but can be set to anything other nonnegative integer for a specific change, using the [undelete option. Thus, given the above change, rsca will reconstruct a to
a ah ha hah ;
but
[undelete 3
h 0 __
would give you back
a ah ahh ahhh ha hah hahh hahhh hha hhah hhahh hhahhh hhha hhhah hhhahh hhhahhh ,
and
[undelete 0
h 0 __
would simply return a.

Be careful with large values of the undelete parameter: they can lead very easily to explosions, especially if the deletion is unconditional.

Wishlist

Just for fun, and maybe as a taste of future versions, here's a list of some additional features, in rough order from simplest and sanest to most complicated and outlandish, that I've considered or dreamt about adding to rsca in the future...


Up