API documentation: exported types and functions

Types

Orthography.OrthographicSystem — Type

An abstract type for orthographic systems.

Orthography.TokenCategory — Type

An abstract type for token categories.

Orthography.LexicalToken — Type

Category of alphabetic tokens.

Orthography.NumericToken — Type

Category of numeric tokens.

Orthography.PunctuationToken — Type

Category of punctuation tokens.

Functions

Public functions implemented for all subtypes of OrthographicSystem.

Orthography.codepoints — Function

Delegate to specific functions based on type's orthography trait value.

codepoints(x)

It is an error to invoke the codepoints function on anything but an orthographic system.

codepoints(_, x)

Orthographic systems must implement codepoints.

codepoints(_, ortho)

Implement codepoints function for SimpleAscii.

codepoints(ortho)

Implement codepoints function for SimpleAscii.

codepoints(ortho)

Orthography.tokentypes — Function

Delegate to specific functions based on type's orthography trait value.

tokentypes(x)

It is an error to invoke the tokentypes function on anything but an orthographic system.

tokentypes(_, x)

Orthographic systems must implement tokentypes.

tokentypes(_, ortho, s)

Implement tokentypes function for SimpleAscii.

tokentypes(ortho)

Implement tokentypes function for WSTokenizer.

tokentypes(ortho)

Orthography.validcp — Function

True if ch appears in list of all valid characters (codepoints) for this orthography.

validcp(ch, ortho)

ch is a string possibly including more than one Julia Char but representing a single character in the orthographic system ortho.

Orthography.validstring — Function

True if all chars in s are valid.

validstring(s, ortho)

Orthography.tokenize — Function

Delegate to specific functions based on type's orthography trait value.

tokenize(s, x)

It is an error to invoke the tokenize function on anything but an orthographic system.

tokenize(_, s, x)

Orthographic systems must implement tokenize.

tokenize(_, s, ortho)

Tokenize citable node cn using the tokenizer of the given orthographic system.

tokenize(psg, ortho; edition, exemplar)

The return value is a list of pairings of a CitablePassage and a token category. The citable node is citable at the level of the token.

Tokenize corpus c using the tokenizer of the given orthographic system.

tokenize(c, ortho; edition, exemplar)

The return value is a list of pairings of a CitablePassage and a token category. The citable node is citable at the level of the token.

Tokenize document doc using the tokenizer of the given orthographic system.

tokenize(doc, ortho; edition, exemplar)

The return value is a list of pairings of a CitablePassage and a token category. The citable node is citable at the level of the token.

Implement tokenize function for SimpleAscii orthography.

tokenize(s, o)

Implement tokenize function for WSTokenizer orthography.

tokenize(s, o)

Working with text corpora:

Orthography.corpus_histo — Function

Create an ordered dictionary of text values for tokens in corpus c. Optional parameters let you filter the results to include only tokens of a specified type and normalize the text value of tokens before counting.

corpus_histo(c, ortho; filterby, normalizer)

Other utilities

Orthography.nfkc — Function

Shorthand function to normalize string s to Unicode form NFKC.

nfkc(s)

Example implementation

Orthography.SimpleAscii — Type

An orthographic system for a basic alphabetic subset of the ASCII character set.

Orthography.simpleAscii — Function

Construct a SimpleAscii with correct member values.

simpleAscii()