API documentation: exported types and functions
Types
Orthography.OrthographicSystem
— TypeAn abstract type for orthographic systems.
Orthography.TokenCategory
— TypeAn abstract type for token categories.
Orthography.LexicalToken
— TypeCategory of alphabetic tokens.
Orthography.NumericToken
— TypeCategory of numeric tokens.
Orthography.PunctuationToken
— TypeCategory of punctuation tokens.
Functions
Public functions implemented for all subtypes of OrthographicSystem
.
Orthography.codepoints
— FunctionDelegate to specific functions based on type's orthography trait value.
codepoints(x)
It is an error to invoke the codepoints
function on anything but an orthographic system.
codepoints(_, x)
Orthographic systems must implement codepoints.
codepoints(_, ortho)
Implement codepoints function for SimpleAscii.
codepoints(ortho)
Implement codepoints function for SimpleAscii.
codepoints(ortho)
Orthography.tokentypes
— FunctionDelegate to specific functions based on type's orthography trait value.
tokentypes(x)
It is an error to invoke the tokentypes
function on anything but an orthographic system.
tokentypes(_, x)
Orthographic systems must implement tokentypes.
tokentypes(_, ortho, s)
Implement tokentypes function for SimpleAscii.
tokentypes(ortho)
Implement tokentypes function for WSTokenizer.
tokentypes(ortho)
Orthography.validcp
— FunctionTrue if ch
appears in list of all valid characters (codepoints) for this orthography.
validcp(ch, ortho)
ch
is a string possibly including more than one Julia Char
but representing a single character in the orthographic system ortho
.
Orthography.validstring
— FunctionTrue if all chars in s
are valid.
validstring(s, ortho)
Orthography.tokenize
— FunctionDelegate to specific functions based on type's orthography trait value.
tokenize(s, x)
It is an error to invoke the tokenize
function on anything but an orthographic system.
tokenize(_, s, x)
Orthographic systems must implement tokenize.
tokenize(_, s, ortho)
Tokenize citable node cn
using the tokenizer of the given orthographic system.
tokenize(psg, ortho; edition, exemplar)
The return value is a list of pairings of a CitablePassage
and a token category. The citable node is citable at the level of the token.
Tokenize corpus c
using the tokenizer of the given orthographic system.
tokenize(c, ortho; edition, exemplar)
The return value is a list of pairings of a CitablePassage
and a token category. The citable node is citable at the level of the token.
Tokenize document doc
using the tokenizer of the given orthographic system.
tokenize(doc, ortho; edition, exemplar)
The return value is a list of pairings of a CitablePassage
and a token category. The citable node is citable at the level of the token.
Implement tokenize function for SimpleAscii
orthography.
tokenize(s, o)
Implement tokenize function for WSTokenizer
orthography.
tokenize(s, o)
Working with text corpora:
Orthography.corpus_histo
— FunctionCreate an ordered dictionary of text values for tokens in corpus c. Optional parameters let you filter the results to include only tokens of a specified type and normalize the text value of tokens before counting.
corpus_histo(c, ortho; filterby, normalizer)
Other utilities
Orthography.nfkc
— FunctionShorthand function to normalize string s
to Unicode form NFKC.
nfkc(s)
Example implementation
Orthography.SimpleAscii
— TypeAn orthographic system for a basic alphabetic subset of the ASCII character set.
Orthography.simpleAscii
— FunctionConstruct a SimpleAscii
with correct member values.
simpleAscii()