class documentation

class PunktToken(object): (source)

Constructor: PunktToken(tok, **params)

View In Hierarchy

Stores a token of text with annotations produced during sentence boundary detection.

Method __init__ Undocumented
Method __repr__ A string representation of the token that can reproduce it with eval(), which lists all the token's non-default annotations.
Method __str__ A string representation akin to that used by Kiss and Strunk.
Class Variable __slots__ Undocumented
Instance Variable period_final Undocumented
Instance Variable tok Undocumented
Instance Variable type Undocumented
Property first_case Undocumented
Property first_lower True if the token's first character is lowercase.
Property first_upper True if the token's first character is uppercase.
Property is_alpha True if the token text is all alphabetic.
Property is_ellipsis True if the token text is that of an ellipsis.
Property is_initial True if the token text is that of an initial.
Property is_non_punct True if the token is either a number or is alphabetic.
Property is_number True if the token text is that of a number.
Property type_no_period The type with its final period removed if it has one.
Property type_no_sentperiod The type with its final period removed if it is marked as a sentence break.
Method _get_type Returns a case-normalized representation of the token.
Constant _RE_ALPHA Undocumented
Constant _RE_ELLIPSIS Undocumented
Constant _RE_INITIAL Undocumented
Constant _RE_NUMERIC Undocumented
Class Variable _properties Undocumented
def __init__(self, tok, **params): (source)

Undocumented

def __repr__(self): (source)

A string representation of the token that can reproduce it with eval(), which lists all the token's non-default annotations.

def __str__(self): (source)

A string representation akin to that used by Kiss and Strunk.

__slots__ = (source)

Undocumented

period_final = (source)

Undocumented

Undocumented

Undocumented

@property
first_case = (source)

Undocumented

@property
first_lower = (source)

True if the token's first character is lowercase.

@property
first_upper = (source)

True if the token's first character is uppercase.

@property
is_alpha = (source)

True if the token text is all alphabetic.

@property
is_ellipsis = (source)

True if the token text is that of an ellipsis.

@property
is_initial = (source)

True if the token text is that of an initial.

@property
is_non_punct = (source)

True if the token is either a number or is alphabetic.

@property
is_number = (source)

True if the token text is that of a number.

@property
type_no_period = (source)

The type with its final period removed if it has one.

@property
type_no_sentperiod = (source)

The type with its final period removed if it is marked as a sentence break.

def _get_type(self, tok): (source)

Returns a case-normalized representation of the token.

_RE_ALPHA = (source)

Undocumented

Value
re.compile(r'[^\W\d]+$',
           re.UNICODE)
_RE_ELLIPSIS = (source)

Undocumented

Value
re.compile(r'\.\.+$')
_RE_INITIAL = (source)

Undocumented

Value
re.compile(r'[^\W\d]\.$',
           re.UNICODE)
_RE_NUMERIC = (source)

Undocumented

Value
re.compile(r'^-?[\.,]?\d[\d,\.-]*\.?$')
_properties: list[str] = (source)

Undocumented