nvm.aux_str package
Submodules
nvm.aux_str.aux_str module
- nvm.aux_str.aux_str.is_ascii(s)[source]
Check if the characters in string s are in ASCII.
- Parameters
s (str) – String to be checked if it contains only ASCII characters.
- Returns
Trueifscontains only ASCII characters.- Return type
bool
Examples
>>> from nvm.aux_str import is_ascii >>> assert is_ascii("abc 123") >>> assert not is_ascii("abc 123 ×") >>> assert not is_ascii("abc 123 ")
- nvm.aux_str.aux_str.is_ascii_alt(s)[source]
Check if the characters in string s are in ASCII, U+0-U+7F.
- Parameters
s (str) – String to be checked if it contains only ASCII characters.
- Returns
Trueifscontains only ASCII characters.- Return type
bool
Examples
>>> from nvm.aux_str import is_ascii_alt >>> assert is_ascii_alt("abc 123") >>> assert not is_ascii_alt("abc 123 ×") >>> assert not is_ascii_alt("abc 123 ")
- nvm.aux_str.aux_str.clean_str(text, mappings=[{' ': ['\\n', '\\r', '\\t']}, {'-': ['−', '–', '—', '―', '﹣', '-']}])[source]
Clean string replacing any unwanted text with the desired.
This function can be used to clean text from redundant whitespace characters and other common problems.
- Parameters
text (str) – Text to be cleaned.
mappings (List[Dict[str, List[Union[str, Pattern[str]]]]], default=[{’ ‘: [’n’, ‘r’, ‘t’]}, {‘-’: [’−’, ‘–’, ‘—’, ‘―’, ‘﹣’, ‘-’]}]) – List of mappings to be used for text cleaning. This should be a list of dictionaries. Dictionary keys should contain strings that are used as replacement for matches of string patterns or regexes provided as list in dictionary key value. The default value is sourced from
nvm.aux_str.clean_str_mappings.CLEAN_STR_MAPPINGS_TINY.
- Returns
Clean text.
- Return type
str
Examples
To clean a string use:
>>> from nvm.aux_str import clean_str >>> text_dirty = " one two three\t \n\n\r four... " >>> text_clean = clean_str(text=text_dirty) >>> # print(text_dirty) >>> print(text_clean) "one two three four..."
This function can be applied to pandas dataframe column, for example:
>>> # let df0 be a dataframe that contains text column "text" >>> # to clean its content in place we may run >>> text_field = "text" >>> df0[text_field] = df0[text_field].apply(clean_str)
The
mappingsargument should be a list of dictionaries that define string pattern- or regex-based replacements used for text cleaning. Dictionary keys should contain strings that are used as replacement for matches of patterns provided as list in corresponding (dictionary key) value (List[Dict[str, List[Union[str, Pattern[str]]]]]).For example, to replace all occurrences of LF (Line Feed,
"\n"), CR (Carriage Return,"\r") and HT (Horizontal Tab,"\t") with" "(space), as well as, replace all occurrences of some dash-like characters with"-", the following mapping can be used:>>> mappings = [ >>> { >>> " ": [ # Unicode Character 'SPACE' (U+0020) >>> "\n", # LF (Line Feed) >>> "\r", # CR (Carriage Return) >>> "\t", # HT (Horizontal Tab) >>> ], >>> }, >>> { >>> "-": [ # Unicode Character 'HYPHEN-MINUS' (U+002D) # chr(45) ord("-") ord("-") >>> "\u2212", # Unicode Character 'MINUS SIGN' (U+2212) >>> "\u2013", # Unicode Character 'EN DASH' (U+2013) # chr(8211) ↔ ord("–") ↔ ord("–") >>> "\u2014", # Unicode Character 'EM DASH' (U+2014) >>> "\u2015", # Unicode Character 'HORIZONTAL BAR' (U+2015) >>> "\uFE63", # Unicode Character 'SMALL HYPHEN-MINUS' (U+FE63) >>> "\uFF0D", # Unicode Character 'FULLWIDTH HYPHEN-MINUS' (U+FF0D) >>> ], >>> }, >>> ]
Note
Hint: an empty string can be used to remove text matching a regex, for example:
>>> mappings = [{"": [re.compile(r"[0-9]")]}] # remove digits
nvm.aux_stralso provides few usefull mappings:>>> # Import example mappings: >>> from nvm.aux_str import CLEAN_STR_MAPPINGS_TINY >>> from nvm.aux_str import CLEAN_STR_MAPPINGS_LARGE >>> from nvm.aux_str import CLEAN_STR_MAPPINGS_HUGE >>> from nvm.aux_str import CLEAN_STR_MAPPINGS_SPACE >>> from nvm.aux_str import CLEAN_STR_MAPPINGS_DROP_HASHTAGS >>> # Display sample mapping as JSON: >>> import srsly >>> print(srsly.json_dumps(CLEAN_STR_MAPPINGS_TINY, indent=2)) [ { " ":[ "\n", "\r", "\t" ] }, { "-":[ "\u2212", "\u2013", "\u2014", "\u2015", "\ufe63", "\uff0d" ] } ]
Note that we used
json_dumpsfunction from thesrslylibrary to get indented JSON output.Drop hashtags
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_DROP_HASHTAGS as map0 >>> from nvm.aux_str import clean_str >>> text_dirty = " #one\ntwo\n\tthree #3443 #three434 #44ok \t #four... five #hashTag comose text" >>> text_clean = clean_str(text=text_dirty, mappings=map0) >>> # print(text_dirty) >>> print(text_clean) "two three #3443 ... five comose text"
nvm.aux_str.clean_str_mappings module
This module contains some useful mappings for the
nvm.aux_str.clean_str function.
Examples
>>> # Import example mappings:
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_TINY
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_LARGE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_HUGE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_SPACE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_DROP_HASHTAGS
>>> # Display sample mappings as JSON:
>>> import srsly
>>> print(srsly.json_dumps(CLEAN_STR_MAPPINGS_TINY, indent=2))
[
{
" ":[
"\n",
"\r",
"\t"
]
},
{
"-":[
"\u2212",
"\u2013",
"\u2014",
"\u2015",
"\ufe63",
"\uff0d"
]
}
]
>>> # Use mappings to clean string:
>>> from nvm.aux_str import clean_str
>>> text_dirty = " one two three\t \n\n\r four... "
>>> text_clean = clean_str(
>>> text=text_dirty,
>>> mappings=CLEAN_STR_MAPPINGS_TINY,
>>> )
>>> # print(text_dirty)
>>> print(text_clean)
"one two three four..."
nvm.aux_str.now module
- nvm.aux_str.now.now(tz0='Europe/Berlin', fm0='%Y%m%dT%H%M%S')[source]
Get date and time as string (e.g., “20230201T070809”).
- Parameters
tz0 (Union[str, pytz.BaseTzInfo]) – Timezone (defaults to “Europe/Berlin”).
fm0 (str) – Output string format (defaults to “%Y%m%dT%H%M%S”).
- Returns
Date and time as string (e.g., “20230201T070809”).
- Return type
str
Examples
>>> from nvm import now >>> now() "20220607T024010"
nvm.aux_str.regex module
This module contains some useful regular expressions.
Examples
>>> from nvm.aux_str.regex import REGEX_ABC_DASH_XYZ_ASTERISK as re0
>>> re0.pattern
'^[a-z]+(\-[a-z]+)*\*?$'