nvm.aux_str package

Submodules

nvm.aux_str.aux_str module

nvm.aux_str.aux_str.is_ascii(s)[source]

Check if the characters in string s are in ASCII.

Parameters: s (str) – String to be checked if it contains only ASCII characters.
Returns: True if s contains only ASCII characters.
Return type: bool

Examples

>>> from nvm.aux_str import is_ascii
>>> assert is_ascii("abc 123")
>>> assert not is_ascii("abc 123 ×")
>>> assert not is_ascii("abc 123 ")

nvm.aux_str.aux_str.is_ascii_alt(s)[source]

Check if the characters in string s are in ASCII, U+0-U+7F.

Parameters: s (str) – String to be checked if it contains only ASCII characters.
Returns: True if s contains only ASCII characters.
Return type: bool

Examples

>>> from nvm.aux_str import is_ascii_alt
>>> assert is_ascii_alt("abc 123")
>>> assert not is_ascii_alt("abc 123 ×")
>>> assert not is_ascii_alt("abc 123 ")

nvm.aux_str.aux_str.clean_str(text, mappings=[{' ': ['\\n', '\\r', '\\t']}, {'-': ['−', '–', '—', '―', '﹣', '－']}])[source]

Clean string replacing any unwanted text with the desired.

This function can be used to clean text from redundant whitespace characters and other common problems.

Parameters

text (str) – Text to be cleaned.
mappings (List[Dict[str, List[Union[str, Pattern[str]]]]], default=[{’ ‘: [’n’, ‘r’, ‘t’]}, {‘-’: [’−’, ‘–’, ‘—’, ‘―’, ‘﹣’, ‘－’]}]) – List of mappings to be used for text cleaning. This should be a list of dictionaries. Dictionary keys should contain strings that are used as replacement for matches of string patterns or regexes provided as list in dictionary key value. The default value is sourced from nvm.aux_str.clean_str_mappings.CLEAN_STR_MAPPINGS_TINY.

Returns

Clean text.

Return type

str

Examples

To clean a string use:

>>> from nvm.aux_str import clean_str
>>> text_dirty = "  one two  three\t \n\n\r four...  "
>>> text_clean = clean_str(text=text_dirty)
>>> # print(text_dirty)
>>> print(text_clean)
"one two three four..."

This function can be applied to pandas dataframe column, for example:

>>> # let df0 be a dataframe that contains text column "text"
>>> # to clean its content in place we may run
>>> text_field = "text"
>>> df0[text_field] = df0[text_field].apply(clean_str)

The mappings argument should be a list of dictionaries that define string pattern- or regex-based replacements used for text cleaning. Dictionary keys should contain strings that are used as replacement for matches of patterns provided as list in corresponding (dictionary key) value (List[Dict[str, List[Union[str, Pattern[str]]]]]).

For example, to replace all occurrences of LF (Line Feed, "\n"), CR (Carriage Return, "\r") and HT (Horizontal Tab, "\t") with " " (space), as well as, replace all occurrences of some dash-like characters with "-", the following mapping can be used:

>>> mappings = [
>>>     {
>>>         " ": [  # Unicode Character 'SPACE' (U+0020)
>>>             "\n",  # LF (Line Feed)
>>>             "\r",  # CR (Carriage Return)
>>>             "\t",  # HT (Horizontal Tab)
>>>         ],
>>>     },
>>>     {
>>>         "-": [  # Unicode Character 'HYPHEN-MINUS' (U+002D) # chr(45) ord("-") ord("-")
>>>             "\u2212",  # Unicode Character 'MINUS SIGN' (U+2212)
>>>             "\u2013",  # Unicode Character 'EN DASH' (U+2013) # chr(8211) ↔ ord("–") ↔ ord("–")
>>>             "\u2014",  # Unicode Character 'EM DASH' (U+2014)
>>>             "\u2015",  # Unicode Character 'HORIZONTAL BAR' (U+2015)
>>>             "\uFE63",  # Unicode Character 'SMALL HYPHEN-MINUS' (U+FE63)
>>>             "\uFF0D",  # Unicode Character 'FULLWIDTH HYPHEN-MINUS' (U+FF0D)
>>>         ],
>>>     },
>>> ]

Note

Hint: an empty string can be used to remove text matching a regex, for example:

>>> mappings = [{"": [re.compile(r"[0-9]")]}]  # remove digits

nvm.aux_str also provides few usefull mappings:

>>> # Import example mappings:
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_TINY
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_LARGE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_HUGE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_SPACE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_DROP_HASHTAGS
>>> # Display sample mapping as JSON:
>>> import srsly
>>> print(srsly.json_dumps(CLEAN_STR_MAPPINGS_TINY, indent=2))
[
  {
    " ":[
      "\n",
      "\r",
      "\t"
    ]
  },
  {
    "-":[
      "\u2212",
      "\u2013",
      "\u2014",
      "\u2015",
      "\ufe63",
      "\uff0d"
    ]
  }
]

Note that we used json_dumps function from the srsly library to get indented JSON output.

Drop hashtags

>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_DROP_HASHTAGS as map0
>>> from nvm.aux_str import clean_str
>>> text_dirty = "  #one\ntwo\n\tthree #3443 #three434 #44ok \t #four... five #hashTag comose text"
>>> text_clean = clean_str(text=text_dirty, mappings=map0)
>>> # print(text_dirty)
>>> print(text_clean)
"two three #3443 ... five comose text"

nvm.aux_str.clean_str_mappings module

This module contains some useful mappings for the nvm.aux_str.clean_str function.

Examples

>>> # Import example mappings:
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_TINY
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_LARGE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_HUGE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_SPACE
>>> from nvm.aux_str import CLEAN_STR_MAPPINGS_DROP_HASHTAGS
>>> # Display sample mappings as JSON:
>>> import srsly
>>> print(srsly.json_dumps(CLEAN_STR_MAPPINGS_TINY, indent=2))
[
  {
    " ":[
      "\n",
      "\r",
      "\t"
    ]
  },
  {
    "-":[
      "\u2212",
      "\u2013",
      "\u2014",
      "\u2015",
      "\ufe63",
      "\uff0d"
    ]
  }
]
>>> # Use mappings to clean string:
>>> from nvm.aux_str import clean_str
>>> text_dirty = "  one two  three\t \n\n\r four...  "
>>> text_clean = clean_str(
>>>     text=text_dirty,
>>>     mappings=CLEAN_STR_MAPPINGS_TINY,
>>> )
>>> # print(text_dirty)
>>> print(text_clean)
"one two three four..."

nvm.aux_str.now module

nvm.aux_str.now.now(tz0='Europe/Berlin', fm0='%Y%m%dT%H%M%S')[source]

Get date and time as string (e.g., “20230201T070809”).

Parameters

tz0 (Union[str, pytz.BaseTzInfo]) – Timezone (defaults to “Europe/Berlin”).
fm0 (str) – Output string format (defaults to “%Y%m%dT%H%M%S”).

Returns

Date and time as string (e.g., “20230201T070809”).

Return type

str

Examples

>>> from nvm import now
>>> now()
"20220607T024010"

nvm.aux_str.regex module

This module contains some useful regular expressions.

Examples

>>> from nvm.aux_str.regex import REGEX_ABC_DASH_XYZ_ASTERISK as re0
>>> re0.pattern
'^[a-z]+(\-[a-z]+)*\*?$'

nvm.aux_str package

Submodules

nvm.aux_str.aux_str module

nvm.aux_str.clean_str_mappings module

nvm.aux_str.now module

nvm.aux_str.regex module

Module contents