Skip to main content


This Script is part of the Base Pack.#

Pre-process text data for the machine learning text classifier.

Script Data#

Script Typepython3
Cortex XSOAR Version5.0.0

Used In#

This script is used in the following playbooks and scripts.

  • DBot Create Phishing Classifier V2
  • DBot Create Phishing Classifier V2 From File
  • Get Mails By Folder Pathes
  • Get Mails By Folder Paths


Argument NameDescription
inputThe input file entry ID or the file content (as a string).
removeShortTextThresholdSample text for which the total number words are less than or equal to this number will be ignored.
dedupThresholdRemove emails with similarity greater than this threshold, range 0-1, where 1 is completly identical.
textFieldsA comma-separated list of incident field names with the text to process. You can also use "|" if you want to choose the first non-empty value from a list of fields.
inputTypeThe input type.
preProcessTypeText pre-processing type. The default is "json".
cleanHTMLWhether to remove HTML tags. Default is "true".
whitelistFieldsA comma-separate list of fields inside the JSON by which to filter.
hashSeedIf non-empty, hash every word with this seed.
outputFormatThe output file format.
outputOriginalTextFieldsWhether to add the original text fields to the output. Default is "false".
languageThe language of the input text. Default is "Any". Can be "Any", "English", "German", "French", "Spanish", "Portuguese", "Italian", "Dutch", or "Other". If "Any" or "Other" is selected, the script preprocess the entire input, no matter what its acutual language is. If a specific language is selected, the script filters out any other language from the output text.
tokenizationMethodTokenization method for text. Only required when the language argument is set to "Other". Can be "tokenizer", "byWords", or "byLetters". Default is "tokenizer".


DBotPreProcessTextData.FilenameThe output file name.String
DBotPreProcessTextData.TextFieldThe original text field inside the file.String
DBotPreProcessTextData.TextFieldProcessedThe processed text field inside the JSON file.String
DBotPreProcessTextData.FileFormatThe output file format.String