diff --git a/docs/user_guide/discretisation/ArbitraryDiscretiser.rst b/docs/user_guide/discretisation/ArbitraryDiscretiser.rst index aac67201a..736635721 100644 --- a/docs/user_guide/discretisation/ArbitraryDiscretiser.rst +++ b/docs/user_guide/discretisation/ArbitraryDiscretiser.rst @@ -6,21 +6,22 @@ ArbitraryDiscretiser ==================== The :class:`ArbitraryDiscretiser()` sorts the variable values into contiguous intervals -which limits are arbitrarily defined by the user. Thus, you must provide a dictionary -with the variable names as keys and a list with the limits of the intervals as values, -when setting up the discretiser. +whose limits are arbitrarily defined by the user. -The :class:`ArbitraryDiscretiser()` works only with numerical variables. The discretiser -will check that the variables entered by the user are present in the train set and cast -as numerical. +.. note:: + You must provide a dictionary + with the variable names as keys and a list with the limits of the intervals as values, + when setting up the discretiser. -Example -------- -Let's take a look at how this transformer works. First, let's load a dataset and plot a -histogram of a continuous variable. We use the california housing dataset that comes +Python implementation +--------------------- + +Let's take a look at how this transformer works. We'll use the california housing dataset that comes with Scikit-learn. +Let's load the dataset: + .. code:: python import numpy as np @@ -31,6 +32,10 @@ with Scikit-learn. X, y = fetch_california_housing( return_X_y=True, as_frame=True) +Let's plot a histogram of a continuous variable. + +.. code:: python + X['MedInc'].hist(bins=20) plt.xlabel('MedInc') plt.ylabel('Number of obs') @@ -99,7 +104,7 @@ If we return the interval values as integers, the discretiser has the option to the transformed variable as integer or as object. Why would we want the transformed variables as object? -Categorical encoders in Feature-engine are designed to work with variables of type +Categorical encoders in feature-engine are designed to work with variables of type object by default. Thus, if you wish to encode the returned bins further, say to try and obtain monotonic relationships between the variable and the target, you can do so seamlessly by setting `return_object` to True. You can find an example of how to use @@ -108,56 +113,12 @@ this functionality `here `_ -- `Jupyter notebook - Discretiser plus Mean Encoding `_ - For more details about this and other feature engineering methods check out these resources: +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/discretisation/DecisionTreeDiscretiser.rst b/docs/user_guide/discretisation/DecisionTreeDiscretiser.rst index 283bd7800..c055c2c4a 100644 --- a/docs/user_guide/discretisation/DecisionTreeDiscretiser.rst +++ b/docs/user_guide/discretisation/DecisionTreeDiscretiser.rst @@ -5,27 +5,27 @@ DecisionTreeDiscretiser ======================= -Discretization consists of transforming continuous variables into discrete features by creating +Discretisation consists of transforming continuous variables into discrete features by creating a set of contiguous intervals, or bins, that span the range of the variable values. -Discretization is a common data preprocessing step in many data science projects, as it simplifies +Discretisation is a common data preprocessing step in many data science projects, as it simplifies continuous attributes and has the potential to improve model performance or speed up model training. -Decision tree discretization +Decision tree discretisation ---------------------------- Decision trees make decisions based on discrete partitions over continuous features. During training, a decision tree evaluates all possible feature values to find the best cut-point, that is, -the feature value at which the split maximizes the information gain, or in other words, reduces the +the feature value at which the split maximises the information gain, or in other words, reduces the impurity. It repeats the procedure at each node until it allocates all samples to certain leaf nodes or end nodes. Hence, classification and regression trees can naturally find the optimal limits -of the intervals to maximize class coherence. +of the intervals to maximise class coherence. -Discretization with decision trees consists of using a decision tree algorithm to identify the optimal +Discretisation with decision trees consists of using a decision tree algorithm to identify the optimal partitions for each continuous variable. After finding the optimal partitions, we sort the variable's values into those intervals. -Discretization with decision trees is a supervised discretization method, in that, the interval +Discretisation with decision trees is a supervised discretisation method, in that, the interval limits are found based on class or target coherence. In simpler words, we need the target variable to train the decision trees. @@ -42,10 +42,10 @@ Limitations - We need to tune some of the decision tree parameters to obtain the optimal number of intervals. -Decision tree discretizer +Decision tree discretiser ------------------------- -The :class:`DecisionTreeDiscretiser()` applies discretization based on the interval limits found +The :class:`DecisionTreeDiscretiser()` applies discretisation based on the interval limits found by decision trees algorithms. It uses decision trees to find the optimal interval limits. Next, it sorts the variable into those intervals. @@ -53,14 +53,14 @@ The transformed variable can either have the limits of the intervals as values, representing the interval into which the value was sorted, or alternatively, the prediction of the decision tree. In any case, the number of values of the variable will be finite. -In theory, decision tree discretization creates discrete variables with a monotonic relationship +In theory, decision tree discretisation creates discrete variables with a monotonic relationship with the target, and hence, the transformed features would be more suitable to train linear models, like linear or logistic regression. Original idea ------------- -The method of decision tree discretization is based on the winning solution of the KDD 2009 competition: +The method of decision tree discretisation is based on the winning solution of the KDD 2009 competition: `Niculescu-Mizil, et al. "Winning the KDD Cup Orange Challenge with Ensemble Selection". JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 @@ -77,14 +77,14 @@ on the performance of linear models. Code examples ------------- -In the following sections, we will do decision tree discretization to showcase the functionality of -the :class:`DecisionTreeDiscretiser()`. We will discretize 2 numerical variables of the Ames house +In the following sections, we will do decision tree discretisation to showcase the functionality of +the :class:`DecisionTreeDiscretiser()`. We will discretise 2 numerical variables of the Ames house prices dataset using decision trees. First, we will transform the variables using the predictions of the decision trees, next, we will return the interval limits, and finally, we will return the bin order. -Discretization with the predictions of the decision tree +Discretisation with the predictions of the decision tree ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ First we load the data and separate it into a training set and a test set: @@ -136,9 +136,9 @@ In the following output we see the predictor variables of the house prices datas We set up the decision tree discretiser to find the optimal intervals using decision trees. -The :class:`DecisionTreeDiscretiser()` will optimize the depth of the decision tree classifier +The :class:`DecisionTreeDiscretiser()` will optimise the depth of the decision tree classifier or regressor by default and using cross-validation. That's why we need to select the appropriate -metric for the optimization. In this example, we are using decision tree regression, so we select +metric for the optimisation. In this example, we are using decision tree regression, so we select the mean squared error metric. We specify in the `bin_output` that we want to replace the continuous attribute values with the @@ -211,8 +211,8 @@ The `binner_dict_` stores the details of each decision tree. scoring='neg_mean_squared_error')} -With decision tree discretization, each bin, that is, each prediction value in this case, does not -necessarily contain the same number of observations. Let's check that out with a visualization: +With decision tree discretisation, each bin, that is, each prediction value in this case, does not +necessarily contain the same number of observations. Let's check that out with a visualisation: .. code:: python @@ -239,7 +239,7 @@ Rounding the prediction value ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sometimes, the output of the prediction can have multiple values after the comma, which makes the -visualization and interpretation a bit uncomfortable. Fortunately, we can round those values through +visualisation and interpretation a bit uncomfortable. Fortunately, we can round those values through the `precision` parameter: .. code:: python @@ -266,7 +266,7 @@ the `precision` parameter: In this example, we are predicting house prices, which is a continuous target. The procedure for classification models is identical, we just need to set the parameter `regression` to False. -Discretization with interval limits +Discretisation with interval limits ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In this section, instead of replacing the original variable values with the predictions of the @@ -314,7 +314,7 @@ of the decision trees: 4576.0, inf]} -The :class:`DecisionTreeDiscretiser()` will use these limits with `pandas.cut` to discretize the +The :class:`DecisionTreeDiscretiser()` will use these limits with `pandas.cut` to discretise the continuous variable values during transform: .. code:: python @@ -337,7 +337,7 @@ In the following output we see the interval limits into which the values of the To train machine learning algorithms we would follow that up with any categorical data encoding method. -Discretization with ordinal numbers +Discretisation with ordinal numbers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the last part of this guide, we will replace the variable values with the number of bin into @@ -384,7 +384,7 @@ The `binner_dict_` will also contain the limits of the intervals: inf]} When we apply transform, :class:`DecisionTreeDiscretiser()` will use these limits with `pandas.cut` to -discretize the continuous variable: +discretise the continuous variable: .. code:: python @@ -408,62 +408,19 @@ were sorted: Additional considerations ------------------------- -Decision tree discretization uses scikit-learn's DecisionTreeRegressor or DecisionTreeClassifier under +Decision tree discretisation uses scikit-learn's DecisionTreeRegressor or DecisionTreeClassifier under the hood to find the optimal interval limits. These models do not support missing data. Hence, we need -to replace missing values with numbers before proceeding with the disrcretization. +to replace missing values with numbers before proceeding with the disrcretisation. Tutorials, books and courses ---------------------------- -Check also for more details on how to use this transformer: - -- `Jupyter notebook `_ -- `tree_pipe in cell 21 of this Kaggle kernel `_ - -For tutorials about this and other discretization methods and feature engineering techniques check out our online course: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other discretisation methods and feature engineering techniques check out our online course: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/discretisation/EqualFrequencyDiscretiser.rst b/docs/user_guide/discretisation/EqualFrequencyDiscretiser.rst index 6d7f3e46c..8b488db8d 100644 --- a/docs/user_guide/discretisation/EqualFrequencyDiscretiser.rst +++ b/docs/user_guide/discretisation/EqualFrequencyDiscretiser.rst @@ -5,32 +5,34 @@ EqualFrequencyDiscretiser ========================= -Equal frequency discretization consists of dividing continuous attributes into equal-frequency bins. These bins -contain roughly the same number of observations, with boundaries set at specific quantile values determined by the desired -number of bins. +Equal frequency discretisation involves dividing continuous attributes into bins that +each contain approximately the same number of observations. The boundaries of these +bins are determined by specific **quantile values**, based on the desired number of bins. -Equal frequency discretization ensures a uniform distribution of data points across the range of values, enhancing the -handling of skewed data and outliers. +.. tip:: + + This method ensures a more uniform distribution of data points across the value range, + which helps in handling skewed data and outliers more effectively. -Discretization is a common data preprocessing technique used in data science. It's also known as binning data (or simply "binning"). Advantages and Limitations -------------------------- -Equal frequency discretization has some advantages and shortcomings: +Equal frequency discretisation has advantages and limitations: Advantages ~~~~~~~~~~ -Some advantages of equal frequency binning: +Advantages of equal frequency binning include: - **Algorithm Efficiency:** Enhances the performance of data mining and machine learning algorithms by providing a simplified representation of the dataset. - **Outlier Management:** Efficiently mitigates the effect of outliers by grouping them into the extreme bins. -- **Data Smoothing:** Helps smooth the data, reduces noise, and improves the model's ability to generalize. +- **Data Smoothing:** Helps smooth the data, reduces noise, and improves the model's ability to generalise. - **Improved value distribution:** Returns an uniform distribution of values across the value range. -Equal frequency discretization improves the data distribution, **optimizing the spread of values**. This is particularly -beneficial for datasets with skewed distributions (see the Python example code). +Equal frequency discretisation improves the data distribution, **optimising the spread of values**. This is particularly +beneficial for datasets with skewed distributions (see the Python implementation section +for an example). Limitations ~~~~~~~~~~~ @@ -44,10 +46,10 @@ would potentially impact the model's performance in this scenario. EqualFrequencyDiscretiser ------------------------- -Feature-engine's :class:`EqualFrequencyDiscretiser` applies equal frequency discretization to numerical variables. It uses -the `pandas.qcut()` function under the hood, to determine the interval limits. +Feature-engine's :class:`EqualFrequencyDiscretiser` applies equal frequency discretisation to numerical variables. It uses +the `pandas.qcut()` function under the hood to determine the interval limits. -You can specify the variables to be discretized by passing their names in a list when you set up the transformer. Alternatively, +You can specify the variables to be discretised by passing their names in a list when setting up the transformer. Alternatively, :class:`EqualFrequencyDiscretiser` will automatically infer the data types to compute the interval limits for all numeric variables. **Optimal number of intervals:** With :class:`EqualFrequencyDiscretiser`, the user defines the number of bins. Smaller intervals @@ -56,62 +58,66 @@ may be required if the variable is highly skewed or not continuous. **Integration with scikit-learn:** :class:`EqualFrequencyDiscretiser` and all other feature-engine transformers seamlessly integrate with scikit-learn `pipelines `_. -Python code example -------------------- +Python implementation +--------------------- In this section, we'll show the main functionality of :class:`EqualFrequencyDiscretiser` Load dataset ~~~~~~~~~~~~ -In this example, we'll use the Ames House Prices' Dataset. First, let's load the dataset and split it into train and +We'll use the Ames House Prices' Dataset. First, let's load the dataset and split it into train and test sets: .. code:: python - import matplotlib.pyplot as plt - from sklearn.datasets import fetch_openml - from sklearn.model_selection import train_test_split + import matplotlib.pyplot as plt + from sklearn.datasets import fetch_openml + from sklearn.model_selection import train_test_split - from feature_engine.discretisation import EqualFrequencyDiscretiser + from feature_engine.discretisation import EqualFrequencyDiscretiser - # Load dataset - X, y = fetch_openml(name='house_prices', version=1, return_X_y=True, as_frame=True) - X.set_index('Id', inplace=True) + # Load dataset + X, y = fetch_openml( + name='house_prices', version=1, return_X_y=True, as_frame=True) + X.set_index('Id', inplace=True) - # Separate into train and test sets - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) + # Separate into train and test sets + X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.3, random_state=42) -Equal-frequency Discretization +Equal-frequency Discretisation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this example, let's discretize two variables, LotArea and GrLivArea, into 10 intervals of approximately equal +Let's discretise two variables, LotArea and GrLivArea, into 10 intervals of equal number of observations. .. code:: python - - # List the target numeric variables to be transformed - TARGET_NUMERIC_FEATURES= ['LotArea','GrLivArea'] - # Set up the discretization transformer - disc = EqualFrequencyDiscretiser(q=10, variables=TARGET_NUMERIC_FEATURES) + # Set up the discretisation transformer + disc = EqualFrequencyDiscretiser( + q=10, + variables=['LotArea','GrLivArea'], + ) - # Fit the transformer - disc.fit(X_train) + # Fit the transformer + disc.fit(X_train) +.. note:: -Note that if we do not specify the variables (default=`None`), :class:`EqualFrequencyDiscretiser` will automatically -infer the data types to compute the interval limits for all numeric variables. + If we do not specify the variables to discretise, :class:`EqualFrequencyDiscretiser` will automatically + infer the data types and compute the interval limits for all numeric variables. -With the `fit()` method, the discretizer learns the bin boundaries and saves them into a dictionary so we can use them -to transform unseen data: +With the `fit()` method, the discretiser learns the bin boundaries and saves them into a dictionary so we can use them +to transform new data: .. code:: python # Learned limits for each variable disc.binner_dict_ +In the following output, we see the interval limits calculated for each variable: .. code:: python @@ -139,12 +145,16 @@ to transform unseen data: inf]} -Note that the lower and upper boundaries are set to -inf and inf, respectively. his behavior ensures that the transformer -will be able to allocate to the extreme bins values that are smaller or greater than the observed minimum and maximum +Note that the lower and upper boundaries are set to -inf and inf, respectively. This behavior ensures that the transformer +is able to allocate to the extreme bins values that are smaller or greater than the observed minimum and maximum values in the training set. -:class:`EqualFrequencyDiscretiser` will not work in the presence of missing values. Therefore, we should either remove or -impute missing values before fitting the transformer. +.. note:: + + :class:`EqualFrequencyDiscretiser` will not work in the presence of missing values. Therefore, we should either remove or + impute missing values before fitting the transformer. + +Let's now discretise the variables in the training and test sets: .. code:: python @@ -153,12 +163,12 @@ impute missing values before fitting the transformer. test_t = disc.transform(X_test) -Let's visualize the first rows of the raw data and the transformed data: +Let's visualise the first rows of the raw data: .. code:: python # Raw data - print(X_train[TARGET_NUMERIC_FEATURES].head()) + print(X_train[['LotArea','GrLivArea']].head()) Here we see the original variables: @@ -172,12 +182,14 @@ Here we see the original variables: 933 11670 1905 436 10667 1661 +Let's display the transformed variables now: + .. code:: python - # Transformed data - print(train_t[TARGET_NUMERIC_FEATURES].head()) + # Transformed data + print(train_t[['LotArea','GrLivArea']].head()) -Here we observe the variables after discretization: +Here we observe the variables after discretisation: .. code:: python @@ -190,9 +202,11 @@ Here we observe the variables after discretization: 436 6 6 -The transformed data now contains discrete values corresponding to the ordered computed buckets (0 being the first and q-1 the last). +The transformed data now contains discrete values corresponding to the ordered computed +buckets (0 being the first and q-1 the last). -Now, let's visualize the plots for equal-width intervals with a histogram and the transformed data with equal-frequency discretiser: +Now, let's visualise the data with a histogram of the original distribution next to a bar +plot of the discretised variable: .. code:: python @@ -211,17 +225,19 @@ Now, let's visualize the plots for equal-width intervals with a histogram and th plt.tight_layout(w_pad=2) plt.show() -As we see in the following image, the intervals contain approximately the same number of observations: +As we see in the following image, the intervals contain approximately the same number of +observations and a uniform distribution: .. image:: ../../images/equalfrequencydiscretisation_gaussian.png -Finally, as the default value for the `return_object` parameter is `False`, the transformer outputs integer variables: +Note that the discretised variables are of type integer by default: .. code:: python - train_t[TARGET_NUMERIC_FEATURES].dtypes + train_t[['LotArea','GrLivArea']].dtypes +We see that the dtype of the variable after discretisation is integer: .. code:: python @@ -233,8 +249,8 @@ Finally, as the default value for the `return_object` parameter is `False`, the Return variables as object ~~~~~~~~~~~~~~~~~~~~~~~~~~ -Categorical encoders in Feature-engine are designed to work by default with variables of type object. Therefore, to further -encode the discretised output with Feature-engine, we can set `return_object=True` instead. This will return the transformed +Categorical encoders in feature-engine are designed to work by default with variables of type object. Therefore, to further +encode the discretised output with feature-engine's encoders, we can set `return_object=True` instead. This will return the transformed variables as object. Let's say we want to obtain monotonic relationships between the variable and the target. We can do that seamlessly by setting @@ -248,27 +264,23 @@ If we want to output the intervals limits instead of integers, we can set `retur .. code:: python - # Set up the discretization transformer + # Set up the discretisation transformer disc = EqualFrequencyDiscretiser( q=10, - variables=TARGET_NUMERIC_FEATURES, + variables=['LotArea','GrLivArea'], return_boundaries=True) # Fit the transformer disc.fit(X_train) - # Transform test set & visualize limit + # Transform test set & visualise limit test_t = disc.transform(X_test) - # Visualize output (boundaries) - print(test_t[TARGET_NUMERIC_FEATURES].head()) + # Visualise output (boundaries) + print(test_t[['LotArea','GrLivArea']].head()) The transformed variables now show the interval limits in the output. We can immediately see that the bin width for these -intervals varies. In other words, they don't have the same width, contrarily to what we see with :ref:`equal width discretization `. - -Unlike the variables discretized into integers, these variables cannot be used to train machine learning models; however, -they are still highly helpful for data analysis in this format, and they may be sent to any Feature-engine encoder for -additional processing. +intervals varies. In other words, they don't have the same width, contrarily to what we see with :ref:`equal width discretisation `. .. code:: python @@ -280,11 +292,14 @@ additional processing. 523 (-inf, 5000.0] (1601.6, 1717.7] 1037 (12208.2, 14570.7] (1601.6, 1717.7] +Unlike the variables discretised into integers, these variables cannot be used to train machine learning models; however, +they are still highly helpful for data analysis in this format, and they may be sent to any Feature-engine encoder for +additional processing. Binning skewed data ~~~~~~~~~~~~~~~~~~~ -Let's now show the benefits of equal frequency discretization for skewed variables. We'll +Let's now show the benefits of equal frequency discretisation for skewed variables. We'll start by importing the libraries and classes: .. code:: python @@ -311,17 +326,17 @@ one that is skewed: # Create dataframe with simulated data X = pd.DataFrame({'feature1': normal_data, 'feature2': skewed_data}) -Let's discretize both variables into 5 equal frequency bins: +Let's discretise both variables into 5 equal frequency bins: .. code:: python - # Instantiate discretizer + # Instantiate discretiser disc = EqualFrequencyDiscretiser(q=5) # Transform simulated data X_transformed = disc.fit_transform(X) -Let's plot the original distribution and the distribution after discretization for the variable that was normally +Let's plot the original distribution and the distribution after discretisation for the variable that was normally distributed: .. code:: python @@ -338,12 +353,12 @@ distributed: plt.show() -In the following image, we see that after the discretization there is an even distribution of the values across +In the following image, we see that after the discretisation there is an even distribution of the values across the value range, hence, the variable does no look normally distributed any more. .. image:: ../../images/equalfrequencydiscretisation_gaussian.png -Let's now plot the original distribution and the distribution after discretization for the variable that was skewed: +Let's now plot the original distribution and the distribution after discretisation for the variable that was skewed: .. code:: python @@ -359,7 +374,7 @@ Let's now plot the original distribution and the distribution after discretizati plt.show() -In the following image, we see that after the discretization there is an even distribution of the values across +In the following image, we see that after the discretisation there is an even distribution of the values across the value range. .. image:: ../../images/equalfrequencydiscretisation_skewed.png @@ -369,7 +384,7 @@ See Also For alternative binning techniques, check out the following resources: -- Further feature-engine :ref:`discretizers / binning methods ` +- Further feature-engine :ref:`discretisers / binning methods ` - Scikit-learn's `KBinsDiscretizer `_. Check out also: @@ -380,56 +395,12 @@ Check out also: Additional resources -------------------- -Check also for more details on how to use this transformer: - -- `Jupyter notebook `_ -- `Jupyter notebook - Discretizer plus Weight of Evidence encoding `_ - For more details about this and other feature engineering methods check out these resources: +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. diff --git a/docs/user_guide/discretisation/EqualWidthDiscretiser.rst b/docs/user_guide/discretisation/EqualWidthDiscretiser.rst index 1e630ac1a..8333992a1 100644 --- a/docs/user_guide/discretisation/EqualWidthDiscretiser.rst +++ b/docs/user_guide/discretisation/EqualWidthDiscretiser.rst @@ -5,7 +5,7 @@ EqualWidthDiscretiser ===================== -Equal width discretization consist of dividing continuous variables into intervals of equal width, calculated +Equal width discretisation consist of dividing continuous variables into intervals of equal width, calculated using the following formula: .. math:: @@ -13,29 +13,27 @@ using the following formula: bin_{width} = ( max(X) - min(X) ) / bins Here, `bins` is the number of intervals specified by the user and `max(X)` and `min(X)` are the minimum and maximum values -of the variable to discretize. +of the variable to discretise. -Discretization is a common data preprocessing technique used in data science. It's also known as data binning (or simply -"binning"). Advantages and Limitations -------------------------- -Equal binning discretization has some advantages and also shortcomings. +Equal binning discretisation has some advantages and also limitations. Advantages ~~~~~~~~~~ -Some advantages of equal width binning: +Advantages of equal width binning include: - **Algorithm Efficiency:** Enhances the performance of data mining and machine learning algorithms by providing a simplified representation of the dataset. - **Outlier Management:** Efficiently mitigates the effect of outliers by grouping them into the extreme bins, thus preserving the integrity of the main data distribution. -- **Data Smoothing:** Helps smooth the data, reduces noise, and improves the model's ability to generalize. +- **Data Smoothing:** Helps smooth the data, reduces noise, and improves the model's ability to generalise. Limitations ~~~~~~~~~~~ -On the other hand, equal width discretzation can lead to a loss of information by aggregating data into broader categories. +On the other hand, equal width discretsation can lead to a loss of information by aggregating data into broader categories. This is particularly concerning if the data in the same bin has predictive information about the target. Let's consider a binary classifier task using a decision tree model. A bin with a high proportion of both target categories would @@ -44,12 +42,12 @@ potentially impact the model's performance in this scenario. EqualWidthDiscretiser --------------------- -Feture-engine's :class:`EqualWidthDiscretiser()` applies equal width discretization to numerical variables. It uses +Feture-engine's :class:`EqualWidthDiscretiser()` applies equal width discretisation to numerical variables. It uses the `pandas.cut()` function under the hood to find the interval limits and then sort the continuous variables into the bins. -You can specify the variables to be discretized by passing their names in a list when you set up the transformer. Alternatively, -:class:`EqualWidthDiscretiser()` will automatically infer the data types to compute the interval limits for all numeric +You can specify the variables to be discretised by passing their names in a list when you set up the transformer. Alternatively, +:class:`EqualWidthDiscretiser()` will automatically infer the data types and compute the interval limits for all numeric variables. **Optimal number of intervals:** With :class:`EqualWidthDiscretiser()`, the user defines the number of bins. Smaller intervals @@ -58,8 +56,8 @@ may be required if the variable is highly skewed or not continuous. **Integration with scikit-learn:** :class:`EqualWidthDiscretiser()` and all other Feature-engine transformers seamlessly integrate with scikit-learn `pipelines `_. -Python code example -------------------- +Python implementation +--------------------- In this section, we'll show the main functionality of :class:`EqualWidthDiscretiser()`. @@ -71,48 +69,49 @@ test sets: .. code:: python - import matplotlib.pyplot as plt - from sklearn.datasets import fetch_openml - from sklearn.model_selection import train_test_split + import matplotlib.pyplot as plt + from sklearn.datasets import fetch_openml + from sklearn.model_selection import train_test_split - from feature_engine.discretisation import EqualFrequencyDiscretiser + from feature_engine.discretisation import EqualFrequencyDiscretiser - # Load dataset - X, y = fetch_openml(name='house_prices', version=1, return_X_y=True, as_frame=True) - X.set_index('Id', inplace=True) + # Load dataset + X, y = fetch_openml( + name='house_prices', version=1, return_X_y=True, as_frame=True) + X.set_index('Id', inplace=True) - # Separate into train and test sets - X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) + # Separate into train and test sets + X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.3, random_state=42) -Equal-width Discretization +Equal-width Discretisation ~~~~~~~~~~~~~~~~~~~~~~~~~~ -In this example, let's discretize two variables, LotArea and GrLivArea, into 10 intervals of equal width: +In this example, let's discretise two variables, LotArea and GrLivArea, into 10 intervals of equal width: .. code:: python - # List the target numeric variables for equal-width discretization - TARGET_NUMERIC_FEATURES= ['LotArea','GrLivArea'] - - # Set up the discretization transformer - disc = EqualWidthDiscretiser(bins=10, variables=TARGET_NUMERIC_FEATURES) + # Set up the discretisation transformer + disc = EqualWidthDiscretiser(bins=10, variables=['LotArea','GrLivArea']) # Fit the transformer disc.fit(X_train) +.. note:: -Note that if we do not specify the variables (default=`None`), :class:`EqualWidthDiscretiser` will automatically infer -the data types to compute the interval limits for all numeric variables. + Note that if we do not specify the variables (default=`None`), :class:`EqualWidthDiscretiser` will automatically infer + the data types to compute the interval limits for all numeric variables. -With the `fit()` method, the discretizer learns the bin boundaries and saves them into a dictionary so we can use them -to transform unseen data: +With the `fit()` method, the discretiser learns the bin boundaries and saves them into a dictionary so we can use them +to transform new data: .. code:: python # Learned limits for each variable disc.binner_dict_ +In the following dictionary, we see the interval limits determined for each variable: .. code:: python @@ -141,24 +140,28 @@ to transform unseen data: Note that the lower and upper boundaries are set to -inf and inf, respectively. This behavior ensures that the transformer -will be able to allocate to the extreme bins values that are smaller or greater than the observed minimum and maximum +is able to allocate to the extreme bins values that are smaller or greater than the observed minimum and maximum values in the training set. -:class:`EqualWidthDiscretiser` will not work in the presence of missing values. Therefore, we should either remove or -impute missing values before fitting the transformer. +.. note:: + + :class:`EqualWidthDiscretiser` will not work in the presence of missing values. Therefore, we should either remove or + impute missing values before fitting the transformer. + +Let's now discretise the variables in the training and test sets: .. code:: python - # Transform the data (data discretization) + # Transform the data (data discretisation) train_t = disc.transform(X_train) test_t = disc.transform(X_test) -Let's visualize the first rows of the raw data and the transformed data: +Let's display the first rows of the raw data: .. code:: python # Raw data - print(X_train[TARGET_NUMERIC_FEATURES].head()) + print(X_train[['LotArea','GrLivArea']].head()) Here we see the original variables: @@ -172,13 +175,14 @@ Here we see the original variables: 933 11670 1905 436 10667 1661 +Let's display the first rows of the transformed data: .. code:: python # Transformed data - print(train_t[TARGET_NUMERIC_FEATURES].head()) + print(train_t[['LotArea','GrLivArea']].head()) -Here we observe the variables after discretization: +Here we observe the variables after discretisation: .. code:: python @@ -209,15 +213,18 @@ histogram: | -Equal width discretization does not improve the spread of values over the value range. If the variable is skewed, it will -still be skewed after the discretization. +.. note:: + + Equal width discretisation does not improve the spread of values over the value range. If the variable is skewed, it will + still be skewed after the discretisation. -Finally, since the default value for the `return_object` parameter is `False`, the transformer outputs integer variables: +By default, the data type of the transformed variables is integer. Let's check that out: .. code:: python - train_t[TARGET_NUMERIC_FEATURES].dtypes + train_t[['LotArea','GrLivArea']].dtypes +In the following output, we see that the discretised variables are of type integer: .. code:: python @@ -229,8 +236,8 @@ Finally, since the default value for the `return_object` parameter is `False`, t Return variables as object ~~~~~~~~~~~~~~~~~~~~~~~~~~ -Categorical encoders in Feature-engine are designed to work by default with variables of type object. Therefore, to -further encode the discretized output with Feature-engine's encoders, we can set `return_object=True` instead. This will +Categorical encoders in feature-engine are designed to work by default with variables of type object. Therefore, to +further encode the discretised output with feature-engine's encoders, we can set `return_object=True` instead. This will return the transformed variables as object. Let's say we want to obtain monotonic relationships between the variable and the target. We can do that seamlessly by @@ -244,24 +251,22 @@ If we want to output the intervals limits instead of integers, we can set `retur .. code:: python - # Set up the discretization transformer + # Set up the discretisation transformer disc = EqualFrequencyDiscretiser( bins=10, - variables=TARGET_NUMERIC_FEATURES, + variables=['LotArea','GrLivArea'], return_boundaries=True) # Fit the transformer disc.fit(X_train) - # Transform test set & visualize limit + # Transform test set & visualise limit test_t = disc.transform(X_test) - # Visualize output (boundaries) - print(test_t[TARGET_NUMERIC_FEATURES].head()) + # Visualise output (boundaries) + print(test_t[['LotArea','GrLivArea']].head()) -In the following output we see that the transformed variables now display the interval limits. While we can't use these -variables to train machine learning models, as opposed to the variables discretized into integers, they are very useful -in this format for data analysis, and they can also be passed on to any Feature-engine encoder for further processing. +In the following output we see that the transformed variables now display the interval limits. .. code:: python @@ -273,13 +278,16 @@ in this format for data analysis, and they can also be passed on to any Feature- 523 (-inf, 22694.5] (1395.6, 1926.4] 1037 (-inf, 22694.5] (1395.6, 1926.4] +While we can't use these +variables to train machine learning models, as opposed to the variables discretised into integers, they are very useful +in this format for data analysis, and they can also be passed on to any Feature-engine encoder for further processing. See Also -------- For alternative binning techniques, check out the following resources: -- Further feature-engine :ref:`discretizers / binning methods ` +- Further feature-engine :ref:`discretisers / binning methods ` - Scikit-learn's `KBinsDiscretizer `_. Check out also: @@ -289,56 +297,12 @@ Check out also: Additional resources -------------------- -Check also for more details on how to use this transformer: - -- `Jupyter notebook `_ -- `Jupyter notebook - Discretizer plus Ordinal encoding `_ - For more details about this and other feature engineering methods check out these resources: +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. diff --git a/docs/user_guide/discretisation/GeometricWidthDiscretiser.rst b/docs/user_guide/discretisation/GeometricWidthDiscretiser.rst index ce9626595..ec33f279d 100644 --- a/docs/user_guide/discretisation/GeometricWidthDiscretiser.rst +++ b/docs/user_guide/discretisation/GeometricWidthDiscretiser.rst @@ -118,7 +118,7 @@ The `binner_dict_` stores the interval limits identified for each variable. With increasing width discretisation, each bin does not necessarily contain the same number of observations. This transformer is suitable for variables with right skewed distributions. -Let's compare the variable distribution before and after the discretization: +Let's compare the variable distribution before and after the discretisation: .. code:: python @@ -149,56 +149,12 @@ this functionality `here `_ -- `Jupyter notebook - Geometric Discretiser plus Mean encoding `_ - For more details about this and other feature engineering methods check out these resources: +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/discretisation/index.rst b/docs/user_guide/discretisation/index.rst index 559dc751a..35a8b020a 100644 --- a/docs/user_guide/discretisation/index.rst +++ b/docs/user_guide/discretisation/index.rst @@ -5,11 +5,9 @@ Discretisation ============== -Feature-engine's variable discretisation transformers transform continuous numerical -variables into discrete variables. The discrete variables will contain contiguous -intervals in the case of the equal frequency and equal width transformers. The -Decision Tree discretiser will return a discrete variable, in the sense that the -new feature takes a finite number of values. +Data discretisation, also known as binning, is the process of grouping continuous +variable values into adjacent intervals. This procedure transforms continuous +variables into discrete ones and is commonly used in data science and machine learning. The following illustration shows the process of discretisation: @@ -17,17 +15,91 @@ The following illustration shows the process of discretisation: :align: center :width: 500 +.. tip:: -With discretisation, sometimes we can obtain a more homogeneous value spread from an -originally skewed variable. But this is not always possible. + With discretisation, we can often make the value spread of skewed variables more + homogeneous across the value range. -**Discretisation plus encoding** +In discretisation, we convert continuous variables into discrete features. This involves +calculating the boundaries of contiguous intervals that cover the entire range of +variable values. The original values are then sorted into these intervals. -Very often, after we discretise the numerical continuous variables into discrete intervals -we want to proceed their engineering as if they were categorical. This is common practice. -Throughout the user guide, we point to jupyter notebooks that showcase this functionality. +A key challenge in discretisation is determining the thresholds or boundaries that define +the intervals into which the continuous values are sorted. To address this, +various discretisation methods are available, each with its own advantages and limitations. -**Discretisers** +How is Discretisation Useful? +----------------------------- + +Several regression and classification models, such as decision trees and Naive Bayes, +perform better with discrete values. + +Decision trees make decisions based on discrete attribute partitions. A decision tree +evaluates all feature values during training to determine the optimal cut-point. +Consequently, the more values a feature has, the longer the decision tree's training +time. Therefore, discretising continuous features can speed up the training process. + +Discretisation also offers additional benefits. Discrete values are easier for people +to interpret. Moreover, when observations are sorted into bins with equal frequency, +skewed values become more evenly distributed across the range. + +Furthermore, discretisation can reduce the impact of outliers by grouping them into +the lower or upper intervals, along with the other values in the distribution. This +approach helps prevent outliers from biasing the coefficients in linear regression models. + +Overall, discretisation of continuous features simplifies the data, accelerates the +learning process, and can lead to more accurate results. + +Shortcomings of Discretisation +------------------------------ + +Discretisation can lead to information loss, for instance, by combining +values that are strongly associated with different target classes into the same bin. + +.. note:: + + The goal of a discretisation algorithm is to determine the fewest possible intervals + without significant information loss. The algorithm's task, then, is to identify the + optimal cut-points for those intervals. + + This brings up the question of how to discretise variables in machine learning. + +Discretisation Methods +---------------------- + +The most popular discretisation algorithms are equal-width and equal-frequency +discretisation. These are unsupervised techniques, as they determine the interval +limits without considering the target variable. + +Another unsupervised method consists of using +k-means to find the interval limits. In all of these methods, the user must specify +the number of bins into which the continuous data will be sorted in advance. + +On the other hand, decision tree-based discretisation techniques can automatically +determine the cut-points and the optimal number of divisions. This is a supervised +method, as it uses the target variable to guide the determination of interval limits. + +Feature-engine's Discretisers +----------------------------- + +Feature-engine's discretisation transformers transform continuous variables into +discrete features. They use different logic to determine the limits of those intervals. + +**Summary of Feature-engine's discretisers** + +===================================== ======================================================================== + Transformer Functionality +===================================== ======================================================================== +:class:`EqualFrequencyDiscretiser()` Sorts values into intervals with similar number of observations. +:class:`EqualWidthDiscretiser()` Sorts values into intervals of equal size. +:class:`ArbitraryDiscretiser()` Sorts values into intervals predefined by the user. +:class:`DecisionTreeDiscretiser()` Replaces values by predictions of a decision tree, which are discrete. +:class:`GeometricWidthDiscretiser()` Sorts variable into geometrical intervals. +===================================== ======================================================================== + + +Discretisers +------------ .. toctree:: :maxdepth: 1 diff --git a/docs/user_guide/encoding/CountFrequencyEncoder.rst b/docs/user_guide/encoding/CountFrequencyEncoder.rst index 3a14b1adb..d5cb9e7e9 100644 --- a/docs/user_guide/encoding/CountFrequencyEncoder.rst +++ b/docs/user_guide/encoding/CountFrequencyEncoder.rst @@ -37,7 +37,7 @@ This, of course, can result in the loss of information by placing two categories are otherwise different in the same pot. But on the other hand, if we are using count encoding or frequency encoding, we have reasons to believe that the count or the frequency are a good indicator of predictive performance or somehow capture data insight, so that -categories with similar counts would show similar patterns or behaviors. +categories with similar counts would show similar patterns or behaviours. Count and Frequency encoding with Feature-engine ------------------------------------------------ @@ -163,7 +163,7 @@ Now, we can go ahead and encode the variables: print(train_t.head()) We see the resulting dataframe where the categorical features are now replaced with -integer values corresponding to the category counts: +the category counts: .. code:: python @@ -251,65 +251,12 @@ values. Additional resources -------------------- -In the following notebook, you can find more details into the :class:`CountFrequencyEncoder()` -functionality and example plots with the encoded variables: +For tutorials about this and other feature engineering methods check out these resources: -- `Jupyter notebook `_ - -For more details about this and other feature engineering methods check out these resources: - - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -.. figure:: ../../images/fetsf.png - :width: 300 - :figclass: align-center - :align: right - :target: https://www.trainindata.com/p/feature-engineering-for-forecasting - - Feature Engineering for Time Series Forecasting - - -| -| -| -| -| -| -| -| -| -| - -Our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. Both our book and courses are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/DecisionTreeEncoder.rst b/docs/user_guide/encoding/DecisionTreeEncoder.rst index 9794b68ab..0ee863288 100644 --- a/docs/user_guide/encoding/DecisionTreeEncoder.rst +++ b/docs/user_guide/encoding/DecisionTreeEncoder.rst @@ -13,18 +13,21 @@ We can also replace the categories with the predictions made by a decision tree on that category value. The process consists of fitting a decision tree using a single feature to predict the -target. The decision tree will try to find a relationship between these variables, if -one exists, and then we'll use the predictions as mappings to replace the categories. +target. The decision tree will try to find a relationship between these variables and +then we'll use the predictions as mappings to replace the categories. The advantage of this procedure is that it captures some information about the relationship -between the variables during the encoding. And if there is a relationship between the +between the variables during the encoding. If there is a relationship between the categorical feature and the target, the resulting encoded variable would have a monotonic relationship with the target, which can be useful for linear models. On the downside, it could cause overfitting, and it adds computational complexity to the -pipeline because we are fitting a tree per feature. If you plan to encode your features -with decision trees, make sure you have appropriate validation strategies and train the -decision trees with regularization. +pipeline because we are fitting a tree per feature. + +.. tip:: + + If you plan to encode your features with decision trees, make sure you have + appropriate validation strategies and train the decision trees with regularisation. DecisionTreeEncoder ------------------- @@ -54,8 +57,8 @@ of the decision tree for the category. The motivation for the :class:`DecisionTreeEncoder()` is to try and create monotonic relationships between the categorical variables and the target. -Python example --------------- +Python implementation +--------------------- Let's look at an example using the Titanic Dataset. First, let's load the data and separate it into train and test: @@ -97,7 +100,7 @@ We will encode the following categorical variables: We set up the encoder to encode the variables above with 3 fold cross-validation, using a grid search to find the optimal depth of the decision tree (this is the default -behaviour of the :class:`DecisionTreeEncoder()`). In this example, we optimize the +behaviour of the :class:`DecisionTreeEncoder()`). In this example, we optimise the tree using the roc-auc metric. .. code:: python @@ -265,7 +268,7 @@ Collisions ---------- This encoder can lead to collisions. Collisions are instances where different categories -are encoded with the same number. It is useful to reduce cardinality. On the other hand, +are encoded with the same number. Collisions reduce cardinality. On the other hand, if the mappings are not meaningful we might lose the information contained in those categories. @@ -418,51 +421,21 @@ In the following image we also see a monotonic relationship after the encoding: .. image:: ../../images/lotshape-price-per-cat-enc.png -Note -~~~~ +.. note:: -Not every encoding will result in monotonic relationshops. For that to occur there needs to -be some sort of relationship between the target and the categories that can be captured by -the decision tree. Use with caution. + Not every encoding will result in monotonic relationships. For that to occur there needs to + be some sort of relationship between the target and the categories that can be captured by + the decision tree. Use with caution. Additional resources -------------------- -In the following notebook, you can find more details into the :class:`DecisionTreeEncoder()` -functionality and example plots with the encoded variables: - -- `Jupyter notebook `_ - -For more details about this and other feature engineering methods check out these resources: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 +For tutorials about this and other feature engineering methods check out these resources: - Python Feature Engineering Cookbook +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/MeanEncoder.rst b/docs/user_guide/encoding/MeanEncoder.rst index ef6d37fb1..c39690af3 100644 --- a/docs/user_guide/encoding/MeanEncoder.rst +++ b/docs/user_guide/encoding/MeanEncoder.rst @@ -92,7 +92,7 @@ rare categories together by using the :class:`RareLabelEncoder`. Alternative Python implementations of mean encoding --------------------------------------------------- -In Feature-engine, we blend the probabilities considering the target variability and the +In feature-engine, we blend the probabilities considering the target variability and the category frequency. In the original paper, there are alternative formulations to determine the blending. If you want to check those out, use the transformers from the Python library Category encoders: @@ -137,8 +137,8 @@ Mean encoding and machine learning Feature-engine's :class:`MeanEncoder()` can perform mean encoding for regression and binary classification datasets. At the moment, we do not support multi-class targets. -Python examples ---------------- +Python implementation +--------------------- In the following sections, we'll show the functionality of :class:`MeanEncoder()` using the Titanic Dataset. @@ -151,9 +151,11 @@ First, let's load the libraries, functions and classes: from feature_engine.datasets import load_titanic from feature_engine.encoding import MeanEncoder -To avoid data leakage, it is important to separate the data into training and test sets. -The mean target values, with or without smoothing, will be determined using the training -data only. +.. note:: + + To avoid data leakage, it is important to separate the data into training and test sets. + The mean target values, with or without smoothing, will be determined using the training + data only. Let's load and split the data: @@ -185,7 +187,7 @@ We see the resulting dataframe containing 3 categorical columns: sex, cabin and Simple mean encoding --------------------- +~~~~~~~~~~~~~~~~~~~~ Let's set up the :class:`MeanEncoder()` to replace the categories in the categorical features with the target mean, without smoothing: @@ -249,7 +251,7 @@ with the target mean values: Mean encoding with smoothing ----------------------------- +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By default, :class:`MeanEncoder()` determines the mean target values without blending. If we want to apply smoothing to control the cardinality of the variable and avoid @@ -325,10 +327,10 @@ Below we see the resulting dataframe with the encoded features: We can now use this dataframes to train machine learning models for regression or classification. -Mean encoding variables with numerical values ---------------------------------------------- +Mean encoding numerical variables +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -:class:`MeanEncoder()`, and all Feature-engine encoders, have been designed to work with +:class:`MeanEncoder()`, and all feature-engine encoders, have been designed to work with variables of type object or categorical by default. If you want to encode variables that are numeric, you need to instruct the transformer to ignore the data type: @@ -344,72 +346,22 @@ are numeric, you need to instruct the transformer to ignore the data type: After encoding the features we can use the data sets to train machine learning algorithms. -Last thing to note before closing in is that mean encoding does not increase the -dimensionality of the resulting dataframes: from 1 categorical feature, we obtain 1 -encoded variable. Hence, this encoding method is suitable for predictive modeling that -uses models that are sensitive to the size of the feature space. +.. note:: + + Last thing to note before closing in is that mean encoding does not increase the + dimensionality of the resulting dataframes: from 1 categorical feature, we obtain 1 + encoded variable. Hence, this encoding method is suitable for predictive modeling that + uses models that are sensitive to the size of the feature space. Additional resources -------------------- -In the following notebook, you can find more details into the :class:`MeanEncoder()` -functionality and example plots with the encoded variables: - -- `Jupyter notebook `_ - For tutorials about this and other feature engineering methods check out these resources: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -.. figure:: ../../images/fetsf.png - :width: 300 - :figclass: align-center - :align: right - :target: https://www.trainindata.com/p/feature-engineering-for-forecasting - - Feature Engineering for Time Series Forecasting - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. Both our book and courses are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/OneHotEncoder.rst b/docs/user_guide/encoding/OneHotEncoder.rst index 1408e3f1b..007bfd244 100644 --- a/docs/user_guide/encoding/OneHotEncoder.rst +++ b/docs/user_guide/encoding/OneHotEncoder.rst @@ -6,12 +6,11 @@ OneHotEncoder ============= -One-hot encoding is a method used to represent categorical data, where each category -is represented by a binary variable. The binary variable takes the value 1 if the -category is present and 0 otherwise. The binary variables are also known as dummy -variables. +With one-hot encoding, each category is represented by a binary variable. The binary +variable takes the value 1 if the category is present and 0 otherwise. The binary +variables are also known as dummy variables. -To represent the categorical feature "is-smoker" with categories "Smoker" and +For example, to represent the categorical feature "is-smoker" with categories "Smoker" and "Non-smoker", we can generate the dummy variable "Smoker", which takes 1 if the person smokes and 0 otherwise. We can also generate the variable "Non-smoker", which takes 1 if the person does not smoke and 0 otherwise. @@ -81,10 +80,10 @@ models evaluate all features during fit, thus, with k-1 they have all the inform about the original categorical variable. There are a few occasions in which we may prefer to encode the categorical variables -with k binary variables. +with k binary variables, which are, when training decision trees based models or performing +feature selection. -Encode into k dummy variables if training decision trees based models or performing -feature selection. Decision tree based models and many feature selection algorithms +Decision tree based models and many feature selection algorithms evaluate variables or groups of variables separately. Thus, if encoding into k-1, the last category will not be examined. In other words, we lose the information contained in that category. @@ -150,16 +149,16 @@ are those with the greatest number of observations. The remaining categories wil zeroes in each one of the derived dummies. This behaviour is useful when the categorical variables are highly cardinal to control the expansion of the feature space. -**Note** +.. note:: -The parameter `drop_last` is ignored when encoding the most popular categories. + The parameter `drop_last` is ignored when encoding the most popular categories. Python implementation --------------------- -Let's look at an example of one hot encoding, using Feature-engine's :class:`OneHotEncoder()` -utilizing the Titanic Dataset. +Let's look at an example of one hot encoding, using feature-engine's :class:`OneHotEncoder()` +utilising the Titanic Dataset. We'll start by importing the libraries, functions and classes, and loading the data into a pandas dataframe and dividing it into a training and a testing set: @@ -200,6 +199,9 @@ Let's explore the cardinality of 4 of the categorical features: X_train[['sex', 'pclass', 'cabin', 'embarked']].nunique() +We see that the variable sex has 2 categories, pclass has 3 categories, the variable +cabin has 9 categories, and the variable embarked has 4 categories: + .. code:: python sex 2 @@ -208,8 +210,6 @@ Let's explore the cardinality of 4 of the categorical features: embarked 4 dtype: int64 -We see that the variable sex has 2 categories, pclass has 3 categories, the variable -cabin has 9 categories, and the variable embarked has 4 categories. Let's now set up the OneHotEncoder to encode 2 of the categorical variables into k-1 dummy variables: @@ -230,15 +230,16 @@ attribute `encoder_dict_`. encoder.encoder_dict_ +The `encoder_dict_` contains the categories that will be represented by dummy variables +for each categorical variable: + .. code:: python {'cabin': ['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T'], 'embarked': ['S', 'C', 'Q']} -The `encoder_dict_` contains the categories that will be represented by dummy variables -for each categorical variable. -With transform, we go ahead and encode the variables. Note that by default, the +With transform, we go ahead and encode the variables. Note that by default, :class:`OneHotEncoder()` drops the original categorical variables, which are now represented by the one-hot array. @@ -301,17 +302,21 @@ categories. We can find the categorical variables like this: encoder.variables_ +Below we see all categorical variables in the Titanic dataset: + .. code:: python ['sex', 'cabin', 'embarked'] -And we can identify the unique categories for each variables like this: +We can identify the unique categories for each variables like this: .. code:: python encoder.encoder_dict_ +Below, we see the unique categories per variable: + .. code:: python {'sex': ['female'], @@ -327,7 +332,7 @@ We can now encode the categorical variables: print(train_t.head()) -And here we see the resulting dataframe: +Here we see the resulting dataframe: .. code:: python @@ -356,7 +361,7 @@ And here we see the resulting dataframe: Encoding variables of type numeric ---------------------------------- -By default, Feature-engine's :class:`OneHotEncoder()` will only encode categorical +By default, feature-engine's :class:`OneHotEncoder()` will only encode categorical features. If you attempt to encode a variable of numeric dtype, it will raise an error. To avoid this error, you can instruct the encoder to ignore the data type format as follows: @@ -399,7 +404,7 @@ the transformer into 2 dummies: Encoding binary variables into 1 dummy ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -With Feature-engine's :class:`OneHotEncoder()` we can encode all categorical variables +With feature-engine's :class:`OneHotEncoder()` we can encode all categorical variables into k dummies and the binary variables into k-1 by setting the encoder as follows: .. code:: python @@ -415,7 +420,7 @@ into k dummies and the binary variables into k-1 by setting the encoder as follo print(train_t.head()) -As we see in the following input, for the variable sex, we have only have 1 dummy, +As we see in the following output, for the variable sex, we have only have 1 dummy, and for all the rest we have k dummies: .. code:: python @@ -526,57 +531,15 @@ For alternative encoding methods used in data science check the :class:`OrdinalE and other encoders included in the :ref:`encoding module `. -Tutorials, books and courses ----------------------------- - -For more details into :class:`OneHotEncoder()`'s functionality visit: - -- `Jupyter notebook `_ - -For tutorials about this and other data preprocessing methods check out our online course: - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +Additional resources +-------------------- + +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/OrdinalEncoder.rst b/docs/user_guide/encoding/OrdinalEncoder.rst index b7283a9bf..18510d57c 100644 --- a/docs/user_guide/encoding/OrdinalEncoder.rst +++ b/docs/user_guide/encoding/OrdinalEncoder.rst @@ -50,8 +50,11 @@ Ordered encoding attempts to define a monotonic relationship between the encoded method helps machine learning algorithms, particularly linear models (like linear regression), better capture and learn the relationship between the encoded feature and the target. -Keep in mind that ordered ordinal encoding will create a monotonic relationship between the encoded variable and the target -variable **only** when *there is* an intrinsic relationship between the categories and the target variable. +.. note:: + + Keep in mind that ordered ordinal encoding will create a monotonic relationship between the encoded variable and the target + variable **only** when *there is* an intrinsic relationship between the categories and the target variable. + Unseen categories ----------------- @@ -60,9 +63,12 @@ Ordinal encoding can't inherently deal with unseen categories. **Unseen categories** are categorical values that appear in test, validation, or live data but were not present in the training data. These categories are problematic because the encoding methods generate mappings only for categories present -in the training data. This means that we would lack encodings for any new, unseen category values. Unseen categories cause -errors during inference time (the phase when the machine learning model is used to make predictions on new data) because our -feature engineering pipeline is unable to convert that value into a number. +in the training data. This means that we would lack encodings for any new, unseen category values. + +.. note:: + + Unseen categories cause errors during inference time (the phase when the machine learning model is used to make predictions on new data) because our + feature engineering pipeline is unable to convert that value into a number. Ordinal encoding by itself does not deal with unseen categories. However, we could replace the unseen category with an arbitrary value, such as -1 (remember that ordinal encoding starts at 0). This procedure might work well for linear models @@ -88,11 +94,14 @@ unseen categories; and it is not suitable for a large number of categories, i.e. Ordinal encoding vs label encoding ---------------------------------- -Ordinal encoding is sometimes also referred to as label encoding. They follow the same procedure. Scikit-learn provides -2 different transformers: the OrdinalEncoder and the LabelEncoder. Both replace values, that is, categories, with ordinal -data. The OrdinalEncoder is designed to transform the predictor variables (those in the training set), while the LabelEncoder -is designed to transform the target variable. The end result of both transformers is the same; the original values are -replaced by ordinal numbers. +Ordinal encoding is sometimes also referred to as label encoding. They follow the same procedure. + +.. tip:: + + Scikit-learn provides 2 different transformers: the `OrdinalEncoder` and the `LabelEncoder`. Both replace categories with ordinal + data. However, `OrdinalEncoder` is designed to transform the predictor variables (those in the training set), while `LabelEncoder` + is designed to transform the target variable. The end result of both transformers is the same; the original values are + replaced by ordinal numbers. In our view, this has raised some confusion as to whether label encoding and ordinal encoding consist of different ways of preprocessing categorical data. Some argue that label encoding consists of replacing categories with numbers assigned @@ -104,7 +113,7 @@ OrdinalEncoder Feature-engine's :class:`OrdinalEncoder()` implements ordinal encoding. That is, it encodes categorical features by replacing each category with a unique number ranging from 0 to k-1, where 'k' is the distinct number of categories in -the dataset. +the variable. :class:`OrdinalEncoder()` supports both **arbitrary** and **ordered** encoding methods. The desired approach can be specified using the `encoding_method` parameter that accepts either **"arbitrary"** or **"ordered"**. If not defined, @@ -125,14 +134,14 @@ category, in which case it will be encoded as `np.nan`, or encode it into -1. Yo Python Implementation --------------------- -In the rest of the page, we'll show different ways how we can use ordinal encoding through Feature-engine's +In the rest of the page, we'll show different ways how we can use ordinal encoding through feature-engine's :class:`OrdinalEncoder()`. Arbitrary ordinal encoding ~~~~~~~~~~~~~~~~~~~~~~~~~~ -We'll show how ordinal encoding is implemented by Feature-engine's :class:`OrdinalEncoder()` using the **Titanic Dataset**. +We'll show how ordinal encoding is implemented by feature-engine's :class:`OrdinalEncoder()` using the **Titanic Dataset**. Let's load the dataset and split it into train and test sets: @@ -168,17 +177,19 @@ We see the Titanic dataset below: 686 3 female 22.000000 0 0 7.7250 M Q -Let's set up the :class:`OrdinalEncoder()` to encode the categorical variables `cabin', `embarked`, and `sex` with +Let's set up the :class:`OrdinalEncoder()` to encode the categorical variables cabin, embarked, and sex with integers assigned arbitrarily: .. code:: python - encoder = OrdinalEncoder( - encoding_method='arbitrary', - variables=['cabin', 'embarked', 'sex']) + encoder = OrdinalEncoder( + encoding_method='arbitrary', + variables=['cabin', 'embarked', 'sex']) -:class:`OrdinalEncoder()` will encode **all** categorical variables in the training set by default, unless we specify -which variables to encode, as we did in the previous code block. +.. tip:: + + :class:`OrdinalEncoder()` will encode **all** categorical variables in the training set by default, unless we specify + which variables to encode, as we did in the previous code block. Let's fit the encoder so that it learns the mappings for each category: @@ -304,9 +315,10 @@ If you want to see the resulting dataframe, go ahead and execute `train_t.head() Ordered ordinal encoding ~~~~~~~~~~~~~~~~~~~~~~~~ -Ordered encoding consists of assigning the integers based on the mean target. +Ordered encoding consists of assigning the integers based on the mean target value. + We will use the **California Housing Dataset** to demonstrate ordered encoding. This dataset contains numeric features -such as *MedInc*, *HouseAge* and *AveRooms*, among others. The target variable is *MedHouseVal* i.e., the median house +such as *MedInc*, *HouseAge* and *AveRooms*, among others. The target variable is *MedHouseVal*, i.e., the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000). Let's first set up the dataset. @@ -411,8 +423,10 @@ variable to the `fit()` method: X_train_t = ordered_encoder.fit_transform(X_train, y_train) X_test_t = ordered_encoder.transform(X_test) -Note that we first fit the encoder on the training data and then transformed both the training and test data, using the -mappings learned from the training set. +.. tip:: + + Note that we first fit the encoder on the training data and then transformed both the training and test data, using the + mappings learned from the training set. Let's display the resulting dataframe: @@ -510,56 +524,12 @@ The power of ordinal ordered encoder resides in its intrinsic capacity of findin Additional resources -------------------- -In the following notebook, you can find more details into the :class:`OrdinalEncoder()`'s -functionality and example plots with the encoded variables: - -- `Jupyter notebook `_ +For tutorials about this and other feature engineering methods check out these resources: -For more details about this and other feature engineering methods check out these resources and tutorials: +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/RareLabelEncoder.rst b/docs/user_guide/encoding/RareLabelEncoder.rst index 8ad1d724c..25e843cff 100644 --- a/docs/user_guide/encoding/RareLabelEncoder.rst +++ b/docs/user_guide/encoding/RareLabelEncoder.rst @@ -5,43 +5,51 @@ RareLabelEncoder ================ -The :class:`RareLabelEncoder()` groups infrequent categories into one new category -called 'Rare' or a different string indicated by the user. We need to specify the -minimum percentage of observations a category should have to be preserved and the -minimum number of unique categories a variable should have to be re-grouped. +:class:`RareLabelEncoder()` groups infrequent categories into one new category +called 'Rare' or a different string indicated by the user. -**tol** +This transformers requires 2 parameters: -In the parameter `tol` we indicate the minimum proportion of observations a category -should have, not to be grouped. In other words, categories which frequency, or proportion -of observations is <= `tol` will be grouped into a unique term. +- The minimum frequency a category should have not to be grouped. +- The minimum cardinality a variable should have to be processed. -**n_categories** +Category frequency +~~~~~~~~~~~~~~~~~~ -In the parameter `n_categories` we indicate the minimum cardinality of the categorical -variable in order to group infrequent categories. For example, if `n_categories=5`, -categories will be grouped only in those categorical variables with more than 5 unique -categories. The rest of the variables will be ignored. +The parameter `tol` specifies the minimum proportion of observations a category must +have to remain ungrouped. In other words, categories with a frequency <= `tol` are +grouped into a single category. -This parameter is useful when we have big datasets and do not have time to examine all -categorical variables individually. This way, we ensure that variables with low cardinality -are not reduced any further. +Variable cardinality +~~~~~~~~~~~~~~~~~~~~ -**max_n_categories** +The parameter `n_categories` specifies the minimum cardinality a categorical variable must +have for infrequent categories to be grouped. -In the parameter `max_n_categories` we indicate the maximum number of unique categories -that we want in the encoded variable. If `max_n_categories=5`, then the most popular 5 -categories will remain in the variable after the encoding, all other will be grouped into -a single category. +For example, if `n_categories = 5`, grouping is applied only to categorical variables +with more than five unique categories. Variables with five or fewer categories are left +unchanged. -This parameter is useful if we are going to perform one hot encoding at the back of it, -to control the expansion of the feature space. +.. tip:: -**Example** + This parameter is useful for large datasets, where it may not be practical to examine all + categorical variables individually. It ensures that variables with low cardinality are + not reduced further. -Let's look at an example using the Titanic Dataset. -First, let's load the data and separate it into train and test: +Encoding popular categories +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The parameter `max_n_categories` specifies the maximum number of unique categories +allowed in the encoded variable. If `max_n_categories = 5`, the five most frequent +categories are retained after encoding, and all others are grouped into a single category. + +Python implementation +--------------------- + +Let's explore how to use :class:`RareLabelEncoder()` using the Titanic Dataset. + +Let's load the data and separate it into train and test: .. code:: python @@ -74,7 +82,7 @@ We see the resulting data below: 1193 3 male 29.881135 0 0 7.7250 M Q 686 3 female 22.000000 0 0 7.7250 M Q -Let's explore the number of uniue categories in the variable `"cabin"`. +Let's explore the number of unique categories in the variable `"cabin"`. .. code:: python @@ -87,8 +95,9 @@ We see the number of unique categories in the output below: array(['M', 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'], dtype=object) Now, we set up the :class:`RareLabelEncoder()` to group categories shown by less than 3% -of the observations into a new group or category called 'Rare'. We will group the -categories in the indicated variables if they have more than 2 unique categories each. +of the observations into a new group called 'Rare'. We will group the +categories in the variables cabin, pclass and embarked, only if they have more than 2 +unique categories each. .. code:: python @@ -103,8 +112,8 @@ categories in the indicated variables if they have more than 2 unique categories encoder.fit(X_train) With `fit()`, the :class:`RareLabelEncoder()` finds the categories present in more than -3% of the observations, that is, those that will not be grouped. These categories can -be found in the `encoder_dict_` attribute. +3% of the observations, that is, those that will not be grouped. These categories are stored +in the `encoder_dict_` attribute. .. code:: python @@ -149,12 +158,14 @@ category: .. code:: python - from feature_engine.encoding import RareLabelEncoder import pandas as pd + from feature_engine.encoding import RareLabelEncoder data = {'var_A': ['A'] * 10 + ['B'] * 10 + ['C'] * 2 + ['D'] * 1} data = pd.DataFrame(data) data['var_A'].value_counts() +In the following output, we see the number of observations per category: + .. code:: python A 10 @@ -163,14 +174,15 @@ category: D 1 Name: var_A, dtype: int64 -In this block of code, we group the categories only for variables with more than 3 -unique categories and then we plot the result: +Now, we group categories only for variables with more than 3 unique categories: .. code:: python rare_encoder = RareLabelEncoder(tol=0.05, n_categories=3) rare_encoder.fit_transform(data)['var_A'].value_counts() +Note that the variable was left unchanged because it has exactly 3 unique categories: + .. code:: python A 10 @@ -188,6 +200,9 @@ the 'Rare' group: Xt = rare_encoder.fit_transform(data) Xt['var_A'].value_counts() +In the following output we see that the 2 most infrequent categories have been grouped into +a new category called `Rare`: + .. code:: python A 10 @@ -195,75 +210,31 @@ the 'Rare' group: Rare 3 Name: var_A, dtype: int64 -Tips ----- +Considerations +-------------- -The :class:`RareLabelEncoder()` can be used to group infrequent categories and like this +:class:`RareLabelEncoder()` can be used to group infrequent categories and hence control the expansion of the feature space if using one hot encoding. -Some categorical encodings will also return NAN if a category is present in the test -set, but was not seen in the train set. This inconvenient can usually be avoided if we +Some categorical encodings will return NAN if a category is present in the test +set, but was not seen in the train set. This inconvenient can be mitigated if we group rare labels before training the encoders. -Some categorical encoders will also return NAN if there is not enough observations for -a certain category. For example the :class:`WoEEncoder()` and the :class:`PRatioEncoder()`. -This behaviour can be also prevented by grouping infrequent labels before the encoding -with the :class:`RareLabelEncoder()`. +Some categorical encoders will return NAN if there is not enough observations for +a certain category to calculate the mapping, for example :class:`WoEEncoder()`. These +type of errors can be prevented by grouping infrequent labels before the encoding with +:class:`RareLabelEncoder()`. Additional resources -------------------- -In the following notebook, you can find more details into the :class:`RareLabelEncoder()` -functionality and example plots with the encoded variables: - -- `Jupyter notebook `_ - -For more details about this and other feature engineering methods check out these resources: - - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +For tutorials about this and other feature engineering methods check out these resources: + +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. + +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/StringSimilarityEncoder.rst b/docs/user_guide/encoding/StringSimilarityEncoder.rst index 211a50e79..7f1031eab 100644 --- a/docs/user_guide/encoding/StringSimilarityEncoder.rst +++ b/docs/user_guide/encoding/StringSimilarityEncoder.rst @@ -6,12 +6,12 @@ StringSimilarityEncoder ======================= -The :class:`StringSimilarityEncoder()` replaces categorical variables with a set of float +:class:`StringSimilarityEncoder()` replaces categorical variables with a set of float variables that capture the similarity between the category names. The new variables have values between 0 and 1, where 0 indicates no similarity and 1 is an exact match between the names of the categories. -To calculate the similarity between the categories, :class:`StringSimilarityEncoder()` +To calculate the similarity between categories, :class:`StringSimilarityEncoder()` uses Gestalt pattern matching. Under the hood, :class:`StringSimilarityEncoder()` uses the `quick_ratio` method from the `SequanceMatcher()` from `difflib`. @@ -27,8 +27,8 @@ For example, the similarity between the categories "dog" and "dig" is 0.66. T is total number of elements in both categories, that is 6. There are 2 matches between the words, the letters d and g, so: 2 * M / T = 2 * 2 / 6 = 0.66. -Output of the :class:`StringSimilarityEncoder()` ------------------------------------------------- +Understanding the output of string similarity encoding +------------------------------------------------------ Let's create a dataframe with the categories "dog", "dig" and "cat": @@ -59,7 +59,6 @@ Let's now encode the variable: We see the encoded variables below: - .. code:: python words_dog words_dig words_cat @@ -67,12 +66,20 @@ We see the encoded variables below: 1 0.666667 1.000000 0.0 2 0.000000 0.000000 1.0 +In the first variable, we see the string similarity between all categories and dog, in the +second variable we have the string similarity between all categories and dig, and in the +final column we observe the string similarity between all categories and cat. + +.. note:: + + :class:`StringSimilarityEncoder()` returns k variables to represent 1 categorical + feature, where k is the number of unique categories of that variable. Note that :class:`StringSimilarityEncoder()` replaces the original variables by the distance variables. -:class:`StringSimilarityEncoder()` vs One-hot encoding ------------------------------------------------------- +String similarity encoding vs one-hot encoding +---------------------------------------------- String similarity encoding is similar to one-hot encoding, in the sense that each category is encoded as a new variable. But the values, instead of 1 or 0, are the similarity @@ -83,26 +90,38 @@ Encoding only popular categories -------------------------------- The :class:`StringSimilarityEncoder()` can also create similarity variables for the *n* most popular -categories, *n* being determined by the user. For example, if we encode only the 6 more popular categories, by +categories, *n* being determined by the user. + +For example, if we encode only the 6 more popular categories, by setting the parameter `top_categories=6`, the transformer will add variables only -for the 6 most frequent categories. The most frequent categories are those with the largest -number of observations. This behaviour is useful when the categorical variables are highly cardinal, -to control the expansion of the feature space. +for the 6 most frequent categories. + +The most frequent categories are those with the largest +number of observations. + +.. note:: + + This behaviour is useful when the categorical variables are highly cardinal, + to control the expansion of the feature space. -Specifying how :class:`StringSimilarityEncoder()` should deal with missing values ---------------------------------------------------------------------------------- +Missing values +-------------- -The :class:`StringSimilarityEncoder()` has three options for dealing with missing values, which can be +:class:`StringSimilarityEncoder()` has three options for dealing with missing values, which can be specified with the parameter `missing_values`: - 1. Ignore NaNs (option `ignore`) - will leave the NaN in the resulting dataframe after transformation. - Could be useful, if the next step in the pipeline is imputation or if the machine learning algorithm - can handle missing data out-of-the-box. - 2. Impute NaNs (option `impute`) - will impute NaN with an empty string, and then calculate the similarity - between the empty string and the variable's categories. Most of the time, the similarity value will be - 0 in resulting dataframe. This is the default option. - 3. Raise an error (option `raise`) - will raise an error if NaN is present during `fit`, `transform` or - `fit_transform`. Could be useful for debugging and monitoring purposes. +1. Ignore NaNs (option `ignore`) - will leave the NaN in the resulting dataframe after transformation. + +Could be useful if the next step in the pipeline is imputation or if the machine learning algorithm +can handle missing data out-of-the-box. + +2. Impute NaNs (option `impute`) - will impute NaN with an empty string, and then calculate the similarity between the empty string and the variable's categories. + +Most of the time, the similarity value will be 0 in resulting dataframe. This is the default option. + +3. Raise an error (option `raise`) - will raise an error if NaN is present during `fit`, `transform` or `fit_transform`. + +Could be useful for debugging and monitoring purposes. Important @@ -114,12 +133,12 @@ string similarity to the seen categories. No text preprocessing is applied by :class:`StringSimilarityEncoder()`. Be mindful of preparing string categorical variables if needed. -:class:`StringSimilarityEncoder()` works with categorical variables by default. And it has the option to +:class:`StringSimilarityEncoder()` works with categorical variables by default. Tt has the option to encode numerical variables as well. This is useful, when the values of the numerical variables are more useful as strings, than as numbers. For example, for variables like barcode. -Examples --------- +Python implementation +--------------------- Let's look at an example using the Titanic Dataset. First we load the data and divide it into a train and a test set: @@ -214,7 +233,9 @@ are stored in the attribute `encoder_dict_`. .. code:: python - encoder.encoder_dict_ + encoder.encoder_dict_ + +In the following output, we see the most frequent categories per variable: .. code:: python @@ -227,8 +248,11 @@ are stored in the attribute `encoder_dict_`. The `encoder_dict_` contains the categories that will derive similarity variables for each categorical variable. -With transform, we go ahead and encode the variables. Note that the -:class:`StringSimilarityEncoder()` will drop the original variables. +With transform, we go ahead and encode the variables: + +.. note:: + + Note that the :class:`StringSimilarityEncoder()` will drop the original variables. .. code:: python @@ -264,11 +288,15 @@ Below, we see the resulting dataframe: 393 0.0 0.437500 0.666667 0.666667 -More details ------------- +Additional resources +-------------------- -For more details into :class:`StringSimilarityEncoder()`'s functionality visit: +For tutorials about feature engineering methods check out these resources: -- `Jupyter notebook `_ +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -All notebooks can be found in a `dedicated repository `_. +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/WoEEncoder.rst b/docs/user_guide/encoding/WoEEncoder.rst index e501635d8..2807a131c 100644 --- a/docs/user_guide/encoding/WoEEncoder.rst +++ b/docs/user_guide/encoding/WoEEncoder.rst @@ -41,9 +41,12 @@ with three categories (A1, A2, and A3). The dataset has the following characteri - Category A3 has 5 positive cases and 50 negative cases. First, we find out the number of instances with a positive target value (1) per category, -and then we divide that by the total number of positive cases in the data. Then we determine -the number of instances with target value of 0 per category and divide that by the total -number of negative instances in the dataset: +and then we divide that by the total number of positive cases in the data (20 in our example). + +After that, we determine the number of instances with target value of 0 per category and +divide that by the total number of negative instances in the dataset (80 in our example). + +So that: - For category A1, we have 10 positive cases and 15 negative cases, resulting in a positive ratio of 10/20 and a negative ratio of 15/80. This means that the positive ratio is 0.5 and the negative ratio is 0.1875. - For category A2, we have 5 positive cases out of 20 positive cases, giving us a positive ratio of 5/20 and a negative ratio of 15/80. This results in a positive ratio of 0.25 and a negative ratio of 0.1875. @@ -62,7 +65,7 @@ the WoE values: 0.98, 0.28, -0.91. Characteristics of the WoE -------------------------- -The beauty of the WoE, is that we can directly understand the impact of the category on +The beauty of the WoE is, that we can directly understand the impact of the category on the probability of success (target variable being 1): - If WoE values are negative, there are more negative cases than positive cases for the category. @@ -90,7 +93,7 @@ Uses of the WoE In general, we use the WoE to encode both categorical and numerical variables. For continuous variables, we first need to do binning, that is, sort the variables into discrete intervals. You can do this by preprocessing the variable using any of -Feature-engine's discretizers. +feature-engine's discretisers. Some authors have extended the Weight of Evidence approach to neural networks and other algorithms, and although they have shown good results, the predictive modeling performance @@ -110,7 +113,7 @@ always takes 1 or 0). In practice, this happens mostly when a category has a low in the dataset, that is, when only very few observations show that category. To overcome this limitation, consider using a variable transformation method to group -those categories together, for example by using Feature-engine's :class:`RareLabelEncoder()`. +those categories together, for example by using feature-engine's :class:`RareLabelEncoder()`. Taking into account the above considerations, conducting a detailed exploratory data analysis (EDA) is essential as part of the data science and model-building process. @@ -144,7 +147,7 @@ raise an error. If you want to encode numerical, for example discrete variables, :class:`WoEEncoder()` does not handle missing values automatically, so make sure to replace them with a suitable value before the encoding. You can impute missing values -with Feature-engine's imputers. +with feature-engine's imputers. :class:`WoEEncoder()` will ignore unseen categories by default, in which case, they will be replaced by np.nan after the encoding. You have the option to make the encoder raise @@ -271,8 +274,8 @@ WoE in categorical and numerical variables In the previous example, we encoded only the variables 'cabin', 'pclass', 'embarked', and left the rest of the variables untouched. In the following example, we will use -Feature-engine's pipeline to transform variables in sequence. We'll group rare categories -in categorical variables. Next, we'll discretize numerical variables. And finally, we'll +feature-engine's pipeline to transform variables in sequence. We'll group rare categories +in categorical variables. Next, we'll discretise numerical variables. And finally, we'll encode them all with the WoE. First, let's load the data and separate it into train and test: @@ -318,7 +321,7 @@ Let's define lists with the categorical and numerical variables: numerical_features = ['fare', 'age'] all = categorical_features + numerical_features -Now, we will set up the pipeline to first discretize the numerical variables, then group +Now, we will set up the pipeline to first discretise the numerical variables, then group rare labels and low frequency intervals into a common group, and finally encode all variables with the WoE: @@ -365,7 +368,7 @@ We see the resulting dataframe below: 1193 0.012075 686 0.012075 -Finally, we can visualize the values of the WoE encoded variables respect to the original +Finally, we can visualise the values of the WoE encoded variables respect to the original values to corroborate the sigmoid function shape, which is the expected behavior of the WoE: @@ -482,8 +485,8 @@ used for feature selection for binary classification problems. Weight of Evidence and Information Value within Feature-engine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -If you're asking yourself whether Feature-engine allows you to automate this process, -the answer is: of course! You can utilize the :class:`SelectByInformationValue()` class +If you're asking yourself whether feature-engine allows you to automate this process, +the answer is: of course! You can utilise the :class:`SelectByInformationValue()` class and it will handle all these steps for you. Again, remember the given considerations. References @@ -497,57 +500,12 @@ References Additional resources -------------------- -In the following notebooks, you can find more details into the :class:`WoEEncoder()` -functionality and example plots with the encoded variables: - -- `WoE in categorical variables `_ -- `WoE in numerical variables `_ - -For more details about this and other feature engineering methods check out these resources: - - -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook +For tutorials about this and other feature engineering methods check out these resources: -| -| -| -| -| -| -| -| -| -| -| -| -| +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. \ No newline at end of file +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/encoding/index.rst b/docs/user_guide/encoding/index.rst index 2b82c0a11..8a0f9a406 100644 --- a/docs/user_guide/encoding/index.rst +++ b/docs/user_guide/encoding/index.rst @@ -7,19 +7,18 @@ Categorical Encoding ==================== Categorical encoding is the process of converting categorical variables into numeric -features. It is an important feature engineering step in most data science projects, -as it ensures that machine learning algorithms can appropriately handle and interpret -categorical data. +features. It is an important feature engineering step, as it ensures that machine +learning algorithms can appropriately handle and interpret categorical data. There are various categorical encoding methods that we can use to encode categorical -features. One hot encoding and ordinal encoding are the most well known, but other +features. One-hot encoding and ordinal encoding are the most well known, but other encoding techniques can help tackle high cardinality and rare categories before and after training machine learning models. Feature-engine's categorical encoders replace the variables' categorical values by estimated or arbitrary numerical values through various encoding methods. In this page, we will discuss categorical features and the importance of categorical encoding in more -detail, and then introduce the various encoding techniques supported by Feature-engine. +detail, and then introduce the various encoding techniques supported by feature-engine. Categorical features -------------------- @@ -55,14 +54,14 @@ We can identify categorical features by inspecting their data types. With pandas we can obtain the data types of all variables in a dataframe; features with non-numeric data types such as *string*, *object* or *categorical* are, in general, categorical. -Categorical features can also be numeric, however, like for example the features *Store ID*, +Categorical features can also be numeric, for example, the features *Store ID*, *SKU ID* or *Zip Code*. Although these variables have numeric values, they are categorical. Cardinality ----------- -**Cardinality** refers to the number of unique categories of a categorical variable. F -or example, the cardinality of the variable 'size', which takes the values 'small', +**Cardinality** refers to the number of unique categories of a categorical variable. For +example, the cardinality of the variable 'size', which takes the values 'small', 'medium' and 'large' is 3. A categorical variable is said to have a **low cardinality** when the number of distinct @@ -79,7 +78,7 @@ Unseen categories ----------------- **Unseen categories** are categorical values that appear in the test or validation -datasets, or even in live data after model deployment, that were not present in the +datasets, or even in live data after model deployment, but were not present in the training data, and therefore were not **seen** by the machine learning model. Unseen categories are challenging for various reasons. Firstly, when we create mappings @@ -88,7 +87,7 @@ mappings for those categories *present* in the training set. Hence, we'd lack a for a new, unseen value. We may feel tempted to replace unseen categories by 0, or an arbitrary value, or just have -0s in all dummy variables if we used one hot encoding, but this may make the machine learning +0s in all dummy variables if we used one-hot encoding, but this may make the machine learning model behave unexpectedly leading to inaccurate predictions. Ideally, we want to account for the potential presence of unseen categories during the @@ -116,13 +115,13 @@ Overfitting Overfitting occurs when a machine learning model learns the noise and random fluctuations present in the training data in addition to the underlying relationships. This results in a -model that performs exceptionally well on the training data but fails to generalize on unseen +model that performs exceptionally well on the training data but fails to generalise on unseen data (i.e., the model shows low performance on the validation data set). High cardinality features can lead to overfitting, particularly in tree-based models such as decision trees or random forests. Overfitting occurs because tree-based models will try to perform extensive splitting on the high cardinality feature, making the final tree overly -complex. This often leads to poor generalization. Reducing cardinality, often helps mitigate +complex. This often leads to poor generalisation. Reducing cardinality, often helps mitigate the problem. Encoding pipeline @@ -138,47 +137,47 @@ from the **training data**. Encoding methods ---------------- -There are various methods to transform categorical variables into numerical features. One hot +There are various methods to transform categorical variables into numerical features. One-hot encoding and ordinal encoding are the most commonly used, but other methods can mitigate high cardinality and account for unseen categories. In the rest of this page, we'll introduce various methods for encoding categorical data, and -highlight the Feature-engine transformer that can carry out this transformation. +highlight the feature-engine transformer that can carry out this transformation. -One hot encoding +One-hot encoding ~~~~~~~~~~~~~~~~ One-hot encoding (OHE) consists of replacing categorical variables by a set of binary variables each representing one of the unique categories in the variable. The binary variable takes the value 1, if the observation shows the category, or alternatively, 0. -One hot encoding is particularly suitable for linear models because it treats each category +One-hot encoding is particularly suitable for linear models because it treats each category independently, and linear models can process binary variables effectively. -One hot encoding, however, increases the dimensionality of the dataset, as it adds a new +One-hot encoding, however, increases the dimensionality of the dataset, as it adds a new variable per category. Hence, OHE may not be suitable for encoding high cardinality features, as it can drastically increase the dimensionality of the dataset, often leading to a set of variables that are highly correlated or even identical. -Feature-engine's :class:`OneHotEncoder` implements one hot encoding. +Feature-engine's :class:`OneHotEncoder` implements one-hot encoding. -One hot encoding of frequent categories +One-hot encoding of frequent categories ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To prevent a massive increase of the feature space, some data scientists create binary -variables through one hot encoding of the **most frequent categories** in the variable. +variables through one-hot encoding of the **most frequent categories** in the variable. Less frequent values are treated collectively and represented as 0 in all the binary variables created for the frequent categories. -One hot encoding of frequent categories can then help tackle high cardinality and also +One-hot encoding of frequent categories can then help tackle high cardinality and also unseen categories, because unseen categories will be also encoded as an infrequent value. -Feature-engine's :class:`OneHotEncoder` can implement one hot encoding of frequent categories. +Feature-engine's :class:`OneHotEncoder` can implement one-hot encoding of frequent categories. Ordinal Encoding ~~~~~~~~~~~~~~~~ In ordinal encoding, each category is replaced with an integer value. These numbers are, -in general, assigned arbitrarily. With Feature-engine's :class:`OrdinalEncoder`, we have the +in general, assigned arbitrarily. With feature-engine's :class:`OrdinalEncoder`, we have the option to assign integers arbitrarily, or alternatively, ordered based on the mean target value per category. @@ -269,7 +268,7 @@ tree model. In decision tree encoding, a decision tree is trained using the categorical feature to predict the target variable. The decision tree splits the data based of each category, eventually leading the observations to a final leaf. Each category is then replaced by -the prediction of the tree, which consists of the mean target value calculated over the +the prediction of the tree, which consists of the mean target value calculated using the observations that ended in that leaf. Feature-engine's :class:`DecisionTreeEncoder` implements decision tree encoding. @@ -311,7 +310,8 @@ blending similar categories together. Feature-engine's :class:`StringSimilarityEncoder` implements string similarity encoding. -**Summary of Feature-engine's encoders characteristics** +Summary of feature-engine's encoders characteristics +---------------------------------------------------- =================================== ============ ================= ============== =============================================================== Transformer Regression Classification Multi-class Description @@ -331,9 +331,10 @@ Feature-engine's categorical encoders work only with categorical variables by de From version 1.1.0, you have the option to set the parameter `ignore_format` to `False`, and make the transformers also accept numerical variables as input. -**Monotonicity** +Monotonicity +~~~~~~~~~~~~ -Most Feature-engine's encoders will return, or attempt to return monotonic relationships +Most feature-engine's encoders will return, or attempt to return monotonic relationships between the encoded variable and the target. A monotonic relationship is one in which the variable value increases as the values in the other variable increase, or decrease. See the following illustration as examples: @@ -345,15 +346,17 @@ See the following illustration as examples: Monotonic relationships tend to help improve the performance of linear models and build shallower decision trees. -**Regression vs Classification** +Regression vs Classification +~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -Most Feature-engine's encoders are suitable for both regression and classification, with +Most feature-engine's encoders are suitable for both regression and classification, with the exception of the :class:`WoEEncoder()` which is designed solely for **binary** classification. -**Multi-class classification** +Multi-class classification +~~~~~~~~~~~~~~~~~~~~~~~~~~ -Finally, some Feature-engine's encoders can handle multi-class targets off-the-shelf for +Finally, some feature-engine's encoders can handle multi-class targets off-the-shelf for example the :class:`OneHotEncoder()`, the :class:`CountFrequencyEncoder()` and the :class:`DecisionTreeEncoder()`. @@ -365,7 +368,7 @@ defeat the purpose of these encoding techniques. Alternative encoding techniques ------------------------------- -In addition to the categorical encoding methods supported by Feature-engine, there are +In addition to the categorical encoding methods supported by feature-engine, there are other methods like feature hashing or binary encoding. These methods are supported by the Python library category encoders. For the time being, we decided not to support these transformations because they return features that are not easy to interpret. And hence, @@ -377,52 +380,13 @@ Additional resources For tutorials about this and other feature engineering methods check out these resources: +- `Feature Engineering for Machine Learning `_, online course. +- `Feature Engineering for Time Series Forecasting `_, online course. +- `Python Feature Engineering Cookbook `_, book. -.. figure:: ../../images/feml.png - :width: 300 - :figclass: align-center - :align: left - :target: https://www.trainindata.com/p/feature-engineering-for-machine-learning - - Feature Engineering for Machine Learning - -| -| -| -| -| -| -| -| -| -| - -Or read our book: - -.. figure:: ../../images/cookbook.png - :width: 200 - :figclass: align-center - :align: left - :target: https://www.packtpub.com/en-us/product/python-feature-engineering-cookbook-9781835883587 - - Python Feature Engineering Cookbook - -| -| -| -| -| -| -| -| -| -| -| -| -| - -Both our book and course are suitable for beginners and more advanced data scientists -alike. By purchasing them you are supporting Sole, the main developer of Feature-engine. +Both our book and courses are suitable for beginners and more advanced data scientists +alike. By purchasing them you are supporting `Sole `_, +the main developer of feature-engine. Encoders -------- @@ -435,6 +399,7 @@ Encoders CountFrequencyEncoder MeanEncoder WoEEncoder + StringSimilarityEncoder DecisionTreeEncoder RareLabelEncoder - StringSimilarityEncoder + diff --git a/docs/user_guide/imputation/AddMissingIndicator.rst b/docs/user_guide/imputation/AddMissingIndicator.rst index 628ebff12..27d3a7c87 100644 --- a/docs/user_guide/imputation/AddMissingIndicator.rst +++ b/docs/user_guide/imputation/AddMissingIndicator.rst @@ -111,4 +111,4 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. \ No newline at end of file +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/imputation/ArbitraryNumberImputer.rst b/docs/user_guide/imputation/ArbitraryNumberImputer.rst index a17e15e74..f87f83e62 100644 --- a/docs/user_guide/imputation/ArbitraryNumberImputer.rst +++ b/docs/user_guide/imputation/ArbitraryNumberImputer.rst @@ -128,4 +128,4 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. \ No newline at end of file +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/imputation/CategoricalImputer.rst b/docs/user_guide/imputation/CategoricalImputer.rst index 024da71b2..925e970d2 100644 --- a/docs/user_guide/imputation/CategoricalImputer.rst +++ b/docs/user_guide/imputation/CategoricalImputer.rst @@ -321,4 +321,4 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. +the main developer of feature-engine. diff --git a/docs/user_guide/imputation/DropMissingData.rst b/docs/user_guide/imputation/DropMissingData.rst index 761af9d99..c5e16dd7b 100644 --- a/docs/user_guide/imputation/DropMissingData.rst +++ b/docs/user_guide/imputation/DropMissingData.rst @@ -408,7 +408,7 @@ When we dropna from a dataframe, we then need to realign the target. We saw prev that we can do that by using the method `transform_x_y`. We can align the target with the resulting dataframe automatically from within a -pipeline as well, by utilizing Feature-engine's pipeline. +pipeline as well, by utilizing feature-engine's pipeline. Let's start by importing the necessary libraries: @@ -550,7 +550,7 @@ instead, check out our :ref:`missing data imputation ` tr Drop columns with nan ^^^^^^^^^^^^^^^^^^^^^ -At the moment, Feature-engine does not have transformers that will find columns with a +At the moment, feature-engine does not have transformers that will find columns with a certain percentage of missing values and drop them. Instead, you can find those columns manually, and then drop them with the help of `DropFeatures` from the selection module. @@ -573,4 +573,4 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. \ No newline at end of file +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/imputation/EndTailImputer.rst b/docs/user_guide/imputation/EndTailImputer.rst index 77e0484fe..23b4a17ae 100644 --- a/docs/user_guide/imputation/EndTailImputer.rst +++ b/docs/user_guide/imputation/EndTailImputer.rst @@ -120,4 +120,4 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. \ No newline at end of file +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/imputation/MeanMedianImputer.rst b/docs/user_guide/imputation/MeanMedianImputer.rst index 2520a5e34..90dbf5ec1 100644 --- a/docs/user_guide/imputation/MeanMedianImputer.rst +++ b/docs/user_guide/imputation/MeanMedianImputer.rst @@ -204,7 +204,7 @@ Imputing missing values alongside missing indicators ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Mean or median imputation are commonly done alongside adding missing indicators. -We can add missing indicators with :class:`AddMissingIndicator()` from Feature-engine. +We can add missing indicators with :class:`AddMissingIndicator()` from feature-engine. We can chain :class:`AddMissingIndicator()` with :class:`MeanMedianImputer()` using a `scikit-learn pipeline `_. @@ -310,4 +310,4 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. \ No newline at end of file +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/imputation/RandomSampleImputer.rst b/docs/user_guide/imputation/RandomSampleImputer.rst index 972569421..03a1d7a9a 100644 --- a/docs/user_guide/imputation/RandomSampleImputer.rst +++ b/docs/user_guide/imputation/RandomSampleImputer.rst @@ -151,4 +151,4 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. \ No newline at end of file +the main developer of feature-engine. \ No newline at end of file diff --git a/docs/user_guide/imputation/index.rst b/docs/user_guide/imputation/index.rst index 3d940010b..54f035a78 100644 --- a/docs/user_guide/imputation/index.rst +++ b/docs/user_guide/imputation/index.rst @@ -76,7 +76,7 @@ have missing values. In this scenario, we can predict the missing grade values b on existing grade data, using age and IQ as predictors. Subsequently, we can apply the same regression imputation approach to the other variables (age and IQ) in subsequent iterations. -Feature-engine currenty supports univariate imputation strategies. For multivariate imputation, check out Scikit-learn's `iterative imputer `_. +Feature-engine currently supports univariate imputation strategies. For multivariate imputation, check out Scikit-learn's `iterative imputer `_. Feature-engine's imputation methods @@ -331,7 +331,7 @@ For tutorials about missing data imputation methods check out these resources: Both our book and courses are suitable for beginners and more advanced data scientists alike. By purchasing them you are supporting `Sole `_, -the main developer of Feature-engine. +the main developer of feature-engine. Imputers