top of page

130 items found for ""

  • Balance Sheet Report in PowerBI

    Investopedia defined Financial statements as written records that convey a company’s business activities and financial performance. Last time out, we created an income statement in PowerBI. Check here for a recap. Now we will look at balance sheets, one of the three financial statements used to determine a company’s economic performance. So what is a balance sheet? In lay terms, a balance sheet is simply a statement that shows what a company owns (assets), what it owes (liabilities) and what it’s worth (equity) at a particular point in time. The asset of a company ranges from cash in the bank, office furniture, accounts receivable, patents, trademarks, buildings, equipment, technology etc. A company’s liability includes accounts payable, i.e. what is owed to the supplier (whenever a company buys on credit, it results in accounts payable), short or long-term loans etc. Concerning equity, consider Mr A has the following assets a laptop ($2000) and a phone ($1000). But Mr A made a part payment of $500 for the phone and has agreed to pay the balance in 30 days. Currently, Mr A is worth $2500 because he still has a $500 debt to settle. So, how do we create a balance sheet using PowerBI? Let’s go into PowerBI to see this. First, we need a balance sheet template to import into PowerBI. A sample data and template have been provided here. The following are the tables needed for the balance sheet report. The balance sheet data have been transformed as the format provided is unsuitable for the model. Plot twist. We must create a relationship between the calendar table and the balance sheet data. The only column that links them together is the Year column which will have a many-many relationship, and we don’t want that. What do you think is the way forward? Try to think of a solution before scrolling down. There are two ways to go about this. 1. Convert the Year column to a date column, which gives you the first day of every month like this, and you can establish a relationship using the dates. 2. Use the TREATAS function. This is my approach. The TREATAS function creates a virtual relationship between the tables specified. VALUES (‘Calendar’ [Year]) is used because unique values are required. With the aid of the VALUES function, we have established some virtual 1-many relationships between the Year column on the calendar table and the year column on the balance sheet data. If we look at the balance sheet template we imported, the itemized rows contain the same values in the balance sheet subcategory. We have rows like Total current assets, Total fixed assets etc., which are not included in the balance sheet data. Hence the need to calculate the totals separately and embed them in a single measure as was done for the income statement. Below are the calculations for the totals. 1. Total current assets 2. Total fixed assets 3. Total other assets 4. Total assets 5. Total current liabilities 6. Total long-term liabilities 7. Total owner’s equity All totals and subtotals have been calculated. The end is near. All left is to wrap our totals in a single DAX measure, as seen below. Note that the format function is used in the DAX above to override the currency data type for the measure because a debt ratio can’t be ₦0.7 The recipe for the balance sheet is ready. All that is needed is to create the table is Make design adjustments, and you will have your balance sheet ready. See the full report here. Adios

  • Sentiment Analysis in Excel

    Analysis of social media streams is usually restricted to just fundamental analysis and count-based metrics, i.e. how many people tweeted about that event, brand or product, who tweeted the most, and where those tweets are coming from etc. This is similar to scratching the surface and missing out on those high-value insights waiting to be discovered. So what should you do to help a brand take full advantage of the captured data? Simple. Carry out sentiment analysis MonkeyLearn defined sentiment analysis as a natural language processing (NLP) technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback and understand customer needs. Why sentiment analysis Social media is an effective tool for reaching out to new customers and engaging with current ones. Positive customer feedback and social media posts encourage other customers to purchase from your company. The opposite is also true. Negative social media posts or reviews can be costly for your company. Understanding how your customers feel about your brand or products is critical. This data can assist you in improving the customer experience or identifying and correcting problems with your products or services. A positive customer service experience can make or break a business. Customers expect their concerns to be addressed quickly, efficiently, and professionally. Sentiment analysis can assist companies in streamlining and improving their customer service experience. How customers feel about a brand can impact sales, churn rates, and how likely they are to recommend this brand to others. In summary, sentiment analysis lets you understand how customers feel about that product, event, brand etc. It is not just enough to run a count-based analysis. That's just scratching the surface. Consider the two scenarios below. Scenario 1 Mr A tells you Mr B, and Mr C tells him something about you. Intuitively you might want to ask what he said even if you eventually don't ask. You would intend to know if what was said about you was either good or bad. Scenario 2 Datafest Africa held an event recently, and I'm sure some are either in the process of scraping the data or have already done it. They should have at least 1000 tweets. That's a minimum of 1000 feedback. Reporting 1000 people tweeted about the tweet doesn't aid much with process improvements. It doesn't say much about what they did right. Those tweets could either be positive or negative. If I were on the team, I would want to know people's perceptions, not just the number-based metrics. For instance, the team wants to hold the next event in another location. You can back up your choice of location with data to the group by saying we had lots of positive feedback from this location based on tweets made in that location during the previous event. I'm not sure of the level of data you can scrape from Twitter, but I'm assuming the tweeter's location would be one. With sentiment analysis, you can take your insights a level higher. The question remains, I don't know python, so how can I do this? The good news is you don't have to worry about any piece of code. Excel have an Add-in that lets us do you do this. Let's go into Excel to do this. Go to the insert tab. Select Get Add-ins Search for Azure Machine Learning and add the add-in Once added, you will have a pop-up section on the right side of your Excel. Select Text Sentiment Analysis. The image below comes up. The input range is the column containing the feedback/comments, while the output is where you want the analysis result. You get two results (two new columns) Sentiment — This could either be positive, negative or neutral Score — Ranges from 0–1. The more positive a comment, the number tends towards and vice versa. Let's do a quick demo in excel. Note that no model is 100% accurate. If a model is 100% accurate, there has been data leakage. For example, classification models aim to minimize false positives and negatives as much as possible and not eradicate them. I changed the comments, and we have a more accurate sentiment. I want you to take your analysis a step higher and not just leave it at count-based metrics. Adios

  • Descriptive Statistics

    I’m here to help you avoid my mistakes early in my career. At some point in time, we were all told to learn stats. But just learning stats is not sufficient. Interpretation and context are also critical. Consider the scenario below. The image below shows the annual salary of 7 employees in an organization. If you are to answer the question of what is the average salary of this company? Are your thoughts on calculating the mean right? But that would give the wrong answer. I will show you why. If we were to calculate the average of all salaries, we would have an average of $86,429, which is not a reflection of the actual data. If you skim through the salaries, $86,429 is a far-fetch as no team member except “G”, an outlier, earns close to that amount. The median, in this case, is a better reflection of what the average salary looks like. Note that the $86,429 can’t be said to be wrong. This is one way to lie with data. Say they were my team member, and I’m the employer “G”. The labour union is protesting that my employees are being underpaid, and I’m being interviewed by one of the popular news channels. If I say the average salary in my organization is $86,429, their protest is unjustified. The statement in literal terms is not wrong, but I just lied with data because, due to the single outlier of $400,000, the average is no longer an accurate reflection of the data. Distribution of the data is essential to the interpretation of your statistical analysis. I wrote about how the distribution data dictate how I treat missing values here. Let’s dive into descriptive statistics. So what is descriptive statistics? Investopedia defined descriptive statistics as brief informational coefficients that summarize a given data set, which can represent the entire population or a sample of a population. Before we go on, let’s break down the difference between a sample and a population. Consider this scenario. As a result of the food inflation, you are surveying the prices of food across the country. Your population size is everyone selling food items in all markets (both major and minor). Seem like an unrealistic thing to achieve. Going to all markets and meeting every vendor asking about the prices of food items requires a hefty amount of resources. That’s where your sample size comes in. The sample size is just a subset of your population. Instead of going to all the markets in the country for your survey, you go to selected markets and vendors, not necessarily all vendors, and then use the data from that subset to make inferences on the entire population. For instance, I can go to selected markets here in Abuja and then use that data to say this is the average price of rice in Abuja. I don’t need to visit every nook and cranny of Abuja. Enough deviation. Let’s get back on track. Descriptive statistics can be broadly grouped into the following: Measures of frequency — tells you how often something occurs, e.g. count, percentage and frequency. Measures of central tendency — captures the central aspect of a data, e.g. mean, median and mode. Measures of dispersions — captures how spread out or dispersed the data is, e.g. range, variance and standard deviation. Measures of Central Tendency This summary measure aims to summarize all the data in a collection by using a single value corresponding to the middle or centre of the distribution. They include mode, mean and median. The mean refers to the average of the data. The median of a set of ordered observations is a middle number that divides the data into two parts, while the mode is the most frequent number. If you notice extreme observations (outliers) in your data, then a median is a better summary statistic to report than a mean. Also, comparing the mean with the median tells you a bit about the distribution of your data. If Mean > Median, the data is skewed to the right (i.e. positively skewed). The quick hack to understand skewness is the length of the tail. Here, a long tail is on the right side compared to the left. If Mean < Median, the data is skewed to the left (i.e. negatively skewed). There is a long tail on the left side compared to the right. Measures of Dispersion Earlier, we said, this set of measures captures how spread out or dispersed the data is. To better understand this, consider the salaries of two different organizations. Let’s plot the salaries. Notice how the salaries in firm 1 are more closely knitted than in firm 2, even though the mean and median are equal. That’s what the measure of dispersion helps us understand, i.e. how spread out the values are. How can this spread difference be converted into a descriptive statistic that means something? Calculating the data range, which is only the difference between the data’s maximum and minimum values, is one method for doing this. Firm 1’s highest and minimum salaries are $36,000 and $30,700, respectively, giving us a range of $5,300. The range of wages in Firm 2 is also $16,200. A wider range denotes more data spread or dispersion. One more comparable measure follows the range measure of dispersion. I’m talking about the inter-quartile range or IQR. This describes the central 50% of the data, leaving the remaining 25% to the right and 25% to the left. A number is considered in the first quartile if 25% of the observations fall within this range. A third quartile is a number that 75% of the observations must be less than or equal to. Incidentally, the second quartile corresponds to the median. The zeroth quartile is the lowest value in the range. The fourth quartile represents the highest value in the range. The interquartile range or IQR is the third quartile, less the first quartile. Why do we prefer the interquartile range over the range measure? Suppose you are working with sample data and trying to make inferences about the sample data’s population. In that case, A sample’s range is not always representative of the population it originates from. On the other hand, the interquartile range of a sample is representative of the population’s interquartile range. Let’s look at a box plot, which is an excellent tool for displaying various descriptive statistics. Box plots are sometimes known as box and whisker plots. The two whiskers at the top and bottom indicate your data’s maximum and minimum values, respectively. The rectangle between the first and third quartiles is your interquartile range. The dot represents the mean, and the horizontal line within the box represents the median. As a result, you get a practical data summary in a single graphic. A box plot is handy when making comparisons—for example, the earnings of both male and female football stars. From the box plot, we can deduce that football’s highest-paid female player makes almost as much as the sport’s highest-paid male player. However, the spread or dispersion of female stars’ earnings is greater than that of male stars. The range, which is the range between the maximum and minimum, serves as proof. Additionally, there is the rectangle box’s height and the interquartile range. Female stars’ mean earnings are higher than their median earnings is another intriguing finding. However, the mean income for the male stars is lower than the median income. Implying that, compared to male stars, the earnings distribution for female stars tends to be significantly more right-skewed. A few female football players with very high salaries contribute to the total mean exceeding the median income. We have covered two measures of dispersion, the range and the interquartile range. Let’s see the difference with the typical standard deviation. Consider the salary scenario for the seven employees below. The spread or dispersion of data is described by the difference between some high values and some low values using the range and interquartile range metrics. The range, for instance, is the difference between a data’s maximum value and minimum value. The difference between a high number, the third quartile, and a low value, the first quartile, is the interquartile range. However, the standard deviation first determines the data’s mean. Then calculates the variations between each data point and the mean. It then combines these differences to give the standard deviation measure. N in this formula is the total number of observations. The standard deviation formula sums the square of differences, divides it by N, and then takes the square root of the result. Now we have calculated the standard deviation. But the number is useless without interpretation. For this, consider the salary data below. The standard deviation is calculated as follows in Excel. What does the $5,013 mean? To understand this, we need to recall the rule of thumb. It says that approximately 68% of data lie within one standard deviation, and approximately 95% lie within 2 standard deviations from the mean Therefore, our data shows that 68% of salaries were $33,500 ± $5,013 (1 standard deviation away from the mean). 95% of salaries were $33,500 ± (2 * $5,013), i.e. two standard deviations from the mean. Variance, another measure of dispersion, is the square of your standard deviation. That marks the end of this piece. Adios

  • Income (P&L) Statement in PowerBI

    One of the three statements used in corporate accounting and finance is the income statement. The statement shows the business’s revenue, gross profit, selling and administrative costs, other expenses and income, taxes paid, and net gain over a certain period. If I were an accountant, I wouldn’t want to create an income statement every month, quarter or year. How, then, can I make my life and work easy? PowerBI is the solution. Create a report once and let it auto-refresh as you have more data Download sample data here The following tables are needed for the income statement report and loaded into PowerBI. The Company expenses data looks somewhat like this, which is unsuitable for our data model. Hence the need to unpivot. After unpivoting, the data is now suitable for the data model. The final model looks like this. The format I usually stick to is dimension tables at the top with fact tables underneath it, while supporting tables are at the bottom. DAX measures tables are always by the right. That way, I have a clear overview of all tables in my model. An income statement speaks to the organization’s revenue and expenses. The tricky part is merging the sales data and company expenses data. Put in mind that both data have different granularity, i.e. the sales data is on a day-to-day basis while the expense data is monthly. How, then, do we ensure they have the same granularity? Create a supporting table from the Sales data To do this, let’s first calculate our total revenue using the Sales table. Next, we create a supporting table from the sales table whose granularity is a monthly interval. To do this, we use the DAX seen below to create a new table. The SUMMARIZE function is similar to SQL Group BY. It groups the specified table based on the columns in that table or related tables. In our case, we are grouping the sales table by the sales channel and the Month & Year columns on the calendar table. Now we have a sales table with the same granularity as the company expense table. Then we combine both the Sales data and the company expenses data into what will be our income statement data using the UNION function. The Union function returns a table that contains all rows from both tables. The caveat with this function includes; The two tables must have the same number of columns Columns are combined by position in their respective tables, irrespective of the column content. Hence the need for another summarize function to ensure the data merge appropriately. The column names in the return table will match the names in the first specified table, i.e. the Company Expenses column names will be inherited. The result is seen below. The type and type index columns were added using an if statement. The reason why I added these columns would be seen later on. If we look at the income statement template, we have the following items; Total Revenue Total COGS Total Gross Profit Gross Profit % Total Other Expenses Total Net Profit Net Profit % Since they are not included in our data, we would have to write dax measures to calculate them individually. 1. Calculate Total Revenue 2. Calculate Total COGS In accounting, expenses are recorded with negative values, hence a need to multiply the calculated function output by -1. 3. Calculate Total Gross Profit 4. Gross Profit % 5. Total Other Expenses 6. Total Net Profit The addition is done here because the COGS and Other Expenses will have negative values. 7. Net Profit % Now we have created all the individual totals. We need to reference all these measures in a single DAX. We are almost there All that is left is to calculate the actual values. The DAX below will help us achieve this. This will return the row item values on the income statement template. The recipe to prepare our income statement is fully ready. Let’s create our income statement You know the sweet thing with PowerBI You can run comparisons. This year vs last year, Q2 vs Q1. Let’s take a quick look at the DAX that lets us achieve this year vs last year. 1. Total Revenue LY 2. Total COGS LY 3. Total Gross Profit LY 4. Gross Profit % LY 5. Total Other Expenses LY 6. Total Net Profit LY 7. Net Profit % LY 8. Actuals LY What is the difference between this year’s and last year’s performance? I know what you are thinking, so why not use this? Sadly you will get the error below. This is a result of the format function used in our DAX measure. What is the percentage of this difference? All that is left is just for you to add these new measures to your table, and you are done. View the completed report here. Next year, next quarter, or next month, you don’t have to spend minutes or hours creating an income statement. All you need to do is ensure the data source is updated. Your finance manager does not need to wait days to get the report, and the award is yours. Congratulations on winning employee of the month/year award.

  • Cashflow Statement in PowerBI

    The cash flow statement is just one of the three financial statements. Investopedia defined the cash flow statement (CFS) as a financial statement that summarizes the movement of cash and cash equivalents (CCE) that come in and go out of a company. This is pretty straightforward compared to the income statement and balance sheet created in PowerBI. If you need to follow, you can download the data used here—tables required for the cash flow report includes. It is necessary to unpivot the data as it is unsuitable for the existing model. The transformed data then looks somewhat like this. Remember, the goal was to build a single report comprising all three financial statements, not three standalone reports. The overall model looks somewhat below. The model diagram shows no established relationship between the calendar table and the cash flow data table. Interesting? How do we establish a relationship since the cash flow data is on an annual basis? Using the year column will create a many-many relationship. Not what we are interested in—what a headache. The solution to this headache is to use the TREATAS function. The TREATAS function creates a virtual relationship between the tables specified. A quick peek at the cash flow data reveals a column called the cash flow sub-category. Hence we need to write a DAX measure that calculates the cash value for each category. VALUES (‘Calendar’ [Year]) is used because unique values are required. With the aid of the VALUES function, we have established some virtual 1-many relationships between the Year column on the calendar table and the year column on the cash flow data. From the cashflow template imported into PowerBI, the cashflow sub-category has been circled in blue. The DAX created above will calculate the itemized cash balances for each of them. Row 4–8, 13–15, 17–19, 23–25, and 27–29 will serve as filter context for the CS Values measure created. If you are unclear about how filter context works, I have explained that here. Since the itemized rows in the cash flow statement are sorted, next on the agenda are the totals and sub-totals, and these will each need their different measures, which will then be encompassed into a single DAX measure. The following measures are used to calculate the totals and sub-totals. Cash receipt from — operations 2. Cash receipt from — investing 3. Cash receipt from — financing 4. Cash paid for — operations Note that this was multiplied by -1 because, in the data provided, the values were recorded with a positive sign. In an accounting statement, outflows are registered with a negative sign. 5. Cash paid for — investing 6. Cash paid for — financing The DAX measure below encompasses all sub-totals and totals. All left is adding the necessary dimensions and dax measure onto a table. We have completed the cash flow statement. You can be more productive at work by developing these financial statements in PowerBI and setting up a refresh schedule. Adios

  • How Dynamic Metrics Revolutionize Data Analysis in Business Reports

    I consider dynamic metrics as metrics calculated based on the selection of the report user. Picture this: your company has tasked you with creating a report that gives a comprehensive overview of their business performance, but with a catch — you only have three metrics to show. Revenue generated, associated costs in running the business, and profit over time. But what if you could make the report dynamic, so that the metrics you show could be adjusted according to the user’s preference? That’s where dynamic metrics come into play. Why dynamic metrics? Consider the following scenario. Company A, a client of yours, generates thousands of data daily and tasked you to create a report detailing an overview performance of the business. Say you have just three metrics to show in your report. Revenue generated Associated costs in running the business Profit over time For a report that needs to be kept updated, performance is something to always look out for while developing the report. One of the many things that affect a slow report is the number of charts. Note that it is not the only thing that slows down a report. Other things could be done to improve performance. But in this scenario, the report will have to connect to a large volume of data, and I want to ultimately minimize the number of seconds charts will add to the report performance without having to dial down on the information reported. That’s where dynamic metrics come in because the more the charts, the more seconds are added to the report performance. To follow through on creating dynamic metrics, download the data used here. Below is the data model for the report. The metrics table contains the metrics we would like to show dynamically, while the Key and Supporting Measures are tables to hold our DAX measures. You can either create the metrics table via the Enter Data feature or import the table from an excel sheet. I prefer the enter data feature. Below is the metric table used. The index is for sorting on the report page. Then, we must create a measure that refers to any item list selected on the metrics table. This is done using the SELECTEDVALUE() function. Now, we create individual measures that calculate the metrics. Total Revenue 2. Total Cost 3. Profit The recipe is almost done. Photo by Eden Constantino on Unsplash All left is to incorporate the three measures into a single DAX measure, as seen below. The switch function is a nested if-statement but cleaner and easier to read. Recall [Metric] refers to the metric selected on the metrics table. The DAX above says that if Revenue is selected, use the revenue measure. If cost is selected, use the cost measure. If profit is selected, use the profit measure; if nothing is selected, use the total revenue measure. Pretty straight-forward right How, then, do we use this measure in the report? Let’s go into PowerBI to see this. The beauty of this is that it works similarly to how all DAX works. For instance, you can still make a month-on-month comparison, year-on-year comparison etc. Last year’s value for the selected metric 2. Periodic comparisons Note that the month-on-month growth here is this month (Nov 2022) vs the same month last year (Nov 2021) and not this month (Nov 2022) vs the previous month (Oct 2022). The month column on the calendar table will provide context to the measure. 3. Daily average run rate You can see all three measures are based on the selected metric, meaning the calculations will be done based on the chosen metric. You can view the completed report here. Power BI Report Report powered by Power BI app.powerbi.com That would be all. Adios.

  • Moving your machine learning model into production

    For every machine learning enthusiast, we were told to go to Kaggle. Engage in competitions. But after that, what next? Should it end at just making predictions on the test data and submitting it? Or we would always be supplied with excel sheets to make predictions on with the model in real-life scenarios. What do we need to do? Read along to find out what needs to be done. Background Information Kowepe bank of Nigeria conducted marketing campaigns via phone calls with their clients. These campaigns prompt their clients to subscribe to a specific financial product of the bank (term deposit). This survey was conducted with some selected individuals that the bank feels are the best representative of their clients to minimize the cost of the complete client survey. You have been contacted as a data scientist to find patterns and build predictive models on this dataset to forecast the percentage of potential customers for the bank’s financial products. Objective Identify customers that will subscribe or not. To start with, you have to 1st import your data. The next is to do a bit of data exploration. Check for missing values. Coincidentally everyone filled out the survey entirely, but this is computer-generated data. It doesn’t depict real-life scenarios, as you might have missing values. Please read here on how I handle missing values. The essence of exploratory analysis (EDA) is to understand your data. Understanding your data is also understanding the distribution of your data. To understand the distribution, we need to run a few statistics. Wrote about the interpretation of these statistics here. Our descriptive statistics show the presence of outliers. This is better seen visually. The codes below will do that for us. I’m a lazy guy. Rather than write separate codes to plot the distribution of each variable. I decided to write a for loop, i.e. it loops through each of the items in my numeric_features list created above and then establishes the boxplot for each. I did the same for my categorical variables, but a bar chart was created. Another quick hack for EDA is to use pandas profiling. This will create something like a dashboard for your EDA. Not all columns are necessary for machine learning, even though they are available. You need to find a way to identify which are relevant to your model. One way of doing that is by using a correlation, i.e. those input variables that are correlated with your target variable. Rather than use my intuition, I’d instead give statistics control. Let’s run a quick check on the data type of all variables. About 11 columns have the object data type, meaning they are categorical variables, but correlation doesn’t work with strings. Also, machine learning models can’t be created with string variables. Hence we need to convert them into numbers. For conversion, LabelEncoder() is employed. This assigns numbers starting with 0 to the elements in the categorical variables. Why didn’t we replace them one by one with the replace method? Any guess If you are moving your model into production, any data processing done before training the model would also have to be done in production. For example, one of the categorical variables is the marital status (Married or single). Algorithms can’t work with strings; then, we convert them to 0 and 1. To be able to use this in production, you can’t ask users to input 0 and 1 as marital status. So you would also have to write a line of code to replace the input in production. The lines of code then start becoming bulky. But with labelencoder() and assigning to marital_encoder variable, it is transforming and storing the mapping. Hence, in production, I need to call the marital_encoder variable. The data before encoding The data after encoding If we check the data info, we will see that only one column has the object type, which is the customer_id, and it will be dropped. Earlier, I said correlation would determine which column is dropped. So let’s plot a correlation matrix. The matrix This seems like a piece of work having to look through which is correlated or not. As I said, I’m a lazy guy. So let’s define a function that lists the columns whose correlation score is more significant than what we specify. We are setting a threshold of 0.01. The score is low, right? Now that we have identified which columns have an existing relationship. We then go into splitting the data for training the model. We have to first separate the features (x) from the output (y). Then split the data into train and test data. Define a function to test the performance of the model Then the sweet part, Model training. Logistic Regressor The first is the Logistic Regression. Making predictions and testing performance Please take a quick look at the matrix and try to understand what it is saying. We have all heard True Positive, True Negative, False Positive and False Negative. What does this mean? Let’s use the image below to break it down. True positive — A lady who is pregnant is correctly diagnosed as pregnant False positive — A man is diagnosed as pregnant, which doesn’t make complete sense. Also known as type 1 error. False negative — A pregnant lady is incorrectly diagnosed as not being pregnant, also known as a type 2 error. True negative — The man is diagnosed as not being pregnant. It perfectly makes sense. The goal is to minimize the amount of both type 1 and type 2 errors in your model. Random Forest Let’s train the model Evaluate the model performance Support Vector Machine (SVM) Model training Performance evaluation KNN Model training Model performance Decision Tree Model training Model performance Gradient Boost Classifier Model training Model performance Out of all models, the Gradient boost performed better. Then, we save the model. Notice the encoders created during the preprocessing were also saved with the model. Let’s test the model with data it has not seen before. For this, we would need to load back the saved model. We load the test data. The same set of columns used in creating the model is also needed for testing the model. This is the exciting thing about not using the replace (). If we had done that, we would have had to write it for each column again. The test data has been transformed. No strings but numbers Then just call model.predict() We’ve been able to export the model and then use the exported model on data it has not seen before. But it doesn’t end here. A model left in your notebook is less beneficial. The goal of building a machine learning model is to solve a problem, and an ML model can only do this when it is actively used in production. So how do we do this? For educational purposes, you can always use streamlit. I’ve hosted my machine learning apps on streamlit for over a year, and it’s free. Let’s take a quick peek at how I moved a model into production. First, we need to load the essential library. Define a function to load your model and encoder. Since the model is sorted, we then have to create the outlook of the ml app. We start by defining our categorical variables. In a production environment, users won’t select 0,1 as marital status. We have grouped our variables. Streamlit gives you several options when designing the front end, but I have stuck with the selectbox (dropdown) and slider. You can play around with others. Socio-economic variables Contacts Others Once a user selects an option, that selection is stored in the appropriate variable. But if the user selects marital status as single, your model doesn’t understand single; it only understands 0’s and 1’s. Hence the need for transformation, which is the essence of the encoder we loaded with the model. This is just me designing the front end. Upload your files to GitHub and connect it to streamlit and that’s it; you have moved your model from your notebook into production. You can view the machine learning app here and also the notebooks here. Those Kaggle competitions shouldn’t end in your jupyter notebook anymore. Move them into production. Thanks for coming to my TedTalk. Adios

  • Create a Star/Snowflake Schema and DAX measures in Excel

    You are not wrong if you call PowerBI an upgraded Excel. The major components of PowerBI are Power Query, Power Pivot and Power View. Power Query is used for ETL purposes while Power Pivot is used to establish a relationship between the tables and Power View aids the visualization. Little did you know that these components also exist from Excel-2010. Safe to say underrate Excel at your own peril. Two things we would be going through are; Create a data model Write DAX measures Data used can be downloaded here. To create the date table, the following M codes were used. Download here. In order to add the date table, power query was used. To do this, Go to the Data tab and select Get data > From Other Sources > Blank Query Note: I’m using Office 365 On selecting Blank Query, the Power Query window shows. Then click Advanced Editor. The M code here was then pasted Next was to include the start and end date. Did a close and Load and we have our calendar table. The data model contains 9 tables including the Calendar table. How then do we establish relationships between these tables? Here comes Power Pivot with the solution. If you don’t have Power Pivot enabled, here’s a quick guide on how to go about it. While your Excel is open, Select File > Options Select Add-ins. At the bottom, where we have Manage. Click the Dropdown and select COM Add-ins. Ensure Power Pivot is ticked and then select Ok. Power Pivot will be added to your excel. Create Data Model Let’s go back to establishing relationships between the tables. To add tables to the data model, select any cell within your table and then click on Add to Data Model. Once done, the power pivot window comes up as seen in the image below indicating that a particular table has been added. Close the window and repeat this step for subsequent tables. All tables have been added. Select Diagram View to link the tables together. See the tables in diagram mode Note that the essence of this article is not to speak on what data models are. All that is left is to drag and drop related columns on each other and we have an established model. In order to be able to use your data model for analysis, click on the Pivot table in Power Pivot. It gives you several options, you can create pivot tables, single/multiple charts Let’s go for a Pivot table. On selecting that, Our model gets loaded while retaining relationships between the tables. Let’s run a test and see the outcome. The image below is a table containing variables from 3 different tables. Year from the calendar table, category name from the category table and quantity from the order details table. Simple and straightforward right? Nothing complicated DAX Measures The beauty of establishing a relationship is that, rather than using generic columns, you can still write DAX measures that use the relationships between the tables. To do this, select Measure > New measure On selecting, you have the dialogue box below. Input the details. Note that total sales were created in two ways. This was done just to show that relationships created in excel work fine with DAX as both measures will return the same output. Let’s do one more thing. Create a sales column in the order details without using DAX and we check if they all return the same values. Exciting, Yeah!!! To add this column, select the manage icon This takes you back to the Power Pivot Window. Select Data View. Click on the Add column Type in your column name and the sales formula in the cell below Load the model by clicking on Pivot Table once more. Now we have sales calculated in 3 different ways. Let’s see if the output will remain the same. From the image above, we can see that the outcome remained the same irrespective of whatever approach we took in calculating sales. You can even write a CALCULATE function. The same DAX that works in PowerBI also works here. The output below You can even write variables The output is below. Punch your calculator and see if the YOY growth is accurate. Excel is a very amazing tool and it is now way more than what it used to be. Like a said, underrate excel at your own peril. I hope you have been able to pick up one or two things. See you next time.

  • Descriptive Statistics: Covariance and Correlation

    Last time out, we talked about the measures of distribution and dispersions. We will now focus on association measures, i.e. covariance and correlation. A Scatter plot is a chart that helps us understand the relationship between two variables. Consider the chart below. From the chart, we can say that as cost increases, revenue also increases. What if we want to quantify this relationship, i.e. put some numbers to it? That’s where the measures of association come in. They tell you the direction of the relationship and quantify it. What are measures of association? These are any number of different coefficients or factors that can be employed to quantify the relationship between two or more variables. Covariance The direction of the relationship between two variables is measured by covariance. Variables move together when the covariance is positive, i.e. as one increases, so does the other, and inversely when the covariance is negative. Let’s see how to do this in excel. One drawback of covariance is that it is greatly affected by units of measurement. As a result, the covariance measure is not the best choice if one wishes to determine how strongly two variables are related. The measurement is acceptable as long as we are only interested in the direction of the relationship. What happens to the other variable when one changes or rises, or falls, in other words? The covariance, however, cannot be immediately read regarding the relationship’s strength. To do that, we add another metric for measuring the relationship between two variables—the correlation coefficient. The values for covariance range from -∞ to +∞ Correlation Similar to covariance, correlation is used to quantify the relationship between two variables, but unlike covariance, it is not affected by the unit of measurement. The values here range from -1 to 1, i.e. the stronger the positive relationship, the more the number tends towards one. Correlations of >+0.5 or <-0.5, respectively, are often regarded as indicating a strong positive or strong negative association between two variables. It is crucial to realize that these metrics show us how two variables differ from one another. In other words, how does an increase or reduction in one variable affect the change in the other variable? We examined the heights and weights of a few Olympic competitors and concluded that there was a significant positive relationship between an athlete’s weight and height. This correlation does not, however, demonstrate that an increase in height is the reason for weight increase. This is where causality, a different idea from correlation, enters the picture. Unfortunately, this distinction is frequently overlooked. To establish causation, we must rule out other variables that might have caused the variable in question to change. For example, there is a strong relationship between smoking and lung cancer, but this does not mean smoking causes cancer. But over time, meticulously linked research has demonstrated that the link between smoking and cancer is more significant than a simple correlation. It has been conclusively proven that smoking does cause cancer. Studies had to account for various potential cancer-causing factors to demonstrate a clear causal relationship between smoking and cancer. I hope this has been informative. See you next time. Adios.

  • Use Business Days to Calculate Revenue Projection

    I made a PowerBI report for a client, and they requested a revenue projection. Sometimes, business problems require simple solutions. You don’t always have to take the most complicated approach to solve business problems. A more complex solution would have been opening my jupyter notebook to do a model.fit, and model.predict The client’s data source is live data. Hence, the projections needed to be recurrent, i.e. today is the month of October, a projection is calculated, and as the days proceed, the current revenue is put into factor when recalculating monthly projections. As new data is added, projections are recalculated. How, then, do we achieve this? Deploy DAX. That’s where the magic lies. Six critical things needed include Calendar Table Public Holidays Table Working to days to date exclusive of public holidays, i.e. how many days are considered business days from the beginning of every month to the current date, excluding days which are public holidays? Total Working Days, i.e. how many days are considered business days from the beginning of every month to the end of that month, excluding days which are public holidays? Revenue Budget — Yearly budget was provided Let’s write the DAX in PowerBI Data Model The Model used looks somewhat like this The calendar table was created using the CALENDARAUTO() function, and i added other columns. Similarly, the public holiday table was provided by the client. Total Working Days Let’s get our total working days using the following DAX. The first part of the DAX measure, the variable calendar_working_days, counts the number of days between Mon-Friday on the calendar table, excluding weekends and including public holidays that fall on weekdays. In the second part of the DAX measure, the variable public_holiday_working_days counts the days between Mon-Friday on the public holiday table, excluding weekends. The RETURN function returns the working days exclusive of public holidays. Working Days to Date The DAX here works similar to Total Working days, the only difference being that the calendar table has been filtered not to exceed the current day. You can view the output of both measures for 2022 below. Today is a Saturday; hence isn’t considered a working day, so you have zero. October has 21 working days, but the 3rd of Oct. is a public holiday on the public holiday table, so you have 20. Revenue Projections The following conditions were applied when calculating the revenue projection. If the month exceeds the current month, budget = projection, i.e. as of today, the current month is October. For Nov and Dec, Budget = projection. If month ≤ current month, use the following formula Daily Average = Total Revenue/Working Days to Date For example, today is the 1st of Oct. Imagine it is a working day and revenue of $5000 was generated. The daily average will be 5000/1 = $5000 Actual Projected = Daily Average * Total Working Days In Nigeria, October has two public holidays on Monday. Hence, the Total working days will be 19. From the example above, the actual projected will be 5000*19 = $19,000, i.e., at the end of October, the expected revenue should be around $19,000. Although as the days proceed, this projected revenue could fall or rise as it depends on how much is generated. Below is the resulting DAX measure. Plot Twist The client returned and said, Abdullah, the current day needed only to be included as a business day at the close of work, i.e. if Monday is not a public holiday, it should be added as a business day only at the close of work. The DAX below has been adjusted to meet that requirements The total working day’s measure remains the same as the projection measure. Based on the readjusted DAX, the working days to date for October will remain zero until 4 pm on Tuesday since Monday is a public holiday and business days are only counted for that particular day at the close of work which is 4 pm in this case. That, my friend, is how I tackled a real-life situation dealing with a revenue projection.

  • Cross-Sell Analytics

    My discussion with a friend led to a story about how a store used analytics to boost revenue by placing beer close to diapers. A retail grocery store did the study. The findings were that men between 30–40 years of age, shopping between 5 pm and 7 pm on Fridays, who purchased diapers were most likely to also have a beer in their carts which motivated the store to move beer close to diapers. This is cross-selling analytics, i.e. understanding what other product a customer who bought product A also bought. So what is cross-selling? Business adobe defined this as encouraging customers to purchase products or services in addition to the original items they intended to purchase. Often the cross-sold items are complementary to one another, so customers have more of a reason to buy both. By definition above, a cross-sell analysis aims to identify which products customers can be encouraged to purchase in addition to their initial purchase. How do we carry out this analysis? Let’s go into PowerBI to see this The model is a pretty straightforward one You can download the data I used if you need to follow it here. Let’s get back on track. Remember, cross-sell analytics is to help us identify what products can be recommended to the customer based on initial purchase, i.e. we have to compare the products in the inventory. Hence we need two separate product tables. My product table is just a single-column table, so creating another table is pretty straightforward. You can also load in two separate product tables; it doesn’t matter how it’s done. Now our model looks like this. You want to create a relationship between the comparison product table and the sales table. Let me leave that as a mystery so you can discover what would happen if you did but do that after creating your measures so you can see the before and after. Model ready. All that is left is to start writing the DAX measures. First, we need to calculate the number of customers that bought a product. Then, customers who bought more than 1 product, i.e. product A and any other product. Let’s break down this DAX together. Line 2–3 creates a virtual table listing customers who bought a product. i.e. for product A, list customers who bought product A. The product columns will add context. Lines 5–13 also create a virtual table of customers who bought certain products. By default, DAX measures use the relationship established in the Model page unless you override it. The ALL function here is used to override the relationship between the products and sales table, i.e. we are telling PowerBI not to use the products — sales table relationship but use the connection specified in the TREATAS function. Lines 15–19 are pretty straightforward. Lines 21–31 are where we specify what to return. This is where the magic happens. Remember, Lines 2–13 created two virtual tables. Consider this example if we are trying to determine how many customers bought products A and B. We would need a list of customers that bought A and another list of customers that bought B and then compare both lists to identify customers on both lists. Customers on both lists are the ones that purchased both products. A similar thing is what the INTERSECT function here does. It compares both virtual tables and then creates another virtual table with a list of customers on both tables. If you understand SQL Queries, think of it like an INNER JOIN. If you don’t, see the Venn diagram below. Yellow is for customers who bought product A, while blue is for customers who bought product B. Green is for customers who bought both products. The INTERSECT function gives you the green, i.e. list of customers who bought both products. The INTERSECT function is then wrapped in COUNTROWS because we are interested in the number of customers and not a list of customers. Since we have everything in place, let’s use the measures in PowerBI. I’m a fan of percentages as it is way easier to picture You have done the table; how do you now run your analytics? Let’s go into PowerBI one last time. You can see you don’t need to start worrying about python codes to do this. PowerBI is my personal nuclear weapon. It can do and undo. You can view the full report here. See you next time. Adios

  • Colour combinations for PowerBI Reports

    Had a chat with a friend, and the issue of colours came up. That brought about the need for this article. Some of us are bad with colours; that’s who we are. I am one. The best we could do is find a way around this with technology. For me, I believe if there is a problem that technology can’t solve now, it’s just a matter of time before something is created/invented. So, what do you do if you are bad with colours like me? The answer is simple, deploy technology to solve the problem. You don’t need to have sleepless nights worrying about which combination of colours is suitable enough for your report. Two technologies you can deploy are; There are some people, who have the gift of mixing different colours. If I come across one of their reports and I love the colour combination, I either save it as an image somewhere or take a screenshot Then what next? Take the image to either of those websites and get the colour codes from there. Straightforward. People post dashboards almost every day; I let them worry about colour combinations for me. There is no award for best in colour combination. Let’s take a quick tour of how I go about it. For now, we will be using google search. Step 1 — Search for a lovely report online. Step 2 — Upload saved image and extract colour codes. You can also create your own colour combination. Note that while using coolors.co to create your own matching colours, you don’t need to lock in two colours. You can just lock in 1, which will suggest four other colours that match the one you locked in. You can also decide to use colour picker. Do you know the most exciting part? You don’t need to go through this step of selecting colours every time; you can just create a reusable theme that you can always import. This is how. First, export the pallet. Select Code and download Open the downloaded file, and you have something like this. What we need are these sets of numbers. Here is the theme template file you will edit and load into PowerBI. PowerBI theme template.txt You will paste the colours generated into the square bracket. Then put an # sign in front of each colour code to make it look like this. By saving this, you’ve created a theme template which you can always reuse. All you need to do is go to the view tab in PowerBI. Select the drop-down arrow and click on browse for themes Locate the theme template and select it. If you can’t find it, set it to All files. Notice how the colours in my powerbi are precisely the same as those extracted on the websites. For reuse purposes, you can now always import this theme and use the same set of colours. With those two websites, you don’t need to worry about colours. If you see any design and you feel it’s nice, screenshot and save it somewhere, it doesn’t always have to be a dashboard.

bottom of page