Data Mining – Learn Data Mining, OLAP, OLTP and Clustering

Hi friends, let’s discuss the important concept of Data Mining and the four common tasks of data mining: Data Clustering, Data classification, regression and association rule learning.

This is an important topic to learn and adopt as a career option these days. Lots of people are trying their luck in this field by mastering the data analysis skills. It is a growing field and by 2020 the vacancy graph for professional data analyst, business analyst and data scientist will be at par.

Hope you guys have checked my previous post on Malicious Programs/Malwares for answering the questions related to this section.

Future Scope of Data Analyst:

Apart from future perspective, data mining is an important topic considering the various government exam vacancies for computer science professionals. So, I have tried to collect every important part of topic data mining in this blog.

If you guys want any other topic to be covered, please let me know by adding a comment. Now let’s first understand what Data mining and data analysis is.

data mining

What is Data Mining and the use of Data Mining?

Data mining is the process of extracting patterns from data. It is an important tool used by modern business to drive information from data. Data mining is currently used in marketing, profiling, fraud detection and scientific discovery etc.

Tasks of Data Mining:

  1. Data Clustering: It is the task of discovering groups and structures in the data that are similar in some way. Data clustering is performed without using known structures in data.
  2. Data Classification: Data classification is the task of generalizing known structures to apply to new data. Common algorithms related to data classifications are: 1.1. Decision tree learning

1.2.  Nearest neighbor

1.3.  Naïve Bayesian classification

1.4.  Neural networks

1.5.  Support Vector Machines

3. Regression: With Regression we attempt to find a function which models the data with the least error. There are different strategies related to regression models.

4. Association Rule learning: This learning is used to search for relationships between variables. I would like to share a big example of association rule learning:

With the help of association rule learning, Amazon displays the items frequently bought together to show as a recommendation. Thus helps the customers and increase its sales.

Approaches to Data Mining Problems:

  1. Discovery of sequential patterns
  2. Analysis of patterns in time series
  3. Discovering of classification rules
  4. Neural Networks
  5. Generic Algorithms
  6. Clustering and Segmentation

Goals of Data Mining and Knowledge Discovery:

  1. Prediction: Data mining can show how certain attributes within the data will behave in future.
  2. Identification: Data mining can be used to identify the existence of an item
  3. Classification: Data mining can partition the data so that different classes or categories can be identified
  4. Optimization: Data mining can be used to optimize the use of limited resources such as time, space, money or materials to optimize the output

What is OLTP (Online Transaction Processing)?

In order to understand OLTP, it is very important to be aware about Transaction and transaction system. So, what is a transaction? What are the properties of transaction system? Let’s analyze the theory of transactions and then we will cover OLTP.

Transaction and Transaction System:

A transaction is nothing but an interaction between different users or different systems or between a user and a system.

Transaction systems: Every organization needs some on-line application system to handle their day to day activities. Some examples of transaction systems are: Salary Processing Library, banking, airline etc.

Transaction System
Transaction Properties:

Every transaction follows the ACID property. Learn it like this. This is an important section and government exams choose multiple questions from this section.

ACID

Atomicity: This means a transaction should either completely succeeded or completely fail.

Consistency: Transaction must preserve the database stability. A transaction must transform the database from one consistent state to another

Isolation: This simply means transaction of one user should not interfere with the transactions of some other user in the database.

Durability: Once a transaction is complete means committed, it should be permanently written to the database. This change should be available to all the transactions followed by it.

I hope the ACID properties are clear to you guys. Please let me know if you need more information on this with examples.

Ever wondered how multiple transaction of different users can be processed simultaneously?? If yes check the below magic:

Concurrency: Currency allows two different independent processes to run simultaneously and thus creates parallelism. This is the thing that utilizes the use of fast processing time of computers.

Learn SQL Basics to get into Data Analytics

What is Deadlock?

Deadlock is a situation where one transaction is waiting for another transaction to release the resource it needs and vice versa. Therefore, it becomes the case of wait and bound situation and the system halts. Each transaction will be waiting forever for the other to release the resource.

How to prevent Deadlock?

The simple rule to avoid deadlock is if deadlock occurs, one of the participating transaction must be rolled back to allow the other to proceed. So, this way transactions can be performed. There are different kinds of schemes available to decide which transaction should be rolled back.

This decision depends on multiple factors given as following:

  1. The run time of the transaction.
  2. Data already updated by the transaction
  3. Data remaining to be updated by the transaction system

I have tried to cover this section completely friends. Learn these concepts about data science and you will be able to solve each and every question that is related to the data mining section.

In order to master this section, please check my next post of the Previous Year questions of Data Mining section.

Not sure about Computer Networking concepts? Need to score good marks in Computer network section? If yes, do read my next post on Why Python is the best language for Data Science and Machine Learning. Till then, C yaa friends 🙂

Intelenet Global Services Analyst Interview Experience

Hola friends, I know this post is very late. Sorry for this. Check out my blog for real interview experiences of different companies. So, today I have brought to you the Intelenet Global Services Analyst Interview Experience.

Intelenet Global Services Data Analyst

Data Analyst position in Intelenet Global Services:

The position was Data Analyst which has been counted as the sub-portion of the sexist job of 21st century. Data analyst work involves collecting, cleaning, transforming and analyzing data.

To Learn SQL Basics, follow the blog.

Job Description:

The job description for this post in Intelenet Global Services included Python, R or SAS, knowledge about statistics….yeah the core things of data science. It also needs you to have knowledge about some BI tool like Tableau. It would be desirable for you 😉

How does Intelenet Global Services Recruits:

So, the recruitment process in Intelenet Global can start either with a consultancy, employee referral or through their company database, if your resume is that smart 😉

My friend got a call from a consultancy and the consultant explained about the job description in Intelenet Global Services. Intelenet is a business process outsourcing companies offering services in multiple sectors and this profile was Data Analyst. The position was for UK shift and cabs facility is not provided with it… Huh.. :/ When I heard this, it was bit upsetting for me and friend.

So, my friend agreed on the terms specified by the consultant and the job consultant agreed upon a date for interview.

The Interview Day:

My friend went to the Gurgaon Intelenet Global Services office. He found the office entrance bit sad.. but then the internal bay system and people were good there. The first round was with a BI specialist and she was very friendly in nature. It was more of a HR kind of interview with some basic questions like:

-Tell me about yourself

– Your Current Role

-The complex project you have done so far and your responsibilities in it

– What kind of tools you use

-Your reason of Job Change

-Why Intelenet Global Services

-Any BI tool experience (by the way they use Tableau in most of the cases for reporting purpose – specific to Gurgaon location office). So if you know any BI tool that will be an advantage for you 😀.

Then she will explain what all tools they use and what the role is all about.

If the HR finds your candidature as relevant, she may call you for a coding round…

The Second Round of Interview with Intelenet Global Services

The second round of interview was a coding round and the questions were related to Advance Python concept. It wasn’t a pen paper based test. -So, if you are someone who only knows how to read a simple csv file as a dataframe, then this profile might not for you.

-If you are someone who knows has worked with a single data type inside a dataframe column, then this place is might not for you..

Guys you need to be very good in Python concepts if you want to clear their coding round.

Also, some basic knowledge of SQL is required to master the test. Don’t know anything about SQL?.. Aah.. don’t worry.. check my post on SQL Basics and Advance SQL functions to master data management. Any feedback is welcomed 🤗

Here are some hints for this coding round:

1. Try to master how to read multiple files at the same time and append it to each other. Like reading all excel or csv files from a directory and then appending it to another file

2. Extracting elements from a nested list. Writing a generic function to extract nested list items and checking prime/even-odd and splitting the data into new two list based upon the function.

3. Handling dataframe with columns handling different data type separated by different delimiters… My friend was totally blown away with this question 😉

4. The fourth and last question was based on your knowledge of dictionary and lists. Writing a function that compares key values of a dictionary in a list and then deleting key values from list.

So, guys master you Python basic and advance skills before appearing for Intelenet Global Services.

The sad part was the consultant informed him that this positions is for statistical analysis, clustering, classification etc… However there wasn’t any question related to these topics. But, this doesn’t mean that these topics will not be asked in further interview rounds. Since my friends was asked to leave for the day… Aah.. I hate this line.

All the best guys if you are going to appear for the data analyst profile. I hope this post will help you 🙂 Let me know if you have some other questions here… I am here to help you..

Thank you

How to check Duplicates based on Multiple Fields – SQL or SAS

You have big data sets and that contains duplicate but you need to check duplicates based on combination of two or three columns. How can you check that using SQL or SAS? So, you can use the following combination to check duplicates based on multiple fields or columns in SQL or SAS:

Query: There is a dataset called “Country_Polulation” with following fields: 1. Name 2. Age 3.CityName 4.DOB 5.Year

You need to check duplicate records – like if Name and DOB of two records is same that means that is a duplicate. Follow the following methods to check duplicates based on multiple fields:

Let’s say we are checking duplicates for year – 2018

Using Having Clause:

Having clause will check the count of name and DOB if exists more than once, that name will be exported from inner query to export query. If you need to learn basics of SAS, follow these articles: SQL Basics and Learn Basic Functions of SQL

Select A.name, A.DOB

from Country_Polulation as A

where name in

(

Select B.name, B.DOB

from Country_Polulation as B

where year=2018

Order by name

Group by Name, DOB

Having (count(name)>1 and count(DOB)>1)

)

order by A.Name

2. Check Duplicates in SAS:

You will have to use Proc step for this task if you want to retrieve all the duplicate records. Basically do a self join and check all the fields with each other for equality to check duplicates.

Proc SQL;

Create table Dupe_Records as

Select A.Name, A.DOB, B.*

from Country_Polulation as A

JOIN

( Select B.Name, B.DOB, Count(*) as Quantity

from Country_Polulation  as B

where year=2018

group by B.name, B.Dob

having count(*)>1

)

order by Name, DOB

This Dupe_Records table will contain all the duplicate records ordered by name and DOB in ascending order. The reason why we couldn’t mention Name and DOB condition in Having clause in SAS is – SAS deals differently with Dates. and DOB is a Datetime18. format here when I tried the same.

3. Check Duplicates in Excel:

If the dataset is of small size, you can follow the mentioned steps:

  • Using Home Tab:

Home Tab -> Click on Conditional Formatting -> Click on Highlight Cell Rules -> Duplicate Rows -> Check the colors to highlight the duplicate text

  • Using Data Tab:

Select the data range or the columns wherein you need to check duplicates -> Data Tab -> Click on Remove Duplicates in Data Tools

This will give you the option of selecting columns to consider dupes. Select the desired columns and the dupes will be removed.

Hope this post will help you 🙂

What is Big Data and the V’s of Big Data

Big Data and V of Big Data

Big data has become quite a buzz these days and we all here that Data Scientist and Data Analytics is the sexiest job in 21st century. But What is big Data? What are the 4 V’s of Big Data and the difference between Data and Big Data – Data vs Big Data . What is the job scope in Big data industry and future of jobs in the industry. Let’s understand this concept:

What is Big Data and Comparison with Data :

Big data is basically a large dataset which in turn can not be analyzed,  maintained and studied by the normal RDBMS & traditional computer systems. Thus there is no specific definition of size of Big data. Then you must be wondering why there is a need to understand big data concepts:

Need of Big Data Concepts:

Today we are living a digital world wherein most of the things have become digital from doing shopping online to studying online. All those eCommerce websites giving you recommendations of products and providing discounts on those recommended products. Have you ever wondered how this has happened…here comes the magic of big data analytics 🙂

But before that do you also want to learn basics of SQL? If yes, check this tutorial here. Also, check Interviews tips of different analytical companies.

Characteristics of Big Data (4 vs of Big Data) :

There are 4 characteristics of big data that together creates the difference and makes Data vs Big Data a discussion:

4 vs of Big Data

4 vs :

  1. Volume – By volume we mean the scale of data in size.. It is very important to analyze the volume to scale us the hardware and software and manage it accordingly for a faster retrieval process of data.
  2. Velocity – Velocity means the speed with which it is increasing each day. Analyze the streaming data.
  3. Variety – In digital world, data can be text, image, audio and video and this increase the uncertainty in data
  4. Veracity – Veracity meaning depicts the biases, noises and abnormality present in big data. Analyze the uncertainty present in the data.

Why Choose JLL? JLL Python Interview Questions

JLL Python Test Interview

JLL is a real estate company and soon it will be dealing with new data science tools. So, one of my friend attended the interview of JLL for Python Data scientist. I am discussing here all the Python Interview questions here. But first of all, lets know why JLL.

Want to know: Why Python is the Best Programming Language for Data Science and Machine Learning?

Why Choose JLL?

JLL(Jones Lang LaSalle) is an American professional services and investment management company specializing in real estate. It values the diversity of its employees and provide them good opportunity growth. Soon, it is targeting to become one customer data science company. So, why not join it 🙂

JLL Python Interview Questions

JLL Python Interview Questions and Interview Process

So, there are total four to five rounds if you are going for JLL Python interview. The very first round will be a telephonic one wherein you will be asked about your current company and job role. The interviewer was very supportive in my friends’ case.

The interviewer will explain about the job role and what a JLL python data scientist does in their company. The interviewer will ask simple managerial type questions and you’ll have to provide him/her your scale in a particular skills.

Give Correct ratings for your skills, Interviewer will make a note of it for future rounds.

Further questions will be like what is your notice period and and further date will be intimated to you if the interview finds you a right candidate.

JLL Python Interview Questions and Answers

There will a technical test based on SQL and Python. You”ll find the test is simple if you know your skills :p. The interviewer will also ask you to write on the test how much time did you take to complete the test. It took my friend around 15 mins to complete the SQL test and 25 min to complete the JLL Python Interview questions.

JLL Python Technical Test Question and Answers

The python test questions were based on basic understanding of Python language. The questions include following things:

  1. Python Lists and Subsetting Lists
  2. Lembda function in Python
  3. Self in Python
  4. Truncating in Python
  5. Sum of lists
  6. Sum of arrays
  7. Concatenations
  8. Data type operations
  9. Local variable vs dynamic variance concept

Python test is an MCQ level test with questions are around 25-30 wherein either you have to calculate the output of the code or write the answer on your own. So, strengthen your skills.

JLL SQL Test Questions and Answers

The JLL SQL test was very easy for my friend. According to him, the questions were based on the following things:

  1. Select queries
  2. SQL Joins
  3. How to create views
  4. Selecting a particular output using where clause
  5. Inner queries
  6. Inserting values into table
  7. Trimming values
  8. Round function of SQL

Find out my next post on further JLL Interview rounds, if you want to grab a job in JLL.