SAS Dataset Basics : What is SAS and SAS Programming? Learn SAS Software

Learn SAS Programming and basics of SAS software.

Hi all, I  know I am bit late actually very late in posting this blog post. So, we will be learning SAS today and I am covering some basics of SAS. So, the very first question is What is SAS and SAS software. Want to know about SAS company and SAS programming off lately? We’ll know SAS introduction and then learn creating SAS dataset.

So, in order to learn it first there are multiple things that we will cover. What is SAS and why it is used.

What is SAS and SAS Software:

SAS is a statistical and business analytics package tool that has been developed by SAS Institute of advanced analytics. SAS software helps you derive insights from the data and visualize the output to client. Yeah too much in a statement…?? 😉 No problem we’ll understand all of these things here.

Firstly, let’s check what makes SAS and what are the primary components of it:

Primary components of SAS Programming:

1. SAS Library – This is just a SAS folder where in you will be keeping all your sas dataset or the files you would like your system to be read

2. SAS Editor Section – Here you will be writing your SAS code. SAS coding is more SQL like. If you want to learn SQL also, check my article: Learn Basics of SQL

3. SAS log section – This SAS component will give you all the information regarding the changes your code have made or the information from your imported SAS dataset. For example, the count of observations and rows, the data type of these fields and length. This component is very important to check any issues related to data type issue while performing data merging or transfer from SAS ton Teradata.

4. SAS debug section – This section enables the SAS developer to do any sort of debugging over the issue. There are various options available that lets you check the issue by filtering and sorting record.

How to Create Library in SAS Programming:

– This is the very first step that you will be doing in SAS Enterprise Guide or SAS Basic. There is a default SAS Library called – “WORK” library. This keeps all the temporary data. So, there are two types of libraries:

1. WORK
2. User Defined Library

Work – Keeps all temporary data of that particular SAS session only. Once the system is log off, all the data is erased. So, better you save it in some user defined library by before loosing it…LOL

User Defined Library – You can create library by following the syntax given as following:

Libname [Your Library Name] “Location where you want to save your files or SAS Dataset”

example: Libname AA “C://User/John/Documents/New Datasets/”

This will save all the SAS dataset in New Datasets folder.

Now Learn the three major components of SAS Code:

Steps of SAS Programming:

1. Data Step
2. Proc Step
3. Run Step

So, for today this is all folks to be learnt in basics of SAS Programming :p

Refer my detailed blog on how to find duplicates in your data using SQL, SAS or Excel. Also, if you want to learn basics of Python, please refer.

If you want to learn more about it, let me know in comments. Thanks 🙂

Why is Python the Best Language for Machine Learning and Data Science?

Python for Data Science and ML

Let us begin by reading the Quote of Guido van Rossum on Python. It goes like this, The joy of coding Python should be in seeing short, concise, readable classes that express a lot of action in a small amount of clear code — not in reams of trivial code that bores the reader to death.” So let’s see why Python is the best language for Machine Learning and Data Science.

First, let’s see why Python is best for Machine Learning and then let’s see why it is best for Data Science.

Why is Python Best for Machine Learning?

Python is the most popular programming language for Machine Learning (ML). Python for ML has the following advantages. First, Python has a great collection of in-built libraries. Some of the in-built libraries are,

  • NumPy: This is used for scientific calculation.
  • Scikit-learn: It has tools for data mining and analysis that optimizes Python’s brilliant ML usability.
  • Pandas: It is a package that provides developers with high-performance structures and data analysis tools. Moreover, it helps developers to reduce project implementation time.
  • SciPy: This is used for advanced computation.
  • Pybrain: It is exclusively used in Machine Learning
  • Seaborn and Matplotlib: Seaborn is an excellent visualization library aimed at statistical plotting. On the other hand, Matplotlib is the most commonly used 2D Python visualization library.

Secondly, Python enables a moderate learning curve. Python is very accessible and easy to learn and use. Moreover, it focuses on code readability and is a versatile and well-structured language.

Thirdly, Python is a general-purpose programming language which is a good choice for project requirements if they are more than just information.

Fourthly, Python is easy to integrate. Python incorporates better for business environments. Moreover, it is easy to integrate it with lower-level languages such as C, C++, and Java. Likewise, Python-based stack is easy to incorporate with the work of a data scientist.

Fifthly, less amount of code. ML has a huge amount of algorithms, and Python makes it simpler for developers in testing. It comes with the potential of implementing the same logic with as less as one-fifth of code required in other OOP (object-oriented programming) languages.

Sixthly, it is easy to create prototypes with Python. As Python requires less coding, you can create prototypes and test your concepts quickly and easily.

Seventhly, Python supports both object-oriented and procedural programming models. Significantly, classes and objects in object-oriented programming help to model the real world while functions in procedural programming enable to reuse the code.

Lastly, the advantage of portability. Code written in Python can be run on another platform. This is called Write Once Run Anywhere (WORA).

All the above supportive cases for Python makes it a part of the vital teaching curriculum in many Python training institutes. Now let’s see Python’s benefits for Data Science.

Why is Python Best for Data Science?

Let’s get straight to the tech part. Python libraries for Data Science are similar to that of ML. They are Numpy, Matplotlib, Scikit-learn, Seaborn, and Pandas.

Basically, Python is good for Data Science for the following reasons,

  • Python is flexible and an open source language.
  • With it’s simple and easy to read syntax Python cuts development time in half.
  • Python powerfully enables data manipulation, analyzes, and visualization.
  • It provides good libraries for scientific computations.

Let’s get into pointing out solid factors that makes Python a valuable choice for Data Science projects.

Less is More

Python employs fewer codes. It automatically identifies and associates data types and follows an indentation based nesting structure. With Python, there is no limit to data processing. For a good hands-on, please check Learn basics of SQL.

Moreover, Python is faster with the Anaconda platform. Hence it is fast in both development and execution.

Python is Compatible with Hadoop

Hadoop is a popular open-source big data platform and the inherent compatibility of Python is another reason to prefer it over other languages. Importantly, the PyDoop package offers access to the HDFS API for Hadoop and hence allows to write Hadoop MapReduce programs and applications.

Moreover, PyDoop also offers MapReduce API for complex problem solving with minimal programming efforts. Eventually, this API can be used seamlessly to apply advanced data science concepts like ‘Counters’ and ‘Record Readers’.

Python is Good for Data Visualization

APIs like Plotly and libraries like Matplotlib, ggplot, Pygal, NetworkX can bring about breathtaking data visualizations. Moreover, you can use Tabpy to integrate Tableau and use win32com and Pythoncom to integrate Qlikview.

Python has a lot of Deep Learning Frameworks

There are several deep learning frameworks like Caffe, TensorFlow, PyTorch, Keras, and mxnet. You can pick from any of these tools that will fit your project and allow you to build deep learning architectures with few lines of Python code.

Python is Good for Writing Scraping Software

Python has a variety of tools for scraping data and the largest community support for doing so. Moreover, you can choose many different scraping ecosystems such as Scrapy, BeautifulSoup, or requests.

Scrapy for example can handle a lot of dirty work for you, by providing a structure for your spiders. By using Scrapy you can write web spiders in minutes.

Python is Versatile

Being a general-purpose programming language, Python is a quick and powerful tool with a lot of capabilities. From building web services, data mining to data mining, Python is a programming language that helps you to solve data problems end-to-end.

Python is Good for Building Analytics Tools

When it comes to creating a web service to allow others to find outliers in their datasets, Python is a good way forward. This is even more important when self-service analytics is becoming more important.

Python is Best for Deep Learning

Plenty of packages such as Theano, Keras, and TensorFlow make it really easy to create neural networks in Python. While some of these packages are being ported to R, the support available in Python is far advanced.

For all the above reasons, Python class conducted by private professional institutes emphasize the leading role of Python over other languages for use in Data Science.

Conclusion

Let us put in a nutshell the advantages of Python for Machine Learning and Data Science. They are,

  • A great library ecosystem
  • It has a low entry barrier
  • Flexibility
  • Platform independence
  • Readability
  • Great visualization options
  • Community support and
  • Growing popularity

Let us end this article with another attribute to Python. Google’s Peter Norvig has this to say about Python, “Python has been an important part of Google since the beginning, and remains so as the system grows and evolves. Today dozens of Google engineers use Python, and we’re looking for more people with skills in this language.”

If you are new to programming, then start with Learn Basics of SQL

SQL Basics and Different Databases

DBMS and SQL

What is a Database?

Database is an organized collection of related information. In daily world we deal with lots of data. In this internet technology more and more data is getting produced these days. We have multiple database management systems available with us to manage, store and update this enormous data in convenient way. e.g. Oracle, Sybase, Microsoft SQL server management studio etc.

DBMS and SQL

DBMS (Database management server) is a collection of software tools used to manage, update, retrieve the data from the database. SQL (Structured query language is used to connect the DBMS with the database.

DBMS

All queries have been executed on the Microsoft SQL Server management Studio version 17.0. SSMS is a client tool and not the server. It is rather used as a tool to connect to the database server.

Settings: Local Host

Connect: Database Engine

Use SQL authentication username and password

SQL Databases:

In Microsoft SQL Server Management Studio you will find two types of databases:

  1. System Database
  2. User created Database

-System database can’t be deleted

SQL Command Types:
  1. DDL (Data Definition language) – Used to define/create database object
  2. DML (Data manipulation language) – Used to insert values, update and delete values from the database object created by DDL commands.
  3. TCL (Transaction Control language) – Used to control transactions through Commit and Rollback commands

SQL DDL Commands – data definition language (Create, Alter and Drop commands)

1. Creating a database:

Database can be created either using GUI or through SQL query in SSMS.

Create statement is used for this purpose: Create [Database Object] [Database Object name]

Ex. Create Database db1 (this statement will create a database with name db1)

Whenever we create a database, two types of files are created with it: 1. .MDF file (contains actual data) 2. .LDF file (contains log file)

2. Modify a Database:

Alter statement is used to alter a sql database object.

Alter Command:  Alter [Database Object] [Database Object name] Modify Col1 = Value

E.g. Alter Database db1 Modify Name = db2 (this will change the name of the database)

Renaming through stored procedure: sp_renameDB [Old database name] [New database name]

e.g. sp_renameDB db1 db2

3. Dropping a Database:

Drop statement is used to delete a database completely from the system(.mdf and .ldf files are also deleted with it)

Drop command: Drop [Database Object] [Database Object name]

e.g. Drop Database db1 (this will delete database db1)

Note- If a database is getting used by any other user, make sure that database is not getting used by any other database. Else an error will be generated

Resolve this single user thing, use this command:

Alter Database db1 set Single_USER with Rollback immediate

(Rollback immediate, rollback any commands and delete the database immediately)

SQL DML Queries : Insert, Update, delete

1. Create a Table:

Command: Create Table [table name] ([column name] [data type of column] [constraint])

e.g. Create table t1(ID int NOT NULL Primary Key, Gender nvarchar(20) NOT NULL)

This command will create a table with name t1 and 2 columns ID and Gender of INT and nvarchar datatypes respectively. nvarchar is a UNICODE data type and store 2 bytes per characters, while varchar stores 1 byte per character.

In order to store the table in a particular database use the following command:

Use [database name]

Go

Create table command….

Primary Key – Can’t be null and must be unique. It uniquely identify each row in the table

Foreign key– It can contain null values and it references primary key present in some other values (basically the column in which it looks for a value). Foreign key is used to establish relationship between two tables. It is used to enforce database integrity.

Create a Foreign key relation –

Alter table [table name] add constraint [constrain name] foreign key(foreign key column name) references [PrimaryColumn Table Name] (primary key column)

e.g. Alter table tb1 add constraint tb1_genderid foreign key(tb1) references tb(id)

Note- Constraint name should make sense like tablename_columnName

2. Select all values of a table:

Command: Select * from [table name]

To select all tables of a database choose:

Select * from dual (dual refers data dictionary)

3. Insert values in a table :

Insert command is used to insert values in a table: Insert into [table name] (col 1, col 2, …) Values(col 1 value, col 2 value,…)

e.g. Insert into a1(id, name, gender) values(11, “ss”, “male”)

4. Adding a Default value in a column:

We can assign default values to a column rather than assigning Null values:

Alter table [table_name] add constraint constraint_name Default [default value] For [column name]

e.g. Alter table tb1 add constraint tb1_gender default 2 for gender

This command will assign default value 2 to column gender if value not explicitly defined.

5. Adding a New column into table:

Command: Alter table [table name] add [column name] [column data type] [NULL|NOT NULL] add constraint [constraint name] Constraint

Alter table tb1 add Address nvarchar Not Null add constraint tb1_address default ‘xyz’

This command will add one column Address to the table tb1 that don’t accept null value. Also, default value of ‘xyz’ will be assigned to it.

6. Dropping a Constraint:

Command: Alter table [table name] Drop Constraint [constraint name]

e.g. Alter table tb1 drop constraint tb1_gender

This will drop the constraint.

7. Delete a Table record:

To delete a table record, we use delete command:

Delete from [table name] where column1=”column value”

Note: Where clause is used to put some condition on search selection

However, you can’t delete a table record if the table is getting used by some other user. There are some cascading referential integrity constraint imposed on foreign key constraints.

8. Cascade Referential Integrity Constraint:

We can choose options if a foreign key constraint reference is getting deleted. Four options are there:

  1. No Action : This will simply generate an error if a record from primary key table is deleted that has some value in foreign key table.
  2. Cascade: This option will delete all the foreign key records that are related to primary key will be deleted
  3. Set NULL: This option will set the foreign key dependent value to Null.
  4. Set Default: This option will set the foreign key dependent value to default values provided to the column.

9. Adding a Check Constraint:

This constraint is used to enforce value checks on column. For e.g. The value in the age column>4

Command: Alter table [table name] add constraint [constraint name] check (boolean expression)

e.g. Alter table tb1 add constraint tb1_age_check check(AGE>0 AND AGE<30)

This command will only let you add age between 0 and 30 in the Age column.

Note: The check constraint returns a Boolean value based on which the value is entered in the table. It also let you insert Null values because for Null values, check constraint returns “Unknown”.

10. Identity Column:

It is a column property in SSMS.

Identity column is a column to which values are automatically assigned. There are different properties linked to this column. Please set up following values while creating identity column:

Identity Seed: A value with which the identity column value starts

Identity Increment: The value with which identity column value is incremented.

CommandCreate table stu(id int identity(1,1) Primary key)

This command will create a stu table having id as a identity column. The id column here will start from 1 and incremented by 1.

10. Setting up External Values/Explicit Value to Identity Column:

To set up external value in Identity column, add the following command before inserting values in table:

Command: Set IDENTITY_INSERT [table name] ON

Insert into table name(column list) values(1,”23″..etc)

11. Setting Off External Values/Explicit Value to Identity Column:

Command: Set IDENTITY_INSERT [table name] OFF

Insert into table name(column list) values(1,”23″..etc)

Note: To reset the identity column value, use DBCC command.

12. Unique Key Constraint:

Unique key constraint is used to enforce unique values in database. There is a slight difference between primary key and unique key.

Primary key values = Unique+Not Null

Unique constraint value = Unique + values can be null

CommandAlter table table_name add constraint constraint_name unique(column name)

or

Create table Stu(Name varchar(20) Unique)

13. Applying a Trigger:

Firstly let’s try to understand what a trigger is. A trigger is an sql instruction/set of instructions that will will cause an action once a specific condition occurs. For example: Inserting another row table 2 when a row is entered in the table 1.

Command:

Create Trigger [trigger_name] on [table_name] for Insert/Update/Delete/Condition

as

begin

[instructions]

end

14. Selecting values from table:

Select is a command used to retrieve records from a table.

  1. To fetch all records from a table:

Command: Select * from [table_name]

e.g. Select * from emp

2. Select specific columns from a table:

Command: Select [col_name_1], [col_name_2]… from [table_name]

e.g. Select name, age, id from Employee

3. Fetch all distinct records from a table:

Command: Select distinct [column_name] from [table_name]

e.g. Select distinct name from Employee

This command will help in fetching the distinct records from table Employee by Name column.

4. Fetch record matching a specific condition:

Where is used to apply a specific condition in the SQL command.

Command: Select * from table_name where column_name = condition value

e.g. Select name, id from employee where name=”John”

This command will fetch all the records from the table with name column value as john.

5. Fetch record not matching a specific condition (column value):

Command: Select * from table_name where col_name <> Column value

“<>” signifies as not equal to here. We can also use “!=” to compare values.

6. OR operator in SQL:

OR operator is used to specify two or more conditions together.

Command: Select * from table_name where col1=value OR col2=value

e.g. Select name, age, salary from Employee where name=”John” OR name=”Nick”

This sql command will fetch all the table records where name is either John or Nick

7. AND Operator in SQL:

AND operator is used to specify two and more conditions together.

Command: Select * from table_name where col1=value AND col2=value

e.g. Select name, age, salary from Employee where name=”John” AND age=”30″

This sql command will fetch all the table records where name is John and age is Nick

8. IN Operator in SQL:

IN operator is used to retrieve records where condition matches more than 1 value. (And you don’t want to use OR multiple times in a sql command)

Command: Select * from table_name where col_name IN(value1, value2, value3…)

e.g. Select * from Employee where age(21, 25, 30)

This command will fetch all the table record where age is either 21 or 25 or 30.

SQL Wildcards

SQL supports various kind of wild card characters to facilitate data retrieval in multiple ways. Please refer the image for all sql wild card characters.

The DataBird

Post Views: 2,025