SQL Basics

Huseyin Baytar
5 min readDec 10, 2023

--

Structured Query Language is formed from the initial letters of SQL. It is the common query language for all databases. It emerged in the 1970s and remains popular to this day. Like modern English commands, a query language has been created to be as close as possible to the spoken language.

To procure the raw material of data, which is the basis of data science projects, it is essential to have a good understanding of the SQL language. When looking at job postings related to data, it is evident that SQL and Python are prioritized.

Fundamentally, a database, which is the raw material of data science projects, is any structure that holds data in lists in tables and rows. Databases consist of tables, columns, rows, and indexes.

Database management systems (DBMS) require resources when reading data from a database. Our computer’s resources, such as CPU and RAM, are used to analyze and query raw data. In contrast, in a database management system, we communicate with a database server. We instruct it to retrieve data using a language it understands, and the system delivers the requested data using its own resources. This database management system responds not only to us but also to many clients.

A database server is not hardware; it is software, also known as a database management system. For example, when a client computer wants to query data from a database server, they must first establish a connection, being on the same network. Once this connection is established, the next step is to execute the sent SQL command. For instance, to retrieve a list of customers, we can say

SELECT * FROM Customer

The concept of a Relational Database Management System (RDMS) refers to database systems structured to eliminate duplicate data. It is designed to reduce the effort and labor associated with continuously entering repetitive data. RDMS aims to prevent unnecessary occupation of space and resource waste caused by redundant data. RDMS entered our lives to prevent data integrity from being compromised due to human error during data entry and the difficulty of making retroactive updates to data.

Integer data types

  • Bigint: Ranges from -2⁶³ to 2⁶³ — 1, occupying 8 bytes.
  • Integer: Ranges from -2³¹ to 2³¹ — 1, occupying 4 bytes.
  • Smallint: Ranges from -2¹⁵ to 2¹⁵ — 1, occupying 2 bytes.
  • Tinyint: Ranges from 0 to 255, occupying 1 byte.
  • Bit: Takes values of 0 or 1.

String Data Types

  • Char: Occupies as many bytes as the defined length (up to 8000).
  • Varchar: Occupies the length of the entered value plus 2 bytes (0 to 8000).
  • Varchar(max): Ranges from 0 to 2,147,483,647, occupying up to 2³¹-1 bytes.
  • Text: Ranges from 0 to 2,147,483,647, occupying up to 2,147,483,647 bytes.
  • Ntext: Ranges from 0 to 1,073,741,823, occupying up to 1,073,741,823 bytes.

Decimal data Types

  • Decimal/Numeric: Ranges from -10³⁸ + 1 to 10³⁸ — 1; disk space depends on precision, 5 bytes for precision 1 to 9.
  • Money: Ranges from -922,337,203,685,477.5808 to positive max, occupying 8 bytes.
  • Smallmoney: Ranges from -214,748.3648 to the opposite max, occupying 4 bytes.
  • Float: Up to 7 digits, 4 bytes; up to 15 digits, 8 bytes.
  • Real: 4 bytes.

Date Time Data Types

  • Date: 4 bytes.
  • Smalldate: 3 bytes.
  • Datetime: 8 bytes, plus additional for time.
  • Datetime2: Between 6 and 8 bytes, with additional for milliseconds.
  • Datetimeoffset: Between 9 and 10 bytes, with additional for timezone range.
  • Time: Stores only hour information, 5 bytes if used as default.

Other Data Types

  • Image: No longer supported by SQL Server, used for storing binary files independently, up to 2 GB.
  • Binary: Occupies as many bytes as defined.
  • Varbinary: Ranges from 0 to 8 KB, occupying defined length plus 2 bytes.
  • Varbinary(max): Can store up to 2 GB.
  • XML: Used for storing XML data.
  • Table: Used to store a result set for later use.
  • Uniqueidentifier: Holds data guaranteeing global uniqueness.
  • Hierarchyid: Used to represent positions in hierarchical structures.
  • Geography: Stores coordinates based on the world’s coordinate system.
  • Geometry: Stores coordinates based on the Euclidean coordinate system, without accounting for the Earth’s curvature.

SQL Language

We can divide it into 2 parts:

  1. DML Commands (Data Manipulation)
  • SELECT: Retrieves records from tables in the database.
  • INSERT: Adds new records to a table.
  • UPDATE: Modifies data in one or more fields of a table.
  • DELETE: Removes records from a table.
  • TRUNCATE: Empties the contents of a table.

These commands are used to manipulate data within the database.

2. Database Manipulation Commands

  • CREATE: Creates a new database object.
  • ALTER: Modifies the properties of a database object.
  • DROP: Deletes a database object.

These commands are used to create, modify, and delete database objects.

Aggregate Functions and Group By

When retrieving data, aggregate functions and the GROUP BY clause are used to extract a summarized portion of the data.

Aggregate Functions (e.g., MIN, MAX, SUM, AVG, COUNT)

  • Primary Key: Uniquely identifies each record in a table, often denoted as “id.”
  • Foreign Key: Refers to a value in another table, establishing a relationship.

JOIN Types:

Suppose we have USERS and ADDRESS tables. Some users may not have an address or may have multiple addresses. In such cases:

  • Inner Join: Takes the intersection of both sets.
  • Left (Outer) Join: Fills rows on both sides; users and addresses. If there is information in one table but not in the other, it will be filled, and if there is a user with no address, the address row will be empty.
  • Right (Outer) Join: Similar to left join, but if there is information in the address table and not in the users table, the user row will be empty.
  • Full (Outer) Join: Combines intersection and unique rows from both sides.

In corporate settings, left join is often recommended.

Subquery is another method in relational databases where one query is embedded within another, treating the inner query as a column. It is less performant and occupies more space compared to using joins.

--

--